Big Data Handling Techniques
Big Data Handling Techniques
Alarge volume of data poses new challenges, such as overloaded memory and
igorithms that never stop running. It forces to adapt and expand the repertoire of
tishniques. Figure 5.1 shows a mind map that willgradually unfold as we go through
le steps: problems, solutions, and tips.
51.1, Problems
A computer only has a limited amount of RAM. When you try to squeeze
more data into this memory than actually fits, the OS will start swapping out
having it all in
memory blocks to disks, which is far less efficient than
data sets,
nemory. But only a few algorithms are designed to handle large
memory at once, which causes the
uOst of them load the whole data set into
copies of the
Our-of-memory error, Other algorithms need to hold multiple
results. All of these aggravate the
uala in memory or store intermediate
problem.
Certain algorithms don't take time into
Another limited resource is time.
time when they
account. algorithms can't end in areasonable amount of
Other
need to process only afew megabytes
of data.
5.2| Data science Fundamentals Handling Large Data
5.3
When dealing with large data sets is that components of computer can start to
5.2.1. Choosing the right algorithm
form a bottleneck while leaving other systems idle. Choosing the right algorithm can solve more
Certain programs don't feed data fast enough to the processor because they
problems than adding more or better
hardware. An algorithm that's well suited for handling
have to read data from the hard drive, which is one of the slowest components the entire data set into memory to large data doesn't need to load
make predictions. Ideally, the algorithm also
on a computer. supports parallelized calculations. three types of
Not enough memory algorithms that can do that: online
Processes that never end algorithms, block algorithms, and MapReduce algorithms, as shown in
Problems Some components form a botleneck while others remain idle
Figure 5.3.
Not enough speed Problems| Onlin algorithms
Solutions Choose the right algorithms Block matrices
Solutions • MapReduce
Handling large data Choose the right data structures
Handling large data Choose the right tools
General tips
Fig. 5.1. Overview ofproblems encountered when working General tips
with more data than can ft in memory
Fig. 5.3. Overview of techniques to adapt algorithms to large data sets
5.2. GENERAL TECHNIQUES FOR HANDLING LARGE VOLUMES () Online Learning Algorithms
OF DATA
Several, but not all, machine learning algorithms can be trained using one
Never-ending algorithms, out-of-memory errors, and speed issues are the most observation at a time instead of taking all the data into memory. Upon the arrival of a
common challenges you face when working with large data. The solutions can be new data point, the model is trained and the observation can be forgotten; its effect is
divided into three categories: using the correct algorithms, choosing the right data now incorporated into the mnodel's parameters. For example, a model used to predict
structure, and using the right tools (figure 5.2). the weather can use different parameters (like atmospheric pressure or temperature) in
Problems different regions. When the data from one region is loaded into the algorithm, it
Choose the right algorithms forgets about this raw data and moves on to the next region. This "use and forget"
Choose the right data structures way of working is the perfect solution for the memory problem as a single
Solutions•
Choose the right tools observation is unlikely to ever be big enough to fll up allthe memory of a moderm
Handling large data day computer.
Listing 5.1 shows how to apply this principle to aperceptron with online learning.
General tips
A perceptron is one of the least complex machine learning algorithms used for binary
classification (0 or l): for instance, it is decided will the customer buy or not?
Fig. 5.2. Overview of solutions for handling large data sets
Handling Large Data
Data science Fundamentals
54 fwe reach the 5.5
marimum numbeer break
Listing 5.1 fallwed runs, epcn se1f.ng1 er
we stop koking
Listing 51 Training a perceptron by observation for a solution. trear
that they handle only a few at a time. Block matrices are created for the predictor variables
dax da . from_array (ar. chunks-(S.5)) (ar) and target (y). Ablock matrix is a matrix cut in
dy da.from_array (y ,chunks-(5,5)) pieces (blocks). dafromm_array reads data
(ü) Dividing a Large Matrix into Many Small Ones from disc or RAM (wherever it resides currently)
chunks=(S,S): every block is a 5S matrix
By cutting a large data table into small matrices, for instance, can still do a linear The XTX 0s defined (defining it as "lazy) as the (unless <5 observations or variables are left).
regression. The logic behind this matriX splitting and how a linear regression can be Xmatrix multiplied with its transposed version.
This is a building block of the formula to do
Xy is the yvector multiplied with the transposed
calculated with matrices can be found in the sidebar. It suffices to know for now that linear regression using matrix calculation. Xmatrix. Again the matrix is only defned, not
the Python libraries about to use will take care of the matrix splitting, and linear calculated yet. This is also abuiding block of the
XIX dax.T.dot (dax) formula to do linear nusing matriùx
calculation (see formula).
regression variable weights can be calculated using matrix calculus. Xy dax.I.dot (dy)
coefficients np.linalg.inv (XTX.compute () ).dot (Xy.computa())
The Pyhon tools to accomplish the task are the following:
coef da.fromarray(coe ff1cients, chunks-(5.5))
bcolz is a Python library that can store data arrays compactly and uses the The coefficients are also put
into a block matrix We got a
hard drive when the array no longer fits into the main memory. ar.flush() Flush memory data. It's no longer needed numpy aray back from the last
y.flush() step so wve need to explicitly
to have large matrices in memory.
convert it back to a "da array.
* Dask is a library that enables you to optimíze the flow of calculations and
predictions - dax.dot (coef).compute() Score the model
makes performing calculations in parallel easier. It doesn't come packaged print predictions (make predictions).
with the default Anaconda setup so make sure to use conda install dask on
The coeffclents are calculated using the matrix
your virtual environment before running the code below. Some errors have linear regresslon functlon. np.llnalg.lnv) is the
A- 1)in thls functlon, or "Inverslon of the
been reported on importing Dask when using 64bit Python.Dask isdependent matrix. X,.dot(y)--> multiplles sthe matrix X
with another matrlx y.
on afew other libraries (such as toolz), but the dependencies should be taken
matrix
care of automatically by pip or conda. Lhere is no need to use a block matrix inversion because XTX is a square
nsize nr of predictors nr of predictors. This is fortunate because Dask doesn't
The following listing demonstrates block matrix calculations with these libraries.
yet support block matrix inversion.
Handling Large Data
|5.10 Data science Fundamentals
S.11
oure 5.4 shows have many
(üi) Mapreduce different data structures to choose from, three of
which will be discussed. They are sparse
MapReduce algorithms are easy to understand with an analogy: To count all thbe data, tree data, and hash data.
votes for the national elections. The country has 25 parties, 1,500 voting offices, and () SPARSE DATA
2 million people. Choose to gather all the voting tickets from every office ASparse data set contains relatively little
individually andcount them centrally, or ask the local offices to count the votes for information compared to its entries
the 25 parties and hand over the results, and could then aggregate them by party. Map (ohservations). At figure 5.5: almost everything is "" with just a single "1" present in
the second observation on variable 9.
reducers follow a similar process to the second way of working. They first map
values to a key and then do an aggregation on that key during the reduce phase. The 1 2 4 6 7 10 | 11 12 13 14| 15 16
following listing denotes the MapReduce Pseudo code example. a look at the
following listing's pseudo code to get a better feeling for this. 1 0 0 0 0 0 0 0
Listing 5.4 MapReduce pseudo code example
0 0 0 0 0 1 0 0 (0 0 0
For each person in voting office:
Yield (voted _party, 1) 3 0 0 0 0 0 0
For each vote in voting office:
4 0 0 0 0 0 0
add_vote_to_party()
One of the advantages of MapReduce algorithms is that they're easy to paralllize Fig. 5.5. Example of asparse matrix: almost everything is 0;
and distribute. This explains their success in distributed environments such as other values are the exception in a sparse matrix
Hadoop, but they can also be used on individual computers. Data like this might look ridiculous, but this is often received when converting
unrelated Twitter
5.2.2. Choosing the right data structure extual data to binary data. magine a set of 100,000 completely
together they might
Algorithms can make or break the program, but the way the data is stored /s of Weets. Most of them probably have fewer than 30 words, but
variable, with "1" representing
equal importance. Data structures have different storage requirements, but also ave hundreds or thousands of distinct words. a binary
tweet." This would result
"present in this tweet," and 0" meaning not present in this
influence the performance of CRUD (ereate, read, update, and delete) and other
matrix can cause memory problems even
operations on the data set. sparse data indeed. The resulting large
Problems Sparse data
hough it contains little information.
Choose the right algorithms
Solutions• Choose the right data structures
Tree
Luckily, data like this can be stored compacted.
this:
Hash like
Choose the right tools
the case of figure 5.5it couldlook
Handling large data
data = [(2,9,1)]
General tips 1.
Row 2, column 9 holds the value algorithms
Fig. 5.4. Overview of data structures often applied to data science matrices is growing in Python. Many
Support for working with sparse
when working with larg: data matrices.
now Support or return sparse
Datascience Fundamentals Handling Large Data
5.13
5.12
(ii)Haslh Tables
() Tree Structures information much faster Hash tables are data structures that calculate a key for every
structure that allows to retrieve value in the data and
Trees are a class of data subtrees of children put the keys in a bucket. This waycan quickly retrieve the
tree always has a root value and information by looking in
than scanning through a table. A examples would be a family tree or a the right bucket when the data is encountered. Dictionaries in Python are a
on. Simple
each with its children, and so and leaves. hash table
branches, twigs,
biological tree and the way it splits into resides. implementation, and they're a close relative of key-value stores. Hash tables are used
Simple decision rules make it easy to find the child tree in which the data extensively in databases as indices for fast information retrieval.
structure enables to get to the relevant information
The figure 5.6 shows how a tree
quickly. 5.2.3.Selecting the right tools
Start search:
With the right class of algorithms and data structures in place, it's time to choose
the right tool for the job. The right tool can be a Python library or at least a tool that's
age < 12 ELED
12 s age <78
age 78
controlled from Python, as shown figure 5.7.
Problems
Choose the right algorithms
Numexpr
Bcolz.
Choose the right data structures
Solutions| Numba
Python tools Blaze
Choose the right tools l
Handling large data Theano
Cython
General tips
Leaf level
L1 L2 L3 Use Python as a master to control other tools
Basu, 33, 4003 Smith, 44. 3000
Daníl, 22,6003
Asby, 25, 3000 Jones, 40, 6003 Tracy, 44, 5004 Fig. 5.7. Overview of tools that can be used when working with large data
Bristo, 29, 2007 Cass, 50, 5004
5.2.3.1. Python Tools
Python has a number of libraries that can help to deal with large data. They range
Fig. 5.6. Example ofa tree data structure: Decision rules such as age categories
can be used quickly locate aperson in afamily tree from smarter data structures over code optimizers to just-in-time compilers. The
following is a list of libraries we like to use when confronted with large data:
In Figure 5.6 start the search at the top and first choose an age category, because
apparently that's the factor that cuts away the most alternatives. This goes on and on 3 Cython - Cython, a superset of Python, solves this problem by forcing the
until the expected outcome received. The Akinator is a djinn in amagical lamp that programmer to specify the data type while developing the program. Once the
tries to guess a person in the mind by asking a few questions about him or her. compiler has this information, it runs programsmuch faster.
packages, as is
Trees are also popular in databases. Databases prefer not to scan the table from the
3 Numexpr - Numexpr is at the core of many of the big data
first line until the last, but to use a device called an index to avoid this. Indices are numerical expression
often based on data structures such as trees and hash tables to find observations faster. NumPy for in-memory packages. Numexpr is a
the original NumPy.
The use of an index speeds up the process of finding dcta enormously. evaluator for NumPy but can be many times faster than
Handlimg Large Däta
5.14 Data science Fundamentals
5.15
Numba -Numba helps to achieve greater speed by compiling the code right Problems |
before executing it, aiso knovWIn as just-in-time compiling.
Bcolz - Bcolz helps to overcome the out-of-memory problem that can occur Solutions|
when using NumPy. it can store and work with arays in an optimal Handling large data
compressed form. It not only slims down the data need but also uses Numexnr Don't reinevent the wheel
in the background to reduce the calculations needed when performing Get the most out of your
General tips O hardware
calculations with bcolz arrays. Reduce the computing needs
& Blaze-Blaze is the "Pythonic way" of working with data. Blaze will translate Fig. 5.8. Overview of general
programming best practices
when working with
the Python code into SQL but can handle many more data stores than large data
relational databases such as CSV, Spark, and others. 5.3.1. Don't reinvent thewheel
* Blaze delivers a unified way of working with many databases and data "Don't repeat anyone" is probably even better than "don't repeat yourself." Add
libraries. value with your actions: make sure that they matter. Solving a
problem that has
Theano - Theano enables to work directly with the graphical processing unit already been solved is a waste of time. As a data scientist, you have two large rules
(GPU) and do symbolical simplifications whenever possible, and it comes that can help you deal with large data and make you much more productive, to boot:
with an excellent just-in-time compiler. Exploit the power of databases. The first reaction most data scientists have
Dask - Dask enables to optimize your flow of calculations and execute them when working with large data sets is to prepare their analytical base tables
efficiently. It also enables to distribute calculations. inside a database. This method works well when the features you want to
prepare are fairly simple.
5.3. GENERAL PROGRAMMING TIPS FOR DEALING VWITH LARGE Use optimized libraries. Creating libraries like Mahout, Weka, and other
DATA SETS
machine learning algorithms requires time and knowledge. They are highly
The tricks that work in a general programming context still apply for data
science. optimized and incorporate best practices and state-of-the art technologies.
Several might be worded slightly differently, but the principles are essentially the
5.3.2. Get the most out of your hardware
same for all programmers. The general tricks are divided into three parts, as shown in are over-utilized.
the figure 5.8 mind map: Resources on a computer can be idle, whereas other resources
Ins slows down programs and can even make them fail. Sometimes it's p0ssible
Don'treinvent the wheel. Use tools and workload from an overtaxed resource to an underutilized
libraries developed by others. (and necessary) to shift the
Get the most out of your hardware. Your esource using the following techniques:
machine is never used to its 1ull
potential, with simple adaptions you can make it work simple trick to avoid CPU starvation is to
harder. Teed the CPU compressed data. A
of the inflated (raw) data. This will
Reduce the computing necd. Slim down your Leed the CPU comnpressed data instead
memory and processing needs a CPU, which is exactly what you
much as possible. Shift more work from the hard disk to the CPUin most modern
disk can't follow the
Want to do. because a hard
Computer architectures.
5.16| Data science Fundamentale Handling Large Data
5.17
Makeuse of the GPU. The GPU is enormously efficient in parallelizable inhe computed much faster than the right side of the equation; even for
this triviai
but has less cache than the CPU. example, it could make a difference when talking about big
chunks of data.
Use multiple threads. It's still possible to parallelize computations on the
CPU, It can be achieved with this normal Python threads. 5.4. CASE STUDY 1:
PREDICTING MALICIOUS URLS
The internet is probably one of the greatest
5.3.3. Reduce your computing needs inventions of modern times. It has
boosted humanitys development, but not everyone uses this
The best way to avoid having large data problems is by removing as much of the great invention with
honorable intentions. Many companies (Google, for one) try to protect
work as possible up front and leting the computer work only on the part that can't be fraud by detecting malicious websites. users from
skipped. The following list contains methods to help achieve this:
Doing so is no easy task, because the internet has billions of web pages to scan. In
Profile your code and remediate slow pieces of code. Not every piece of code this case it is shown that how to work with a data set that no
needs to be optimized, use a profiler to detect slow parts inside program and longer fits in memory.
Data - The data in this case study was made available as part of a
remediate these parts. research
project. The project contains data from 120 days, and each observation has
Use compiled code whenever possible, certainly when loops are involved.
approximately 3,200,000 features. The target variable contains 1if it's a
Whenever possible use functions from packages that are optimized for malicious website and -1 otherwise.
numerical computations instead of implementing everything yourself. The The Scikit-learn library - This library installed in the Python environment.
code in these packages is often highly optimized and compiled.
Otherwise, compile the code yourself. When existing package cannot be used, 5.4.1. Step 1: Defining the research goal
use eitherajust-in-time compiler or implement the slowest parts of your code The goal of our project is to detect whether certain URLs can be trusted or not.
in a lower-level language such as C or Fotran and integrate this with your Because the data is so large we aim to do this in a memory-friendly way.
codebase.
5.4.2. Step 2: Acquiring the URL data
* Avoid pulling data into memory. When working with data that doesn't ft in Start by downloading the data from http:/lsysnet.ucsd.edu/projects/url/#datasets
the memory, avoid puling everything into memory. A simple way of doing and place it in a folder. Choose the data in SVMLight format. SVMLight is a text
this is by reading data in chunks and parsing the data on the fly. based format with one observation per row. To save space, it leaves out the zeros.
Use generators to avoid intermediate data storage. Generators helps to return
data per observation instead of in batches. This way to avoid storing
intermediate results. MemoryError Traceback (most recent call last)
<ipython-input-532-d196cO5088ce> in <module>()
Use as little data as possible. If no large-scale algorithm is available and aren't
willing to implement such a technique yourself, then can still train the data on 5 print there are %d files" % len(files)
only a sample of the original data. 6 X,y = load_svmlight_file (files[6], n_features=3500009)
---) 7 x.todense()
Use your math skills to simplify calculations as much as possible. Take the
following equation, for example: (a +b)? =a'+ 2ab +b2, The left side willbe
|5.18 Data science Fundamentals Handling Large Data
The following listing and figure 5.9 show what happens when trying to read in 1 CPUcompressed files. In 5.19
our example, it's
file out of the 120 and create the normal matrix as most algorithms expect. The unpack file only when you need it,
a already packed in
todense() method changes the data from a special file format to a normal matrix part of your computer). without writing it to the hardthedisktar.gz format.
where every entry contains a value.
(the slowest
Listing 5.6
Tools and Techniques Listing 5.6
Checking data size
Amemory error while loading a single file. Luckily, there are a few tricks that try We don't know how
we have, so many features
these techniques over the course of the case study: let's initialize it at 0.
We don't know how
Use a sparse representation of data. we have, so let's manyitobservations
initialize at 0. The uri variable holds
import tarfile the
3 Feed the algorithm compressed data instead of raw data. Erom location in which you saved
(X.shape [O])
the files one
by one to
reduce the
maz_vars np.mazinun (az_vars, X.shape [0])
max_ob8 np.aaximum (saz_obs, X.shaps(1]) load_5vcfile.
to load a
hle)
if i split:
* float (X.shape[1]))) memory
needed. break
Adjust maximum number of
observations and variables
i+s 1 Stop when we when necessary (big file).
reach 5 fhles.
This outputs the following: print "ax X %s, BAX y dinension Xs" % (Bax obs, maz_vars) Print
number of non-zero entries 0.000033 resutts.
Part of the code needs some extra explanation. In this code loop through the svm
Data that contains little information compared to zeros is called sparse data. This files inside the tar archive. We unpack the files one by one to reduce the
memory
can be saved more compactly if you store the data as [(0,0,1),(4,4,1)] instead of needed. As these files are in the SVM format, a helper is used,
[[1,0,0,0,0]. [0.0,0,0,0]. [0,0,0,0,0]. [0,0,0,0,0], [0,0,0,0,1] functionload svmlight file(0 to load a specific file. Then we can see how many
One of the file formats that implements this is SVMLight, and that downloaded the
observations and variables the file has by checking the shape of the resulting data set.
data in this format. 5.4.4. Step 4: Model building
To get this information there is a need to keep the data compressed while checking Now that we're aware of the dimensions of our data, we can apply the same rwo
for the maximum number of observations and variables. We also need to read in data icks (sparse representation of compressed file) and add the third (using an online
file by file. This way you consume even less memory. A second trick is to feed the algorithm), in the following listing.
5.20 Data science Fundamentals Handling Large Data
6.
Listing 5.7 BUILDINGA RECOMMENDER SYSTEM 5.21
Listing 5.7 Creating a model to distingulsh the mallcious
from the normal URLs In this
example, the hash table data structure and INSIDE ADATABASE
We know number The target variable can bd
used. Python to control other tools is
of features from or -1,"1": website safe to
data exploration. visit, "- 1": website unsafe. Set up stochastic
s.5.1. Tools and
classes - [-1, 1)
sgd SODClassi fier (loss"log")
qradient
classifier. techniques
The required tools are needed
n_features-323 1952
split - 5
All fles together take up around 2.05 5.5.1.1. Tools
i- 0 Gb. The trick here is to leave data We unpack the
for tarinfo in tar:
in main memory and files one by one
conpressed
co
1f i > split: only ( what you need. to reduce the
MySQL database - Needs a MySQL
break memory needed.
database to work with download from
1f tarinfo. isf1le() :
£ tar. extractfile (tarin fo .name)
load_svmlight_file (f .n_features =n_features) Use a helper function,
www.mysql.com.
X.y
1f i ( split: load_svmlight _fle) MySOL database connection Python library- To
sgd.partial_fit (%, y, classes-classes) to load a specific file. Python there is a need to install SQL Alchemy orconnect to this server from
communicating with MySQL. Use MySQLdb andanother
1f 1 m split:
print classification_report (sgd.predict (X) y) library capable of
1 += 1 Third important thing
is online algorithm. It right off the bat to install it.
on Windows use Conda
Stop when wel
Initialize file reach 5 files and can be fed data points
Counter at 0. print results. file by file (batches). o First install Binstar
(another package management service) and look for the
Stop at 5th fle (instead of all
of them, for demonstration appropriate mysql-python package for the Python setup.
purposes). conda install binstar
The code in the previous listing looks fairly similar, apart from the stochastic binstar search -t conda mysql-python
gradient descent classifier SGDClassifier).Here, we trained the algorithm iteratively The following command entered into the Windows
command line worked (after
by presenting the observations in one file with the partial fit() function. Looping activating the Python environment):
through only the first 5 files here gives the output shown in table 5.1. The table shows conda install --channel
https://conda.binstar.org/krisvanneste
classification diagnostic measures: precision, recall, F1-score, and support. mysql-python
3 Need the pandas python library, but that should already be installed by now.
Table 5.1. Classification Problen: Can a website be trusted or not?
Precision Recall fl-score support 5.5.1.2. Techniques
-1 0.97 0.99 0.98 14045 A simple recommender system will look for customers who've rented similar
movies and then suggest those that the others have watched. This technique is called
0.97 0.94 0.96 5955
1 k-nearest neighbors in machine learning.
0.97 0.97 0.97 20000 A customer who behaves similarly isn't necessarily the most similar customer. A
|Avg/total
technique is used to ensure similar customers (local optima) without guaranteeing that
Only 3% (1 -0.97) of the malicious sites aren't detected (precision), and 6% (1 - it finds the best customer (global optimum).
decent result, so it is
0.94)of the sites detected are falsely accused (recall). This is a A common technique used to solve this is called Locality-Sensitive
Hashing. The
concluded that the methodology works. ldea behind Locality-Sensitive Hashing is simple: Construct functions that map
Data science Fundamentals Handling Large Data
5.23
label) and Table 5.3. Combining the
(thev'r put in a bucket with the same information from
similar customers close together This is also how DNA works: alldifferent columns into the movies
make su that objs that ar
ditèent are put in difterent buckets.
information in a long string colunn.
mapping. This tunction is called Column 1| Movie1
Central to this idea isa function that performs the Movie 2 Movie 3 Movie 4 Movies
of input to a fixed output. The
a hash function: a function that maps any range random columns. t
Customer 1
from several 1
simplest hash function concatenates the values Customer 2
1011
(scalable input), it brings it back to a single 0
doesn't matter how many columns 0001
up to find similar customers.
column (fixed output). Thre hash functions has been set This allows you to calculate the
hamming distance
The three functions take the values of three
movies:
handling this operator as a bit, you can exploit the XOR much more efficiently. By
movies 10, 15, and 28. operator.
$ The fist function takes the values of The outcome of the XOR operator (^) is as follows:
18, and 22.
$ The second function takes the values of movies 7, 1^1 = 0
30.
The last function takes the values of movies 16, 19, and 1^0 = 1
same bucket share at least
This will ensure that the customers who are in the 0^1 = 1
several movies. But the customers inside one bucket might still differ on the movies
020 = 0
that weren't included in the hashing functions. With this in place, the process to find similar customers becomes very
the bucket with each simple.
To solve this there is a need to compare the customers within Let's first look at it in pseudo code:
is to compare
other. For this create a new distance measure, The distance that used
customers is called the hamming distance. The hamming distance is used to
calculate Preprocessing
defined as the number of different Define p(for instance, 3) functions that select k (for instance, 3) entries from
how much two strings differ. The distance is
distance. the vector of movies. Here we take 3 functions (p) that each take 3 () movies.
characters in a string. Table 5.2 offers a few examples of the hamming
Apply these functions to every point and store them in a separate column. (n
Table 5.2. Examples of calculatng the hamming distance
literature each function is called a hash function and each column will store a
String 1 String 2 Hamming distance
bucket.)
Hat Cat 1
by skipping the data exploration step and move straight into model building. colnames ="novie%d" %1
pd.np.random. seed(2015 )
for i in range(1,33)] Next we simulate
a database with
generated customers =
5.5.3. Step 2: Data preparation pd.np.random.randint
nr_customers).reshape(nr_customers, 32)(,2,32
* customers and create
a few observations.
The data boss has collected is shown in table 5.4. Create this data for the sake. of data = pd. DataFrame
data.to_sql(' (generated_customers,
flavor = 'mysql', columns
= list
index = True, (colnames)
'replacec',ust'index_
,mc, )
demonstration. label = 'cust id') if exists =
Table 5.4. Excerpt fromthe client database and the movies customers rented Store the data inside a Pandas data frame and
write the data frame in a MySQL table called
Customer Movie 1 Movie 2 Movie 3 "cust". If this table already exists, replace it.
Movie 32
Jack Dani 1 1 Create 100 customers and randomly assign whether they did or
didn't see a certain
movie, and 32 movies in total. The data first created in a Pandas
Wihelmson 1 1 data frame but is
then turned into SQL code. Note: You might run across a
code.
warning when running this
To efficiently query the database there is a need additional data
Jane Dane 0 0 1 preparation, That
includes the following things:
Xi Liu 1
3 Creating bit strings. The bit strings are compressed versions of the columns
Eros Mazo 1 1 content (0"and 1 values). First these binary values are concatenated; then the
resulting bit string is reinterpreted as a number. This might sound abstract
now but willbecome clearer in the code.
For each customer you get an indication of whether they've rented the movie
$ Defining hash functions. The hash functions will in fact create the bit strings.
before (1) or not (0).
$ Adding an index to the table, to quicken data retrieval.
First let's connect Python to MySQL to create our data. Make a connection to
MySQL using your usermame and password. In the following listing we used a Creating Bit Strings
database called "test". Replace the user, password, and database name with the An intermediate table suited for querying, apply the hash functions, and represent
appropriate values for your setup and retrieve the connection and the cursor. A the sequence of bits as a decimal number isprepared. Finally, place them in a table.
database cursor is a control structure that remembers where you are curently in ne
database.
5.26 Data science Fundamentals Handling Large Data
First, create bit strings. Convert the string "11111111" to a binary or a numeric Creating a Hash Function 527
value to make the hamming function work. A numeric representation, as shown in The hash
the next listing has been opted. this case studyfunctions take the
is decided to values of movies for acustomer. The
Listing 5.9 movies 10, 5, and 18; the create 3 hash functions: the first function theory part of
Listing 5.9 Creating bit strings combines l6, 19,and 30. Thesecond combines movies 7, 18, and 22, and combines
following code listing shows how
the
the third one
We represent the string as a numeric value. The string will be a concatenation this is done.
of zeros (0) and ones 0) because these indicate whether someone has seen a Listing 5.1O
certain movie or not. The strings are then regarded as bit code. For example
0011 is the same as the number 3. What def createNum() does: takes in 8 LIsting 5.10 Creating hash
values, concatenates these 8 column values and turns them into a string, then
turns the byte code of the string intoa number.
functions
Define hash function it is exactly like
def createNum(x1, x2, x3, x4, x5, x6, X7, x8): the
return [int("Xá%a%4%AKaKaKAKA %(11,12,i3, i4, 15, 16, 17, 18),2) the createNum)
final function without
and for 3conversion
to a Test if it works
for (i1,12, i3,i4, i5, i6, 17 ,i8) in zíp(x1, x2, X3, X4, X5, X6, x7, x8) 1
def hash fn(x1,x2, x3): columns insteadnumber
of 8). correctly (f no error
is raised, it works)
assert int('1111',2) z= 15
assert int('1100',2) == 12
return (b'%%%d" X (1,j,k) for (i,j, k) in It's sampling on
assert== createNun((1,1],(1,1), [1,1],[1,1], [1,1], [1,1] [1,0], [1,0]) zip(x1, x2, x3)) columns but al
[255,252] assert hash_fn([1,0]. [1,1],[e,e]) observations wll
(b'110',b'e18') be selected
store = pd.DataFrame () store[' bucket1]
store("bit1'] = createNum(data.movie1, Translate the <torel' bucket2'] hash_fn(data .movie18, data.movie15,
hash_fn(dat a. movie7, data. data.movie28)
movieecolumn to
data.movíe2, data.movie3, data . movie4, data.movie5,
4bit strings in store[ 'buc ket3]
store. to_sql('movie hash_fn(data.moviel6, movie18,data.movie22)
_comparison' ,mc, flavordata.movie19,data.
data.movie6, data.movie7, data. movie8) numeric form. movie3e)
store('bít2'] = createNum(data.movie9,
data.movíe10, data.movie11, data.movie12, data.movie13, Each bit string 'mysql', index = True,
index_label 'cust_id', 1f_exists 'replace')
represents 8
data.movíe14, data. movíe15, data. movie16) movies. 4°8 32
store[" bít3') = createNum(data. movie17, movies. Note. you Store this information
data.movie18, data.movie19, data.movie20, data.movíe21, could use a 32-bit in database Create hash values from customer
data.movie22, data.movie23, data . movie24) string instead of movies, respectively (10,15, 28,
store['bit4'] = createNum(data.movie25, 4°8 to keep the (7,18, 22). ([16,19,30].
data.movie26, data. movie27, data.movie28, data .movie29, code short.
data.movie 30,data.movie31, data . movie32) The hash function concatenates the values from the
different movies into a binary
Test ifthe function works correctly. Binary codel 11 is the value likewhat happened before in the create Num() function,only this
same as 15 (= 1"8+ 1°4+ 12+ 11). If the assert fails, it time it is not
will raise an assert error; otherwise nothing will happen. converted to numbers and only take 3 movies instead of &as input. The asser.
By converting the information of 32 columns into 4 numbers, Figure 5.10 shows function shows how it concatenates the 3 values for every observation. When the
for the first 2 observations client has bought movie 10 but not movies 15 and 28, it will return b' 100 for bucket
(customer movie view history) in this new format. 1.When the client bought movies 7and 18, but not 22, it will return b'l10' for bucker
store[o:2] 2. If we look at the current result we see the 4 variables created earlier (bit!. bir.
The nexXt step is to create the hash functions, to sample the data to determine bit3, bit4) from the 9handpicked movies (figure 5.11).
whether twocustomers have similar behavior.
bitl bit2 bit3 bit4 bucketl bucket2 bucket3
bitl bit2 bit3 bit4
0 10 62 42 182 011 100 011
10 62 42 182
1 23 28 223 180 001 111 001
23 28 223 180
sample movies
Fig. 5.11. Information from the bit string compression and the 9
Fig. 5.10. First 2customers information on all 32 movies
after bit string to numeric conversion
ata scieee Fundamentals Handling Large Data
If allis well, the
output of this code should be 3. (5.29)
Adding An Index To The Table in a real-time
s dup rtrieval as neoded system. 5.5.5. Step 4:
Vu mus ai indis O
The applicationPresentation
and automation
This is shwn inth NNisting needs to perform two
customer: steps when confronted with a
Listing &ll
given
Look for similar
Listing 511 Creating an index customers.
Crate function to easily create Suggest movies the customer has yet to see
already viewed and the viewing history of the based on what he
indices. ndices will quicken retrieval. or she has
def createinder(colun, cursor)
Vie cmrison (ts);" $
(column, column) similar customers.
sql CREATE IAOEX Ss ON First things first: select ourselves a lucky
cursar.exeUte(sql) customer.
CneateIder'cketl' ,cursor) Put index on
Finding A Similar Customer
cteldex('bucket?', cursor) bit bucket Time to perform real-time queries. In the
creteIndex(bucket3",cursor)
benny one who`ll get his next movies following listing. customer 27 is the
"model building part." selected for him. But first select customers with
With the data indexed it is been now moved on to the a similar viewing history.
5.5.4. Step 3: Model building Listing 5.13
define it as a function.
To use the hamming distance in the database we need to Listing 5.13 FIndlng simllar customers
CREATING THE HAMMING DISTANCE FUNCTION We do twO-step sampling. First sampling: index must be
calculate the as the one of the selected customer (is based on 9 exactly the same
A user-defined function has been implemented. This function can must have seen (or not seen) these 9 movies exactlymovies). Selected people
like our customer did.
distance for a 32-bit integer (actually 4*8), as shown in the following listing. Second sampling is a ranking based on the 4-bit strings. These take into
account all the movies in.the database. Pick customer
from database.
Listing 5.I2 customer id = 27
Listing 5.12 Creating the hamming distance sql = "select * from movie comparison where cust id = %s"
cust_data = pd.read _sql(sql,mc) customer id
Sal = sql = nn* select cust id, hammingdistance (bit1,
CREATE FUNCTION HAMMINGDISTANCE ( The function is stored in a bit2, bit3, bit4,%s, %s,%s,Xs) as distance
A8 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, database. You can only do
Be BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT this once; running this code a
from movie_comparison where bucketi = %s' or bucket2 ='%s'
or bucket3="%s' order by distance limit 3* %
second time will result in an
eror: OperationalError. (cust_data.bit1[0], cust_data.bit2[O],
RETURNS INT DETERMINISTIC Defne funtion. It takes 8 input (304, 'FUNCTION HAMMING cust_data.bit3[0), cust_data.bit4[0],
arguments:4 strings of length 8
RETURN
for the first customer and
DISTANCE already exists). cust_data.bucketi[0], cust_data. bucket2 [e),cust_data . bucket3[e))
BIT COUNT (A8 A Be) + shortlist - pd.read_sql(sq1, mc)
another 4 strings of length 8 for
BIT_COUNT (A1 ^ B1) + the second custome. This way We show the 3 custonmers that
BIT_ COUNT(A2 ^ B2) + we can compare 2 customers most resemble customer 27.
BIT_COUNT(A3 ^B3): To check this function you
side-by-side for 32 movies. can run this SQL statement Customer 27 ends up frst.
cursor. execute (Sql) with 8 fixed strings. Notice
the b" before each string,
indicating that you're Table 5.5 shows customers 2 and 97 to be the most similar to customer 27. As the
Sql ='Select hanaingdistance( passing bit values. The
This |
runs
b'11111111',b'e8eeeeee',b'11011111',b'11111111"
,b'11111111',b'10881881',b'11811111 ,b'11111111 outcome of this particular data was generated randomly, anyone replicating this example miht receive different
test should be 3, which
the
indicates the series of strings results. Finally, a movie for customer 27 to watch is selected.
query Lo pd.read_sql(Sql, nc) differ in only 3 places.
5.30 Datascience Fundamentals Handling Large Data
Mission
accompl
to hisished. happy movie addict can now
Table 5.5. The most similar customers to customer 27 A
movie, tailored 5.31|
cust id distance
preferences. indulge himself with anew
27
TWO MARKS
QUESTIONS AND
1. What are the
general problems you face while ANSWERS
97 9
Managing massive amounts of data. It's in thehandling large data
Finding a New Movie Integrating data from multiple sources. name-big data is big.
We need to look at movies customer 27 hasn't seen yet, but the nearest customer Ensuring data quality.
has, as shown in the following listing. This is also a good check to see if your Keeping data secure.
distance function worked correctly. Although this may not be the closest customer, Selecting the right big data tools.
it's a good match with customer 27. By using the hashed indexes, enormous speed Scaling systems and costsefficiently.
when querying large databases have been gained. Lack of skilled data professionals.
Listing 5.14 Organizational resistance.
Listing 5.14 Finding an unseen movie 2. What are the problems encountered when
in memory.
working with more data thatcan ft
cust = pd.read_sqi('select * from cust where cust_id in (27,2,97),mc)
dif = cust.T
dif[dif[O] != dif[1]] Select movies
Not enough memory
Select movies Transpose for customers 27, 2, Processes that never end
customer 27 convenience. 97 have seen.
didn't see yet. * Some components from a bottleneck while others remain idle.
Table 5.6 shows you can recommend movie 12, 15 or 31 based on customer 2's Not enough speed
behaviour. 3. Mention the solutions for handling large data sets.
Table 5.6. Movies fromcustomer 2 can be used as suggestions for customer 27 Choosethe right algorithms
1 2 Choose theright data structures
Cust id 2 27 97
Choose the right tools
Movie3 1 1
Movie9 1 1 4. What are the algorithms involved in handling large data sets.
Moviell 0 1 1 Online algorithms
Moviel2 Block Matrices
Moviel5 1 MapReduce
Moviel6 1
1
S. Define Perceptron
machine learning al gorithms sed for
Movie25 0 1