0% found this document useful (0 votes)

35 views19 pages

Big Data Handling Techniques

Uploaded by

kabileshramesh80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views19 pages

Big Data Handling Techniques

Uploaded by

kabileshramesh80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

UNIT- V

HANDLING LARGE DATA

AA Sylabus:
olflems- Techniques For Handling Large Volumes Of Data Programming Tips For
Dalng With Larqe Data Sets- Case Studies: Predicting Malicious URIS, Building a
Recommender System - Tools and Techniques Neded - Research Question Data
Proyaration - Model Building - Presentation and Automation

S.1, THE PROBLEMS YOU FACEWHEN HANDLING LARGE DATA

Alarge volume of data poses new challenges, such as overloaded memory and
igorithms that never stop running. It forces to adapt and expand the repertoire of
tishniques. Figure 5.1 shows a mind map that willgradually unfold as we go through
le steps: problems, solutions, and tips.
51.1, Problems
A computer only has a limited amount of RAM. When you try to squeeze
more data into this memory than actually fits, the OS will start swapping out
having it all in
memory blocks to disks, which is far less efficient than
data sets,
nemory. But only a few algorithms are designed to handle large
memory at once, which causes the
uOst of them load the whole data set into
copies of the
Our-of-memory error, Other algorithms need to hold multiple
results. All of these aggravate the
uala in memory or store intermediate
problem.
Certain algorithms don't take time into
Another limited resource is time.
time when they
account. algorithms can't end in areasonable amount of
Other
need to process only afew megabytes
of data.
5.2| Data science Fundamentals Handling Large Data
5.3
When dealing with large data sets is that components of computer can start to
5.2.1. Choosing the right algorithm
form a bottleneck while leaving other systems idle. Choosing the right algorithm can solve more
Certain programs don't feed data fast enough to the processor because they
problems than adding more or better
hardware. An algorithm that's well suited for handling
have to read data from the hard drive, which is one of the slowest components the entire data set into memory to large data doesn't need to load
make predictions. Ideally, the algorithm also
on a computer. supports parallelized calculations. three types of
Not enough memory algorithms that can do that: online
Processes that never end algorithms, block algorithms, and MapReduce algorithms, as shown in
Problems Some components form a botleneck while others remain idle
Figure 5.3.
Not enough speed Problems| Onlin algorithms
Solutions Choose the right algorithms Block matrices
Solutions • MapReduce
Handling large data Choose the right data structures
Handling large data Choose the right tools
General tips
Fig. 5.1. Overview ofproblems encountered when working General tips
with more data than can ft in memory
Fig. 5.3. Overview of techniques to adapt algorithms to large data sets
5.2. GENERAL TECHNIQUES FOR HANDLING LARGE VOLUMES () Online Learning Algorithms
OF DATA
Several, but not all, machine learning algorithms can be trained using one
Never-ending algorithms, out-of-memory errors, and speed issues are the most observation at a time instead of taking all the data into memory. Upon the arrival of a
common challenges you face when working with large data. The solutions can be new data point, the model is trained and the observation can be forgotten; its effect is
divided into three categories: using the correct algorithms, choosing the right data now incorporated into the mnodel's parameters. For example, a model used to predict
structure, and using the right tools (figure 5.2). the weather can use different parameters (like atmospheric pressure or temperature) in
Problems different regions. When the data from one region is loaded into the algorithm, it
Choose the right algorithms forgets about this raw data and moves on to the next region. This "use and forget"
Choose the right data structures way of working is the perfect solution for the memory problem as a single
Solutions•
Choose the right tools observation is unlikely to ever be big enough to fll up allthe memory of a moderm
Handling large data day computer.
Listing 5.1 shows how to apply this principle to aperceptron with online learning.
General tips
A perceptron is one of the least complex machine learning algorithms used for binary
classification (0 or l): for instance, it is decided will the customer buy or not?
Fig. 5.2. Overview of solutions for handling large data sets
Handling Large Data
Data science Fundamentals
54 fwe reach the 5.5
marimum numbeer break
Listing 5.1 fallwed runs, epcn se1f.ng1 er
we stop koking
Listing 51 Training a perceptron by observation for a solution. trear

The learingrateof an algorithm s the adustrent it makes every time a new

tservzion comes in Fthis is hich the model will adut qucdy to new observations
the optimal
but migt overshoct and rever get precise An oversimplhed example a The real value (y) is ether Go; the
land uniowweight foran varabie =075.Currert estimation is 04 1with redition of eitheri c-1. wong ve Tre ran EEntrenGrs
get an error saso0o1.s
learnngte of 0S te austment = 05 learring rate) size of error) (value
fr=0s 04iurent weghi -05 ladustTert) =09 (new weight), instead of
C75 The
a e t mas too big to get the correct resut def trafín
Theinit_method ofany Pyhon
dass is always run vhen crezting
result_hserretin(se1f,1,y,
z 19.ot, err
se1f.eigs)_):
>self.tsresteis
perceptron class. en instance of the dass. Several
class percegtrn():
set irít_(self, ,y, threshold e.5, defautt values are set here.
learnire_rate e.1, s3r_echs 12): in case we have a
se1f. threshold threshold
The thresthold is an arbitrary cutoff between vrong prediction
self.learring rate learninerate Oand1 to decide vhether the prediction an error, vie
becomea0or a1. Often it's 0.5, right in need to adjust 1f error l= 8:
se1f-y the rmiddle, but it depends on the use case. the model. error court 1
self.m2re s 2_eochs
for inder, value in
ermerte(r):
Iand y variables are ereturn
the error count
self.weigts[irden]
return error Ount
ss1f.1e2rnirz rte *errr rla
ore e i s ne run through assigned to the dass.
athedte Wie alufor a
because we need to
erauzte itat the
metum Orurs urt we end oftheepoch.] preda eraterg te learning
Esch observation will end up with a rete, the err ed ttea e
spthe ereprn weight. The initialize fundion sets
these weights for each incorníng def predict(self, ):
observation. Vle allow for 2 The
options. all vweights start at 0or . predict return int(np.dot(K, self.we1gs) > self.threstola)
they are assigned a srnall (between dass.
0and 005) randorn vieight. The values of the predito vaves are mutpiedby theit
ef initialize(self, init_type 'zers'): respecive weights (this mutiplicaion is done by rpdot.
1f in1t type s 'randn: Then the outeome is Compared to the overzlthreshoid
self.wesgts ng.randea.rard(len(self.x[9])) * .95 (here thisis05)to sEe fa0or1 shod bE presce
1 init type * 'zeros':
se1f.elgts ng.zeros(len(self.x[9])) X= [(1,0,0) ,(1,1, 0), (1,1,1), (1,1,1), (1,e, 1), (1,6,1)1
Oury y [1,1, 0, 0, 1, 1] (predictrs)
We instantiate our perceptron class with
We start t the frst epoch. (targei)
data the data from matriz Xand vectory
daa matr

The trairig vector. P = perceptron (X,y)

furctin. p.inítíalize ()
True isaways true, o technicaly this is a p.traín() The weights for the predctors are
set traln (self) :
neverending loop, but we build in several prínt p.predict((1,1, 1)) initialized (as explained previousy
stop (brea) conditions. print p.predict ((1,0,1))
ile True: The perceptron model is trained it l try to
error cunt s Adds one to the current We check what the perceptron would train until it either converges (no nore errors)
eeh 4* 1 number of epochs. now predict gíven different values for the or it runs out of training runs (epochs)
predictor variables. In the hrst case it vill
for U,) in zíp(se1f.1, self.y): predict 0; in the second it predicts a1.
error count t* self.train observation(2,y,error_count)
works. This function has
Start by explaining how the train observation() function
and compare
Initiates the rurnter fencsrtered errors
at 0 for eah eprh, This iz irnportant;fan We loop through the data and feed two
it large
parts, The first is to calculate the prediction of an observation
seems
epch erds vithrast error, the algofithrn
Nerged and were doe.
it to the train observation function,
one observation at a tirne. to the actual value. The second part is to change the weights if the prediction
to be wrong
5.6 Datascience Fundamentals Handling Large Data
The real vaBue (v is either 0 The train observation function is 5.7
Aprediction is made for index,value in
or 1,the prediction is also
or 1.fit's wrong we get an
run for every observation and
will adjust the weights Using the
for this observation.
Because it's binary,
enumerate(X):
error of either 1 o -1. formula explained earlier. this will be either self.weights[index] += self.learning ratexerror x
value
0or 1.
The second function that is the train(0
def train_ observation(self,X,y, error_count) :
esult - np.dot (X, self.weights) > self.threshold that keeps on training the
function. This function has an internal loop
perceptron until it can either
reached a certain number of training rounds (epochs),predict
esult
if eror l e: In case we have a wrong prediction (an
eror), we need to adjust the model.
perfectly or until it has
error count += 1 as shown in the following
Adds1 for index, value in enumerate(X) listing.
to error self.weights[index]4=self. learning_rate error * value
ount return error count
Listing 5.2
Adjusts the weight for every predictor variable
usingthe learning rate, the error, and the Listing 5.2 Using train functions
actual value of the predictor variable.
We return the error count
because we need to evaluate Training Starts at True is always true so technically this
it at the end of the epoch. For every predictor variable in the input function.
the first is a never-ending loop. but we build
vector (), we'lladjust its welght. epoch. in several stop (break) conditions.
def train(self) :
The prediction ) is calculated by multiplying the input vector of independent epoch m Initiates the number of encountered
errors at 0 for each epoch This is
variables with their respective weights and summing up the terms (as in linear Adds one to while True:
error_cOunt = important, because if an epoch ends
the current
regression). Then this value is compared with the threshold. If it's larger than the number of epoch += 1
without errors, the algorithm
converged and we're done
for (X,y) in zip(self.x, self.y):
threshold, the algorithm will give a 1 as output, and if it's less than the threshold, the epochs.
error_count += self. train_observation (X, y, error_count)
if error count 0:
algorithm gives 0 as output. Setting the threshold is asubjective thing and depends on print "training succesful1"
We loop through break
the business case. The error is calculated, which will give the direction to the change edata and if epoch >= self.max_epochs:
feed it to the train
print "reached maximum epochs, no perfect prediction"
of the weights. observation function, break an lo0p to train the data
one observation
result = np.dot (X, self.weights) > self.threshold at a time. f we reach the maimum
number of allowed uns, we
If by the end of the epoch stop looking for a solution
erTor = y-result we don't have an error,
training was successful.
The weights are changed according to the sign of the error. The update is done
Most online algorithms can also handle mini-batches; this way, can feed batches of
with the learning rule for perceptrons. For every weight in the weight vector, update
its value with the following rule: 10 to 1,000 observations at once while using a sliding window to go over the data.
Aw, = aEx, There are three options:
where Aw, is the amount that the weight needs to be changed, a is the learning Fullbatch learning (also called statistical learming) - Feed the algorithm all the
rate, [ is the error, and x, is the h value in the input vector (the /h predictor variable). data at once.
The error count is a variable to keep track of how many observations are wrongly Mini-batch learming Feed the algorithm a spoonful (100, 1000,
predicted in this epoch and is returned to the calling function. Add one observation to depending on what your hardware can handle) ofobservations at a time.
the error counter if the original prediction was wrong. An epoch is a single training time.
run through all the observations. Online learning - Feed the algorithm one observation at a
algorithms, where it is seen
if error! = 0: Online learning techniques are related to streaming
incoming Twitter data: it gets loaded into the
error count += 1 every data point only once.Think about
Handling Large Data
5.s
Datascience Fundamentals 5.9
Listing 5.3
discarded because the sheer number of Llsting5.3 Block matrix calculations with
algorithnms, and then the observation (tweet) is bcolz and Dask libraries
incoming twects of data might soon overwhelm the hardware. Online learning Number of observations
algorithns differ from streaming algorithms in that they can see the same (sclentific notatlon).
1e4 = 10.000. Feel
Creates fake data: nparange(n) seshape(n2,2) creates
a matrix of S000 byy 2 (because we
set n to 10.000).
free to change this. bc.carray = numpy is an array extension that can
observations multiple times. Swap to disc. This is also
rootdir ='ar.bcolz -> creates stored
import dask.array as da in a compressed way.
can both lean from a ile on disc in case out of
True, the online learming algorithms and strcaming algorithms import bcolz as ba
import numpy a9 np
RAM. YoUcan check this on your file
ipython file or whatever location you system next to this
are also used on
observations one by one. Where they differ is that online algorithms
ran this code from.
import dask mode
w->is the wrte
the storage type of the data yhis Roat -> is
mooe. ='float64
a static data source as well as on a streaming data source by presenting the data in n 1e4 numbers).
over the data
small batches (as small as a single obscrvation), which enables to go ar bc.carray (np .arange (n) .reshape (n/2,2) . dtype=' float 64".
multiple times. This isn't the case with a streaming algorithm, where data flows into y
rootdir 'ar. bcolz', mode 'w')
bc.carray (ap.arange (n/2), dtype='float64, rootdir -
the system and necd to do the calculations typically immediately. They're similar in
'yy.bcolz', mode 'w')

that they handle only a few at a time. Block matrices are created for the predictor variables
dax da . from_array (ar. chunks-(S.5)) (ar) and target (y). Ablock matrix is a matrix cut in
dy da.from_array (y ,chunks-(5,5)) pieces (blocks). dafromm_array reads data
(ü) Dividing a Large Matrix into Many Small Ones from disc or RAM (wherever it resides currently)
chunks=(S,S): every block is a 5S matrix
By cutting a large data table into small matrices, for instance, can still do a linear The XTX 0s defined (defining it as "lazy) as the (unless <5 observations or variables are left).
regression. The logic behind this matriX splitting and how a linear regression can be Xmatrix multiplied with its transposed version.
This is a building block of the formula to do
Xy is the yvector multiplied with the transposed
calculated with matrices can be found in the sidebar. It suffices to know for now that linear regression using matrix calculation. Xmatrix. Again the matrix is only defned, not
the Python libraries about to use will take care of the matrix splitting, and linear calculated yet. This is also abuiding block of the
XIX dax.T.dot (dax) formula to do linear nusing matriùx
calculation (see formula).
regression variable weights can be calculated using matrix calculus. Xy dax.I.dot (dy)
coefficients np.linalg.inv (XTX.compute () ).dot (Xy.computa())
The Pyhon tools to accomplish the task are the following:
coef da.fromarray(coe ff1cients, chunks-(5.5))
bcolz is a Python library that can store data arrays compactly and uses the The coefficients are also put
into a block matrix We got a
hard drive when the array no longer fits into the main memory. ar.flush() Flush memory data. It's no longer needed numpy aray back from the last
y.flush() step so wve need to explicitly
to have large matrices in memory.
convert it back to a "da array.
* Dask is a library that enables you to optimíze the flow of calculations and
predictions - dax.dot (coef).compute() Score the model
makes performing calculations in parallel easier. It doesn't come packaged print predictions (make predictions).
with the default Anaconda setup so make sure to use conda install dask on
The coeffclents are calculated using the matrix
your virtual environment before running the code below. Some errors have linear regresslon functlon. np.llnalg.lnv) is the
A- 1)in thls functlon, or "Inverslon of the
been reported on importing Dask when using 64bit Python.Dask isdependent matrix. X,.dot(y)--> multiplles sthe matrix X
with another matrlx y.
on afew other libraries (such as toolz), but the dependencies should be taken
matrix
care of automatically by pip or conda. Lhere is no need to use a block matrix inversion because XTX is a square
nsize nr of predictors nr of predictors. This is fortunate because Dask doesn't
The following listing demonstrates block matrix calculations with these libraries.
yet support block matrix inversion.
Handling Large Data
|5.10 Data science Fundamentals
S.11
oure 5.4 shows have many
(üi) Mapreduce different data structures to choose from, three of
which will be discussed. They are sparse
MapReduce algorithms are easy to understand with an analogy: To count all thbe data, tree data, and hash data.
votes for the national elections. The country has 25 parties, 1,500 voting offices, and () SPARSE DATA
2 million people. Choose to gather all the voting tickets from every office ASparse data set contains relatively little
individually andcount them centrally, or ask the local offices to count the votes for information compared to its entries
the 25 parties and hand over the results, and could then aggregate them by party. Map (ohservations). At figure 5.5: almost everything is "" with just a single "1" present in
the second observation on variable 9.
reducers follow a similar process to the second way of working. They first map
values to a key and then do an aggregation on that key during the reduce phase. The 1 2 4 6 7 10 | 11 12 13 14| 15 16
following listing denotes the MapReduce Pseudo code example. a look at the
following listing's pseudo code to get a better feeling for this. 1 0 0 0 0 0 0 0
Listing 5.4 MapReduce pseudo code example
0 0 0 0 0 1 0 0 (0 0 0
For each person in voting office:
Yield (voted _party, 1) 3 0 0 0 0 0 0
For each vote in voting office:
4 0 0 0 0 0 0
add_vote_to_party()
One of the advantages of MapReduce algorithms is that they're easy to paralllize Fig. 5.5. Example of asparse matrix: almost everything is 0;
and distribute. This explains their success in distributed environments such as other values are the exception in a sparse matrix
Hadoop, but they can also be used on individual computers. Data like this might look ridiculous, but this is often received when converting
unrelated Twitter
5.2.2. Choosing the right data structure extual data to binary data. magine a set of 100,000 completely
together they might
Algorithms can make or break the program, but the way the data is stored /s of Weets. Most of them probably have fewer than 30 words, but
variable, with "1" representing
equal importance. Data structures have different storage requirements, but also ave hundreds or thousands of distinct words. a binary
tweet." This would result
"present in this tweet," and 0" meaning not present in this
influence the performance of CRUD (ereate, read, update, and delete) and other
matrix can cause memory problems even
operations on the data set. sparse data indeed. The resulting large
Problems Sparse data
hough it contains little information.
Choose the right algorithms
Solutions• Choose the right data structures
Tree
Luckily, data like this can be stored compacted.
this:
Hash like
Choose the right tools
the case of figure 5.5it couldlook
Handling large data
data = [(2,9,1)]
General tips 1.
Row 2, column 9 holds the value algorithms
Fig. 5.4. Overview of data structures often applied to data science matrices is growing in Python. Many
Support for working with sparse
when working with larg: data matrices.
now Support or return sparse
Datascience Fundamentals Handling Large Data
5.13
5.12
(ii)Haslh Tables
() Tree Structures information much faster Hash tables are data structures that calculate a key for every
structure that allows to retrieve value in the data and
Trees are a class of data subtrees of children put the keys in a bucket. This waycan quickly retrieve the
tree always has a root value and information by looking in
than scanning through a table. A examples would be a family tree or a the right bucket when the data is encountered. Dictionaries in Python are a
on. Simple
each with its children, and so and leaves. hash table
branches, twigs,
biological tree and the way it splits into resides. implementation, and they're a close relative of key-value stores. Hash tables are used
Simple decision rules make it easy to find the child tree in which the data extensively in databases as indices for fast information retrieval.
structure enables to get to the relevant information
The figure 5.6 shows how a tree
quickly. 5.2.3.Selecting the right tools
Start search:
With the right class of algorithms and data structures in place, it's time to choose
the right tool for the job. The right tool can be a Python library or at least a tool that's
age < 12 ELED
12 s age <78
age 78
controlled from Python, as shown figure 5.7.
Problems
Choose the right algorithms
Numexpr
Bcolz.
Choose the right data structures
Solutions| Numba
Python tools Blaze
Choose the right tools l
Handling large data Theano

Cython
General tips
Leaf level
L1 L2 L3 Use Python as a master to control other tools
Basu, 33, 4003 Smith, 44. 3000
Daníl, 22,6003
Asby, 25, 3000 Jones, 40, 6003 Tracy, 44, 5004 Fig. 5.7. Overview of tools that can be used when working with large data
Bristo, 29, 2007 Cass, 50, 5004
5.2.3.1. Python Tools
Python has a number of libraries that can help to deal with large data. They range
Fig. 5.6. Example ofa tree data structure: Decision rules such as age categories
can be used quickly locate aperson in afamily tree from smarter data structures over code optimizers to just-in-time compilers. The
following is a list of libraries we like to use when confronted with large data:
In Figure 5.6 start the search at the top and first choose an age category, because
apparently that's the factor that cuts away the most alternatives. This goes on and on 3 Cython - Cython, a superset of Python, solves this problem by forcing the
until the expected outcome received. The Akinator is a djinn in amagical lamp that programmer to specify the data type while developing the program. Once the
tries to guess a person in the mind by asking a few questions about him or her. compiler has this information, it runs programsmuch faster.
packages, as is
Trees are also popular in databases. Databases prefer not to scan the table from the
3 Numexpr - Numexpr is at the core of many of the big data
first line until the last, but to use a device called an index to avoid this. Indices are numerical expression
often based on data structures such as trees and hash tables to find observations faster. NumPy for in-memory packages. Numexpr is a
the original NumPy.
The use of an index speeds up the process of finding dcta enormously. evaluator for NumPy but can be many times faster than
Handlimg Large Däta
5.14 Data science Fundamentals
5.15
Numba -Numba helps to achieve greater speed by compiling the code right Problems |
before executing it, aiso knovWIn as just-in-time compiling.
Bcolz - Bcolz helps to overcome the out-of-memory problem that can occur Solutions|
when using NumPy. it can store and work with arays in an optimal Handling large data
compressed form. It not only slims down the data need but also uses Numexnr Don't reinevent the wheel
in the background to reduce the calculations needed when performing Get the most out of your
General tips O hardware
calculations with bcolz arrays. Reduce the computing needs
& Blaze-Blaze is the "Pythonic way" of working with data. Blaze will translate Fig. 5.8. Overview of general
programming best practices
when working with
the Python code into SQL but can handle many more data stores than large data
relational databases such as CSV, Spark, and others. 5.3.1. Don't reinvent thewheel
* Blaze delivers a unified way of working with many databases and data "Don't repeat anyone" is probably even better than "don't repeat yourself." Add
libraries. value with your actions: make sure that they matter. Solving a
problem that has
Theano - Theano enables to work directly with the graphical processing unit already been solved is a waste of time. As a data scientist, you have two large rules
(GPU) and do symbolical simplifications whenever possible, and it comes that can help you deal with large data and make you much more productive, to boot:
with an excellent just-in-time compiler. Exploit the power of databases. The first reaction most data scientists have
Dask - Dask enables to optimize your flow of calculations and execute them when working with large data sets is to prepare their analytical base tables
efficiently. It also enables to distribute calculations. inside a database. This method works well when the features you want to
prepare are fairly simple.
5.3. GENERAL PROGRAMMING TIPS FOR DEALING VWITH LARGE Use optimized libraries. Creating libraries like Mahout, Weka, and other
DATA SETS
machine learning algorithms requires time and knowledge. They are highly
The tricks that work in a general programming context still apply for data
science. optimized and incorporate best practices and state-of-the art technologies.
Several might be worded slightly differently, but the principles are essentially the
5.3.2. Get the most out of your hardware
same for all programmers. The general tricks are divided into three parts, as shown in are over-utilized.
the figure 5.8 mind map: Resources on a computer can be idle, whereas other resources
Ins slows down programs and can even make them fail. Sometimes it's p0ssible
Don'treinvent the wheel. Use tools and workload from an overtaxed resource to an underutilized
libraries developed by others. (and necessary) to shift the
Get the most out of your hardware. Your esource using the following techniques:
machine is never used to its 1ull
potential, with simple adaptions you can make it work simple trick to avoid CPU starvation is to
harder. Teed the CPU compressed data. A
of the inflated (raw) data. This will
Reduce the computing necd. Slim down your Leed the CPU comnpressed data instead
memory and processing needs a CPU, which is exactly what you
much as possible. Shift more work from the hard disk to the CPUin most modern
disk can't follow the
Want to do. because a hard
Computer architectures.
5.16| Data science Fundamentale Handling Large Data
5.17
Makeuse of the GPU. The GPU is enormously efficient in parallelizable inhe computed much faster than the right side of the equation; even for
this triviai
but has less cache than the CPU. example, it could make a difference when talking about big
chunks of data.
Use multiple threads. It's still possible to parallelize computations on the
CPU, It can be achieved with this normal Python threads. 5.4. CASE STUDY 1:
PREDICTING MALICIOUS URLS
The internet is probably one of the greatest
5.3.3. Reduce your computing needs inventions of modern times. It has
boosted humanitys development, but not everyone uses this
The best way to avoid having large data problems is by removing as much of the great invention with
honorable intentions. Many companies (Google, for one) try to protect
work as possible up front and leting the computer work only on the part that can't be fraud by detecting malicious websites. users from
skipped. The following list contains methods to help achieve this:
Doing so is no easy task, because the internet has billions of web pages to scan. In
Profile your code and remediate slow pieces of code. Not every piece of code this case it is shown that how to work with a data set that no
needs to be optimized, use a profiler to detect slow parts inside program and longer fits in memory.
Data - The data in this case study was made available as part of a
remediate these parts. research
project. The project contains data from 120 days, and each observation has
Use compiled code whenever possible, certainly when loops are involved.
approximately 3,200,000 features. The target variable contains 1if it's a
Whenever possible use functions from packages that are optimized for malicious website and -1 otherwise.
numerical computations instead of implementing everything yourself. The The Scikit-learn library - This library installed in the Python environment.
code in these packages is often highly optimized and compiled.
Otherwise, compile the code yourself. When existing package cannot be used, 5.4.1. Step 1: Defining the research goal
use eitherajust-in-time compiler or implement the slowest parts of your code The goal of our project is to detect whether certain URLs can be trusted or not.
in a lower-level language such as C or Fotran and integrate this with your Because the data is so large we aim to do this in a memory-friendly way.
codebase.
5.4.2. Step 2: Acquiring the URL data
* Avoid pulling data into memory. When working with data that doesn't ft in Start by downloading the data from http:/lsysnet.ucsd.edu/projects/url/#datasets
the memory, avoid puling everything into memory. A simple way of doing and place it in a folder. Choose the data in SVMLight format. SVMLight is a text
this is by reading data in chunks and parsing the data on the fly. based format with one observation per row. To save space, it leaves out the zeros.
Use generators to avoid intermediate data storage. Generators helps to return
data per observation instead of in batches. This way to avoid storing
intermediate results. MemoryError Traceback (most recent call last)
<ipython-input-532-d196cO5088ce> in <module>()
Use as little data as possible. If no large-scale algorithm is available and aren't
willing to implement such a technique yourself, then can still train the data on 5 print there are %d files" % len(files)
only a sample of the original data. 6 X,y = load_svmlight_file (files[6], n_features=3500009)
---) 7 x.todense()
Use your math skills to simplify calculations as much as possible. Take the
following equation, for example: (a +b)? =a'+ 2ab +b2, The left side willbe
|5.18 Data science Fundamentals Handling Large Data
The following listing and figure 5.9 show what happens when trying to read in 1 CPUcompressed files. In 5.19
our example, it's
file out of the 120 and create the normal matrix as most algorithms expect. The unpack file only when you need it,
a already packed in
todense() method changes the data from a special file format to a normal matrix part of your computer). without writing it to the hardthedisktar.gz format.
where every entry contains a value.
(the slowest
Listing 5.6
Tools and Techniques Listing 5.6
Checking data size
Amemory error while loading a single file. Luckily, there are a few tricks that try We don't know how
we have, so many features
these techniques over the course of the case study: let's initialize it at 0.
We don't know how
Use a sparse representation of data. we have, so let's manyitobservations
initialize at 0. The uri variable holds
import tarfile the
3 Feed the algorithm compressed data instead of raw data. Erom location in which you saved

Use an online algorithm to make predictions.

sklearn.linear_model
Eram sklearn .metrics import SGDClassifier
import
the
edownloded fles. Youl
need to fill out this uri
Erom
sklearn.datasets
inport numpy as np
classification_report
import load_svalight Eile variable yourseif for the code
to run on your
computer.
5.4.3. Step 3: Data Exploration uri "D:\Python Book\Chapter 4\url_svalight . tar.gz'
tar
BAx_obs
tarfile.open
0
(uri, "r:gz") All files together take up
To apply the first trick (sparse representation), there is a need to find out whether maz_vars 0 Stop at the 5th file
around2205 Gh.The
trick here is to leave the
i 0 data compressed in
the data does indeed contain lots of zeros. (instead of all of them, for
Initialize split 5 main memory and only
file for tarinfo in tar: demonstration purposes). unpack what you need.
Counter priat " extracting %s,f size %s"
We can check this with the following piece of code: at 0. if tarinfo.isfile () : % (tarinfo .name, tarinfo .size)
f tar.extractfile (tarinfo.name)
print "number of non-zero entries %2.6f" % float ( (X.nnz) / (float We unpack X.y load_svmlight_file (E) Use a helper function,

(X.shape [O])
the files one
by one to
reduce the
maz_vars np.mazinun (az_vars, X.shape [0])
max_ob8 np.aaximum (saz_obs, X.shaps(1]) load_5vcfile.
to load a
hle)

if i split:
* float (X.shape[1]))) memory
needed. break
Adjust maximum number of
observations and variables
i+s 1 Stop when we when necessary (big file).
reach 5 fhles.
This outputs the following: print "ax X %s, BAX y dinension Xs" % (Bax obs, maz_vars) Print
number of non-zero entries 0.000033 resutts.

Part of the code needs some extra explanation. In this code loop through the svm
Data that contains little information compared to zeros is called sparse data. This files inside the tar archive. We unpack the files one by one to reduce the
memory
can be saved more compactly if you store the data as [(0,0,1),(4,4,1)] instead of needed. As these files are in the SVM format, a helper is used,
[[1,0,0,0,0]. [0.0,0,0,0]. [0,0,0,0,0]. [0,0,0,0,0], [0,0,0,0,1] functionload svmlight file(0 to load a specific file. Then we can see how many
One of the file formats that implements this is SVMLight, and that downloaded the
observations and variables the file has by checking the shape of the resulting data set.
data in this format. 5.4.4. Step 4: Model building
To get this information there is a need to keep the data compressed while checking Now that we're aware of the dimensions of our data, we can apply the same rwo
for the maximum number of observations and variables. We also need to read in data icks (sparse representation of compressed file) and add the third (using an online
file by file. This way you consume even less memory. A second trick is to feed the algorithm), in the following listing.
5.20 Data science Fundamentals Handling Large Data
6.
Listing 5.7 BUILDINGA RECOMMENDER SYSTEM 5.21
Listing 5.7 Creating a model to distingulsh the mallcious
from the normal URLs In this
example, the hash table data structure and INSIDE ADATABASE
We know number The target variable can bd
used. Python to control other tools is
of features from or -1,"1": website safe to
data exploration. visit, "- 1": website unsafe. Set up stochastic
s.5.1. Tools and
classes - [-1, 1)
sgd SODClassi fier (loss"log")
qradient
classifier. techniques
The required tools are needed
n_features-323 1952
split - 5
All fles together take up around 2.05 5.5.1.1. Tools
i- 0 Gb. The trick here is to leave data We unpack the
for tarinfo in tar:
in main memory and files one by one
conpressed
co
1f i > split: only ( what you need. to reduce the
MySQL database - Needs a MySQL
break memory needed.
database to work with download from
1f tarinfo. isf1le() :
£ tar. extractfile (tarin fo .name)
load_svmlight_file (f .n_features =n_features) Use a helper function,
www.mysql.com.
X.y
1f i ( split: load_svmlight _fle) MySOL database connection Python library- To
sgd.partial_fit (%, y, classes-classes) to load a specific file. Python there is a need to install SQL Alchemy orconnect to this server from
communicating with MySQL. Use MySQLdb andanother
1f 1 m split:
print classification_report (sgd.predict (X) y) library capable of
1 += 1 Third important thing
is online algorithm. It right off the bat to install it.
on Windows use Conda
Stop when wel
Initialize file reach 5 files and can be fed data points
Counter at 0. print results. file by file (batches). o First install Binstar
(another package management service) and look for the
Stop at 5th fle (instead of all
of them, for demonstration appropriate mysql-python package for the Python setup.
purposes). conda install binstar
The code in the previous listing looks fairly similar, apart from the stochastic binstar search -t conda mysql-python
gradient descent classifier SGDClassifier).Here, we trained the algorithm iteratively The following command entered into the Windows
command line worked (after
by presenting the observations in one file with the partial fit() function. Looping activating the Python environment):
through only the first 5 files here gives the output shown in table 5.1. The table shows conda install --channel
https://conda.binstar.org/krisvanneste
classification diagnostic measures: precision, recall, F1-score, and support. mysql-python
3 Need the pandas python library, but that should already be installed by now.
Table 5.1. Classification Problen: Can a website be trusted or not?
Precision Recall fl-score support 5.5.1.2. Techniques
-1 0.97 0.99 0.98 14045 A simple recommender system will look for customers who've rented similar
movies and then suggest those that the others have watched. This technique is called
0.97 0.94 0.96 5955
1 k-nearest neighbors in machine learning.
0.97 0.97 0.97 20000 A customer who behaves similarly isn't necessarily the most similar customer. A
|Avg/total
technique is used to ensure similar customers (local optima) without guaranteeing that
Only 3% (1 -0.97) of the malicious sites aren't detected (precision), and 6% (1 - it finds the best customer (global optimum).
decent result, so it is
0.94)of the sites detected are falsely accused (recall). This is a A common technique used to solve this is called Locality-Sensitive
Hashing. The
concluded that the methodology works. ldea behind Locality-Sensitive Hashing is simple: Construct functions that map
Data science Fundamentals Handling Large Data
5.23
label) and Table 5.3. Combining the
(thev'r put in a bucket with the same information from
similar customers close together This is also how DNA works: alldifferent columns into the movies
make su that objs that ar
ditèent are put in difterent buckets.
information in a long string colunn.
mapping. This tunction is called Column 1| Movie1
Central to this idea isa function that performs the Movie 2 Movie 3 Movie 4 Movies
of input to a fixed output. The
a hash function: a function that maps any range random columns. t
Customer 1
from several 1
simplest hash function concatenates the values Customer 2
1011
(scalable input), it brings it back to a single 0
doesn't matter how many columns 0001
up to find similar customers.
column (fixed output). Thre hash functions has been set This allows you to calculate the
hamming distance
The three functions take the values of three
movies:
handling this operator as a bit, you can exploit the XOR much more efficiently. By
movies 10, 15, and 28. operator.
$ The fist function takes the values of The outcome of the XOR operator (^) is as follows:
18, and 22.
$ The second function takes the values of movies 7, 1^1 = 0
30.
The last function takes the values of movies 16, 19, and 1^0 = 1
same bucket share at least
This will ensure that the customers who are in the 0^1 = 1
several movies. But the customers inside one bucket might still differ on the movies
020 = 0
that weren't included in the hashing functions. With this in place, the process to find similar customers becomes very
the bucket with each simple.
To solve this there is a need to compare the customers within Let's first look at it in pseudo code:
is to compare
other. For this create a new distance measure, The distance that used
customers is called the hamming distance. The hamming distance is used to
calculate Preprocessing
defined as the number of different Define p(for instance, 3) functions that select k (for instance, 3) entries from
how much two strings differ. The distance is
distance. the vector of movies. Here we take 3 functions (p) that each take 3 () movies.
characters in a string. Table 5.2 offers a few examples of the hamming
Apply these functions to every point and store them in a separate column. (n
Table 5.2. Examples of calculatng the hamming distance
literature each function is called a hash function and each column will store a
String 1 String 2 Hamming distance
bucket.)
Hat Cat 1

Hat Mad 2 Querying point q

2 3 Apply the same p functions to the point (observation) qyou want to query.
Tiger Tigre
Paris Rome 5 Retrieve for every function the points that correspond to the result in the
corresponding bucket.
Comparing multiple columns is an expensive operation, so there is trick to speed points
this up. As the columns contain a binary (0 or 1) variable to indicate whether a
Stop when you've retrieved all the points in the buckets or reached 2p
customer has bought a movie or not, the information can be concatenated so that the (for exanple 10 if you have 5functions).
with the minimum
same information is contained in a new column. Table 5.3 shows the "movies" 3 Calculate the distance for each point and return the points
yariable that contains as much information as all the movie columns combined. distance.
5.24 Data science Fundamentals Handling Large Data
|5.25
5.5.2. Step 1: Research question Listing 5.8
The goal of this case study is to create a memory-friendly recommender system
You're working in a video store and the manager asks you if it's possible to use the Listing 5.8 Creatlng customers In the
import MySQLdb
database
information on what movies people rent to predict what other movies they might like import pandas as pd
The boss has stored the data in a MySQL database and he is referring to is a user = '****
password = ****!
recommender system, an automated system that learns people's preferences and database = 'test' First we establish the
recommends movies and other products the customers. Create the data for this case mc = connection; you'llneed to
study and move right into data preparation by skipping data retrieval step and again
MySsQLdb.connect('localhost',user, password,database)
cursor = MC.cUrsor()
nr customers = 100
fill out your own
username,
password, I schema-name
ariahle atabase).
(vari

by skipping the data exploration step and move straight into model building. colnames ="novie%d" %1
pd.np.random. seed(2015 )
for i in range(1,33)] Next we simulate
a database with
generated customers =
5.5.3. Step 2: Data preparation pd.np.random.randint
nr_customers).reshape(nr_customers, 32)(,2,32
* customers and create
a few observations.

The data boss has collected is shown in table 5.4. Create this data for the sake. of data = pd. DataFrame
data.to_sql(' (generated_customers,
flavor = 'mysql', columns
= list
index = True, (colnames)
'replacec',ust'index_
,mc, )
demonstration. label = 'cust id') if exists =

Table 5.4. Excerpt fromthe client database and the movies customers rented Store the data inside a Pandas data frame and
write the data frame in a MySQL table called
Customer Movie 1 Movie 2 Movie 3 "cust". If this table already exists, replace it.
Movie 32
Jack Dani 1 1 Create 100 customers and randomly assign whether they did or
didn't see a certain
movie, and 32 movies in total. The data first created in a Pandas
Wihelmson 1 1 data frame but is
then turned into SQL code. Note: You might run across a
code.
warning when running this
To efficiently query the database there is a need additional data
Jane Dane 0 0 1 preparation, That
includes the following things:
Xi Liu 1
3 Creating bit strings. The bit strings are compressed versions of the columns
Eros Mazo 1 1 content (0"and 1 values). First these binary values are concatenated; then the
resulting bit string is reinterpreted as a number. This might sound abstract
now but willbecome clearer in the code.
For each customer you get an indication of whether they've rented the movie
$ Defining hash functions. The hash functions will in fact create the bit strings.
before (1) or not (0).
$ Adding an index to the table, to quicken data retrieval.
First let's connect Python to MySQL to create our data. Make a connection to
MySQL using your usermame and password. In the following listing we used a Creating Bit Strings
database called "test". Replace the user, password, and database name with the An intermediate table suited for querying, apply the hash functions, and represent
appropriate values for your setup and retrieve the connection and the cursor. A the sequence of bits as a decimal number isprepared. Finally, place them in a table.
database cursor is a control structure that remembers where you are curently in ne
database.
5.26 Data science Fundamentals Handling Large Data
First, create bit strings. Convert the string "11111111" to a binary or a numeric Creating a Hash Function 527
value to make the hamming function work. A numeric representation, as shown in The hash
the next listing has been opted. this case studyfunctions take the
is decided to values of movies for acustomer. The
Listing 5.9 movies 10, 5, and 18; the create 3 hash functions: the first function theory part of
Listing 5.9 Creating bit strings combines l6, 19,and 30. Thesecond combines movies 7, 18, and 22, and combines
following code listing shows how
the
the third one
We represent the string as a numeric value. The string will be a concatenation this is done.
of zeros (0) and ones 0) because these indicate whether someone has seen a Listing 5.1O
certain movie or not. The strings are then regarded as bit code. For example
0011 is the same as the number 3. What def createNum() does: takes in 8 LIsting 5.10 Creating hash
values, concatenates these 8 column values and turns them into a string, then
turns the byte code of the string intoa number.
functions
Define hash function it is exactly like
def createNum(x1, x2, x3, x4, x5, x6, X7, x8): the
return [int("Xá%a%4%AKaKaKAKA %(11,12,i3, i4, 15, 16, 17, 18),2) the createNum)
final function without
and for 3conversion
to a Test if it works
for (i1,12, i3,i4, i5, i6, 17 ,i8) in zíp(x1, x2, X3, X4, X5, X6, x7, x8) 1
def hash fn(x1,x2, x3): columns insteadnumber
of 8). correctly (f no error
is raised, it works)
assert int('1111',2) z= 15
assert int('1100',2) == 12
return (b'%%%d" X (1,j,k) for (i,j, k) in It's sampling on
assert== createNun((1,1],(1,1), [1,1],[1,1], [1,1], [1,1] [1,0], [1,0]) zip(x1, x2, x3)) columns but al
[255,252] assert hash_fn([1,0]. [1,1],[e,e]) observations wll
(b'110',b'e18') be selected
store = pd.DataFrame () store[' bucket1]
store("bit1'] = createNum(data.movie1, Translate the <torel' bucket2'] hash_fn(data .movie18, data.movie15,
hash_fn(dat a. movie7, data. data.movie28)
movieecolumn to
data.movíe2, data.movie3, data . movie4, data.movie5,
4bit strings in store[ 'buc ket3]
store. to_sql('movie hash_fn(data.moviel6, movie18,data.movie22)
_comparison' ,mc, flavordata.movie19,data.
data.movie6, data.movie7, data. movie8) numeric form. movie3e)
store('bít2'] = createNum(data.movie9,
data.movíe10, data.movie11, data.movie12, data.movie13, Each bit string 'mysql', index = True,
index_label 'cust_id', 1f_exists 'replace')
represents 8
data.movíe14, data. movíe15, data. movie16) movies. 4°8 32
store[" bít3') = createNum(data. movie17, movies. Note. you Store this information
data.movie18, data.movie19, data.movie20, data.movíe21, could use a 32-bit in database Create hash values from customer
data.movie22, data.movie23, data . movie24) string instead of movies, respectively (10,15, 28,
store['bit4'] = createNum(data.movie25, 4°8 to keep the (7,18, 22). ([16,19,30].
data.movie26, data. movie27, data.movie28, data .movie29, code short.
data.movie 30,data.movie31, data . movie32) The hash function concatenates the values from the
different movies into a binary
Test ifthe function works correctly. Binary codel 11 is the value likewhat happened before in the create Num() function,only this
same as 15 (= 1"8+ 1°4+ 12+ 11). If the assert fails, it time it is not
will raise an assert error; otherwise nothing will happen. converted to numbers and only take 3 movies instead of &as input. The asser.
By converting the information of 32 columns into 4 numbers, Figure 5.10 shows function shows how it concatenates the 3 values for every observation. When the
for the first 2 observations client has bought movie 10 but not movies 15 and 28, it will return b' 100 for bucket
(customer movie view history) in this new format. 1.When the client bought movies 7and 18, but not 22, it will return b'l10' for bucker
store[o:2] 2. If we look at the current result we see the 4 variables created earlier (bit!. bir.
The nexXt step is to create the hash functions, to sample the data to determine bit3, bit4) from the 9handpicked movies (figure 5.11).
whether twocustomers have similar behavior.
bitl bit2 bit3 bit4 bucketl bucket2 bucket3
bitl bit2 bit3 bit4
0 10 62 42 182 011 100 011
10 62 42 182
1 23 28 223 180 001 111 001
23 28 223 180
sample movies
Fig. 5.11. Information from the bit string compression and the 9
Fig. 5.10. First 2customers information on all 32 movies
after bit string to numeric conversion
ata scieee Fundamentals Handling Large Data
If allis well, the
output of this code should be 3. (5.29)
Adding An Index To The Table in a real-time
s dup rtrieval as neoded system. 5.5.5. Step 4:
Vu mus ai indis O
The applicationPresentation
and automation
This is shwn inth NNisting needs to perform two
customer: steps when confronted with a
Listing &ll
given
Look for similar
Listing 511 Creating an index customers.
Crate function to easily create Suggest movies the customer has yet to see
already viewed and the viewing history of the based on what he
indices. ndices will quicken retrieval. or she has
def createinder(colun, cursor)
Vie cmrison (ts);" $
(column, column) similar customers.
sql CREATE IAOEX Ss ON First things first: select ourselves a lucky
cursar.exeUte(sql) customer.
CneateIder'cketl' ,cursor) Put index on
Finding A Similar Customer
cteldex('bucket?', cursor) bit bucket Time to perform real-time queries. In the
creteIndex(bucket3",cursor)
benny one who`ll get his next movies following listing. customer 27 is the
"model building part." selected for him. But first select customers with
With the data indexed it is been now moved on to the a similar viewing history.
5.5.4. Step 3: Model building Listing 5.13
define it as a function.
To use the hamming distance in the database we need to Listing 5.13 FIndlng simllar customers

CREATING THE HAMMING DISTANCE FUNCTION We do twO-step sampling. First sampling: index must be
calculate the as the one of the selected customer (is based on 9 exactly the same
A user-defined function has been implemented. This function can must have seen (or not seen) these 9 movies exactlymovies). Selected people
like our customer did.
distance for a 32-bit integer (actually 4*8), as shown in the following listing. Second sampling is a ranking based on the 4-bit strings. These take into
account all the movies in.the database. Pick customer
from database.
Listing 5.I2 customer id = 27
Listing 5.12 Creating the hamming distance sql = "select * from movie comparison where cust id = %s"
cust_data = pd.read _sql(sql,mc) customer id
Sal = sql = nn* select cust id, hammingdistance (bit1,
CREATE FUNCTION HAMMINGDISTANCE ( The function is stored in a bit2, bit3, bit4,%s, %s,%s,Xs) as distance
A8 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, database. You can only do
Be BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT this once; running this code a
from movie_comparison where bucketi = %s' or bucket2 ='%s'
or bucket3="%s' order by distance limit 3* %
second time will result in an
eror: OperationalError. (cust_data.bit1[0], cust_data.bit2[O],
RETURNS INT DETERMINISTIC Defne funtion. It takes 8 input (304, 'FUNCTION HAMMING cust_data.bit3[0), cust_data.bit4[0],
arguments:4 strings of length 8
RETURN
for the first customer and
DISTANCE already exists). cust_data.bucketi[0], cust_data. bucket2 [e),cust_data . bucket3[e))
BIT COUNT (A8 A Be) + shortlist - pd.read_sql(sq1, mc)
another 4 strings of length 8 for
BIT_COUNT (A1 ^ B1) + the second custome. This way We show the 3 custonmers that
BIT_ COUNT(A2 ^ B2) + we can compare 2 customers most resemble customer 27.
BIT_COUNT(A3 ^B3): To check this function you
side-by-side for 32 movies. can run this SQL statement Customer 27 ends up frst.
cursor. execute (Sql) with 8 fixed strings. Notice
the b" before each string,
indicating that you're Table 5.5 shows customers 2 and 97 to be the most similar to customer 27. As the
Sql ='Select hanaingdistance( passing bit values. The
This |
runs
b'11111111',b'e8eeeeee',b'11011111',b'11111111"
,b'11111111',b'10881881',b'11811111 ,b'11111111 outcome of this particular data was generated randomly, anyone replicating this example miht receive different
test should be 3, which
the
indicates the series of strings results. Finally, a movie for customer 27 to watch is selected.
query Lo pd.read_sql(Sql, nc) differ in only 3 places.
5.30 Datascience Fundamentals Handling Large Data
Mission
accompl
to hisished. happy movie addict can now
Table 5.5. The most similar customers to customer 27 A
movie, tailored 5.31|
cust id distance
preferences. indulge himself with anew
27
TWO MARKS
QUESTIONS AND
1. What are the
general problems you face while ANSWERS
97 9
Managing massive amounts of data. It's in thehandling large data
Finding a New Movie Integrating data from multiple sources. name-big data is big.
We need to look at movies customer 27 hasn't seen yet, but the nearest customer Ensuring data quality.
has, as shown in the following listing. This is also a good check to see if your Keeping data secure.
distance function worked correctly. Although this may not be the closest customer, Selecting the right big data tools.
it's a good match with customer 27. By using the hashed indexes, enormous speed Scaling systems and costsefficiently.
when querying large databases have been gained. Lack of skilled data professionals.
Listing 5.14 Organizational resistance.
Listing 5.14 Finding an unseen movie 2. What are the problems encountered when
in memory.
working with more data thatcan ft
cust = pd.read_sqi('select * from cust where cust_id in (27,2,97),mc)
dif = cust.T
dif[dif[O] != dif[1]] Select movies
Not enough memory
Select movies Transpose for customers 27, 2, Processes that never end
customer 27 convenience. 97 have seen.
didn't see yet. * Some components from a bottleneck while others remain idle.
Table 5.6 shows you can recommend movie 12, 15 or 31 based on customer 2's Not enough speed
behaviour. 3. Mention the solutions for handling large data sets.
Table 5.6. Movies fromcustomer 2 can be used as suggestions for customer 27 Choosethe right algorithms
1 2 Choose theright data structures
Cust id 2 27 97
Choose the right tools
Movie3 1 1

Movie9 1 1 4. What are the algorithms involved in handling large data sets.
Moviell 0 1 1 Online algorithms
Moviel2 Block Matrices
Moviel5 1 MapReduce
Moviel6 1
1
S. Define Perceptron
machine learning al gorithms sed for
Movie25 0 1

Movie31 1 A perceptron is one of the least complex

binary classification (0 or 1)
11. When sparse data is present in your
There r two ypes of sparsity:
datasets, it Creates dense data
Controllad sparsity harpens when there is a
oeahonand compar it to ths rznge
TssNulse the nitn ofan dimensions that have no value of values for
unçie
Random sparsity o0cus when there is
pienon sems to be wrong
The s per is to change the weights ifthe throughout the datasets
Sparse data scanered
randomly
learning.
7. #ha are the options nuilaNe for online algorithms 12, State the difjerent Python Tools
There threopions: Python Tools
learing) - Fed the algorithm all
Fullbarch leaming (also called saistical Python has a number of librarie that can help to deal with large data
the data st once. Rom smarter data structures over code opimizers to They range
algorithm a spoonful (100, 1000, .... just-in-im compiles.
& Mini-batch learning Feei the observations at a time. The following isa list of libraries we like to use when
depending on what your hardware can handle) of confionted with large dats:
observation at atime. Cython
Online learning - Feed the algorithm one
Numexpr
& #hat s meant by Mapreduce algorithmns?
divide atask into
MapReduce implements various mathemaical algorithmns to
$ Numba

small partS and assign them to multiple systems. Bcolz

country has 25 parties,
Ex. To count all the votes for the national elections. The Blaze
gather all the voting tickets
1,500voting offices, and 2 million people. Choose to Theano
ask the local offices
from every office individually and count them centrally, or
over the results, and could then 3 Dask
to count the votes for the 25 parties and hand
aggregate them by party 13. Mention the three general programming tips for dealing with large data sets.
9. What do you mean by CRUD? Don't reinvent the wheel.
DELETE. These
CRUD is the acronym for CREATE, READ, UPDATE and Get the most out of your hardware.
terms describe the four essential operations for creating and managing persistent
data elements, mainly in relational and NoSQL databases Reduce the computing need.
Malicious URLs?
10. What is Sparse Data? 14, What is meant by Predicting
organizations
within data for protecting users and
Sparse data is a variable in which the cells do not contain actual data Fredicting malicious URLs is essential continuous
requires
analysis. Sparse data is empty or has azero value. Sparse data is
different from threats. It is a dynamic field that
rom a wide range of cyber of methods and
missing data because sparse data shows up as empty or zero while missing
data making use of a combination
adaptation to evolving threats,
doesn't show what some or any of the values are. technologies to ensure robust cybersecurity.
Handling Large Data
Data science Fundamentals
S.32 Whensparse data is 5.33
present in your datasets, it
6. Staie how thetrain observation) funcion works. There are two types of sparsity: creates dense data.
This function has íw large parts Controlled sparsity
observation and compare it to the happens when there is a range of
The first is to calculatc the prediction of an dimensions that have no value values for multiple
actual valuc. Random sparsity
seccms to be Wrong.
occurs when there is sparse data scattered
The second part is to change the weights if the prediction throughout the datasets randomly
7. What are the options available for online algorithmns learning. I2. State the different Python Tools
There are three options: Python Tools
algorithm all
$ Fuilbatch learning (also called statistical lcaming) - Fced the D.dbon has a number of libraries that can help to deal
the data at once. with large data. They range
Fom smarter data structures over code optimizers to
& Mini-batch learming Fced the algorithm a spoonful (100, 1000, just-in-time compilers.
The following is a list of libraries we like to use when
epending on what your hardware can handle) of observations at a time. confronted ith large data:
$ Online learning - Feed the algorithm onc observation at a time. Cython
8 What is meant by Mapreduce algorithms? Numexpr
MapReduce implements various mathematical algorithms to divide a task into Numba
small parts and assign them to multiple systems. Bcolz
Ex. To count all the votes for the national elèctions. The country has 25 parties, Blaze
1,500voting offices, and 2 million people. Choose to gather all the voting tickets
Theano
from every office individually and count them centrally, or ask the local offices
to count the votes for the 25 parties and hand over the results, and could then Dask
aggregate them by party
13. Mention the three general programming tËps for dealing with large data sets.
9. What do you mean by CRUD?
3 Don't reinvent the wheel.
CRUD is the acronym for CREATE, READ, UPDATE and DELETE. These
Get the most out of your hardware.
terms describe the four essential operations for creating and managing persistent
data elements, mainly in relational and NoSQL databases 3 Reduce the computing need.
URLs?
10. What is Sparse Data? 19, What is meant by Predicting Malicious
organizations
Sparse data is a variable in which the cells do not contain actual data within data for protecting users and
Fredicting malicious URLs is essential that requires continuous
analysis. Sparse data is empty or has a zero value. Sparse data is different from
range of cyber threats. It is a dynamic field
Om awide methods and
missing data because sparse data shows up as empty or zero while missing data making use of a combination of
doesn't show what some or any of the values are. ddaptation to evolving threats,
lechnologies to ensure robust cybersecurity.
Handllhg Luge
Data science Fundamentals
5.34 for data analysis and |5.35
modeling,
15. How building a recommender
system is done? and utilities for tasks like including various machine learning algorithms
Building arecommender system inside a
database leverages the strengths of a reduction, and more. classification, regression, clustering, dimensionality
including real-time
DBMS to enhance recommendation capabilities, 21. What is a Tree Structure?
data managemnent, It
recommendations, scalability, personalization, and efiicient reoes are a class of data
low-latency and high-performance structure that
can be particularly useful in scenarios where allows to
recommendations are critical, such as e-commerce platforms,
streaming services. than scanning through a table. A tree always hasretrieve information much faster
a root value and
and content recommendation systems.
bildren. each with its children, and so on. subtrees of
tree or a biological tree and the way it Simple examples would be a family
16. What is the main purpose ofLocality-Sensitive
Hashng. splits into branches, twigs, and leaves.
customers (local optima) without
A technique is used to ensure similar REVIEW QUESTIONS
common
guaranteeing that it finds the best customer (global optimum). A 1
technique used to solve this is called Locality-Sensitive Hashing. The idea behind Explain in detail about the problems that you face when handling large data.
Locality-Sensitive Hashing is simple: Construct functions that map similar Bxplain in detail about the general techniques for handling large
and make volumes data.
of
customersclose together (they're put in a bucket with the same label) 3 Explain the steps involved in training the
perceptron by observation.
sure that objects that are different are put in different buckets. 4. Explain the steps involved in train functions
17. What is meant by hamming distance? 5. Explain in detail about block matrix calculations with bcolz and Dask libraries.
The distance that used is to compare customers is called the hamming distance. 6. Explain in detail about general programming tips for dealing with large data sets.
The hamming distance is used to calculate how much two strings differ. The
distance is defined as the number of different characters in a string. 7. Explain the case study Predicting Malicious URLS.
8. Explain the steps for creating a model to distinguish the malicious from the
18. How does MYSQL database connect to Python library ? normal URLS.
MySQL database connection Python library - To connect to this server from
9. Explain in detail about Building arecommender system inside adatabase.
Python there is a need to installSQL Alchemy or another library capable of
communicating with MySQL. Use MySQLdb and on Windows use Conda right 10. Explain in detail the steps involved in creating the hamming distance.
off the bat to install it. 11. Explain about the customers in the database who watched the movie or not using
a recommender system.
19. Define SVMLight.
SVMLight is a machine learning software package developed by Thorsten
Joachims. It is designed for solving classification and regression problems using
Support Vector Machines (SVMs), which are a type of supervised learning
algorithm.
20. Define Scikit-learn library.
Scikit-learn, often referred to as sklearn, is a popular and widely used open
source machine learning library for Python. It provides a simple and efficient tool

Unit 4 - DS - 1st Year
No ratings yet
Unit 4 - DS - 1st Year
6 pages
Efficient Large Data Handling
No ratings yet
Efficient Large Data Handling
6 pages
Single-Layer Perceptron Guide
No ratings yet
Single-Layer Perceptron Guide
39 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
Deep Learning Tutorial for Business
No ratings yet
Deep Learning Tutorial for Business
58 pages
DL Mod 1 Final
No ratings yet
DL Mod 1 Final
4 pages
Numpy Module
No ratings yet
Numpy Module
10 pages
Doubt Clearance Session (AI) On 29.12.2024
No ratings yet
Doubt Clearance Session (AI) On 29.12.2024
41 pages
Efficient Single-PC Data Handling
No ratings yet
Efficient Single-PC Data Handling
54 pages
Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF
No ratings yet
Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF
24 pages
Neural Networks Course Notes
No ratings yet
Neural Networks Course Notes
253 pages
Ann Mid1: Artificial Neural Networks With Biological Neural Network - Similarity
No ratings yet
Ann Mid1: Artificial Neural Networks With Biological Neural Network - Similarity
13 pages
Internshipml (J2)
No ratings yet
Internshipml (J2)
50 pages
Introduction To Artificial Neural Networks
No ratings yet
Introduction To Artificial Neural Networks
31 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
M03 Networks
No ratings yet
M03 Networks
40 pages
PyTorch for Deep Learning Students
No ratings yet
PyTorch for Deep Learning Students
7 pages
KLS'S Vishwanathrao Deshpande Institute of Technology, Haliyal
No ratings yet
KLS'S Vishwanathrao Deshpande Institute of Technology, Haliyal
17 pages
Module 4
No ratings yet
Module 4
55 pages
Datascience Unit3
No ratings yet
Datascience Unit3
19 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
AI Test Study Guide
No ratings yet
AI Test Study Guide
2 pages
DS R Unit-4
No ratings yet
DS R Unit-4
5 pages
3 Datasets
No ratings yet
3 Datasets
5 pages
Perceptron Learning with Python
No ratings yet
Perceptron Learning with Python
10 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
Ann Cae-3
No ratings yet
Ann Cae-3
22 pages
Module4 DS PPT
No ratings yet
Module4 DS PPT
49 pages
Neural Network
No ratings yet
Neural Network
82 pages
(Program Curriculum) : PG Diploma in Data Science
No ratings yet
(Program Curriculum) : PG Diploma in Data Science
6 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
305 Ba Machine Learning & Cognitive Intelligence Using Python
No ratings yet
305 Ba Machine Learning & Cognitive Intelligence Using Python
14 pages
Deep Learning Basics (Lecture Notes) : Romain Tavenard
No ratings yet
Deep Learning Basics (Lecture Notes) : Romain Tavenard
49 pages
NoteGPT Summary DL Mod2
No ratings yet
NoteGPT Summary DL Mod2
8 pages
Documentation
No ratings yet
Documentation
75 pages
Ids Sem Ans U-Ii
No ratings yet
Ids Sem Ans U-Ii
10 pages
MODULE 2 Deep Learning
No ratings yet
MODULE 2 Deep Learning
26 pages
Matlab Neural Network Toolbox Guide
No ratings yet
Matlab Neural Network Toolbox Guide
8 pages
Deep Learning
No ratings yet
Deep Learning
21 pages
Manual - Deep Learning Lab.
No ratings yet
Manual - Deep Learning Lab.
43 pages
Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch
No ratings yet
Approaching (Almost) Any Machine Learning Problem - Abhishek Thakur - No Free Hunch
22 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Unit 3
No ratings yet
Unit 3
110 pages
ML Unit 2
No ratings yet
ML Unit 2
5 pages
AIML Unit 5 - Neural Networks
No ratings yet
AIML Unit 5 - Neural Networks
45 pages
06 Ann
No ratings yet
06 Ann
56 pages
02 Neural Network
No ratings yet
02 Neural Network
28 pages
Study Structure
No ratings yet
Study Structure
13 pages
Lecture 3-4
No ratings yet
Lecture 3-4
50 pages
Air Quality Prediction Using Machine Learning
No ratings yet
Air Quality Prediction Using Machine Learning
29 pages
Ai Lab Document-1
No ratings yet
Ai Lab Document-1
18 pages
P95 Course Slides
No ratings yet
P95 Course Slides
86 pages
AIML Spiral
No ratings yet
AIML Spiral
41 pages
Unit 2
No ratings yet
Unit 2
19 pages
Libro Nuevo ML
No ratings yet
Libro Nuevo ML
577 pages
A Gentle Introduction To Neural Networks With Python
No ratings yet
A Gentle Introduction To Neural Networks With Python
85 pages
Alice Book Volume 1
No ratings yet
Alice Book Volume 1
281 pages
Skill DEVElopment
No ratings yet
Skill DEVElopment
30 pages
Dice Resume CV Deema Alk
No ratings yet
Dice Resume CV Deema Alk
6 pages
Walmart's Sales Data Analysis - A Big Data
No ratings yet
Walmart's Sales Data Analysis - A Big Data
6 pages
Unit Iv Programming Model
No ratings yet
Unit Iv Programming Model
58 pages
V To VIII Syllabus
No ratings yet
V To VIII Syllabus
106 pages
Unit 4 Session 2
No ratings yet
Unit 4 Session 2
15 pages
BCS601 Module 1 PDF
No ratings yet
BCS601 Module 1 PDF
32 pages
Hadoop Basics for Data Science Students
No ratings yet
Hadoop Basics for Data Science Students
22 pages
Pig vs. SQL & MapReduce: Features & Benefits
No ratings yet
Pig vs. SQL & MapReduce: Features & Benefits
21 pages
Big Data Analytics 2016th Edition Radha Shankarmani Vijayalakshmi PDF Download
No ratings yet
Big Data Analytics 2016th Edition Radha Shankarmani Vijayalakshmi PDF Download
90 pages
RMK Group Data Analytics Guide
No ratings yet
RMK Group Data Analytics Guide
72 pages
Admin Cloudera
100% (3)
Admin Cloudera
637 pages
MapReduce: Challenges and Solutions
No ratings yet
MapReduce: Challenges and Solutions
15 pages
Hadoop Quiz and Exam Answers
No ratings yet
Hadoop Quiz and Exam Answers
10 pages
Unit Iii
No ratings yet
Unit Iii
38 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Using Google Earth Engine To Detect Land Cover Cha
100% (1)
Using Google Earth Engine To Detect Land Cover Cha
16 pages
Tommy Iverson Johnson CSENG 506 Seminar Research Project
No ratings yet
Tommy Iverson Johnson CSENG 506 Seminar Research Project
7 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
Java Project
No ratings yet
Java Project
66 pages
Mining of Massive Data Sets 2nd Edition by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman ISBN 1107077230 9781107077232 PDF Download
100% (1)
Mining of Massive Data Sets 2nd Edition by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman ISBN 1107077230 9781107077232 PDF Download
40 pages
Big Data
No ratings yet
Big Data
31 pages
B.Tech. IT VII Sem (207 Credits) PDF
No ratings yet
B.Tech. IT VII Sem (207 Credits) PDF
34 pages
CCS335-Cloud Computing Record
No ratings yet
CCS335-Cloud Computing Record
68 pages
21CS745 Model Papper
No ratings yet
21CS745 Model Papper
2 pages
Module 01 Big Data Industry and Technological Trends
No ratings yet
Module 01 Big Data Industry and Technological Trends
50 pages
Unit 4 Pig and Hive
No ratings yet
Unit 4 Pig and Hive
86 pages
Hadoop Setup for CSE Students
No ratings yet
Hadoop Setup for CSE Students
17 pages
Ccs 334 Bigdata Manual
No ratings yet
Ccs 334 Bigdata Manual
45 pages

Big Data Handling Techniques

Uploaded by

Big Data Handling Techniques

Uploaded by

UNIT- V

HANDLING LARGE DATA

S.1, THE PROBLEMS YOU FACEWHEN HANDLING LARGE DATA

The learingrateof an algorithm s the adustrent it makes every time a new

The trairig vector. P = perceptron (X,y)

Use an online algorithm to make predictions.

Hat Mad 2 Querying point q

Movie31 1 A perceptron is one of the least complex

small partS and assign them to multiple systems. Bcolz

You might also like