Cheatsheet - BigData
Cheatsheet - BigData
n
Predictive Models
n
Linear Regression
n
Logisitic Regression Patterns for Predictive Analytics
n
Regression with Regularization
Neural Network
By Ricky Ho
n
n
And more...
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa …where y is the output numeric value, and xi is the input numeric
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
value.
6 5.4 3.9 1.7 0.4 setosa
>
> # Prepare training and testing data
> testidx <- which(1:length(iris[,1])%%5 == 0)
The learning algorithm will learn the set of parameters such that
> newcol = data.frame(isSetosa=(iristrain$Species == ‘setosa’))
the sum of square error (yactual - yestimate)2 is minimized. > traindata <- cbind(iristrain, newcol)
Here is the sample code that uses the R language to predict the > head(traindata)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species isSetosa
output “prestige” from a set of input variables: 1 5.1 3.5 1.4 0.2 setosa TRUE
2 4.9 3.0 1.4 0.2 setosa TRUE
3 4.7 3.2 1.3 0.2 setosa TRUE
4 4.6 3.1 1.5 0.2 setosa TRUE
> model <- lm(prestige~., data=prestige_train) 6 5.4 3.9 1.7 0.4 setosa TRUE
> # Use the model to predict the output of test data 7 4.6 3.4 1.4 0.3 setosa TRUE
> prediction <- predict(model, newdata=prestige_test) > formula <- isSetosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
> # Check for the correlation with actual result > logisticModel <- glm(formula, data=traindata, family=”binomial”)
> cor(prediction, prestige_test$prestige) Warning messages:
[1] 0.9376719009 1: glm.fit: algorithm did not converge
> summary(model) 2: glm.fit: fitted probabilities numerically 0 or 1 occurred
Call: > # Predict the probability for test data
lm(formula = prestige ~ ., data = prestige_train) > prob <- predict(logisticModel, newdata=iristest, type=’response’)
Residuals: > round(prob, 3)
Min 1Q Median 3Q Max 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
-13.9078951 -5.0335742 0.3158978 5.3830764 17.8851752 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
Coefficients: 105 110 115 120 125 130 135 140 145 150
Estimate Std. Error t value Pr(>|t|) 0 0 0 0 0 0 0 0 0 0
(Intercept) -20.7073113585 11.4213272697 -1.81304 0.0743733 .
education 4.2010288017 0.8290800388 5.06710 0.0000034862 ***
income 0.0011503739 0.0003510866 3.27661 0.0016769 **
women 0.0363017610 0.0400627159 0.90612 0.3681668
census 0.0018644881 0.0009913473 1.88076 0.0644172 .
typeprof 11.3129416488 7.3932217287 1.53018 0.1307520 REGRESSION WITH REGULARIZATION
typewc 1.9873305448 4.9579992452 0.40083 0.6898376
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.41604 on 66 degrees of freedom To avoid an over-fitting problem (the trained model fits too
(4 observations deleted due to missingness)
Multiple R-squared: 0.820444, Adjusted R-squared: 0.8041207
well with the training data and is not generalized enough), the
F-statistic: 50.26222 on 6 and 66 DF, p-value: < 0.00000000000000022204 regularization technique is used to shrink the magnitude of ƟӨi.
This is done by adding a penalty (a function of the sum of ƟӨi) into
the cost function.
The coefficient column gives an estimation of ƟӨi, and an
associated p-value gives the confidence of each estimated ƟӨi. In L2 regularization (also known as Ridge regression), Өi2 will be
For example, features not marked with at least one * can be added to the cost function. In L1 regularization (also known as
safely ignored. Lasso regression), Σ ||Өi|| will be added to the cost function.
Both L1, L2 will shrink the magnitude of Өi. For variables that
In the above model, education and income has a high influence are inter-dependent, L2 tends to spread the shrinkage such that
to the prestige. all interdependent variables are equally influential. On the other
hand, L1 tends to keep one variable and shrink all the other
The goal of minimizing the square error makes linear regression dependent variables to values very close to zero. In other words,
very sensitive to outliers that greatly deviate in the output. It is L1 shrinks the variables in an uneven manner so that it can also be
a common practice to identify those outliers, remove them, and used to select input variables.
then rerun the training.
Combining L1 and L2, the general form of the cost function
LOGISTIC REGRESSION becomes the following:
> library(neuralnet)
> nnet_iristrain <-iristrain
> #Binarize the categorical output
> nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == ‘setosa’)
> nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == ‘versicolor’)
> nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == ‘virginica’)
> names(nnet_iristrain)[6] <- ‘setosa’
> names(nnet_iristrain)[7] <- ‘versicolor’
> names(nnet_iristrain)[8] <- ‘virginica’
> nn <- neuralnet(setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width, data=nnet_iristrain, hidden=c(3))
> plot(nn)
> mypredict <- compute(nn, iristest[-5])$net.result
> # Consolidate multiple binary output back to categorical output
> maxidx <- function(arr) {
+ return(which(arr == max(arr)))
+ }
> idx <- apply(mypredict, c(1), maxidx)
> prediction <- c(‘setosa’, ‘versicolor’, ‘virginica’)[idx]
> table(prediction, iristest$Species)
NEURAL NETWORK
> library(e1071)
But it is possible that some patterns never show up in training
> tune <- tune.svm(Species~., data=iristrain, gamma=10^(-6:-1), cost=10^(1:4)) data, e.g., P(X1=a | Y=y) is 0. To deal with this situation, we
> summary(tune)
Parameter tuning of ‘svm’: pretend to have seen the data of each possible value one more
- sampling method: 10-fold cross validation
- best parameters: time than we actually have.
gamma cost
0.001 10000
- best performance: 0.03333333 P(X1=a | Y=y) == (count(a, y) + 1) / (count(y) + m)
> model <- svm(Species~., data=iristrain, method=”C-classification”,
kernel=”radial”, probability=T, gamma=0.001, cost=10000)
> prediction <- predict(model, iristest, probability=T)
> table(iristest$Species, prediction)
…where m is the number of possible values in X1.
prediction
setosa versicolor virginica
setosa 10 0 0 When the input features are numeric, say a = 2.75, we can assume
versicolor 0 10 0
virginica 0 3 7
X1 is the normal distribution. Find out the mean and standard
> deviation of X1 and then estimate P(X1=a) using the normal
distribution function.
SVM with a Kernel function is a highly effective model and works Here is how we use Naïve Bayes in R:
well across a wide range of problem sets. Although it is a binary > library(e1071)
> # Can handle both categorical and numeric input variables, but output must be
classifier, it can be easily extended to a multi-class classification categorical
by training a group of binary classifiers and using “one vs all” or > model <- naiveBayes(Species~., data=iristrain)
> prediction <- predict(model, iristest[,-5])
“one vs one” as predictors. > table(prediction, iristest[,5])
K-NEAREST NEIGHBORS
From a probabilistic viewpoint, the predictive problem can be
viewed as a conditional probability estimation; trying to find Y
where P(Y | X) is maximized. A contrast to model-based learning is K-Nearest neighbor. This is
also called instance-based learning because it doesn’t even learn
From the Bayesian rule, P(Y | X) == P(X | Y) * P(Y) / P(X) a single model. The training process involves memorizing all the
training data. To predict a new data point, we found the closest
This is equivalent to finding Y where P(X | Y) * P(Y) is maximized. K (a tunable parameter) neighbors from the training set and let
Let’s say the input X contains 3 categorical features— X1, X2, them vote for the final prediction.
X3. In the general case, we assume each variable can potentially
influence any other variable. Therefore the joint distribution
becomes:
Notice how in the last term of the above equation, the number
of entries is exponentially proportional to the number of input
variables.
prediction setosa versicolor virginica The good part of the Tree is that it can take different data types
setosa 10 0 0
versicolor 0 10 1 of input and output variables that can be categorical, binary and
>
virginica 0 0 9 numeric values. It can handle missing attributes and outliers
well. Decision Tree is also good in explaining reasoning for its
prediction and therefore gives good insight about the underlying
The strength of K-nearest neighbor is its simplicity. No model
data.
needs to be trained. Incremental learning is automatic when
more data arrives (and old data can be deleted as well). The The limitation of Decision Tree is that each decision boundary
weakness of KNN, however, is that it doesn’t handle high at each split point is a concrete binary decision. Also, the
numbers of dimensions well. decision criteria considers only one input attribute at a time, not
a combination of multiple input variables. Another weakness
DECISION TREE of Decision Tree is that once learned it cannot be updated
incrementally. When new training data arrives, you have to throw
away the old tree and retrain all data from scratch. In practice,
Based on a tree of decision nodes, the learning approach is to
standalone decision trees are rarely used because their accuracy
recursively divide the training data into buckets of homogeneous
ispredictive and relatively low . Tree ensembles (described
members through the most discriminative dividing criteria
below) are the common way to use decision trees.
possible. The measurement of “homogeneity” is based on
the output label; when it is a numeric value, the measurement
TREE ENSEMBLES
will be the variance of the bucket; when it is a category, the
measurement will be the entropy, or “gini index,” of the bucket.
Instead of picking a single model, Ensemble Method combines
multiple models in a certain way to fit the training data. Here are
the two primary ways: “bagging” and “boosting.” In “bagging”,
we take a subset of training data (pick n random sample out of
N training data, with replacement) to train up each model. After
multiple models are trained, we use a voting scheme to predict
future data.
Random Forest is one of the most popular bagging models; in
addition to selecting n training data out of N at each decision
node of the tree, it randomly selects m input features from the
total M input features (m ~ M^0.5). Then it learns a decision tree
from that. Finally, each tree in the forest votes for the result.
Here is the R code to use Random Forest:
> library(randomForest)
#Train 100 trees, random selected attributes
During the training, various dividing criteria based on the input > model <- randomForest(Species~., data=iristrain, nTree=500)
will be tried (and used in a greedy manner); when the input is a #Predict using the forest
> prediction <- predict(model, newdata=iristest, type=’class’)
category (Mon, Tue, Wed, etc.), it will first be turned into binary > table(prediction, iristest$Species)
> importance(model)
(isMon, isTue, isWed, etc.,) and then it will use true/false as a MeanDecreaseGini
decision boundary to evaluate homogeneity; when the input is Sepal.Length
Sepal.Width
7.807602
1.677239
a numeric or ordinal value, the lessThan/greaterThan at each Petal.Length 31.145822
Petal.Width 38.617223
training-data input value will serve as the decision boundary.
The training process stops when there is no significant gain in “Boosting” is another approach in Ensemble Method. Instead
homogeneity after further splitting the Tree. The members of of sampling the input features, it samples the training data
the bucket represented at leaf node will vote for the prediction; records. It puts more emphasis, though, on the training data that
the majority wins when the output is a category. The member’s is wrongly predicted in previous iterations. Initially, each training
average is taken when the output is a numeric. data is equally weighted. At each iteration, the data that is
wrongly classified will have its weight increased.
Here is an example in R:
Gradient Boosting Method is one of the most popular boosting
> library(rpart)
> #Train the decision tree methods. It is based on incrementally adding a function that fits
> treemodel <- rpart(Species~., data=iristrain)
> plot(treemodel)
the residuals.
> text(treemodel, use.n=T)
> #Predict using the decision tree Set i = 0 at the beginning, and repeat until convergence.
> prediction <- predict(treemodel, newdata=iristest, type=’class’)
> #Use contingency table to see how accurate it is • Learn a function Fi(X) to predict Y. Basically, find F that
> table(prediction, iristest$Species)
prediction setosa versicolor virginica minimizes the expected(L(F(X) – Y)), where L is the lost
setosa 10 0 0
versicolor 0 10 3 function of the residual
virginica 0 0 7
> names(nnet_iristrain)[8] <- ‘virginica’ • Learning another function gi(X) to predict the gradient of
Here is the Tree model that has been learned: the above function
#82
CONTENTS INCLUDE:
■
■
About Cloud Computing
Usage Scenarios Getting Started with
Aldon Cloud#64Computing
■
Underlying Concepts
Cost
by...
■
Upcoming Refcardz
youTechnologies ®
■
Data
t toTier
brough Comply.
borate.
Platform Management and more...
■
Chan
ge. Colla By Daniel Rubio
tion:
dz. com
tegra ternvasll
ABOUT CLOUD COMPUTING one time events. TEN TS
INC ■
HTML LUD E:
us Ind Anti-PPaat
Basics
Automated growthHTM
ref car
nuorn
■
Valid
ation one time events, cloud ML
connected to what is now deemed the ‘cloud’. Having the capability to support
Scala Collections
Usef
Du
ti
■
ul M.
computing platforms alsoulfacilitate
Open the gradual growth curves
n an
Page Source
o
■
s
Vis it
C
faced by web applications. Tools
Core
By Key Structur e Elements
■
Patte
E: has changed substantially in recent years, especially with al Elem
INC LUD gration the entrance of service providers like Amazon, Google and Large scale growth scenarios involvingents
specialized
NTS and mor equipment
rdz !
HTML
CO NTE Microsoft. es e... away by
(e.g. load balancers and clusters) are all but abstracted
Continu at Every
VisualVM
e chang
About ns to isolat
relying on a cloud computing platform’s technology.
Software i-patter
■
n space
Re fca
e Work
Build
riptio
and Ant
Desc
These companies have a Privat
are in long deployed
trol repos
itory
webmana applications
ge HTM
L BAS
■
Build
re
Opa
■
Deve code lines a system Database Systems (RDBMS): is usedMap are web service APIs,
d as thescalethe By An
sitory of work prog foundati
Ge t Mo
Buil Repo
This Refcard active
will introduce are within
to you to cloud riente computing, with an
d units
RATION etc. Some platforms ram support large grapRDBMS deployments.
■
The src
dy Ha
softw
e ine loping and Java s written in hical on of
INTEG attribute
task-o it all
softwar emphasis onDeve es by
Mainl
S these
ines providers, chang so youComm can better understand
also rece JavaScri user interfac web develop and the desc rris
Vis i
codel
INUOU Task Level
ding e code as the
ww w.dzone.com
Data Warehousing
build um
UNDERLYING CONCEPTS can be
gration ed to blem Task-L Label
activit
ies to
the bare
minim e
standard a very loos
and XHT rging use HTM PHP Tags tags text that found,
ous Inte committ to a pro USAGE SCENARIOS ate all
Autom configurat
ion cies to t
ization, ely-defi
ML as
thei
Ajax
tech L can is disp
Continu ry change cannot be (and freq
nden ymen layed
tion ned lang r visual eng nologies
Re fca
CI can ticular con used can rity all depe all targe
s will L or XHT arent. Reg the HTM l, but r. Tags
etimes Anti-pa they
tend
es, but to most phone services:
y Integ alizeplans with le that
late fialloted resources, ts with an and XHT simplify all help ML, und ardless
L VS
XHTM <a><
in a par hes som
Centr
end,
Binar
pro cess. the practic enti ng incurred costgeme
nt
whether e a single
such
temp
resources on
were consumed
t enviro
nmen
orthenot. Virtualization MLaare your
othe
you prov
erst of L
b></
in muchallows physical piece of hardware to
ide abe HTM
rily bad
Mana based
approac ed with the cial, but, implem anding
Creat targe es to actually r web
nden
cy rties are nt
of the solid L has
into differe coding.resources
necessa pared to
chang
efi Depe prope itting utilized by multiple operating
function systems.simplerThis allows foundati job adm been arou
associat to be ben
er
Ge t
te builds commo
are not Fortuna
late Verifi e comm than on
n com Cloud computing asRun it’sremo
known etoday has changed this.
befor alitytohas irably, nd for
They they
etc.
Temp Build (e.g. bandwidth, n memory, CPU) be allocated exclusively to tely exp that som
lts whe
ually,
appear effects. Privat y, contin Every elem mov used
to be, HTML
ected. job has e time
ed resu The various resourcesPerfo rm a
consumed by webperio applications
dicall (e.g. nt team
pag
individual operating entsinstances. ed to CSS
system Brow Early . Whi
gration expand
opme because
adverse unintend d Builds sitory Build r to devel common e (HTML . ser HTM
ed far le it has don
ous Inte web dev manufacture L had very
Stage Repo
e bandwidth, memory, CPU) are tallied
an Integ
ration on a per-unit CI serve basis or XHT
duc tinu e Build from
extensio .) All are more e its
e.com
DZone, Inc.
soon an operating in the same as on allents
hosting
tinu ous Inte tional use and test concepts Integ
ration ors as entat
ion with r
ld not
n text
in st web dar er wor ng stan
ort.
ven “build include
docum
Con be cr kar dar
the con s to the standar
oper
rate devel
While CI to Gene
efer ion of
Cloud Computing
the not
s on
expand
tutorials, cheat sheets, blogs, feature articles, source code and more.
refcardz@dzone.com
“DZone is a developer’s dream,” says PC Magazine.
Sponsorship Opportunities
Copyright © 2012 DZone, Inc. All rights reserved. No part of this publication may be reproduced, stored in a
retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, sales@dzone.com Version 1.0
without prior written permission of the publisher.