0% found this document useful (0 votes)

17 views7 pages

Cheatsheet - BigData

This document provides an introduction to predictive analytics and summarizes several machine learning models that can be used for predictive modeling, including linear regression, logistic regression, and neural networks. It uses the iris and Prestige datasets to demonstrate linear regression modeling. Key steps discussed are preparing training and test datasets, fitting a linear regression model to the training data, and using the model to make predictions on the test data.

Uploaded by

rokr58

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views7 pages

Cheatsheet - BigData

Uploaded by

rokr58

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

#158

Get More Refcardz! Visit refcardz.com

Big Data Machine Learning:

CONTENTS INCLUDE

n
Predictive Models
n
Linear Regression
n
Logisitic Regression Patterns for Predictive Analytics
n
Regression with Regularization
Neural Network
By Ricky Ho
n

n
And more...

INTRODUCTION > library(car)

> summary(Prestige)
education income women
Min. : 6.38000 Min. : 611.000 Min. : 0.00000
Predictive Analytics is about predicting future outcome based on 1st Qu.: 8.44500
Median :10.54000
1st Qu.: 4106.000
Median : 5930.500
1st Qu.: 3.59250
Median :13.60000
analyzing data collected previously. It includes two phases: Mean :10.73804 Mean : 6797.902 Mean :28.97902
3rd Qu.:12.64750 3rd Qu.: 8187.250 3rd Qu.:52.20250
Max. :15.97000 Max. :25879.000 Max. :97.51000
1. Training phase: Learn a model from training data prestige
Min. :14.80000 Min.
census
:1113.000
type
bc :44
2. Predicting phase: Use the model to predict the 1st Qu.:35.22500 1st Qu.:3120.500 prof:31
Median :43.60000 Median :5135.000 wc :23
unknown or future outcome Mean :46.83333 Mean :5401.775 NA’s: 4
3rd Qu.:59.27500 3rd Qu.:8312.500
Max. :87.20000 Max. :9517.000
> head(Prestige)
PREDICTIVE MODELS education income women prestige census type
gov.administrators 13.11 12351 11.16 68.8 1113 prof
general.managers 12.26 25879 4.02 69.1 1130 prof
accountants 12.77 9271 15.70 63.4 1171 prof
We can choose many models, each based on a set of different purchasing.officers 11.42 8865 9.11 56.8 1175 prof
chemists 14.62 8403 11.68 73.5 2111 prof
assumptions regarding the underlying distribution of data. physicists 15.64 11030 5.13 77.6 2113 prof
> testidx <- which(1:nrow(Prestige)%%4==0)
Therefore, we are interested in two general types of problems > prestige_train <- Prestige[-testidx,]
in this discussion: 1. Classification—about predicting a category > prestige_test <- Prestige[testidx,]

(a value that is discrete, finite with no ordering implied), and 2.

Regression—about predicting a numeric quantity (a value that’s
continuous and infinite with ordering). LINEAR REGRESSION

Linear regression has the longest, most well-understood history

For classification problems, we use the “iris” data set and predict
in statistics, and is the most popular machine learning model.
its “species” from its “width” and “length” measures of sepals
It is based on the assumption that a linear relationship exists
and petals. Here is how we set up our training and testing data:
between the input and output variables, as follows:
> summary(iris) y = Ө0 + Ө1x1 + Ө 2x2 + …
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300000 Min. :2.000000 Min. :1.000 Min. :0.100000
1st Qu.:5.100000 1st Qu.:2.800000 1st Qu.:1.600 1st Qu.:0.300000
Median :5.800000 Median :3.000000 Median :4.350 Median :1.300000
Mean :5.843333 Mean :3.057333 Mean :3.758 Mean :1.199333
3rd Qu.:6.400000 3rd Qu.:3.300000 3rd Qu.:5.100 3rd Qu.:1.800000
Max. :7.900000 Max. :4.400000 Max. :6.900 Max. :2.500000
Species
setosa :50
versicolor:50
virginica :50

> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa …where y is the output numeric value, and xi is the input numeric
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
value.
6 5.4 3.9 1.7 0.4 setosa
>
> # Prepare training and testing data
> testidx <- which(1:length(iris[,1])%%5 == 0)

> iristrain <- iris[-testidx,]

> iristest <- iris[testidx,]

To illustrate a regression problem (where the output we predict

is a numeric quantity), we’ll use the “Prestige” data set imported
from the “car” package to create our training and testing data.
Machine Learning

DZone, Inc. | www.dzone.com

2 Machine Learning

The learning algorithm will learn the set of parameters such that
> newcol = data.frame(isSetosa=(iristrain$Species == ‘setosa’))
the sum of square error (yactual - yestimate)2 is minimized. > traindata <- cbind(iristrain, newcol)
Here is the sample code that uses the R language to predict the > head(traindata)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species isSetosa
output “prestige” from a set of input variables: 1 5.1 3.5 1.4 0.2 setosa TRUE
2 4.9 3.0 1.4 0.2 setosa TRUE
3 4.7 3.2 1.3 0.2 setosa TRUE
4 4.6 3.1 1.5 0.2 setosa TRUE
> model <- lm(prestige~., data=prestige_train) 6 5.4 3.9 1.7 0.4 setosa TRUE
> # Use the model to predict the output of test data 7 4.6 3.4 1.4 0.3 setosa TRUE
> prediction <- predict(model, newdata=prestige_test) > formula <- isSetosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
> # Check for the correlation with actual result > logisticModel <- glm(formula, data=traindata, family=”binomial”)
> cor(prediction, prestige_test$prestige) Warning messages:
[1] 0.9376719009 1: glm.fit: algorithm did not converge
> summary(model) 2: glm.fit: fitted probabilities numerically 0 or 1 occurred
Call: > # Predict the probability for test data
lm(formula = prestige ~ ., data = prestige_train) > prob <- predict(logisticModel, newdata=iristest, type=’response’)
Residuals: > round(prob, 3)
Min 1Q Median 3Q Max 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
-13.9078951 -5.0335742 0.3158978 5.3830764 17.8851752 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
Coefficients: 105 110 115 120 125 130 135 140 145 150
Estimate Std. Error t value Pr(>|t|) 0 0 0 0 0 0 0 0 0 0
(Intercept) -20.7073113585 11.4213272697 -1.81304 0.0743733 .
education 4.2010288017 0.8290800388 5.06710 0.0000034862 ***
income 0.0011503739 0.0003510866 3.27661 0.0016769 **
women 0.0363017610 0.0400627159 0.90612 0.3681668
census 0.0018644881 0.0009913473 1.88076 0.0644172 .
typeprof 11.3129416488 7.3932217287 1.53018 0.1307520 REGRESSION WITH REGULARIZATION
typewc 1.9873305448 4.9579992452 0.40083 0.6898376
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.41604 on 66 degrees of freedom To avoid an over-fitting problem (the trained model fits too
(4 observations deleted due to missingness)
Multiple R-squared: 0.820444, Adjusted R-squared: 0.8041207
well with the training data and is not generalized enough), the
F-statistic: 50.26222 on 6 and 66 DF, p-value: < 0.00000000000000022204 regularization technique is used to shrink the magnitude of ƟӨi.
This is done by adding a penalty (a function of the sum of ƟӨi) into
the cost function.
The coefficient column gives an estimation of ƟӨi, and an
associated p-value gives the confidence of each estimated ƟӨi. In L2 regularization (also known as Ridge regression), Өi2 will be
For example, features not marked with at least one * can be added to the cost function. In L1 regularization (also known as
safely ignored. Lasso regression), Σ ||Өi|| will be added to the cost function.
Both L1, L2 will shrink the magnitude of Өi. For variables that
In the above model, education and income has a high influence are inter-dependent, L2 tends to spread the shrinkage such that
to the prestige. all interdependent variables are equally influential. On the other
hand, L1 tends to keep one variable and shrink all the other
The goal of minimizing the square error makes linear regression dependent variables to values very close to zero. In other words,
very sensitive to outliers that greatly deviate in the output. It is L1 shrinks the variables in an uneven manner so that it can also be
a common practice to identify those outliers, remove them, and used to select input variables.
then rerun the training.
Combining L1 and L2, the general form of the cost function
LOGISTIC REGRESSION becomes the following:

Cost == Non-regularization-cost + λ (α.Σ ||Ɵi|| + (1- α).Σ Ɵi2)

In a classification problem, the output is binary rather than
numeric. We can imagine doing a linear regression and then Notice the 2 tunable parameters, lambda, and alpha. Lambda
compressing the numeric output into a 0..1 range using the logit controls the degree of regularization (0 means no regularization
function 1/(1+e-t), shown here: and infinity means ignoring all input variables because all
coefficients of them will be zero). Alpha controls the degree of
mix between L1 and L2 (0 means pure L2 and 1 means pure L1).
Glmnet is a popular regularization package. The alpha parameter
needs to be supplied based on the application’s need, i.e.,
its need for selecting a reduced set of variables. Alpha=1
is preferred. The library provides a cross-validation test to
automatically choose the better lambda value.
Let’s repeat the above linear regression example and use
y = 1/(1 + e -(Ө 0 + Ө1 x 1 +ƟӨ2 x 2 + …)) regularization this time. We pick alpha = 0.7 to favor L1
regularization.
…where y is the 0 .. 1 value, and xi is the input numeric value.
> library(glmnet)
The learning algorithm will learn the set of parameters such > cv.fit <- cv.glmnet(as.matrix(prestige_train[,c(-4, -6)]), as.vector(prestige_
train[,4]), nlambda=100, alpha=0.7, family=”gaussian”)
that the cost (yactual * log yestimate + (1 - yactual) * log(1 - yestimate)) is > plot(cv.fit)
> coef(cv.fit)
minimized. 5 x 1 sparse Matrix of class “dgCMatrix”
1
(Intercept) 6.3876684930151
Here is the sample code that uses the R language to perform a education 3.2111461944976
income 0.0009473793366
binary classification using iris data. women 0.0000000000000
census 0.0000000000000
> prediction <- predict(cv.fit, newx=as.matrix(prestige_test[,c(-4, -6)]))
> cor(prediction, as.vector(prestige_test[,4]))
[,1]
1 0.9291181193

This is the cross-validation plot. It shows the best lambda with

minimal-root,mean-square error.

DZone, Inc. | www.dzone.com

3 Machine Learning

> library(neuralnet)
> nnet_iristrain <-iristrain
> #Binarize the categorical output
> nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == ‘setosa’)
> nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == ‘versicolor’)
> nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == ‘virginica’)
> names(nnet_iristrain)[6] <- ‘setosa’
> names(nnet_iristrain)[7] <- ‘versicolor’
> names(nnet_iristrain)[8] <- ‘virginica’
> nn <- neuralnet(setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width, data=nnet_iristrain, hidden=c(3))
> plot(nn)
> mypredict <- compute(nn, iristest[-5])$net.result
> # Consolidate multiple binary output back to categorical output
> maxidx <- function(arr) {
+ return(which(arr == max(arr)))
+ }
> idx <- apply(mypredict, c(1), maxidx)
> prediction <- c(‘setosa’, ‘versicolor’, ‘virginica’)[idx]
> table(prediction, iristest$Species)

prediction setosa versicolor virginica

setosa 10 0 0
versicolor 0 10 3
virginica 0 0 7

NEURAL NETWORK

A Neural Network emulates the structure of a human brain as a

network of neurons that are interconnected to each other. Each
neuron is technically equivalent to a logistic regression unit.

Neural networks are very good at learning non-linear functions.

They can even learn multiple outputs simultaneously, though
the training time is relatively long, which makes the network
susceptible to local minimum traps. This can be mitigated by
doing multiple rounds and picking the best-learned model.

SUPPORT VECTOR MACHINE

A Support Vector Machine provides a binary classification

mechanism based on finding a hyperplane between a set of
samples with +ve and -ve outputs. It assumes the data is linearly
separable.

In this setting, neurons are organized in multiple layers where

every neuron at layer i connects to every neuron at layer i+1 and
nothing else. The tuning parameters in a neural network include
the number of hidden layers (commonly set to 1), the number of
neurons in each layer (which should be same for all hidden layers
and usually at 1 to 3 times the input variables), and the learning
rate. On the other hand, the number of neurons at the output
layer depends on how many binary outputs need to be learned.
In a classification problem, this is typically the number of possible
values at the output category.
The problem can be structured as a quadratic programming
The learning happens via an iterative feedback mechanism optimization problem that maximizes the margin subjected to
where the error of training data output is used to adjust the a set of linear constraints (i.e., data output on one side of the
corresponding weights of input. This adjustment propagates to line must be +ve while the other side must be -ve). This can be
previous layers and the learning algorithm is known as “back- solved with the quadratic programming technique.
propagation.” Here is an example:

DZone, Inc. | www.dzone.com

4 Machine Learning

If the data is not linearly separable due to noise (the majority

is still linearly separable), then an error term will be added to
penalize the optimization.
If the data distribution is fundamentally non-linear, the trick is
to transform the data to a higher dimension so the data will be
linearly separable.The optimization term turns out to be a dot
product of the transformed points in the high-dimension space,
which is found to be equivalent to performing a kernel function in
the original (before transformation) space.

The kernel function provides a cheap way to equivalently

transform the original point to a high dimension (since we don’t
actually transform it) and perform the quadratic optimization in
that high-dimension space. Since P(X | Y) == P(X1 | Y) * P(X2 | Y) * P(X3 | Y), we need to find
the Y that maximizes P(X1 | Y) * P(X2 | Y) * P(X3 | Y) * P(Y)
There are a couple of tuning parameters (e.g., penalty and cost),
so transformation is usually conducted in 2 steps—finding the Each term on the right hand side can be learned by counting the
optimal parameter and then training the SVM model using that training data. Therefore we can estimate P(Y | X) and pick Y to
parameter. Here are some example codes in R: maximize its value.

> library(e1071)
But it is possible that some patterns never show up in training
> tune <- tune.svm(Species~., data=iristrain, gamma=10^(-6:-1), cost=10^(1:4)) data, e.g., P(X1=a | Y=y) is 0. To deal with this situation, we
> summary(tune)
Parameter tuning of ‘svm’: pretend to have seen the data of each possible value one more
- sampling method: 10-fold cross validation
- best parameters: time than we actually have.
gamma cost
0.001 10000
- best performance: 0.03333333 P(X1=a | Y=y) == (count(a, y) + 1) / (count(y) + m)
> model <- svm(Species~., data=iristrain, method=”C-classification”,
kernel=”radial”, probability=T, gamma=0.001, cost=10000)
> prediction <- predict(model, iristest, probability=T)
> table(iristest$Species, prediction)
…where m is the number of possible values in X1.
prediction
setosa versicolor virginica
setosa 10 0 0 When the input features are numeric, say a = 2.75, we can assume
versicolor 0 10 0
virginica 0 3 7
X1 is the normal distribution. Find out the mean and standard
> deviation of X1 and then estimate P(X1=a) using the normal
distribution function.
SVM with a Kernel function is a highly effective model and works Here is how we use Naïve Bayes in R:
well across a wide range of problem sets. Although it is a binary > library(e1071)
> # Can handle both categorical and numeric input variables, but output must be
classifier, it can be easily extended to a multi-class classification categorical
by training a group of binary classifiers and using “one vs all” or > model <- naiveBayes(Species~., data=iristrain)
> prediction <- predict(model, iristest[,-5])
“one vs one” as predictors. > table(prediction, iristest[,5])

prediction setosa versicolor virginica

setosa 10 0 0
SVM predicts the output based on the distance to the dividing versicolor 0 10 2
hyperplane. This doesn’t directly estimate the probability of the virginica 0 0 8

prediction. We therefore use the calibration technique to find a

logistic regression model between the distance of the hyperplane Notice the independence assumption is not true in most
and the binary output. Using that regression model, we then get cases.Nevertheless, the system still performs incredibly well.
our estimation. Onestrength of Naïve Bayes is that it is highly scalable and can
learn incrementally—all we have to do is count the observed
variables and update the probability distribution.
BAYESIAN NETWORK AND NAÏVE BAYES

K-NEAREST NEIGHBORS
From a probabilistic viewpoint, the predictive problem can be
viewed as a conditional probability estimation; trying to find Y
where P(Y | X) is maximized. A contrast to model-based learning is K-Nearest neighbor. This is
also called instance-based learning because it doesn’t even learn
From the Bayesian rule, P(Y | X) == P(X | Y) * P(Y) / P(X) a single model. The training process involves memorizing all the
training data. To predict a new data point, we found the closest
This is equivalent to finding Y where P(X | Y) * P(Y) is maximized. K (a tunable parameter) neighbors from the training set and let
Let’s say the input X contains 3 categorical features— X1, X2, them vote for the final prediction.
X3. In the general case, we assume each variable can potentially
influence any other variable. Therefore the joint distribution
becomes:

P(X | Y) = P(X1 | Y) * P(X2 | X1, Y) * P(X3 | X1, X2, Y)

Notice how in the last term of the above equation, the number
of entries is exponentially proportional to the number of input
variables.

DZone, Inc. | www.dzone.com

5 Machine Learning

To determine the “nearest neighbors,” a distance function

needs to be defined (e.g., a Euclidean distance function is a
common one for numeric input variables). The voting can also be
weighted among the K-neighbors based on their distance from
the new data point.
Here is the R code using K-nearest neighbor for classification.
> library(class)
> train_input <- as.matrix(iristrain[,-5])
> train_output <- as.vector(iristrain[,5])
> test_input <- as.matrix(iristest[,-5])
> prediction <- knn(train_input, test_input, train_output, k=5)
> table(prediction, iristest$Species)

prediction setosa versicolor virginica The good part of the Tree is that it can take different data types
setosa 10 0 0
versicolor 0 10 1 of input and output variables that can be categorical, binary and
>
virginica 0 0 9 numeric values. It can handle missing attributes and outliers
well. Decision Tree is also good in explaining reasoning for its
prediction and therefore gives good insight about the underlying
The strength of K-nearest neighbor is its simplicity. No model
data.
needs to be trained. Incremental learning is automatic when
more data arrives (and old data can be deleted as well). The The limitation of Decision Tree is that each decision boundary
weakness of KNN, however, is that it doesn’t handle high at each split point is a concrete binary decision. Also, the
numbers of dimensions well. decision criteria considers only one input attribute at a time, not
a combination of multiple input variables. Another weakness
DECISION TREE of Decision Tree is that once learned it cannot be updated
incrementally. When new training data arrives, you have to throw
away the old tree and retrain all data from scratch. In practice,
Based on a tree of decision nodes, the learning approach is to
standalone decision trees are rarely used because their accuracy
recursively divide the training data into buckets of homogeneous
ispredictive and relatively low . Tree ensembles (described
members through the most discriminative dividing criteria
below) are the common way to use decision trees.
possible. The measurement of “homogeneity” is based on
the output label; when it is a numeric value, the measurement
TREE ENSEMBLES
will be the variance of the bucket; when it is a category, the
measurement will be the entropy, or “gini index,” of the bucket.
Instead of picking a single model, Ensemble Method combines
multiple models in a certain way to fit the training data. Here are
the two primary ways: “bagging” and “boosting.” In “bagging”,
we take a subset of training data (pick n random sample out of
N training data, with replacement) to train up each model. After
multiple models are trained, we use a voting scheme to predict
future data.
Random Forest is one of the most popular bagging models; in
addition to selecting n training data out of N at each decision
node of the tree, it randomly selects m input features from the
total M input features (m ~ M^0.5). Then it learns a decision tree
from that. Finally, each tree in the forest votes for the result.
Here is the R code to use Random Forest:
> library(randomForest)
#Train 100 trees, random selected attributes
During the training, various dividing criteria based on the input > model <- randomForest(Species~., data=iristrain, nTree=500)
will be tried (and used in a greedy manner); when the input is a #Predict using the forest
> prediction <- predict(model, newdata=iristest, type=’class’)
category (Mon, Tue, Wed, etc.), it will first be turned into binary > table(prediction, iristest$Species)
> importance(model)
(isMon, isTue, isWed, etc.,) and then it will use true/false as a MeanDecreaseGini
decision boundary to evaluate homogeneity; when the input is Sepal.Length
Sepal.Width
7.807602
1.677239
a numeric or ordinal value, the lessThan/greaterThan at each Petal.Length 31.145822
Petal.Width 38.617223
training-data input value will serve as the decision boundary.
The training process stops when there is no significant gain in “Boosting” is another approach in Ensemble Method. Instead
homogeneity after further splitting the Tree. The members of of sampling the input features, it samples the training data
the bucket represented at leaf node will vote for the prediction; records. It puts more emphasis, though, on the training data that
the majority wins when the output is a category. The member’s is wrongly predicted in previous iterations. Initially, each training
average is taken when the output is a numeric. data is equally weighted. At each iteration, the data that is
wrongly classified will have its weight increased.
Here is an example in R:
Gradient Boosting Method is one of the most popular boosting
> library(rpart)
> #Train the decision tree methods. It is based on incrementally adding a function that fits
> treemodel <- rpart(Species~., data=iristrain)
> plot(treemodel)
the residuals.
> text(treemodel, use.n=T)
> #Predict using the decision tree Set i = 0 at the beginning, and repeat until convergence.
> prediction <- predict(treemodel, newdata=iristest, type=’class’)
> #Use contingency table to see how accurate it is • Learn a function Fi(X) to predict Y. Basically, find F that
> table(prediction, iristest$Species)
prediction setosa versicolor virginica minimizes the expected(L(F(X) – Y)), where L is the lost
setosa 10 0 0
versicolor 0 10 3 function of the residual
virginica 0 0 7
> names(nnet_iristrain)[8] <- ‘virginica’ • Learning another function gi(X) to predict the gradient of
Here is the Tree model that has been learned: the above function

DZone, Inc. | www.dzone.com

6 Machine Learning

• Update Fi+1 = Fi + a.gi(X), where a is the learning rate

> prediction <- predict.gbm(model, iris2[45:55,], type=”response”, n.trees=1000)
Below is Gradient-Boosted Tree using the decision tree as the > round(prediction, 3)
[1] 0.127 0.131 0.127 0.127 0.127 0.127 0.687 0.688 0.572 0.734 0.722
learning model F. Here is the sample code in R: > summary(model)
var rel.inf
1 Petal.Length 61.4203761582
2 Petal.Width 34.7557511871
> library(gbm) 3 Sepal.Width 3.5407662531
> iris2 <- iris 4 Sepal.Length 0.2831064016
> newcol = data.frame(isVersicolor=(iris2$Species==’versicolor’))
> iris2 <- cbind(iris2, newcol)
> iris2[45:55,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species isVersicolor The GBM R package also gave the relative importance of the
45 5.1 3.8 1.9 0.4 setosa FALSE
46 4.8 3.0 1.4 0.3 setosa FALSE
input features, as shown in the bar graph.
47 5.1 3.8 1.6 0.2 setosa FALSE
48 4.6 3.2 1.4 0.2 setosa FALSE
49 5.3 3.7 1.5 0.2 setosa FALSE
50 5.0 3.3 1.4 0.2 setosa FALSE
51 7.0 3.2 4.7 1.4 versicolor TRUE
52 6.4 3.2 4.5 1.5 versicolor TRUE
53 6.9 3.1 4.9 1.5 versicolor TRUE
54 5.5 2.3 4.0 1.3 versicolor TRUE
55 6.5 2.8 4.6 1.5 versicolor TRUE
> formula <- isVersicolor ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.
Width
> model <- gbm(formula, data=iris2, n.trees=1000, interaction.depth=2,
distribution=”bernoulli”)
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.2714 -1.#IND 0.0010 0.0008
2 1.2705 -1.#IND 0.0010 0.0004
3 1.2688 -1.#IND 0.0010 0.0007
4 1.2671 -1.#IND 0.0010 0.0008
5 1.2655 -1.#IND 0.0010 0.0008
6 1.2639 -1.#IND 0.0010 0.0007
7 1.2621 -1.#IND 0.0010 0.0008
8 1.2614 -1.#IND 0.0010 0.0003
9 1.2597 -1.#IND 0.0010 0.0008
10 1.2580 -1.#IND 0.0010 0.0008
100 1.1295 -1.#IND 0.0010 0.0008
200 1.0090 -1.#IND 0.0010 0.0005
300 0.9089 -1.#IND 0.0010 0.0005
400 0.8241 -1.#IND 0.0010 0.0004
500 0.7513 -1.#IND 0.0010 0.0004
600 0.6853 -1.#IND 0.0010 0.0003
700 0.6266 -1.#IND 0.0010 0.0003
800 0.5755 -1.#IND 0.0010 0.0002
900 0.5302 -1.#IND 0.0010 0.0002
1000 0.4901 -1.#IND 0.0010 0.0002

ABOUT THE AUTHOR RECOMMENDED BOOK

Ricky has spent the last 20 years developing and Introduction to Data Mining covers all aspects
designing large scale software systems including of data mining, taking both theoretical and
software gateways, fraud detection, cloud practical approaches to introduce a complex field
computing, web analytics, and online advertising. to those learning data mining for the first time.
He has played different roles from architect to Copious figures and examples bridge the gap
developer and consultant in helping companies from abstract to hands-on. The book requires
to apply statistics, machine learning, and optimization techniques to only basic background in statistics, and requires
extract useful insight from their raw data, and also predict business no background in databases. Includes detailed
trends. Ricky has 9 patents in the areas of distributed systems, cloud treatment of predictive modeling, association analysis, clustering,
computing and real-time analytics. He is very passionate about anomaly detection, visualization, and more. http://www-users.cs.umn.
algorithms and problem solving. He is an active blogger and maintains edu/~kumar/dmbook/index.php
a technical blog to share his ideas at http://horicky.blogspot.com

#82

Browse our collection of over 150 Free Cheat Sheets

Get More Refcardz! Visit refcardz.com

CONTENTS INCLUDE:
■

■
About Cloud Computing
Usage Scenarios Getting Started with

Aldon Cloud#64Computing
■
Underlying Concepts
Cost
by...
■

Upcoming Refcardz
youTechnologies ®
■
Data
t toTier
brough Comply.
borate.
Platform Management and more...
■

Chan
ge. Colla By Daniel Rubio

tion:
dz. com

also minimizes the need to make design changes to support

CON

tegra ternvasll
ABOUT CLOUD COMPUTING one time events. TEN TS
INC ■
HTML LUD E:

us Ind Anti-PPaat
Basics
Automated growthHTM
ref car

Web applications have always been deployed on servers & scalable

L vs XHT technologies

nuorn
■
Valid
ation one time events, cloud ML
connected to what is now deemed the ‘cloud’. Having the capability to support

Scala Collections
Usef
Du
ti
■

ul M.
computing platforms alsoulfacilitate
Open the gradual growth curves

n an
Page Source

o
■

s
Vis it

However, the demands and technology used on such servers Structur

C
faced by web applications. Tools

Core
By Key Structur e Elements
■

Patte
E: has changed substantially in recent years, especially with al Elem
INC LUD gration the entrance of service providers like Amazon, Google and Large scale growth scenarios involvingents
specialized
NTS and mor equipment
rdz !

ous Inte Change

HTML
CO NTE Microsoft. es e... away by
(e.g. load balancers and clusters) are all but abstracted
Continu at Every

VisualVM
e chang
About ns to isolat
relying on a cloud computing platform’s technology.
Software i-patter
■
n space
Re fca

e Work
Build
riptio
and Ant
Desc
These companies have a Privat
are in long deployed
trol repos
itory
webmana applications
ge HTM
L BAS
■

Patterns Control lop softw n-con to

■
that adapt and scale
Deve
les toto large user
a versio ing and making them
bases, In addition, several cloud computing ICSplatforms support data
ment ize merg
rn
Version e... Patte it all fi minim le HTM
Manage s and mor e Work knowledgeable in amany ine to
mainl aspects related tos multip
cloud computing. tier technologies that Lexceed the precedent set by Relational
space Comm and XHT
■

Build
re

Privat lop on that utilize HTML MLReduce,

Practice

Opa
■
Deve code lines a system Database Systems (RDBMS): is usedMap are web service APIs,
d as thescalethe By An
sitory of work prog foundati
Ge t Mo

Buil Repo
This Refcard active
will introduce are within
to you to cloud riente computing, with an
d units
RATION etc. Some platforms ram support large grapRDBMS deployments.
■
The src
dy Ha
softw
e ine loping and Java s written in hical on of
INTEG attribute
task-o it all
softwar emphasis onDeve es by
Mainl
S these
ines providers, chang so youComm can better understand
also rece JavaScri user interfac web develop and the desc rris
Vis i

codel
INUOU Task Level
ding e code as the
ww w.dzone.com

NT of buil trol what it is a cloudnize

line Policy sourc es as aplatform
computing can offer your ut web output ive data pt. Server-s e in clien ment. the ima alt attribute ribes whe
T CO cess ion con Code Orga it chang e witho likew mec ide lang t-side ge is describe re the ima
ABOU the pro ject’s vers applications. and subm with uniqu
e name
are from
sourc CLOUD COMPUTING ise hanism. fromAND
PLATFORMS web unavaila
was onc use HTML The eme pages and uages like ge file
it
(CI) is Nested s alte
rd z!

a pro evel Comm the build softw ble. rnate

Data Warehousing
build um
UNDERLYING CONCEPTS can be
gration ed to blem Task-L Label
activit
ies to
the bare
minim e
standard a very loos
and XHT rging use HTM PHP Tags tags text that found,
ous Inte committ to a pro USAGE SCENARIOS ate all
Autom configurat
ion cies to t
ization, ely-defi
ML as
thei
Ajax
tech L can is disp
Continu ry change cannot be (and freq
nden ymen layed
tion ned lang r visual eng nologies
Re fca

Build al depe need

a solu ineffective
deplo t
Label manu d tool the same for stan but as if
(i.e., ) nmen overlap
eve , Build stalle
t, use target enviro Amazon EC2: Industry standard it has
software and uagvirtualization ine. HTM uen
with terns ns (i.e. problem ated ce pre-in whether dards e with b></ , so <a>< tly are) nest
has bec become
Autom Redu ymen a> is
ory. via pat ticular d deploEAR) in each very little L
reposit -patter s that Pay only cieswhat you consume
tagge or Amazon’s cloud
the curr you cho platform
computing isome
heavily based moron fine. b></ ed insid
lained ) and anti the par solution duce nden For each (e.g. WAR es t
ent stan ose to writ more e imp a></ e
not lega each othe
x” b> is
be exp text to “fi are al Depe
Web application deployment ge until
nden a few years
t librari agonmen
t enviro was similar that will softwaredard
industry standard and virtualization app
e HTM technology.
orta nt,
tterns to pro Minim packa
Mo re

CI can ticular con used can rity all depe all targe
s will L or XHT arent. Reg the HTM l, but r. Tags
etimes Anti-pa they
tend
es, but to most phone services:
y Integ alizeplans with le that
late fialloted resources, ts with an and XHT simplify all help ML, und ardless
L VS
XHTM <a><
in a par hes som
Centr
end,
Binar
pro cess. the practic enti ng incurred costgeme
nt
whether e a single
such
temp
resources on
were consumed
t enviro
nmen
orthenot. Virtualization MLaare your
othe
you prov
erst of L
b></
in muchallows physical piece of hardware to
ide abe HTM
rily bad
Mana based
approac ed with the cial, but, implem anding
Creat targe es to actually r web
nden
cy rties are nt
of the solid L has
into differe coding.resources
necessa pared to
chang
efi Depe prope itting utilized by multiple operating
function systems.simplerThis allows foundati job adm been arou
associat to be ben
er
Ge t

te builds commo
are not Fortuna
late Verifi e comm than on
n com Cloud computing asRun it’sremo
known etoday has changed this.
befor alitytohas irably, nd for
They they
etc.
Temp Build (e.g. bandwidth, n memory, CPU) be allocated exclusively to tely exp that som
lts whe
ually,
appear effects. Privat y, contin Every elem mov used
to be, HTML
ected. job has e time
ed resu The various resourcesPerfo rm a
consumed by webperio applications
dicall (e.g. nt team
pag
individual operating entsinstances. ed to CSS
system Brow Early . Whi
gration expand
opme because
adverse unintend d Builds sitory Build r to devel common e (HTML . ser HTM
ed far le it has don
ous Inte web dev manufacture L had very
Stage Repo
e bandwidth, memory, CPU) are tallied
an Integ
ration on a per-unit CI serve basis or XHT
duc tinu e Build from
extensio .) All are more e its
e.com

pro ard rm ML shar limit

tern.
ack rs add than
Con Refc Privat
from zero) by Perfo elopers
(starting all majorated cloud ed man ed layout
feedb computing platforms. As a user of Amazon’s esse
term e, this
on
n. HTM EC2 cloud
ntiallycomputing es certplatform, you are result anybody
the pat
occur came
gration as based is
of the ” cycl such Build Send
autom as they builds proc
assigned esso L system
files shou plai wayain elem The late a lack of stan up with clev y competi
supp

DZone, Inc.
soon an operating in the same as on allents
hosting
tinu ous Inte tional use and test concepts Integ
ration ors as entat
ion with r
ld not
n text
in st web dar er wor ng stan
ort.
ven “build include
docum
Con be cr kar dar
the con s to the standar
oper
rate devel
While CI to Gene
efer ion of
Cloud Computing

the not
s on
expand

150 Preston Executive Dr.

Suite 200
Cary, NC 27513
DZone communities deliver over 6 million pages each month to 888.678.0399
more than 3.3 million software developers, architects and decision 919.678.0300
makers. DZone offers something for everyone, including news,
Refcardz Feedback Welcome
$7.95

tutorials, cheat sheets, blogs, feature articles, source code and more.
refcardz@dzone.com
“DZone is a developer’s dream,” says PC Magazine.
Sponsorship Opportunities
Copyright © 2012 DZone, Inc. All rights reserved. No part of this publication may be reproduced, stored in a
retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, sales@dzone.com Version 1.0
without prior written permission of the publisher.

Big Data Machine Learning
100% (1)
Big Data Machine Learning
6 pages
4503 Rc158 010d Machinelearning 1
100% (1)
4503 Rc158 010d Machinelearning 1
6 pages
WEEK
No ratings yet
WEEK
17 pages
Final Data Lab
No ratings yet
Final Data Lab
21 pages
R Programming for Data Science
No ratings yet
R Programming for Data Science
20 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Ekstrøm, Claus Thorn - Sørensen, Helle - Introduction To Statistical Data Analysis For The Life Sciences-CRC Press (2014)
No ratings yet
Ekstrøm, Claus Thorn - Sørensen, Helle - Introduction To Statistical Data Analysis For The Life Sciences-CRC Press (2014)
521 pages
Data Scinece Practical File
No ratings yet
Data Scinece Practical File
23 pages
The Multivariate Social Scientist Introductory Statistics Using Generalized Linear Models Sofroniou
100% (10)
The Multivariate Social Scientist Introductory Statistics Using Generalized Linear Models Sofroniou
49 pages
Chapter 08 Inference
No ratings yet
Chapter 08 Inference
34 pages
Big Data Mid Term
No ratings yet
Big Data Mid Term
14 pages
Data Science Project
No ratings yet
Data Science Project
31 pages
Math631 Course Notes
No ratings yet
Math631 Course Notes
281 pages
Statistics 231 Course Notes
No ratings yet
Statistics 231 Course Notes
204 pages
Introduction To Bayesian Methods in Ecology and Natural Resources Exclusive Download
100% (12)
Introduction To Bayesian Methods in Ecology and Natural Resources Exclusive Download
15 pages
STAT 231 Course Notes W16 Print
No ratings yet
STAT 231 Course Notes W16 Print
424 pages
BDA Lab Manual (12 Weeks)
No ratings yet
BDA Lab Manual (12 Weeks)
22 pages
AI Lec 3
No ratings yet
AI Lec 3
36 pages
Intro LOGIT
No ratings yet
Intro LOGIT
46 pages
S 15 Notes
No ratings yet
S 15 Notes
216 pages
Stat 231 Course Notes
100% (1)
Stat 231 Course Notes
326 pages
Applied Statistics With Python
100% (4)
Applied Statistics With Python
320 pages
Modern Applied Regressions
No ratings yet
Modern Applied Regressions
298 pages
20BCE1205 Lab6
No ratings yet
20BCE1205 Lab6
12 pages
Machine Learning - Lab Record
No ratings yet
Machine Learning - Lab Record
43 pages
AI & ML Lab Journal for MCA Students
No ratings yet
AI & ML Lab Journal for MCA Students
77 pages
Statistics For Applied Science 200l
No ratings yet
Statistics For Applied Science 200l
122 pages
Big Data
No ratings yet
Big Data
17 pages
FE - Final Project Report【0809v2】
No ratings yet
FE - Final Project Report【0809v2】
12 pages
BAN5
No ratings yet
BAN5
2 pages
The Multivariate Social Scientist Introductory Statistics Using Generalized Linear Models Sofroniou Complete Edition
No ratings yet
The Multivariate Social Scientist Introductory Statistics Using Generalized Linear Models Sofroniou Complete Edition
168 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Analyzing and Modeling Rank Data
No ratings yet
Analyzing and Modeling Rank Data
28 pages
Statistical Regression
No ratings yet
Statistical Regression
32 pages
BSc. AC-Sem IV
No ratings yet
BSc. AC-Sem IV
19 pages
STAT 231 Course Notes Winter
100% (1)
STAT 231 Course Notes Winter
358 pages
Tree Models in Insurance Pricing
No ratings yet
Tree Models in Insurance Pricing
142 pages
1
No ratings yet
1
130 pages
Bayesian Statistical Methods (Brian J. Reich, Sujit K. Ghosh)
No ratings yet
Bayesian Statistical Methods (Brian J. Reich, Sujit K. Ghosh)
288 pages
Basic Stats
No ratings yet
Basic Stats
49 pages
SSRN 3526707
No ratings yet
SSRN 3526707
5 pages
1
No ratings yet
1
19 pages
The R Primer 2nd Edition Claus Thorn Ekstrøm Download Full Chapters
No ratings yet
The R Primer 2nd Edition Claus Thorn Ekstrøm Download Full Chapters
170 pages
Introduction To Statistical Modelling PDF
100% (1)
Introduction To Statistical Modelling PDF
133 pages
Course Notes
No ratings yet
Course Notes
141 pages
Rinku Mitra MLID241017
No ratings yet
Rinku Mitra MLID241017
18 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
46 pages
Logit Probit
No ratings yet
Logit Probit
66 pages
Statistical Machine Learning Assignment
No ratings yet
Statistical Machine Learning Assignment
5 pages
The Multivariate Social Scientist Introductory Statistics Using Generalized Linear Models Sofroniou Instant Download
100% (10)
The Multivariate Social Scientist Introductory Statistics Using Generalized Linear Models Sofroniou Instant Download
71 pages
ANOVA3
No ratings yet
ANOVA3
194 pages
Class Notes
No ratings yet
Class Notes
147 pages
Islp 1
No ratings yet
Islp 1
15 pages
Introductory Statics For The Life and Biomedical Sciences
100% (1)
Introductory Statics For The Life and Biomedical Sciences
348 pages
Akritas Probability & Statistics With R For Engineers and Scientists
No ratings yet
Akritas Probability & Statistics With R For Engineers and Scientists
256 pages
Statistical Thinking From Scratch: A Primer For Scientists M. D. Edge Instant Download
No ratings yet
Statistical Thinking From Scratch: A Primer For Scientists M. D. Edge Instant Download
147 pages
DSAI
No ratings yet
DSAI
7 pages
DevOps CICD Pipeline Workshop
100% (3)
DevOps CICD Pipeline Workshop
123 pages
Continuous Delivery Essentials
No ratings yet
Continuous Delivery Essentials
7 pages
Opus by Owl
No ratings yet
Opus by Owl
34 pages
Process Synchronization Buffer Queue: Producer-Consumer Problem
No ratings yet
Process Synchronization Buffer Queue: Producer-Consumer Problem
10 pages
Scientific Paper. Ronald Tres Reyes
No ratings yet
Scientific Paper. Ronald Tres Reyes
6 pages
Effective Management of Educational Institutions in Tanzania
No ratings yet
Effective Management of Educational Institutions in Tanzania
12 pages
Thesis Proposal Structure
100% (2)
Thesis Proposal Structure
7 pages
Spinozas Ethics A Critical Guide PDF
100% (6)
Spinozas Ethics A Critical Guide PDF
24 pages
Research Paper Structure Guide
No ratings yet
Research Paper Structure Guide
10 pages
Grade 10 Science Lesson Plan
No ratings yet
Grade 10 Science Lesson Plan
3 pages
PHD EvaluationPracticumStudent
No ratings yet
PHD EvaluationPracticumStudent
2 pages
Listening Speaking Skills
100% (3)
Listening Speaking Skills
57 pages
3 - Light in Architecture and Psychology
0% (1)
3 - Light in Architecture and Psychology
17 pages
ED605013
No ratings yet
ED605013
12 pages
New Routes To Power: Towards A Typology of Power Mediation
No ratings yet
New Routes To Power: Towards A Typology of Power Mediation
22 pages
G1 Term2 Ch6 Water
No ratings yet
G1 Term2 Ch6 Water
8 pages
Essential Skills of Medical Teacher
100% (6)
Essential Skills of Medical Teacher
305 pages
DYP - Chapter 1 - G8
No ratings yet
DYP - Chapter 1 - G8
3 pages
Toru Shirai's Biography Reliability Study
No ratings yet
Toru Shirai's Biography Reliability Study
12 pages
Topic 4 KBSR Science Curriculum I
No ratings yet
Topic 4 KBSR Science Curriculum I
16 pages
BA 177 Refelction (Exam 2) - 11may 2020
No ratings yet
BA 177 Refelction (Exam 2) - 11may 2020
9 pages
PUC Position Paper RUBRIC
No ratings yet
PUC Position Paper RUBRIC
2 pages
Narasaraopeta Engineering College Accreditation Report
No ratings yet
Narasaraopeta Engineering College Accreditation Report
313 pages
Process Essay Writing Guide
No ratings yet
Process Essay Writing Guide
3 pages
Planning, Monitoring and Evaluation
No ratings yet
Planning, Monitoring and Evaluation
40 pages
Exploring The Psychological Benefits of Hardship A Critical Reassessment of Posttraumatic Growth 1st Edition Eranda Jayawickreme PDF Download
100% (3)
Exploring The Psychological Benefits of Hardship A Critical Reassessment of Posttraumatic Growth 1st Edition Eranda Jayawickreme PDF Download
56 pages
Trifa Raluca - Crossing Boundaries - A New Methodological Model For The Evaluation of Industrial Heritage
No ratings yet
Trifa Raluca - Crossing Boundaries - A New Methodological Model For The Evaluation of Industrial Heritage
2 pages
Data Poisoning
No ratings yet
Data Poisoning
3 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
Being Manager Leader and Coach
No ratings yet
Being Manager Leader and Coach
78 pages
Sensory Integration - A Companion To The NCAEP Report
No ratings yet
Sensory Integration - A Companion To The NCAEP Report
6 pages
EDUC 2220-Educational Technology Lesson Plan Narrative Writing
No ratings yet
EDUC 2220-Educational Technology Lesson Plan Narrative Writing
4 pages
EDU 532 - Lesson Planning BULAWIT HIPOLITO PINEDA..
No ratings yet
EDU 532 - Lesson Planning BULAWIT HIPOLITO PINEDA..
12 pages
Ielts Vaccine Listening
No ratings yet
Ielts Vaccine Listening
73 pages

Cheatsheet - BigData

Uploaded by

Cheatsheet - BigData

Uploaded by

#158

Get More Refcardz! Visit refcardz.com

Big Data Machine Learning:

INTRODUCTION > library(car)

(a value that is discrete, finite with no ordering implied), and 2.

Linear regression has the longest, most well-understood history

> iristrain <- iris[-testidx,]

To illustrate a regression problem (where the output we predict

DZone, Inc. | www.dzone.com

Cost == Non-regularization-cost + λ (α.Σ ||Ɵi|| + (1- α).Σ Ɵi2)

This is the cross-validation plot. It shows the best lambda with

DZone, Inc. | www.dzone.com

prediction setosa versicolor virginica

A Neural Network emulates the structure of a human brain as a

Neural networks are very good at learning non-linear functions.

SUPPORT VECTOR MACHINE

A Support Vector Machine provides a binary classification

In this setting, neurons are organized in multiple layers where

DZone, Inc. | www.dzone.com

If the data is not linearly separable due to noise (the majority

The kernel function provides a cheap way to equivalently

prediction setosa versicolor virginica

prediction. We therefore use the calibration technique to find a

P(X | Y) = P(X1 | Y) * P(X2 | X1, Y) * P(X3 | X1, X2, Y)

DZone, Inc. | www.dzone.com

To determine the “nearest neighbors,” a distance function

DZone, Inc. | www.dzone.com

• Update Fi+1 = Fi + a.gi(X), where a is the learning rate

ABOUT THE AUTHOR RECOMMENDED BOOK

Browse our collection of over 150 Free Cheat Sheets

also minimizes the need to make design changes to support

Web applications have always been deployed on servers & scalable

However, the demands and technology used on such servers Structur

ous Inte Change

Patterns Control lop softw n-con to

Privat lop on that utilize HTML MLReduce,

NT of buil trol what it is a cloudnize

a pro evel Comm the build softw ble. rnate

Build al depe need

pro ard rm ML shar limit

150 Preston Executive Dr.

You might also like