0% found this document useful (0 votes)
58 views2 pages

CP 3

The document discusses various techniques for modeling relationships between variables using regression and classification trees, including decision trees, regression splines, smoothing splines, local regression, and generalized additive models (GAM). It covers the basic structure and optimization goals of decision trees, as well as pros and cons compared to linear models. Regression tree techniques are also discussed for classification problems using classification error rate, Gini index, and cross-entropy for evaluating splits.

Uploaded by

Ankita Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views2 pages

CP 3

The document discusses various techniques for modeling relationships between variables using regression and classification trees, including decision trees, regression splines, smoothing splines, local regression, and generalized additive models (GAM). It covers the basic structure and optimization goals of decision trees, as well as pros and cons compared to linear models. Regression tree techniques are also discussed for classification problems using classification error rate, Gini index, and cross-entropy for evaluating splits.

Uploaded by

Ankita Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

library(ISLR) attach(Wage) fit = lm(wage˜poly(age,4), data=Wage) % orthogonal polynomials∗

coef(summary(fit)) % print out fit2 = lm(wage˜poly(age,4,raw=T), data=Wage) fit2a =


lm(wage˜age+I(ageˆ2)+I(ageˆ3)+I(ageˆ4), data=Wage) fit2b =
lm(wage˜cbind(age,ageˆ2,ageˆ3,ageˆ4), data=Wage) % original/raw polynomial % fitting the same,
coefs change agelims = range(age) age.grid = seq(from=agelims[1], to=agelims[2]) % grid: 18,19,...,90
preds = predict(fit,newdata=list(age=age.grid),se=TRUE) % make prediction se.bands =
cbind(preds$fit+2∗preds$se.fit, preds$fit-2∗preds$se.fit) % show standard error band at 2se
par(mfrow=c(1,2), mar=c(4.5,4.5,1,1), oma=c(0,0,4,0)) % 1 row 2 col grid % margin:
(bottom,left,top,right) % oma: outer margin plot(age, wage, xlim=agelims, cex=.5, col=’darkgrey’)
title(’D-4 Poly’, outer=T) lines(age.grid, preds$fit , lwd=2, col=’blue’) % add fit curve
matlines(age.grid, se.bands, lwd=1, col=’blue’, lty=3) % add standard error band fit .1 = lm(wage˜age,
data=Wage) fit .2 = lm(wage˜poly(age,2), data=Wage) fit .3 = lm(wage˜poly(age,3), data=Wage) fit .4
= lm(wage˜poly(age,4), data=Wage) fit .5 = lm(wage˜poly(age,5), data=Wage) anova(fit.1, fit .2, fit .3,
fit .4, fit .5) % model comparison for choosing degree of polynomial % cutting point is where value is
insignificant % Polynomial Regression II ( logistic ) fit = glm(I(wage>250)˜poly(age,4), data=Wage,
family=binomial) % create fit preds = predict(fit, newdata=list(age=age.grid), se=T)

% make prediction % alternative: % preds = predict(fit, newdata=list(age=age.grid), %


type=’response’, se=T) pfit = exp(preds$fit) / (1+exp(preds$fit)) % convert logit to estimate
se.bands.logit = cbind(preds$fit+2∗preds$se.fit, preds$fit-2∗preds$se.fit) se.bands =
exp(se.bands.logit) / (1+exp(se.bands.logit)) % show standard error band at 2se plot(age,
I(wage>250), xlim=agelims, type=’n’, ylim=c(0,.2)) points(jitter(age), I((wage>250)/5), cex=.5, pch=’|’,
col=’darkgrey’) % jitter(): ‘rug plot’ that makes values non-overlap lines (age.grid, pfit , lwd=2,
col=’blue’) matlines(age.grid, se.bands, lwd=1, col=’blue’, lty=3) % plot i) fit , ii ) 2se bands
table(cut(age,4)) % table representation of prediction % 4 ‘age-buckets’ fit = lm(wage˜cut(age,4),
data=Wage) % partitioned fit coef(summary(fit)) % Splines I ( regression splines ) library(splines) fit =
lm(wage˜bs(age,knots=c(25,40,60)), data=Wage) % fit % bs(): generate matrix of basis functions for
specified knots pred = predict(fit, newdata=list(age=age.grid), se=T) % make prediction plot(age,
wage, col=’gray’) lines(age.grid, pred$fit , lwd=2) lines(age.grid, pred$fit+2∗pred$se, lty=’dashed’)
lines(age.grid, pred$fit -2∗pred$se, lty=’dashed’) dim(bs(age, knots=c(25,40,60))) % two ways to
check df attr(bs(age,df=6), ’knots’) % show quantile percentages % Splines II (natural splines ) fit2 =
lm(wage˜ns(age,df=4), data=Wage) pred2 = predict(fit2, newdata=list(age=age.grid), se=T)

lines(age.grid, pred2$fit, col=’red’, lwd=2) % Splines III (smoothing splines) fit = smooth.spline(age,
wage, df=16) fit2 = smooth.spline(age, wage, cv=T) $ fit2 $df: 6.8 plot(age, wage, xlim=agelims,
cex=.5, col=’darkgrey’) lines( fit , col=’red’, lwd=2) lines( fit2 , col=’blue’, lwd=2) % Local Regression fit
= loess(wage˜age, span=.2, data=Wage) % span=.2: neighborhood consists of 20% of the
observations fit2 = loess(wage˜age, span=.5, data=Wage) plot(age, wage, xlim=agelims, cex=.5,
col=’darkgrey’) lines(age.grid, predict(fit, data.frame(age=age.grid)), col=’red’, lwd=2) lines(age.grid,
predict(fit2,data.frame(age=age.grid)), col=’blue’, lwd=2) % GAM gam1 =
lm(wage˜ns(year,4)+ns(age,5)+education, data=Wage) % ns(data, df, ...) for year & age % regular
qualitative for education library(gam) gam.m3 = gam(wage˜s(year,4)+s(age,5)+education,
data=Wage) par(mfrow=c(1,3)) plot(gam.m3, se=T, col=’blue’) % 3 plots for 3 predictors % each
shows respective predictor’s fit to response gam.m1 = gam(wage˜s(age,5)+education, data=Wage)
gam.m2 = gam(wage˜year+s(age,5)+education, data=Wage) anova(gam.m1, gam.m2, gam.m3,
test=’F’) % model comparison gam.lo = gam(wage˜s(year,df=4)+lo(age,span=.7)+education,
data=Wage) gam.lo.i = gam(wage˜lo(year,age,span=.5)+education, data=Wage) % make use of local
regression gam.lr = gam(I(wage>250)˜year+s(age,df=5)+education, family=binomial, data=Wage)
par(mfrow=c(1,3)) plot(gam.lr, se=T, col=’green’) % logistic GAM
7 Tree-Based Models 7.1 Decision Trees 7.1.1 Model of DT In a typical Decision Tree task, we have n
observations x1 , ..., xn and p predictors/parameters, and we would like to compute estimate ˆyi for
each response yi . Graphically, the following example illustrates how the predictors year and hits are
used to predict a baseball player’s salary36 . In the example, the two predictors are binarily
factorified by an artificial dividing point which minimizes RSS (coming up soon). The tree can also be
represented with a graph of decision regions, as Fig 7.2: Having the basic setup of a decision tree task
in mind, we now formulate the prediction-making and optimization goal of a decision tree, given . •
Prediction – Given a set of possible values of observations X1 , ..., Xp characterized by p predictors,
partition the values into J distinct and nonoverlapping regions R1 , ..., RJ . – For every observation xi
in region Rj , the prediction/estimate for its corresponding ˆyi is the mean of the response values yi
which are in

• Optimization Goal X J j=1 X i∈Rj (yi − yˆRj ) 2 (7.1) Essentially, in constructing a decision tree, we
make two decisions: • The cutting points s1 , ..., sk for each predictor Xj , by which each predictor
gives two decision regions: R1 (j, sj ) = {X|Xj < sj} and R2 (j, sj ) = {X|Xj ≥ sj} (7.2) • The sequence of
predictors X1 , ..., Xk, where k ≤ p, by which the partitioning of the decision space is done. The
sequence should minimize the combined RSS of all decision regions. X j∈J X i:xi∈Rj (j,sj ) (yi − yˆRj ) 2
(7.3) In practice, it is apparently inefficient to scan through all possible sequences (i.e. all possible
tree structures). Further, for the simplicity of a model, we would like to have as less predictors (thus
decision regions) involved at a reasonable cost of RSS37. Therefore, the optimization goal in Eq. 7.1 is
modified to include a penalty term to minimize also the number of terminal nodes in a tree (Eq. 7.4,
where |T| is the number of terminal nodes, m is the index for decision regions). X |T| m=1 X xi∈Rm
(yi − yˆRm ) 2 + α|T| (7.4) 37It is clear that the more predictors we use, the lower the RSS will be on
the training set. This, however, risks overfitting for our model. 6

The sequence selection can be carried out with some variation of forward/backward/hybrid
selection procedure (cf. Ch 5.1), which will not be elaborated here. To guard against overfitting, each
tree is also subject to a cross-validation where the MSE is computed to evaluate a particular tree’s
performance. Regression tree in a classification task differs in both the way in which prediction is
made and the optimization goal. • Prediction Each observation goes to the most commonly occurring
class of training observations in a decision region. • Optimization Goals – Classification Error Rate38 E
= 1 − max k (ˆpmk) (7.5) – Gini Index 39 G = X K k=1 pˆmk(1 − pˆmk) (7.6) – Cross-Entropy40 D = − X K
k=1 pˆmklogpˆmk (7.7) In building a classification tree, Gini or Cross-Entropy is used to evaluate the
quality of a particular split. While Gini and Cross-Entropy are effective with tree pruning,
Classification Error Rate is preferable if the objective is making predictions. Finally, note that node
purity is important because it reduces the uncertainty in a decision when information is incomplete.
7.1.2 DT: Pros & Cons Many tasks can be approached with either a DT or a linear model, therefore we
need to decide which one is more ideal for a particular data set and task. A general rule of thumb is
as follows: A linear model works better if the relationship between the predictors and the response is
close to linear. On the other hand, if this relationship is highly non-linear and complex, then DT
makes a better bet. More generally, the pros and cons of DT are listed as follows: 38 pˆmk represents
the proportion of training observations in the mth region that are from the kth class. 39Gini is a
measure of node purity, in the sense that a small Gini indicates that a node contains predominantly
observations from a single class. 40Cross-Entropy also measures node purity.

You might also like