0% found this document useful (0 votes)
60 views26 pages

Week 3

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views26 pages

Week 3

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Introductory Applied Machine Learning

Generalization, Overfitting, Evaluation

Victor Lavrenko and Nigel Goddard


School of Informatics
Generalization
! Training data: {xi,yi}
! examples that we used to train our predictor
! e.g. all emails that our users labelled ham / spam
! Future data: {xi,?}
! examples that our classifier has never seen before
! e.g. emails that will arrive tomorrow
! Want to do well on future data, not training
! not very useful: we already know yi
! easy to be perfect on training data (DT, kNN, kernels)
! does not mean you will do well on future data
! can over-fit to idiosyncrasies of our training data
Copyright © 2014 Victor Lavrenko
Under- and Over-fitting
• Over-fitting:
F’
• predictor too complex (flexible) future data

• fits “noise” in the training data under-fit F


• patterns that will not re-appear
• predictor F over-fits the data if:
over-fit F
• we can find another predictor F’
• which makes more mistakes on training data: Etrain(F’) > Etrain(F)
• but fewer mistakes on unseen future data : Egen(F’) < Egen(F)
• Under-fitting:
• predictor too simplistic (too rigid)
• not powerful enough to capture salient patterns in data
• can find another predictor F’ with smaller Etrain and Egen
Copyright © 2014 Victor Lavrenko
Under- and Over-fitting examples

Regression:

predictor too inflexible: predictor too flexible:


cannot capture pattern fits noise in the data

Classification:

Copyright © 2014 Victor Lavrenko


Flexible vs. inflexible predictors
• Each dataset needs different level of “flexibility”
• depends on task complexity + available data
• want a “knob” to get rigid / flexible predictors
• Most learning algorithms have such knobs:
• regression: order of the polynomial
• NB: number of attributes, limits on σ2, ε
• DT: #nodes in the tree / pruning confidence
• kNN: number of nearest neighbors
• SVM: kernel type, cost parameter future data

• Tune to minimize generalization error


Copyright © 2014 Victor Lavrenko
Training vs. Generalization Error
same? different by how much?

• Training error:
• Generalization error: training value we
predicted
true
value
examples
• how well we will do on future data
• don’t know what future data xi will be
• don’t know what labels yi it will have
• but know the “range” of all possible {x,y} Usually
Etrain ≤ Egen
• x: all possible 20x20 black/white bitmaps
• y: {0,1,…,9} (digits)

Can never compute error as before how often we expect


generalisation error over all to see such x and y
possible x,y
Copyright © 2014 Victor Lavrenko
Estimating Generalization Error
over testing set

• Testing error: test

• set aside part of training data (testing set)


• learn a predictor without using any of this data
• predict values for testing set, compute error
• gives an estimate of true generalization error
• if testing set is unbiased sample from p(x,y): lim E test = E gen
n→∞
• how close? depends on n
• Ex: binary classification, 100 instances

• assume: 75 classified correctly, 25 incorrectly
• Etest = 0.25, Egen around 0.25, but how close?
Copyright © 2014 Victor Lavrenko
Confidence Interval for Future Error
• What range of errors can we expect for future test sets?
• Etest ± ΔE such that 95% of future test sets fall within that interval
• Etest is an unbiased estimate of E = true error rate
• E = probability our system will misclassify a random instance
• take a random set of n instances, how many misclassified? our test set is
one such set
• flip E-biased coin n times, how many heads will we get?
• Binomial distribution with mean = n E, variance = n E (1-E)
• Efuture= #misclassified / n, ~ Gaussian, mean E, variance = E (1-E) / n
• 2/3 future test sets will have error in E ± √(E(1-E)/n)
• p% confidence interval for future error:
• for n=100 examples, p=0.95 and E = 0.25
• σ = √(0.25"0.75/100) = .043
• CI = 0.25 ± 1.96"σ = 0.25 ± 0.08
• n=100, p=0.99 # CI = 0.25 ± 0.11
• n=10000, p=0.95 # CI = 0.25 ± 0.008
Copyright © 2014 Victor Lavrenko
Training, Validation, Testing sets
• Training set: construct classifier
• NB: count frequencies, DT: pick attributes to split on
• Validation set: pick algorithm + knob settings
• pick best-performing algorithm (NB vs. DT vs. …)
• fine-tune knobs (tree depth, k in kNN, c in SVM …)
• Testing set: estimate future error rate
• never report best of many runs
• run only once, or report results of every run
• Split randomly to avoid bias
Copyright © 2014 Victor Lavrenko
Cross-validation
• Conflicting priorities when splitting the dataset
• estimate future error as accurately as possible
• large testing set: big ntest # tight confidence interval
• learn classifier as accurately as possible
• large training set: big ntrain # better estimates
• training and testing cannot overlap: ntrain + ntest = const
• Idea: evaluate Train # Test, then Test # Train, average results
• every point is both training and testing, never at the same time
• reduces chances of getting an unusual (biased) testing set
• 5-fold cross-validation
• randomly split the data into 5 sets
• test on each in turn (train on 4 others)
• average the results over 5 folds
• more common: 10-fold
Copyright © 2014 Victor Lavrenko
Leave-one-out
• n-fold cross-validation (n = total number of instances)
• predict each instance, training on all (n-1) other instances
• Pros and cons:
• best possible classifier learned: n-1 training examples
• high computational cost: re-learn everything n times
• not an issue for instance-based methods like kNN
• there are tricks to make such learning faster
• classes not balanced in training / testing sets
• random data, 2 equi-probable classes # wrong 100% of the time
• testing balance: {1 of A, 0 of B} vs. training: {n/2 of B, n/2-1 of A}
• duplicated data # nothing can beat 1NN (0% error)
• wouldn’t happen with 10-fold cross-validation
Copyright © 2014 Victor Lavrenko
Stratification
• Problems with leave-one-out:
• training / testing sets have classes in different proportions
• not limited to leave-one-out:
• K-fold cross-validation: random splits # imbalance

• Stratification
• keep class labels balanced across training / testing sets
• simple way to guard against unlucky splits
• recipe:
• randomly split each class into K parts
• assemble ith part from all classes to make the ith fold

Copyright © 2014 Victor Lavrenko


Evaluation measures
• Are we doing well? Is system A better than B?
• A measure of how (in)accurate a system is on a task
• in many cases Error (Accuracy / PC) is not the best measure
• using the appropriate measure will help select best algorithm
• Classification
• how often we classify something right / wrong
• Regression
• how close are we to what we’re trying to predict
• Unsupervised
• how well do we describe our data
• in general – really hard
Copyright © 2014 Victor Lavrenko
Classification measures: basics
all testing instances

system predicts positive

Predict positive? False


Really positive?

Positives
Yes No (FP)
True
Yes TP FN Positives
(TP)
No FP TN
False
Confusion matrix for
Negatives
two-class classification True Negatives (TN) (FN)

really positive
Want: large diagonal,
small FP, FN
Copyright © 2014 Victor Lavrenko
Classification Error
FP
TP
! Classification error = TN FN

! Accuracy = (1 – error) =
! Basic measure of “goodness” of a classifier
! Problem: cannot handle unbalanced classes
! ex1: predict whether an earthquake is about to happen
! happen very rarely, very good accuracy if always predict “No”
! solution: make FNs much more “costly” than FPs
! ex2: web search: decide if a webpage is relevant to user
! 99.9999% of pages not relevant to any query # retrieve nothing
! solution: use measures that don’t involve TN (recall / precision)

Copyright © 2014 Victor Lavrenko


Accuracy and un-balanced classes
• You’re predicting Nobel
prize (+) vs. not (")
• Human would prefer
classifier A.
• Accuracy will
prefer classifier B
(fewer errors)
• Accuracy poor
metric here
A
B
Copyright © Victor Lavrenko, 2014 Copyright © 2014 Victor Lavrenko
FP
Misses and False Alarms TN
TP
FN

• False Alarm rate = False Positive rate = FP / (FP+TN)


• % of negatives we misclassified as positive
• Miss rate = False Negative rate = FN / (TP+FN)
• % of positives we misclassified as negative
• Recall = True Positive rate = TP / (TP+FN)
• % of positives we classified correctly (1 – Miss rate)
• Precision = TP / (TP + FP)
• % positive out of what we predicted was positive
• Meaningless to report just one of these
• trivial to get 100% recall or 0% false alarm
• typical: recall/precision or Miss / FA rate or TP/FP rate

Copyright © 2014 Victor Lavrenko


Evaluation (recap)
• Predicting class C (e.g. spam)
• classifier can make two types of mistakes:
• FP: false positives – non-spam emails mistakenly classified as spam
• FN: false negatives – spam emails mistakenly classified as non-spam
• TP/TN: true positives/negatives – correctly classified spam/non-spam
• common error/accuracy measures:
• Classification Error:
• Accuracy = 1-Error:
• False Alarm = False Positive rate = FP / (FP+TN)
• Miss = False Negative rate = FN / (TP+FN)
• Recall = True Positive rate = TP / (TP+FN)
• Precision = TP / (TP+FP)

Copyright © 2014 Victor Lavrenko


Utility and Cost
• Sometimes need a single-number evaluation measure
• optimizing the learner (automatically), competitive evaluation
• may know costs of different errors, e.g. earthquakes:
• false positive: cost of preventive measures (evacuation, lost profit)
• false negative: cost of recovery (reconstruction, liability)
• Detection cost: weighted average of FP, FN rates
• Cost = CFP * FP + CFN * FN [event detection]
• F-measure: harmonic mean of recall, precision
• F1 = 2 / (1 / Recall + 1 / Precision) [Information Retrieval]
• Domain-specifc measures:
• e.g. observed profit/loss from +/- market prediction
Copyright © 2014 Victor Lavrenko
Thresholds in Classification
• Two systems have the following performance:
• A: True Positive = 50%, False Positive = 20%
• B: True Positive = 100%, False Positive = 60%
• Which is better? (assume no-apriori utility)
• very misleading question
• A and B could be the same exact system
• operating at different thresholds

Copyright © 2014 Victor Lavrenko


ROC curves
• Many algorithms compute “confidence” f(x)
• threshold to get decision: spam if f(x) > t, non-spam if f(x) ≤ t
• Naïve Bayes: P(spam|x) > 0.5, Linear/Logistic/SVM: wTx > 0, Decision Tree: p+/p- > 1

• threshold t determines error rates


• False Positive rate = P(f(x)>t|ham), True Positive rate = P(f(x)>t|spam)

• Receiver Operating Characteristic (ROC):


• plot TPR vs. FPR as t varies from ∞ to -∞
shows performance of system
TP perfect performance across all possible thresholds
B
AUC = area under ROC curve
C popular alternative to Accuracy
A

FP
Evaluating regression Y
• Classification:
• count how often we are wrong
• Regression:
• predict numbers yi from inputs xi X1
X2
• always wrong, but by how much?
• distance between predicted & true values
• (root) mean squared error:
• popular, well-understood, nicely differentiable
• sensitive to single large errors (outliers) predicted true
testing set
• mean absolute error:
• less sensitive to outliers
• correlation coefficient
• insensitive to mean & scale
Copyright © 2014 Victor Lavrenko
Mean Squared Error
• Average (squared) deviation from truth
• Very sensitive to outliers
• 99 exact, 1 off by $10 same large effect
MSE on models
• all 100 wrong by $1
• Sensitive to mean / scale
• µy=1/nΣiyi ... good baseline
• Relative squared error (Weka)
MSE of
predictor

MSE when using the


mean as a predictor

Copyright © 2014 Victor Lavrenko


Mean Absolute Error
• Mean Absolute Error (MAE):
• less sensitive to outliers
• many small errors = one large error 5 better
6 worse
all by
• best 0th order baseline: median{yi} same
amount
• not the mean as for MSE

• Median Absolute Deviation (MAD): med{|f(xi)-yi|}


• robust, completely ignores outliers
• can define similar squared error: median{(f(xi)-yi)2}
• difficult to work with (can’t take derivatives)
• Sensitive to mean, scale

Copyright © 2014 Victor Lavrenko


Correlation Coefficient
• Completely insensitive to mean / scale:

prediction same
truth
relative to CC
relative to
mean mean

• Intuition: did you capture the relative ordering?


• output larger f(xi) for larger yi

true value
• output smaller f(xi) for smaller yi
• useful for ranking tasks:
predicted value predicted value
• e.g. recommend a movie to a user

true value
• Important to visualize data
• same CC for 4 predictors $
Copyright © 2014 Victor Lavrenko
Summary
• Training vs. generalization error
• under-fitting and over-fitting
• Estimate how well your system is doing its job
• how does it compare to other approaches?
• what will be the error rate on future data?
• Training and testing
• cross-validation, leave-one-out, stratification, significance
• Evaluation measures
• accuracy, miss / false alarm rates, detection cost
• ROC curves
• regression: (root) mean squared/absolute error, correlation
Copyright © 2014 Victor Lavrenko

You might also like