0% found this document useful (0 votes)
18 views92 pages

Unit 1

Machine Learning is defined as the study of algorithms that improve their performance on a task through experience, with applications ranging from spam filtering to medical image recognition. It encompasses various learning types such as supervised, unsupervised, and reinforcement learning, each with distinct methodologies and use cases. The document also discusses the importance of feature selection, dimensionality reduction, and the practical implementation of machine learning systems.

Uploaded by

Sonali S. Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views92 pages

Unit 1

Machine Learning is defined as the study of algorithms that improve their performance on a task through experience, with applications ranging from spam filtering to medical image recognition. It encompasses various learning types such as supervised, unsupervised, and reinforcement learning, each with distinct methodologies and use cases. The document also discusses the importance of feature selection, dimensionality reduction, and the practical implementation of machine learning systems.

Uploaded by

Sonali S. Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 92

Machine

Learning
TE IT
What is Machine Learning?
What is Machine Learning?
“Learning is any process by which a system improves
performance from experience.”
- Herbert Simon

Definition by Tom Mitchell (1998):


Machine Learning is the study of algorithms that
• improve their performance P
• at some task T
• with experience E.
A well-defined learning task is given by <P, T,
E>.
What is Machine Learning?

Consider the example of Spam Filter, an email program


which watches the email is to be mark as spam or not.

T: To decide whether an email is spam or not

E:The number of emails which are correctly decided as


spam/not spam

P:Observing the label of emails for available email data


Samuel’s Checkers-Player
“Machine Learning: Field of study that gives
computers the ability to learn without being
explicitly programmed.” -Arthur Samuel (1959)

9
Defining the Learning Task
Improve on task T, with respect to
performance metric P, based on experience
E
T: Playing checkers
P: Percentage of games won against an arbitrary
opponent E: Playing practice games against itself

T: Recognizing hand-written words


P: Percentage of words correctly classified
E: Database of human-labeled images of
handwritten words

T: Driving on four-lane highways using vision


sensors
P: Average distance traveled before a human-
judged error
E: A sequence of images and steering commands recorded while
observing a human driver. 10
Slide credit: Ray Mooney
Traditional Programming

Data
Computer Output

Program
Machine Learning

Data
Computer Progra
m
Output 4
Machine Learning process
When Do We Use Machine Learning?
ML is used when:
• Human expertise does not exist (navigating on Mars)
• Humans can’t explain their expertise (speech recognition)
• Models must be customized (personalized medicine)
• Models are based on huge amounts of data (genomics)

Learning isn’t always useful:


• There is no need to “learn” to calculate payroll
5
A classic example of a task that requires machine learning:
It is very hard to say what makes a 2

6
Some more examples of tasks that are best
solved by using a learning algorithm
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences
• Recognizing anomalies:
– Unusual credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant
• Prediction:
– Future stock prices or currency exchange rates
7
Sample Applications
• Web search
• Computational biology
• Finance
• E-commerce
• Space exploration
• Robotics
• Information extraction
• Social networks
• Debugging software

8
Career opportunities
Search Engine companies: Google , Yahoo ,Bing (Microsoft)

Data science vendors: Palantir Teradata Pixar SAS ,Alpine Labs ,Pivotal Tableau , More

than 8000 companies hiring data scientist.

Social network: Engineering related :


• Facebook • Intel
• Twitter • Oil industry IBM
• LinkedIn • HCL Technologies
• Instagram • Wipro Technologies
• Tumblr • Verizon
• Visa Boeing SAP
Financial related • Oracle
• Amazon
• Apple
• eBay EMC
• Bank of America
• Capital One
• Paypal
• GE Capital
Data and types: Scales of Measurement

Quantitative Qualitative

23
Data and types: Scales of Measurement
Nominal

Ordinal

Discrete Data

Continuous Data

23
Data and types: Scales of Measurement

Quantitative: interval and ratio

Qualitative :nominal and ordinal


Features and Patterns:
Pattern recognition is a process of finding regularities and similarities
in data using machine learning data.
In machine learning and pattern recognition, a feature is an
individual measurable property or characteristic of a phenomenon.
Choosing informative, discriminating and independent features is a
crucial element of effective algorithms in pattern recognition,
classification and regression.
Attributes are the items of data that are used in machine learning.
Attributes are also referred as variables, fields, or predictors
Features and Patterns:
Learning Tasks- Descriptive and Predictive Tasks.
Learning Tasks- Descriptive and Predictive Tasks.
Types of Learning

• Supervised (inductive) learning


– Given: training data + desired outputs (labels)
• Unsupervised learning
– Given: training data (without desired outputs)
• Semi-supervised learning
– Given: training data + a few desired outputs
• Reinforcement learning
– Rewards from sequence of actions

24
Based on slide by Pedro Domingos
Types of Learning

23
Learning Paradigms
Supervised (inductive) learning
Supervised (inductive) learning
Supervised Learning: Regression
• Given (x1, y1), (x2, y2), ..., (xn, yn)
• Learn a function f (x) to predict y given
x
– y is real-valued
9 == regression
8
September Arctic Sea Ice Extent

7
(1,000,000 sq km)

6
5
4
3
2
1
0
1970 1990 2000 2010 2020
1980 Year
26
Data from G. Witt. Journal of Statistics Education, Volume 21,
Supervised Learning: Classification
• Given (x1, y1), (x2, y2), ..., (xn, yn)
• Learn a function f (x) to predict y given
x
– y is categorical == classification
Cancer (Malignant / Benign)

1(Malignant)

0(Benign)
Tumor Size

Based on example by Andrew Ng


Tumor Size 28
Unsupervised Learning
• Given x1 , x2 , ..., x n (without labels)
• Output hidden structure behind the x’s
– E.g., clustering

31
Unsupervised Learning
Genomics application: group individuals by genetic similarity
Genes

Individuals 32
[Source: Daphne Koller]
Unsupervised Learning

Organize computing clusters Social network analysis

Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison)

Market segmentation Astronomical data analysis 33


Slide credit: Andrew Ng
Unsupervised Learning
• Independent component analysis – separate a
combined signal into its original sources

34
Image credit: statsoft.com Audio from http://www.ism.ac.jp/~shiro/research/blindsep.html
Unsupervised Learning
• Independent component analysis – separate a
combined signal into its original sources

35
Image credit: statsoft.com Audio from http://www.ism.ac.jp/~shiro/research/blindsep.html
Reinforcement Learning
• Given a sequence of states and actions with
(delayed) rewards, output a policy
– Policy is a mapping from states  actions that
tells you what to do in a given state
• Examples:
– Credit assignment problem
– Game playing
– Robot in a maze
– Balance a pole on your hand

36
The Agent-Environment Interface

Agent and environment interact at discrete time : t  0, 1, 2,


steps Agent observes state at step t: K
t S
sproduces action at step t : at 
A(st )resulting reward : rt 1 
gets
and resulting next state :
st 1

... st rt +1 rt +2 s rt +3 ...
at st +1 t +2 st +3
at +1 at +2 at +3
37
Slide credit: Sutton & Barto
Reinforcement Learning

https://www.youtube.com/watch?v=4cgWya-wjgY 38
Framing a Learning Problem

40
Designing a Learning System
• Choose the training experience
• Choose exactly what is to be learned
– i.e. the target function
• Choose how to represent the target function
• Choose a learning algorithm to infer the target
function from the experience

Training data Learner

Environment/
Experience Knowledge

Testing data
Performanc
e Element 41
Training vs. Test Distribution
• We generally assume that the training and
test examples are independently drawn from
the same overall distribution of data
– We call this “i.i.d” which stands for “independent
and identically distributed”

• If examples are not independent, requires


collective classification
• If test distribution is different, requires
transfer learning
42
ML in a Nutshell
• Tens of thousands of machine learning
algorithms
– Hundreds new every year

• Every ML algorithm has three components:


– Representation
– Optimization
– Evaluation

43
Various Function Representations
• Numerical functions
– Linear regression
– Neural networks
– Support vector machines
• Symbolic functions
– Decision trees
– Rules in propositional logic
– Rules in first-order predicate logic
• Instance-based functions
– Nearest-neighbor
– Case-based
• Probabilistic Graphical Models
– Naïve Bayes
– Bayesian networks
– Hidden-Markov Models (HMMs)
– Probabilistic Context Free Grammars (PCFGs)
– Markov networks
44
Various Search/Optimization Algorithms
• Gradient descent
– Perceptron
– Backpropagation
• Dynamic Programming
– HMM Learning
– PCFG Learning
• Divide and Conquer
– Decision tree induction
– Rule learning
• Evolutionary Computation
– Genetic Algorithms (GAs)
– Genetic Programming (GP)
– Neuro-evolution
45
Evaluation
• Accuracy
• Precision and recall
• Squared error
• Likelihood
• Posterior probability
• Cost / Utility
• Margin
• Entropy
• K-L divergence
• etc.

47
ML in Practice
• Understand domain, prior knowledge, and goals
• Data integration, selection, cleaning, pre-processing, etc.
Loop • Learn models
• Interpret results
• Consolidate and deploy discovered knowledge

48
Based on a slide by Pedro Domingos
Lessons Learned about Learning
• Learning can be viewed as using direct or indirect
experience to approximate a chosen target function.

• Function approximation can be viewed as a search


through a space of hypotheses (representations of
functions) for one that best fits a set of training data.

• Different learning methods assume different


hypothesis spaces (representation languages) and/or
employ different search techniques.

49
Slide credit: Ray Mooney
Features
•Mathematically, features are functions that map
from the instance space to some set of feature
values called the domain of the feature.
•Features are variety in nature e.g.
oSet of integers, the number of occurrence of particular
word
oBoolean, true or false for email is spam or ham
oArbitrary finite set of colors or shapes etc.
Data and Dimensionality

Increasing the number of features will not


always improve classification accuracy. k=3
3 bins
1

In practice, the inclusion of more features


might actually lead to worse performance.

The number of training examples required


increases exponentially with 32 bins
dimensionality d (i.e., kd). 33 bins
k: number of bins per feature
Dimensionality reduction
• Too many factors on the basis for the final classification or regression
•51 Factors are basically variables called features

• The higher the number of features, the harder it gets to visualize the training set
• These features are correlated, and hence redundant
• Dimensionality reduction is a series of techniques in machine learning and

statistics to reduce the number of random variables to consider feature selection

and feature extraction


 Some features (dimensions) bear little or nor useful information (e.g., color
of hair for a car selection)
 Can drop some features
 Have to estimate which features can be dropped from data

 Several features can be combined together without loss or even with gain
of information (e.g., income of all family members for loan application)
 Some features can be combined together
 Have to estimate which features to combine from data
Why Dimensionality Reduction
• What is the objective?
Choose an optimum set of features of lower dimensionality to
improve classification accuracy.

• Different methods can be used to reduce dimensionality:


• Feature extraction
• Feature selection
• Reduces Time Complexity: Less computation
• Reduces Space Complexity: Less parameters
• Saves the cost of observing the feature
• Simpler models are more robust on small datasets
• More interpretable; simpler explanation
• Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3
dimensions
52
Data and Dimensionality
Dimensionality Reduction (cont’d)
 x1 
 
Feature selection: chooses a subset of  x2   xi1 
 .   
the original features.    xi2 
.
x    y   . 
 .   
   . 
  .  
 .   xiK 
 
 xN 

• find a smaller subset of a many-dimensional data set to create


a data model
• finding k features of the d dimensions that give us the
most information and discard the other (d − k) dimensions.
• Subset selection is one of the widely used method

K<<N K<<N
54
Feature Extraction

Feature extraction: finds a set of new features (i.e., through


some mapping f()) from the existing features.

y=𝑓(x) is equivalent to optimizing an objective criterion.


From a mathematical point of view, finding an optimum mapping

Different methods use different objective criteria, e.g.,

Minimize Information Loss: represent the data as accurately as


possible in the lower-dimensional space.

Maximize Discriminatory Information: enhance the class-


discriminatory information in the lower-dimensional space.
55
Feature Extraction
Popular linear feature extraction methods:
Principal Components Analysis (PCA): Seeks a projection that preserves as
much information in the data as possible.
Linear Discriminant Analysis (LDA): Seeks a projection that best discriminates
the data.

Many other methods:


Making features as independent as possible (Independent Component
Analysis or ICA).
Retaining interesting directions (Projection Pursuit).
Embedding to lower dimensional manifolds (Isomap, Locally Linear
Embedding or LLE).

56
Subset-selection
• to find the best subset of the set of features
57

• The best subset contains the least number


of dimensions that most contribute to accuracy
• used in both regression and classification problems.
• 2𝑑 possible subsets of 𝑑 variables
• is not possible to test for all of them unless 𝑑 is small
• Insteadsome heuristics is designed to get a reasonable
(but not optimal) solution in reasonable (polynomial) time.
Subset-selection
 Forward search
 Start from empty set of features
58
 Try each of remaining features
 Estimate classification/regression error for adding specific feature
 Select feature that gives maximum improvement in validation
error
 Stop when no significant improvement

 Backward search
 Start with original set of size d
 Drop features with smallest impact on error
Principal Component Analysis:
• mapping from d dimensional space to a new(k <d)-
dimensional space, with minimum loss of information.
• As the dimensions of data increases, the difficulty to visualize it and
perform computations on it also increase
• Mainly there are two strategies to reduce the dimensions of a data-
o Remove the redundant dimensions
o Only keep the most important dimensions
Concepts used in
PCA
Variance: It is a measureof the variability or it simply
measures how spread the data set is
Concepts used in
PCA
Covariance: : It is a measure of the extent to which
corresponding elements from two sets of ordered data move in the
same direction
Concepts used in
PCA
• PCA finds a new set of dimensions such that all the dimensions are
orthogonal (and hence linearly independent) and ranked according
to the variance of data along them.
• It means more important principle axis occurs first. (more
important = more variance/more spread out data).
• The principal direction in which the data varies is shown by the U
axis and the second most important direction is the V axis orthogonal
to it.
• If the each (X, Y) instance is transform coordinate into its
corresponding (U, V) value, the data is de-correlated, meaning that
the co-variance between the U and V variables is zero.
Concepts used in
PCA
• The directions U and V are called the principal components.

PCA for Data Representation PCA for Dimension Reduction


Working of
PCA
1. The first step is to gather reliable raw data from a sample
based on a questionnaire
2. The second step is to calculate correlations between the
variables.
3. In principal component analysis, principal components are
extracted and presented as a table with the components in
columns and variables in rows.
4. The principal components analysis table is truncated.
Components are reported in order by eigenvalue and by the
proportion of total variance. Frequently, these components
are easily interpreted.
Objectives of PCA

• PCA helps to Extract the most important information from the


data table and compress the size of the data set by keeping
only important information,
• PCA also Simplify the description of the data
• set and analyze the structure of the observations and the
variables.
Dataset Validation Techniques – Hold-out,
k-fold Cross validation, Leave-One-Out
Cross-Validation (LOOCV).
Lesson Goals:

Understand the basic concepts of resampling methods for


machine learning model selection and assessment.
Understand the similarities and differences between the
validation set approach, leave-one-out cross-validation, and K-
fold cross-validation.
Understand the basic theoretical concepts of the bootstrap
method for assessing statistical accuracy.
Recognize that the performance of machine learning models
depends on their prediction capability on independent test data.
Resampling Methods

Involves repeatedly drawing samples from a training set and refitting a


model of interest on each sample in order to obtain more information
about the fitted model.
Example: We can estimate the variability of a linear regression fit by
repeatedly drawing different samples from the training data, fitting a
OLS regression to each new sample, and then examining the extent to
which the resulting fits differ.
Model Assessment: having chosen a final model, estimating its
prediction error on new data.
Model Selection: estimating the performance of different models in
order to choose the best one.
Resampling Methods (cont.)

Cross-Validation
Used to estimate test set prediction error rates associated with a given
machine learning method to evaluate its performance, or to select the
appropriate level of model flexibility.

Bootstrap
Used most commonly to provide a measure of accuracy of a parameter
estimate or of a given machine learning method.
Model Assessment

The generalization performance of a machine learning method


relates to its prediction capability on independent test sets.
Assessment of this performance is extremely important in
practice, since it guides the choice of the machine learning
method or model.
Further, this gives us a measure of the quality of the ultimately
chosen model.
Model Assessment (cont.)

Test Error
The average error that results from using a machine learning
method to predict the response on a new observation.
The prediction error over an independent test sample.
Training Error
The average loss over the training sample:

Note: The training error rate can dramatically underestimate the test
error rate
Model Assessment (cont.)

 As the model becomes more and


more complex, it uses the training
data more and is able to adapt to
more complicated underlying
structures.
 Hence, there is a decrease in bias but
an increase in variance.
 However, training error is not a good
estimate of the test error.
• Training error consistently decreases with model complexity.
• A model with zero training error is overfit to the training data and will typically
generalize poorly.
Model Assessment (cont.)

 If we are in a data-rich situation, the best approach for both model


selection and model assessment is to randomly divide the dataset
into three parts: training set, validation set, and test set.

 The training set is used to fit the models. The validation set is used
to estimate prediction error for model selection. The test set is used
for assessment of the prediction error of the final chosen model.

 A typical split might by 50% for training, and 25% each for
validation and testing.
Model Assessment
(cont.)
Best solution: use a large designated test set, which is often not
available. For the methods presented here, there is insufficient data to
split it into three parts.

There are some methods to make mathematical adjustments to the


training error rate in order to estimate the test error rate:
Cp statistic, AIC, BIC (we will discuss these next week)

Here, we consider cross-validation (CV) methods that estimate the test


error by holding out a subset of the training observations from the
fitting process, and then applying the machine learning method to
those held out observations.
Overview: Model
Selection
By far, the most important use of validation is for model selection,
which we will discuss in greater detail next week.
This could be the choice between a linear model and a nonlinear
model, the choice of the order of polynomial in a model, the choice of
a regularization parameter, or any other choice that affects the
learning process.
In almost every learning situation, there are some choices to be made
and we need a principled way of making these choices.
The leap is to realize that validation can be used to estimate the out-
of-sample error for more than one model.
Overview: Model Selection (cont.)

Suppose we have M models; validation can


be used to select one of these models.

We use the training data to fit the model,


and we evaluate each model on the
validation set to obtain the validation
errors.

It is now a simple matter to select the


model with the lowest validation error.
Validation Set
Approach
Suppose that we would like to find a set of variables that give the
lowest validation error rate (an estimate of the test error rate).

If we have a large data set, we can achieve this goal by randomly


splitting the data into separate training and validation data sets.

Then, we use the training data set to build each possible model and
select the model that gives the lowest error rate when applied to the
validation data set.

Training Data Validation Data


Validation Set Approach: Example

Example: we want to predict mpg from horsepower


Two models:
mpg ~ horsepower
mpg ~ horsepower + horspower2

Which model gives a better fit?


We randomly split (50/50) 392 observations into training and
validation data sets, and we fit both models using the training data.
Next, we evaluate both models using the validation data set.
Winner = model with the lowest validation (testing) MSE
Validation Set Approach: Example Results

Left Panel: Validation error estimates for a single split into


training and validation data sets.
Right Panel: Validation error estimates for multiple splits;
shows the test error rate is highly variable.
Validation Set Approach: Review

Advantages:
Conceptually simple and easy implementation.

Drawbacks:
The validation set error rate (MSE) can be highly variable.
Only a subset of the observations (those in the training set) are used
to fit the model.
Machine learning methods tend to perform worse when trained on
fewer observations.
Thus, the validation set error rate may tend to overestimate the test
error rate for the model fit on the entire data set.
Leave-One-Out Cross-
Validation
Instead of creating two subsets of comparable size, a single
observation is used for the validation set and the remaining
observations (n – 1) make up the training set.
 LOOCV Algorithm:
– Split the entire data set of size n into:
• Blue = training data set
• Beige = validation data set

– Fit the model using the training data set


– Evaluate the model using validation set and
compute the corresponding MSE.
𝑛 – Repeat this process n times, producing n
1
CV (𝑛)= ∑ MSE 𝑖 squared errors. The average of these n
𝑛 𝑖=1 squared errors estimates the test MSE.
Validation Set Approach vs. LOOCV

LOOCV has far less bias and, therefore, tends not to overestimate the
test error rate.

Performing LOOCV multiple times always yields the same results


because there is no randomness in the training/validation set splits.

LOOCV is computationally intensive because the model has to be fit n


times. However, there is a shortcut with OLS linear or polynomial
regression (where hi is the leverage):

( )
1
𝑛 ^
𝑌 𝑖 −𝑌
2

CV (𝑛)= ∑ 𝑖
𝑛 𝑖 =1 1 − h𝑖
K-Fold Cross-Validation

Probably the simplest and most widely used method for estimating prediction
error.

This method directly estimates the average prediction error when the
machine learning method is applied to an independent test sample.

Ideally, if we had enough data, we would set aside a validation set (as
previously described) and use it to assess the performance of our prediction
model.

To finesse the problem, K-fold cross-validation uses part of the available data
to fit the model, and a different part to test it.
K-Fold Cross-Validation
(cont.)
We use this method because LOOCV is
computationally intensive.
We randomly divide the data set of into K
folds (typically K = 5 or 10).

 The first fold is treated as a validation set, and the method is fit on
the remaining K – 1 folds. The MSE is computed on the
observations in the held-out fold. The process is repeated K times,
taking out a different part each time.
 By averaging the K estimates of the test error, we get an estimated
validation (test) error rate for new observations.
K-Fold Cross-Validation
(cont.)
Let the K folds be C1, … , CK, where Ck denotes the indices of the
observations in fold k. There are nk observations in fold k: if N is a
multiple of K, then nk = n / K.

Compute:
where and is the fitted value for observation i, obtained from the
data with fold k removed.
K-Fold Cross-Validation vs. LOOCV

Left Panel: LOOCV Error Curve


Right Panel: 10-fold CV run nine separate times, each with a different
random split of the data into ten parts.
Note: LOOCV is a special case of K-fold, where K = n
Bias-Variance Trade-off for K-Fold Cross-
Validation

Which is better, LOOCV or K-fold CV?


LOOCV is more computationally intensive than K-fold CV
From the perspective of bias reduction, LOOCV is preferred to K-fold CV (when K < n)
However, LOOCV has higher variance than K-fold CV (when K < n)
Thus, we see the bias-variance trade-off between the two resampling methods

We tend to use K-fold CV with K = 5 or K = 10, as these values have been shown
empirically to yield test error rate estimates that suffer neither from excessively
high bias nor from very high variance
Cross-Validation on Classification Problems

We will cover classification problems in more detail later in the course,


but we briefly show how CV can be used when Y is qualitative
(categorical) as opposed to quantitative. Here, rather than use MSE to
quantify test error, we instead use the number of misclassified
observation.
LOOCV Error Rate:
We use CV as follows:
Divide data into K folds; hold-out one part and fit using the remaining data
(compute error rate on hold-out data); repeat K times.
CV Error Rate: average over the K errors we have computed
Cross-Validation:
Wrong Way
Cross-Validation: Right
Way
Summary

Resampling methods for machine learning model selection


and assessment.
The validation set approach, leave-one-out cross-validation,
and K-fold cross-validation.

You might also like