Machine
Learning
  TE IT
What is Machine Learning?
      What is Machine Learning?
“Learning is any process by which a system improves
  performance from experience.”
                 - Herbert Simon
Definition by Tom Mitchell (1998):
 Machine Learning is the study of algorithms that
   • improve their performance P
   • at some task T
   • with experience E.
   A well-defined learning task is given by <P, T,
   E>.
         What is Machine Learning?
Consider the example of Spam Filter, an email program
which watches the email is to be mark as spam or not.
T: To decide whether an email is spam or not
E:The number of emails which are correctly decided as
  spam/not spam
P:Observing the label of emails for available email data
      Samuel’s Checkers-Player
“Machine Learning: Field of study that gives
computers the ability to learn without being
explicitly programmed.” -Arthur Samuel (1959)
                                            9
                    Defining the Learning Task
                      Improve on task T, with respect to
                performance metric P, based on experience
                E
                      T: Playing checkers
                      P: Percentage of games won against an arbitrary
              opponent E: Playing practice games against itself
                       T: Recognizing hand-written words
                       P: Percentage of words correctly classified
                       E: Database of human-labeled images of
              handwritten words
                           T: Driving on four-lane highways using vision
              sensors
                        P: Average distance traveled before a human-
              judged error
              E: A sequence of images and steering commands recorded while
                  observing a human driver.                                  10
Slide credit: Ray Mooney
Traditional Programming
   Data
                   Computer   Output
Program
Machine Learning
    Data
                   Computer   Progra
                              m
  Output                               4
Machine Learning process
    When Do We Use Machine Learning?
ML is used when:
•   Human expertise does not exist (navigating on Mars)
•   Humans can’t explain their expertise (speech recognition)
•   Models must be customized (personalized medicine)
•   Models are based on huge amounts of data (genomics)
Learning isn’t always useful:
• There is no need to “learn” to calculate payroll
                                                                5
A classic example of a task that requires machine learning:
             It is very hard to say what makes a 2
                                                          6
   Some more examples of tasks that are best
      solved by using a learning algorithm
• Recognizing patterns:
   – Facial identities or facial expressions
   – Handwritten or spoken words
   – Medical images
• Generating patterns:
   – Generating images or motion sequences
• Recognizing anomalies:
   – Unusual credit card transactions
   – Unusual patterns of sensor readings in a nuclear power plant
• Prediction:
   – Future stock prices or currency exchange rates
                                                               7
               Sample Applications
•   Web search
•   Computational biology
•   Finance
•   E-commerce
•   Space exploration
•   Robotics
•   Information extraction
•   Social networks
•   Debugging software
                                     8
                  Career opportunities
Search Engine companies: Google , Yahoo ,Bing (Microsoft)
Data science vendors: Palantir Teradata Pixar SAS ,Alpine Labs ,Pivotal Tableau , More
than 8000 companies hiring data scientist.
Social network:       Engineering related :
• Facebook            • Intel
• Twitter             • Oil industry IBM
• LinkedIn            • HCL Technologies
• Instagram           • Wipro Technologies
•    Tumblr           • Verizon
                      • Visa Boeing SAP
Financial related     • Oracle
• Amazon
• Apple
• eBay EMC
• Bank of America
• Capital One
• Paypal
• GE Capital
Data and types: Scales of Measurement
Quantitative              Qualitative
                                        23
          Data and types: Scales of Measurement
Nominal
Ordinal
                                        Discrete Data
Continuous Data
                                                        23
       Data and types: Scales of Measurement
       Quantitative: interval and ratio
Qualitative :nominal and ordinal
Features and Patterns:
Pattern recognition is a process of finding regularities and similarities
in data using machine learning data.
In machine learning and pattern recognition, a feature is an
individual measurable property or characteristic of a phenomenon.
Choosing informative, discriminating and independent features is a
crucial element of effective algorithms in pattern recognition,
classification and regression.
Attributes are the items of data that are used in machine learning.
Attributes are also referred as variables, fields, or predictors
Features and Patterns:
Learning Tasks- Descriptive and Predictive Tasks.
Learning Tasks- Descriptive and Predictive Tasks.
                               Types of Learning
     • Supervised (inductive) learning
           – Given: training data + desired outputs (labels)
     • Unsupervised learning
           – Given: training data (without desired outputs)
     • Semi-supervised learning
           – Given: training data + a few desired outputs
     • Reinforcement learning
           – Rewards from sequence of actions
                                                               24
Based on slide by Pedro Domingos
Types of Learning
                    23
Learning Paradigms
Supervised (inductive) learning
Supervised (inductive) learning
           Supervised Learning: Regression
      • Given (x1, y1), (x2, y2), ..., (xn, yn)
      • Learn a function f (x) to predict y given
        x
            – y is real-valued
                     9         == regression
                                                      8
                    September Arctic Sea Ice Extent
                                                      7
                          (1,000,000 sq km)
                                                      6
                                                      5
                                                      4
                                                      3
                                                      2
                                                      1
                                                      0
                                                       1970   1990          2000   2010   2020
                                                       1980          Year
                                                                                                 26
Data from G. Witt. Journal of Statistics Education, Volume 21,
       Supervised Learning: Classification
     • Given (x1, y1), (x2, y2), ..., (xn, yn)
     • Learn a function f (x) to predict y given
       x
           – y is categorical == classification
                                Cancer (Malignant / Benign)
     1(Malignant)
          0(Benign)
                                      Tumor Size
Based on example by Andrew Ng
                                      Tumor Size              28
         Unsupervised Learning
• Given x1 , x2 , ..., x n (without labels)
• Output hidden structure behind the x’s
  – E.g., clustering
                                              31
                          Unsupervised Learning
       Genomics application: group individuals by genetic similarity
   Genes
                               Individuals                         32
[Source: Daphne Koller]
                          Unsupervised Learning
           Organize computing clusters                   Social network analysis
                                         Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison)
               Market segmentation                   Astronomical data analysis                                       33
Slide credit: Andrew Ng
                         Unsupervised Learning
      • Independent component analysis – separate a
         combined signal into its original sources
                                                                                           34
Image credit: statsoft.com Audio from http://www.ism.ac.jp/~shiro/research/blindsep.html
                         Unsupervised Learning
      • Independent component analysis – separate a
         combined signal into its original sources
                                                                                           35
Image credit: statsoft.com Audio from http://www.ism.ac.jp/~shiro/research/blindsep.html
         Reinforcement Learning
• Given a sequence of states and actions with
  (delayed) rewards, output a policy
  – Policy is a mapping from states  actions that
    tells you what to do in a given state
• Examples:
  –   Credit assignment problem
  –   Game playing
  –   Robot in a maze
  –   Balance a pole on your hand
                                                     36
          The Agent-Environment Interface
                       Agent and environment interact at discrete time                      : t  0, 1, 2,
                         steps Agent observes state at step      t:                         K
                             t S
                           sproduces   action at step t : at 
                            A(st )resulting reward : rt 1 
                            gets
                            and resulting next state :
                                                            st 1
                 ...           st        rt +1                   rt +2 s           rt +3               ...
                                    at           st +1                  t +2               st +3
                                                         at +1                 at +2               at +3
                                                                                                             37
Slide credit: Sutton & Barto
  Reinforcement Learning
https://www.youtube.com/watch?v=4cgWya-wjgY   38
Framing a Learning Problem
                             40
    Designing a Learning System
• Choose the training experience
• Choose exactly what is to be learned
   – i.e. the target function
• Choose how to represent the target function
• Choose a learning algorithm to infer the target
  function from the experience
                        Training data   Learner
         Environment/
         Experience                        Knowledge
                        Testing data
                                        Performanc
                                        e Element      41
     Training vs. Test Distribution
• We generally assume that the training and
  test examples are independently drawn from
   the same overall distribution of data
   – We call this “i.i.d” which stands for “independent
     and identically distributed”
• If examples are not independent, requires
  collective classification
• If test distribution is different, requires
  transfer learning
                                                          42
                ML in a Nutshell
• Tens of thousands of machine learning
  algorithms
  – Hundreds new every year
• Every ML algorithm has three components:
  – Representation
  – Optimization
  – Evaluation
                                             43
Various Function Representations
• Numerical functions
   – Linear regression
   – Neural networks
   – Support vector machines
• Symbolic functions
   – Decision trees
   – Rules in propositional logic
   – Rules in first-order predicate logic
• Instance-based functions
   – Nearest-neighbor
   – Case-based
• Probabilistic Graphical Models
   – Naïve Bayes
   – Bayesian networks
   – Hidden-Markov Models (HMMs)
   – Probabilistic Context Free Grammars (PCFGs)
   – Markov networks
                                                   44
Various Search/Optimization Algorithms
  • Gradient descent
     – Perceptron
     – Backpropagation
  • Dynamic Programming
     – HMM Learning
     – PCFG Learning
  • Divide and Conquer
     – Decision tree induction
     – Rule learning
  • Evolutionary Computation
     – Genetic Algorithms (GAs)
     – Genetic Programming (GP)
     – Neuro-evolution
                                         45
                     Evaluation
•   Accuracy
•   Precision and recall
•   Squared error
•   Likelihood
•   Posterior probability
•   Cost / Utility
•   Margin
•   Entropy
•   K-L divergence
•   etc.
                                  47
                                     ML in Practice
                •   Understand domain, prior knowledge, and goals
                •   Data integration, selection, cleaning, pre-processing, etc.
   Loop         •   Learn models
                •   Interpret results
                •   Consolidate and deploy discovered knowledge
                                                                             48
Based on a slide by Pedro Domingos
           Lessons Learned about Learning
      • Learning can be viewed as using direct or indirect
        experience to approximate a chosen target function.
      • Function approximation can be viewed as a search
        through a space of hypotheses (representations of
        functions) for one that best fits a set of training data.
      • Different learning methods assume different
        hypothesis spaces (representation languages) and/or
        employ different search techniques.
                                   49
Slide credit: Ray Mooney
          Features
•Mathematically, features         are functions that map
      from the instance space to some set of feature
values called the domain of the feature.
•Features are variety in nature e.g.
   oSet of integers, the number of occurrence of particular
   word
   oBoolean, true or false for email is spam or ham
   oArbitrary finite set of colors or shapes etc.
                          Data and Dimensionality
Increasing the number of features will not
always improve classification accuracy.                                    k=3
                                                        3 bins
                                                         1
In practice, the inclusion of more features
might actually lead to worse performance.
The number of training examples required
increases exponentially with                  32 bins
dimensionality d (i.e., kd).                                     33 bins
           k: number of bins per feature
     Dimensionality reduction
•   Too many factors on the basis for the final classification or regression
•51 Factors are basically variables called features
•   The higher the number of features, the harder it gets to visualize the training set
•   These features are correlated, and hence redundant
•   Dimensionality reduction is a series of techniques in machine learning and
     statistics to reduce the number of random variables to consider feature selection
    and feature extraction
  Some features (dimensions) bear little or nor useful information (e.g., color
   of hair for a car selection)
    Can drop some features
    Have to estimate which features can be dropped from data
  Several features can be combined together without loss or even with gain
   of information (e.g., income of all family members for loan application)
    Some features can be combined together
    Have to estimate which features to combine from data
          Why Dimensionality Reduction
    • What is the objective?
      Choose an optimum set of features of lower dimensionality to
      improve classification accuracy.
    • Different methods can be used to reduce dimensionality:
       • Feature extraction
       • Feature selection
•    Reduces Time Complexity: Less computation
•    Reduces Space Complexity: Less parameters
•    Saves the cost of observing the feature
•    Simpler models are more robust on small datasets
•    More interpretable; simpler explanation
•    Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3
     dimensions
                                                                           52
Data and Dimensionality
Dimensionality Reduction (cont’d)
                                           x1 
                                           
 Feature selection: chooses a subset   of  x2      xi1 
                                           .       
 the original features.                            xi2 
                                              .
                                       x    y   . 
                                           .       
                                                   . 
                                            .      
                                           .       xiK 
                                           
                                           xN 
• find a smaller subset of a many-dimensional data set to create
  a data model
• finding k features of the d dimensions that give us the
  most information and discard the other (d − k) dimensions.
• Subset selection is one of the widely used method
                       K<<N                                   K<<N
                                                                     54
             Feature Extraction
Feature extraction: finds a set of new features (i.e., through
some mapping f()) from the existing features.
y=𝑓(x) is equivalent to optimizing an objective criterion.
From a mathematical point of view, finding an optimum mapping
Different methods use different objective criteria, e.g.,
   Minimize Information Loss: represent the data as accurately as
   possible in the lower-dimensional space.
   Maximize Discriminatory Information: enhance the class-
   discriminatory information in the lower-dimensional space.
                                                                    55
              Feature Extraction
Popular linear feature extraction methods:
   Principal Components Analysis (PCA): Seeks a projection that preserves as
   much information in the data as possible.
   Linear Discriminant Analysis (LDA): Seeks a projection that best discriminates
   the data.
Many other methods:
   Making features as independent as possible (Independent Component
   Analysis or ICA).
   Retaining interesting directions (Projection Pursuit).
   Embedding to lower dimensional manifolds (Isomap, Locally Linear
   Embedding or LLE).
                                                                               56
     Subset-selection
 • to find the best subset of the set of features
57
 • The     best     subset contains        the      least   number
       of dimensions that most contribute to accuracy
 • used in both regression and classification problems.
 • 2𝑑 possible subsets of 𝑑 variables
 • is not possible to test for all of them unless 𝑑 is small
 • Insteadsome heuristics is designed to get a reasonable
 (but not optimal) solution in reasonable (polynomial) time.
   Subset-selection
 Forward search
    Start from empty set of features
58
    Try each of remaining features
    Estimate classification/regression error for adding specific feature
    Select feature that gives maximum improvement in validation
     error
    Stop when no significant improvement
 Backward search
   Start with original set of size d
   Drop features with smallest impact on error
Principal Component Analysis:
 •   mapping              from      d dimensional space to a new(k <d)-
     dimensional space, with minimum loss of information.
 •   As the dimensions of data increases, the difficulty    to visualize it and
     perform computations on it also increase
 •   Mainly there are two strategies to reduce the dimensions of a data-
         o     Remove the redundant dimensions
         o     Only keep the most important dimensions
Concepts used in
PCA
Variance: It is a measureof the variability or it simply
measures how spread the data set is
Concepts used in
PCA
 Covariance: : It is a measure of the extent to        which
 corresponding elements from two sets of ordered data move in the
 same direction
Concepts used in
PCA
• PCA finds a new set of dimensions such that all the dimensions are
    orthogonal (and hence linearly independent) and ranked according
    to the variance of data along them.
•        It means more important principle axis occurs first. (more
    important = more variance/more spread out data).
• The principal direction in which the data varies is shown by the U
axis and the second most important direction is the V axis orthogonal
    to it.
• If the each (X, Y) instance is transform coordinate into its
    corresponding (U, V) value, the data is de-correlated, meaning that
    the co-variance between the U and V variables is zero.
Concepts used in
PCA
• The directions U and V are called the principal components.
      PCA for Data Representation         PCA for Dimension Reduction
Working of
PCA
 1. The first step is to gather reliable raw data from a sample
    based on a questionnaire
 2. The second step is to calculate correlations between the
    variables.
 3. In principal component analysis, principal components are
    extracted and presented as a table with the components in
    columns and variables in rows.
 4. The principal components analysis table is truncated.
    Components are reported in order by eigenvalue and by the
    proportion of total variance. Frequently, these components
    are easily interpreted.
Objectives of PCA
• PCA helps to Extract the most important information from the
   data table and compress the size of the data set by keeping
   only important information,
• PCA also Simplify the description of the data
• set and analyze the structure of the observations and the
   variables.
Dataset Validation Techniques – Hold-out,
 k-fold Cross validation, Leave-One-Out
        Cross-Validation (LOOCV).
          Lesson Goals:
Understand the basic concepts of resampling methods for
machine learning model selection and assessment.
Understand the similarities and differences between the
validation set approach, leave-one-out cross-validation, and K-
fold cross-validation.
Understand the basic theoretical concepts of the bootstrap
method for assessing statistical accuracy.
Recognize that the performance of machine learning models
depends on their prediction capability on independent test data.
             Resampling Methods
Involves repeatedly drawing samples from a training set and refitting a
model of interest on each sample in order to obtain more information
about the fitted model.
Example: We can estimate the variability of a linear regression fit by
repeatedly drawing different samples from the training data, fitting a
OLS regression to each new sample, and then examining the extent to
which the resulting fits differ.
Model Assessment: having chosen a final model, estimating its
prediction error on new data.
Model Selection: estimating the performance of different models in
order to choose the best one.
            Resampling Methods (cont.)
Cross-Validation
   Used to estimate test set prediction error rates associated with a given
   machine learning method to evaluate its performance, or to select the
   appropriate level of model flexibility.
Bootstrap
   Used most commonly to provide a measure of accuracy of a parameter
   estimate or of a given machine learning method.
             Model Assessment
The generalization performance of a machine learning method
relates to its prediction capability on independent test sets.
Assessment of this performance is extremely important in
practice, since it guides the choice of the machine learning
method or model.
Further, this gives us a measure of the quality of the ultimately
chosen model.
              Model Assessment (cont.)
Test Error
   The average error that results from using a machine learning
   method to predict the response on a new observation.
   The prediction error over an independent test sample.
Training Error
   The average loss over the training sample:
Note: The training error rate can dramatically underestimate the test
error rate
                     Model Assessment (cont.)
                                              As the model becomes more and
                                               more complex, it uses the training
                                               data more and is able to adapt to
                                               more complicated underlying
                                               structures.
                                              Hence, there is a decrease in bias but
                                               an increase in variance.
                                              However, training error is not a good
                                               estimate of the test error.
•   Training error consistently decreases with model complexity.
•   A model with zero training error is overfit to the training data and will typically
    generalize poorly.
               Model Assessment (cont.)
   If we are in a data-rich situation, the best approach for both model
    selection and model assessment is to randomly divide the dataset
    into three parts: training set, validation set, and test set.
   The training set is used to fit the models. The validation set is used
    to estimate prediction error for model selection. The test set is used
    for assessment of the prediction error of the final chosen model.
   A typical split might by 50% for training, and 25% each for
    validation and testing.
                Model Assessment
                (cont.)
Best solution: use a large designated test set, which is often not
available. For the methods presented here, there is insufficient data to
split it into three parts.
There are some methods to make mathematical adjustments to the
training error rate in order to estimate the test error rate:
   Cp statistic, AIC, BIC (we will discuss these next week)
Here, we consider cross-validation (CV) methods that estimate the test
error by holding out a subset of the training observations from the
fitting process, and then applying the machine learning method to
those held out observations.
               Overview: Model
               Selection
By far, the most important use of validation is for model selection,
which we will discuss in greater detail next week.
This could be the choice between a linear model and a nonlinear
model, the choice of the order of polynomial in a model, the choice of
a regularization parameter, or any other choice that affects the
learning process.
In almost every learning situation, there are some choices to be made
and we need a principled way of making these choices.
The leap is to realize that validation can be used to estimate the out-
of-sample error for more than one model.
                Overview: Model Selection (cont.)
Suppose we have M models; validation can
be used to select one of these models.
We use the training data to fit the model,
and we evaluate each model on the
validation set to obtain the validation
errors.
It is now a simple matter to select the
model with the lowest validation error.
                Validation Set
                Approach
Suppose that we would like to find a set of variables that give the
lowest validation error rate (an estimate of the test error rate).
If we have a large data set, we can achieve this goal by randomly
splitting the data into separate training and validation data sets.
Then, we use the training data set to build each possible model and
select the model that gives the lowest error rate when applied to the
validation data set.
                                 Training Data   Validation Data
               Validation Set Approach: Example
Example: we want to predict mpg from horsepower
Two models:
   mpg ~ horsepower
   mpg ~ horsepower + horspower2
Which model gives a better fit?
   We randomly split (50/50) 392 observations into training and
   validation data sets, and we fit both models using the training data.
   Next, we evaluate both models using the validation data set.
   Winner = model with the lowest validation (testing) MSE
              Validation Set Approach: Example Results
Left Panel: Validation error estimates for a single split into
training and validation data sets.
Right Panel: Validation error estimates for multiple splits;
shows the test error rate is highly variable.
              Validation Set Approach: Review
Advantages:
  Conceptually simple and easy implementation.
Drawbacks:
  The validation set error rate (MSE) can be highly variable.
  Only a subset of the observations (those in the training set) are used
  to fit the model.
  Machine learning methods tend to perform worse when trained on
  fewer observations.
  Thus, the validation set error rate may tend to overestimate the test
  error rate for the model fit on the entire data set.
              Leave-One-Out Cross-
              Validation
Instead of creating two subsets of comparable size, a single
observation is used for the validation set and the remaining
observations (n – 1) make up the training set.
                             LOOCV Algorithm:
                              – Split the entire data set of size n into:
                                  • Blue = training data set
                                  • Beige = validation data set
                              – Fit the model using the training data set
                              – Evaluate the model using validation set and
                                compute the corresponding MSE.
             𝑛                – Repeat this process n times, producing n
        1
 CV (𝑛)= ∑ MSE 𝑖                squared errors. The average of these n
        𝑛 𝑖=1                   squared errors estimates the test MSE.
              Validation Set Approach vs. LOOCV
LOOCV has far less bias and, therefore, tends not to overestimate the
test error rate.
Performing LOOCV multiple times always yields the same results
because there is no randomness in the training/validation set splits.
LOOCV is computationally intensive because the model has to be fit n
times. However, there is a shortcut with OLS linear or polynomial
regression (where hi is the leverage):
                                        (               )
                              1
                                    𝑛            ^
                                            𝑌 𝑖 −𝑌
                                                            2
                       CV (𝑛)= ∑                    𝑖
                              𝑛 𝑖 =1         1 − h𝑖
                K-Fold Cross-Validation
Probably the simplest and most widely used method for estimating prediction
error.
This method directly estimates the average prediction error when the
machine learning method is applied to an independent test sample.
Ideally, if we had enough data, we would set aside a validation set (as
previously described) and use it to assess the performance of our prediction
model.
To finesse the problem, K-fold cross-validation uses part of the available data
to fit the model, and a different part to test it.
                K-Fold Cross-Validation
                (cont.)
                             We use this method because LOOCV is
                             computationally intensive.
                              We randomly divide the data set of into K
                             folds (typically K = 5 or 10).
   The first fold is treated as a validation set, and the method is fit on
    the remaining K – 1 folds. The MSE is computed on the
    observations in the held-out fold. The process is repeated K times,
    taking out a different part each time.
   By averaging the K estimates of the test error, we get an estimated
    validation (test) error rate for new observations.
               K-Fold Cross-Validation
               (cont.)
Let the K folds be C1, … , CK, where Ck denotes the indices of the
observations in fold k. There are nk observations in fold k: if N is a
multiple of K, then nk = n / K.
Compute:
where and is the fitted value for observation i, obtained from the
data with fold k removed.
             K-Fold Cross-Validation vs. LOOCV
Left Panel: LOOCV Error Curve
Right Panel: 10-fold CV run nine separate times, each with a different
random split of the data into ten parts.
Note: LOOCV is a special case of K-fold, where K = n
                    Bias-Variance Trade-off for K-Fold Cross-
                    Validation
Which is better, LOOCV or K-fold CV?
    LOOCV is more computationally intensive than K-fold CV
    From the perspective of bias reduction, LOOCV is preferred to K-fold CV (when K < n)
    However, LOOCV has higher variance than K-fold CV (when K < n)
    Thus, we see the bias-variance trade-off between the two resampling methods
We tend to use K-fold CV with K = 5 or K = 10, as these values have been shown
empirically to yield test error rate estimates that suffer neither from excessively
high bias nor from very high variance
                 Cross-Validation on Classification Problems
We will cover classification problems in more detail later in the course,
but we briefly show how CV can be used when Y is qualitative
(categorical) as opposed to quantitative. Here, rather than use MSE to
quantify test error, we instead use the number of misclassified
observation.
LOOCV Error Rate:
We use CV as follows:
   Divide data into K folds; hold-out one part and fit using the remaining data
   (compute error rate on hold-out data); repeat K times.
   CV Error Rate: average over the K errors we have computed
Cross-Validation:
Wrong Way
Cross-Validation: Right
Way
          Summary
Resampling methods for machine learning model selection
and assessment.
The validation set approach, leave-one-out cross-validation,
and K-fold cross-validation.