Machine Learning for Data Analysts
Machine Learning for Data Analysts
Module-2
Topics Covered (Unit 2)
 REGRESSION: Linear Regression and Logistic
  Regression.
 BAYESIAN LEARNING - Bayes theorem, Concept
  learning, Bayes Optimal Classifier, Naïve Bayes classifier,
  Bayesian belief networks, EM algorithm.
 SUPPORT VECTOR MACHINE: Introduction, Types
  of support vector kernel – (Linear Kernel, Polynomial
  Kernel, and Gaussian Kernel), Hyperplane – (Decision
  surface), Properties of SVM, and Issues in SVM.
Overview
            •Dynamic Programming
            •Monte Carlo methods
    • ANN
Applications of these approaches
Classification Vs Regression
Regression
 Regression is a statistical method used in finance,
  investing, and other disciplines that attempts to determine
  the strength and character of the relationship between one
  dependent variable (usually denoted by Y) and a series of
  other variables (known as independent variables).
Applications
 Predicting a stock market.
 Forecasting amount of precipitation in a region.
 Projecting total sale of a company etc.
Regression: Types (Basic)
 The two basic types of regression are simple linear
 regression and multiple linear regression, although there
 are non-linear regression methods for more complicated
 data and analysis. Simple linear regression uses one
 independent variable to explain or predict the outcome
 of the dependent variable Y, while multiple linear
 regression uses two or more independent variables to
 predict the outcome.
Equations/General Form of Regression
 The general form of each type of regression is:
• Simple linear regression: Y = a + bX + u
• Multiple linear regression: Y = a + b1X1 + b2X2 + b3X3 + ... +
  btXt + u
Where:
• Y = the variable that you are trying to predict (dependent
  variable).
• X = the variable that you are using to predict Y
  (independent variable).
• a = the intercept.
• b = the slope.
• u = the regression residual.
Important Point :Regression
 Regression takes a group of random variables, thought to
  be predicting Y, and tries to find a mathematical
  relationship between them. This relationship is typically
  in the form of a straight line (linear regression) that best
  approximates all the individual data points. In multiple
  regression, the separate variables are differentiated by
  using subscripts.
 The core idea is to obtain a line that best fits the data.
  The best fit line is the one for which total prediction error
  (all data points) are as small as possible. Error is the
  distance between the point to the regression line.
Preliminaries
 Let D be a data set that contains N observations:
                     f(x)=1/(1+e-x)
Logistic regression
 Scenario:
 – A multidimensional feature space (features can be
  categorical or continuous).
 – Outcome is discrete, not continuous.
      We’ll focus on case of two classes/categories.
 – It seems plausible that a linear decision boundary
  (hyperplane) will give good predictive accuracy.
Using a logistic regression model
So why not always use probability?
 The problem is that probability and odds have different
  properties that give odds some advantages in statistics.
  For example, in logistic regression the odds ratio
  represents the constant effect of a predictor X, on the
  likelihood that one outcome will occur.
 The key phrase here is constant effect. In regression
  models, we often want a measure of the unique effect of
  each X on Y. If we try to express the effect of X on the
  likelihood of a categorical Y having a specific value
  through probability, the effect is not constant.
So why not always use probability?
 What that means is there is no way to express in one
  number how X affects Y in terms of probability. The
  effect of X on the probability of Y has different values
  depending on the value of X.
 So while we would love to use probabilities because
  they’re intuitive, you’re just not going to be able to
  describe that effect in a single number. So if you need
  to communicate that effect to a research audience,
  you’re going to have to wrap your head around odds
  ratios.
What about Probabilities
 What you can do, and many people do, is to use the logistic
  regression model to calculate predicted probabilities at specific
  values of a key predictor, usually when holding all other predictors
  constant.
 This is a great approach to use together with odds ratios. The
  odds ratio is a single summary score of the effect, and the
  probabilities are more intuitive.
 Presenting probabilities without the corresponding odds ratios
  can be problematic, though.
 First,when X, the predictor, is categorical, the effect of X can be
  effectively communicated through a difference or ratio of
  probabilities. The probability a person has a relapse in an
  intervention condition compared to the control condition makes a
  lot of sense.
What about Probabilities
 But the p-value for that effect is not the p-value for the
  differences in probabilities.
 If you present a table of probabilities at different values of X,
  most research audiences will, at least in their minds, make
  those difference comparisons between the
  probabilities. They do this because they’ve been trained to
  do this in linear models.
 These differences in probabilities don’t line up with the p-
  values in logistic regression models, though. And this can get
  quite confusing.
 Second, when X, the predictor is continuous, the odds ratio is
  constant across values of X. But probabilities aren’t.
Using a logistic regression model
 Can interpret prediction from a logistic regression model
  as:
 A probability of class membership
 A class assignment, by applying threshold to probability
      *threshold represents decision boundary in feature
  space
Training a logistic regression model
 Need to optimize β so the model gives the best possible
  reproduction of training set labels
 Usually done by numerical approximation of maximum
  likelihood
 On really large datasets, may use stochastic gradient
  descent
Logistic regression in one
dimension
Pros and Cons
Advantages:
 – Makes no assumptions about distributions of classes in feature
  space
 Easily extended to multiple classes (multinomial regression)
 Natural probabilistic view of class predictions
 Quick to train
 Very fast at classifying unknown records
 Good accuracy for many simple data sets
 Resistant to overfitting
 Can interpret model coefficients as indicators of feature importance
Disadvantages:
 Linear decision boundary
Difference between Linear & Logistic Regression
 Difference between Linear & Logistic Regression
              P ( A, B ) = P ( A | B ) P ( B ) = P ( B | A) P ( A)
•Sum Rule: probability of a disjunction of two events A and B:
P ( A + B ) = P ( A) + P ( B ) − P ( AB )
                                                                                    n
•Theorem of Total Probability : if events A1, …., An are mutually exclusive with    P( A ) = 1
                                                                                   i =1
                                                                                          i
                                       n
                          P ( B ) =  P ( B | Ai ) P ( Ai )
                                      i =1
Some Notations
•D: the training data
•h: a hypothesis h ϵ H
•P(h): the prior probability of h: the initial probability that hypothesis h holds,
          before we have observed the training data
•P(D | h): the probability of observing data D given some world in which h holds
•P(x | y): the probability of x occurring given that y has been observed
•P(h | D): the posterior probability of h given D: the probability that h holds given
         the observed training data D
                      2. Bayes’ Rule
                                                   Understanding Bayes' rule
                                                   d = data
                                                   h = hypothesis
                                                   Proof. Just rearrange :
               P ( D | h) P ( h)
    P(h | D) =                                     p ( h | d ) P ( d ) = P ( d | h) P ( h)
                    P( D)                          P ( d , h) = P ( d , h)
                                                   the same joint probability
                                                   on both sides
•   If we know P(X,Y), then the so-called marginal probability P(X) can be computed
    as            P( X ) =  P( X , Y )
                              Y
•    Probabilities sum to 1. Conditional probabilities sum to 1 provided that their
    conditions are the same.
                 Choosing Hypotheses
•   Maximum Likelihood (ML)
    hypothesis:                                    hML = arg max P (d | h)
                                                                hH
MAP estimation algorithms do not compute the posterior probability of each hypothesis
to decide which is the most probable hypothesis. Assuming that our hypothesis space is
continuous (i.e. fairness of the coin encoded as probability of observing heads,
coefficient of a regression model, etc.), where endless possible hypotheses are present
even in the smallest range that the human mind can think of, or for even a discrete
hypothesis space with a large number of possible outcomes for an event, we do not need
to find the posterior of each hypothesis in order to decide which is the most probable
hypothesis. Therefore, the practical implementation of MAP estimation algorithms uses
approximation techniques, which are capable of finding the most probable hypothesis
without computing posteriors or only by computing some of them.
Using the Bayesian theorem, we can now
incorporate our belief as the prior probability,
which was not possible when we
used frequentist statistics. However, we still
have the problem of deciding a sufficiently
large number of trials or attaching a
confidence to the concluded hypothesis. This is
because the above example was solely
designed to introduce the Bayesian theorem
and each of its terms. Let us now gain a better
understanding of Bayesian learning to learn
about the full potential of Bayes' theorem.
    An example: Does patient have cancer
                 or not?
•    A patient takes a lab test and the result comes back positive. It is known that the
     test returns a correct positive result in only 98% of the cases and a correct
     negative result in only 97% of the cases. Furthermore, only 0.008 of the entire
     population has this disease.
 Suppose we now observe a new patient for whom the lab test returns a positive
 result. Should we diagnose the patient as having cancer or not? The maximum a
 posteriori hypothesis can be found using hMAP
while the posterior probability of cancer is significantly higher than its prior probability, the
most probable hypothesis is still that the patient does not have cancer i.e.
3.Baye’s Theorem and Concept Learning
The set of items/objects over which the concept is defined is called the set of
instances and denoted by X. The concept or function to be learned is called the
target concept and denoted by c. It can be seen as a boolean valued function defined
over X and can be represented as c: X -> {0, 1}.
If we have a set of training examples with specific features of target concept C, the
problem faced by the learner is to estimate C that can be defined on training data.
H is used to denote the set of all possible hypotheses that the learner may consider
regarding the identity of the target concept. The goal of a learner is to find a
hypothesis H that can identify all the objects in X so that h(x) = c(x) for all x in X.
An algorithm that supports concept learning requires:
1.Training data (past experiences to train our models)
2.Target concept (hypothesis to identify data objects)
3.Actual data objects (for testing the models)
Inductive Learning
Hypothesis
As we discussed earlier, the ultimate goal of concept learning is to identify a
hypothesis H identical to target concept C over data set X with the only available
information about C being its value over X. Our algorithm can guarantee that it best
fits the training data. In other words:
 With the training data, we have with two data objects as positive samples and
 one as negative:
 1.x1: <true, true, false, false> : +ve
 2.x2: <true, false, false, true> : +ve
 3.x3:<true, false, false, true> : -ve
Hypothesis Notations
Each of the data objects represents a concept and hypotheses. Considering a
hypothesis <true, true, false, false> is more specific because it can cover
only one sample. Generally, we can add some notations into this hypothesis.
We have the following notations:
1.ⵁ (represents a hypothesis that rejects all)
2.< ? , ? , ? , ? > (accepts all)
3.<true, false, ? , ? > (accepts some)
The hypothesis ⵁ will reject all the data samples. The hypothesis <? , ? , ? , ?
> will accept all the data samples. The ? notation indicates that the values of
this specific feature do not affect the result.
The total number of the possible hypothesis is (3 * 3 * 3 * 3) + 1 — 3 because
one feature can have either true, false, or ? and one hypothesis for rejects all
(ⵁ).
General to Specific
Many machine learning algorithms rely on the concept of general-to-
specific ordering of hypothesis.
1.h1 = < true, true, ?, ? >
2.h2 = < true, ? , ? , ? >
Any instance classified by h1 will also be classified by h2. We can say
that h2 is more general than h1. Using this concept, we can find a
general hypothesis that can be defined over the entire dataset X.
To find a single hypothesis defined on X, we can use the concept of being
more general than partial ordering. One way to do this is start with the
most specific hypothesis from H and generalize this hypothesis each
time it fails to classify and observe positive training data object as
positive.
General to Specific
1.The first step in the Find-S algorithm is to start with the most specific hypothesis,
which can be denoted by h <- <ⵁ, ⵁ, ⵁ, ⵁ>.
2.This step involves picking up next training sample and applying Step 3 on the
sample.
3.The next step involves observing the data sample. If the sample is negative, the
hypothesis remains unchanged and we pick the next training sample by processing
Step 2 again. Otherwise, we process Step 4.
4.If the sample is positive and we find that our initial hypothesis is too specific
because it does not cover the current training sample, then we need to update our
current hypothesis. This can be done by the pairwise conjunction
(logical and operation) of the current hypothesis and training sample.
5.If the next training sample is <true, true, false, false> and the current hypothesis
is <ⵁ, ⵁ, ⵁ, ⵁ>, then we can directly replace our existing hypothesis with the new
one.
General to Specific
If the next positive training sample is <true, true, false, true> and current
hypothesis is <true, true, false, false>, then we can perform a pairwise
conjunctive. With the current hypothesis and next training sample, we can find
a new hypothesis by putting ? in the place where the result of conjunction is
false:
<true, true, false, true> ⴷ <true, true, false, false> = <true, true, false, ?>
Now, we can replace our existing hypothesis with the new one: h <-<true,
true, false, ?>
5. This step involves repetition of Step 2 until we have more training samples.
6. Once there are no training samples, the current hypothesis is the one we
wanted to find. We can use the final hypothesis to classify the real objects.
Limitations of the Find-S
Algorithm
The Find-S algorithm for concept learning is one of the most basic algorithms
of machine learning, though it has some limitation and disadvantages like:
1.There's no way to determine if the only final hypothesis (found by Find-S) is
consistent with the data or there are more hypotheses that are consistent
with data.
2.Inconsistent sets of training examples can mislead the Find-S algorithm, as
it ignores negative data samples. An algorithm that can detect inconsistency
of training data is better.
3.A good concept learning algorithm should be able to backtrack the choice of
hypothesis found so that the resulting hypothesis can be improved over time.
Unfortunately, Find-S provides no such method.
More Terminology
Brute-Force Bayes Concept Learning
       therefore:
Evolution of posterior probabilities
Furthermore,
What we want:
Gradient ascent:
6. Minimum Description Length
Principle
Terminology used:
Minimum Description Length Principle:
Naïve Bayes Classifier
Review:Prior and Posterior Probabilities
      P(A) and P(B) are called prior probabilities            X          Y
      P(A|B), P(B|A) are called posterior probabilities
                                                               𝑥1         A
Example 8.6:Prior versus Posterior Probabilities               𝑥2         A
 This table shows that the event Y has two outcomes
   namely A and B, which is dependent on another event X       𝑥3         B
   with various outcomes like 𝑥1 , 𝑥2 and 𝑥3 .                 𝑥3         A
 Case1: Suppose, we don’t have any information of the
   event A. Then, from the given sample space, we can          𝑥2         B
                        5
   calculate P(Y = A) = 10 = 0.5                               𝑥1         A
                                                               𝑥1         B
 Case2:     Now, suppose, we want to calculate P(X =
               2
   𝑥2 |Y =A) = 5= 0.4 .                                        𝑥3         B
                                                               𝑥2         B
The later is the conditional or posterior probability, where
as the former is the prior probability.                        𝑥2         A
                                                                    116
Naïve Bayesian Classifier
  Suppose, Y is a class variable and X = 𝑋1, 𝑋2 , … . . , 𝑋𝑛 is a set of attributes,
   with instance of Y.
                                                                                   117
Naïve Bayesian Classifier
  Naïve Bayesian classifier calculate this posterior probability using Bayes’ theorem, which is
     as follows.
  From Bayes’ theorem on conditional probability, we have
                                               𝑃(𝑋|𝑌) ∙ 𝑃(𝑌)
                                      𝑃 𝑌𝑋 =
                                                    𝑃(𝑋)
                                            𝑃(𝑋|𝑌) ∙ 𝑃(𝑌)
                   =
                       𝑃 𝑋 𝑌 = 𝑦1   ∙ 𝑃 𝑌 = 𝑦1 + ⋯ + 𝑃 𝑋 𝑌 = 𝑦𝑘 ∙ 𝑃 𝑌 = 𝑦𝑘
     where,
                               𝑃 𝑋 = σ𝑘𝑖=1 𝑃(𝑋|𝑌 = 𝑦𝑖 ) ∙ 𝑃(Y = 𝑦𝑖 )
 Note:
 ▪ 𝑃 𝑋 is called the evidence (also the total probability) and it is a constant.
 ▪    The probability P(Y|X) (also called class conditional probability) is therefore
      proportional to P(X|Y)∙ 𝑃(𝑌).
                                                                                         118
Naïve Bayesian Classifier
  Suppose, for a given instance of X (say x = (𝑋1 = 𝑥1 ) and ….. (𝑋𝑛 = 𝑥𝑛 )).
  There are any two class conditional probabilities namely P(Y= 𝑦𝑖 |X=x) and
    P(Y= 𝑦𝑗 | X=x).
  If P(Y= 𝑦𝑖 | X=x) >P(Y= 𝑦𝑗 | X=x), then we say that 𝑦𝑖 is more stronger than
    𝑦𝑗 for the instance X = x.
                                                                                 119
Naïve Bayesian Classifier
 Example: With reference to the Air Traffic Dataset mentioned earlier, let us
  tabulate all the posterior and prior probabilities as shown below.
                                                  Class
            Attribute     On Time        Late             Very Late    Cancelled
              Weekday    9/14 = 0.64    ½ = 0.5            3/3 = 1      0/1 = 0
              Saturday   2/14 = 0.14    ½ = 0.5            0/3 = 0      1/1 = 1
   Day
                                                   Class
         Attribute      On Time          Late              Very Late     Cancelled
             None      5/14 = 0.36      0/2 = 0              0/3 = 0       0/1 = 0
  Fog
                                                                                       121
Naïve Bayesian Classifier
Instance:
Case3: Class = Very Late : 0.15 × 1.0 × 0.67 × 0.33 × 0.67 = 0.0222
                                                                           122
Naïve Bayesian Classifier
 Algorithm: Naïve Bayesian Classification
                                             123
Example. ‘Play Tennis’ data
 Day     Outlook    Temperature   Humidity   Wind      Play
                                                      Tennis
  = arg max P (h) P (Outlook = sunny | h) P (Temp = cool | h) P ( Humidity = high | h) P (Wind = strong | h)
    h[ yes , no ]
• Working:
    P ( PlayTennis = yes) = 9 / 14 = 0.64
          P ( PlayTennis = no) = 5 / 14 = 0.36
          P (Wind = strong | PlayTennis = yes) = 3 / 9 = 0.33
          P (Wind = strong | PlayTennis = no) = 3 / 5 = 0.60
          etc.
          P ( yes) P ( sunny | yes) P (cool | yes) P (high | yes) P ( strong | yes) = 0.0053
          P (no) P ( sunny | no) P (cool | no) P (high | no) P ( strong | no) = 0.0206
           answer : PlayTennis ( x) = no
Naïve Bayesian Classifier
Pros and Cons
 The Naïve Bayes’ approach is a very popular one, which often works well.
                                                                         126
Naïve Bayesian Classifier
Approach to overcome the limitations in Naïve Bayesian Classification
 Estimating the posterior probabilities for continuous attributes
        In real life situation, all attributes are not necessarily be categorical, In fact, there is a mix of
         both categorical and continuous attributes.
        In the following, we discuss the schemes to deal with continuous attributes in Bayesian
         classifier.
    1.     We can discretize each continuous attributes and then replace the continuous values
           with its corresponding discrete intervals.
    2.     We can assume a certain form of probability distribution for the continuous variable and
           estimate the parameters of the distribution using the training data. A Gaussian distribution is
           usually chosen to represent the posterior probabilities for continuous attributes. A general
           form of Gaussian distribution will look like
                                                                        2
                                                      1          x−μ
                                   P   x: μ, σ2   =         e−
                                                      2πσ         2σ2
               2
where, μ and σ denote mean and variance, respectively.
                                                                                                       127
Naïve Bayesian Classifier
    For each class Ci, the posterior probabilities for attribute Aj(it is the numeric
    attribute) can be calculated following Gaussian normal distribution as follows.
                                    1          aj − μij   2
               P Aj = aj|Ci =             e−
                                  2πσij          2σij2
    Here, the parameter μijcan be calculated based on the sample mean of attribute
    value of Aj for the training records that belong to the class Ci.
    Similarly, σij2 can be estimated from the calculation of variance of such training
    records.
                                                                                  128
  Naïve Bayesian Classifier
M-estimate of Conditional Probability
 The M-estimation is to deal with the potential problem of Naïve Bayesian Classifier
   when training data size is too poor.
      If the posterior probability for one of the attribute is zero, then the overall class-
       conditional probability for the class vanishes.
      In other words, if training data do not cover many of the attribute values, then we may
       not be able to classify some of the test records.
                                                                                       129
  M-estimate Approach
 M-estimate approach can be stated as follows
                                                 𝑛𝑐𝑖 + 𝑚𝑝
                                  P Aj = aj|Ci =
                                                  𝑛+𝑚
   where, n = total number of instances from class C𝑖
𝑛𝑐𝑖 = number of training examples from class C𝑖 that take the value Aj =aj
m = it is a parameter known as the equivalent sample size, and
p = is a user specified parameter.
Note:
If n = 0, that is, if there is no training set available, then 𝑃 ai|C𝑖 = p,
   so, this is a different value, in absence of sample value.
                                                                              130
A Practice Example
                           age    income studentcredit_rating
                                                       buys_computer
   Example 1:            <=30    high       no fair           no
                         <=30    high       no excellent      no
Class:                   31…40   high       no fair           yes
C1:buys_computer = ‘yes’
                         >40     medium no fair               yes
C2:buys_computer = ‘no’
                         >40     low       yes fair           yes
                         >40     low       yes excellent      no
Data instance
                         31…40   low       yes excellent yes
X = (age <=30,
                         <=30    medium no fair               no
Income = medium,
                         <=30    low       yes fair           yes
Student = yes
                         >40     medium yes fair              yes
Credit_rating = fair)
                         <=30    medium yes excellent yes
                         31…40   medium no excellent yes
                         31…40   high      yes fair           yes
                         >40     medium no excellent          no
                                                              131
A Practice Example
 P(Ci):      P(buys_computer = “yes”) = 9/14 = 0.643
             P(buys_computer = “no”) = 5/14= 0.357
   Example:
   P(Campfire=True|Storm=True,BusTourGroup=True)=0.4
Bayesian belief network
  Inference
      We wish to infer the probability distribution for some
       variable given observed values for (a subset of) the other
       variables
      Exact (and sometimes approximate) inference of
       probabilities for an arbitrary BN is NP-hard
      There are numerous methods for probabilistic inference in
       BN (for instance, Monte Carlo), which have been shown
       to be useful in many cases
Bayesian belief network
   Learning Bayesian Belief Networks
   Task: Devising effective algorithms for learning BBN
          from training data
    Focus of much current research interest
    For given network structure, gradient ascent can be
     used to learn the entries of conditional probability tables
    Learning the structure of BBN is much more difficult,
     although there are successful approaches for some
     particular problems
EM Algorithm
“A widely used approach to learning in the presence of unobserved
variables. The EM algorithm can be used even for variables whose value is
never directly observed, provided the general form of the probability
distribution governing these variables is known.
The EM algorithm has been used to train Bayesian belief networks as well
as radial basis function networks.
                                                                            139
EM Algorithm
Estimating Means of k Gaussians
The easiest way to introduce the EM algorithm is via an example. Consider a problem in
which the data D is a set of instances generated by a probability distribution that is a
mixture of k distinct Normal distributions. This problem setting is illustrated in Figure 6.4
for the case where k = 2 and where the instances are the points shown along the x axis.
Each instance is generated using a two-step process. First, one of the k Normal
distributions is selected at random. Second, a single random instance xi is generated
according to this selected distribution.
This process is repeated to generate a set of data points as shown in the figure.
EM Algorithm
We would like to find a maximum likelihood hypothesis for these means; that is, a
hypothesis h that maximizes p(D | h).
In this case:
EM Algorithm
Let X = {xl , . . . , xm} denote the observed data in a set of m independently drawn
instances, let Z = {zl , . . . , zm} denote the unobserved data in these same instances,
and let Y = X U Z denote the full data.
We use h to denote the current hypothesized values of the parameters ϴ, and h'
to denote the revised hypothesis that is estimated on each iteration of the EM
algorithm.
The EM algorithm searches for the maximum likelihood hypothesis h' by seeking
the h' that maximizes E[ln P(Y |h')].
The EM algorithm uses its current hypothesis h in place of the actual
parameters ϴ to estimate the distribution governing Y. Let us define a function
Q(h' |h) that gives E[ln P(Y |h')] as a function of h', under the
assumption that ϴ = h and given the observed portion X of the full data Y.
EM Algorithm
In its general form, the EM algorithm repeats the following two steps until convergence:
2023/11/1               146
Introduction
 What is a classication problem?
 How can it be thought as a prediction problem?
 Support Vector Machine (SVM) as classication technique,
    Received considerable attention
    SVM has its roots in Statistical learning theory,
    Shown promising results in many practical applications.
    For examples:
        Handwritten digit recognition,
        Text categorization,
        Regression
Introduction
 SVM works very well with
    High-dimensional data,
    Avoids the curse of dimensionality problem.
 Another unique aspect of this approach is that
    it represents the decision boundary using a subset of the
     training examples, known as the support vectors.
 Goal of the SVM
    To find the optimal separating hyperplane
        which maximizes the margin of training data.
Hyper plane
or
Decision boundary
Decided by Maximum Margin
Maximum Margin Hyperplanes
Maximum Margin
             denotes +1
             denotes -1
Support Vectors
are those
datapoints that the
margin pushes up
against
 2023/11/1                155
Rationale for Maximum Margin
Linear SVM: Separable Case
                  x – Vector (x1-x2)
                  w (or wT)– Normal
                    Vector to hyper
                    plane
                  b – Scale Value
2023/11/1   158
 Estimate the Margin
             denotes +1
              denotes -1         x
                                                        wx +b = 0
                                                                  X – Vector
        W
                                                                  W – Normal Vector
                                                                  b – Scale Value
Class 2
                               m
             Class 1
 2023/11/1              160
Learning a Linear SVM Model
2023/11/1   161
Learning a Linear SVM Model: Example
Linear SVM: Nonseparable Case
 When it is not possible to separate the training data
  linearly.
 SVM to construct a linear decision boundary even in
  situations,
    Where the classes are not linearly separable.
Non-Linear SVM’s
Non-linear transformation is to make a dataset higher-dimensional space (Mapping a
higher dimension). And it is also the fundamental of a non-linear system. The below
graph reveals a non-linear dataset and how it can not be used Linear kernel rather than
the Gaussian kernel.
In geometry, a hyperplane is a subspace whose dimension is one less than that of
its ambient space. If space is 3-dimensional then its hyperplanes are the 2-dimensional
planes, while if space is 2-dimensional, its hyperplanes are the 1-dimensional lines. This
notion can be used in any general space in which the concept of the dimension of a
subspace is defined.
x2
                            0                x
Extension to Non-linear Decision
Boundary
  So far, we have only considered large-margin classifier with
   a linear decision boundary
  How to generalize it to become nonlinear?
  Key idea: transform xi to a higher dimensional space to
     “make life easier”
        Input space: the space the point xi are located
        Feature space: the space of f(xi) after transformation
2023/11/1             176
Non-linear SVMs: Feature spaces
◼   General idea: the original input space can always be mapped
    to some higher-dimensional feature space where the training
    set is separable:
Φ: x → φ(x)
2023/11/1          178
Types of Kernels
 SVM algorithms use a set of mathematical functions
  that are defined as the kernel. The function of kernel
  is to take data as input and transform it into the
  required form. Different SVM algorithms use
  different types of kernel functions. These functions
  can be different types.
 For example linear, nonlinear, polynomial, radial basis
  function (RBF), and sigmoid.
 Introduce Kernel functions for sequence data,
 graphs, text, images, as well as vectors. The most
 used type of kernel function is RBF. Because it has
 localized and finite response along the entire x-axis.
 The kernel functions return the inner product
 between two points in a suitable feature space. Thus
 by defining a notion of similarity, with little
 computational cost even in very high-dimensional
 spaces.
Polynomial kernel
 K ( x a , x b ) = f ( x a ) .f ( x b )
                                                              f (xa )
Letting the          doing the scalar         f ( xb )
kernel do            product in the
the work             obvious way
An Example for f(.) and K(.,.)
  Suppose f(.) is given as follows
2023/11/1                   193
◼ Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
◼ Semi-positive definite symmetric functions correspond
  to a semi-positive definite symmetric Gram matrix:
2023/11/1              195
Example
  Suppose we have 5 one-dimensional data points
     x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as
      class 2  y1=1, y2=1, y3=-1, y4=-1, y5=1
  We use the polynomial kernel of degree 2
     K(x,y) = (xy+1)2
     C is set to 100
  We first find ai (i=1, …, 5) by
2023/11/1           196
 Example
 By using a QP solver, we get
    a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
    Note that the constraints are indeed satisfied
    The support vectors are {x2=2, x4=5, x5=6}
 The discriminant function is
 2023/11/1             197
Characteristics of SVM
SVM has many desirable qualities that make it one of the most widely
  usedclassication algorithms. Following is a summary of the general
  characteristics of SVM: