Question Paper LH - MLT
Question Paper LH - MLT
L-1
LECTURE HANDOUTS
CSE III/V/A
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
L-2
LECTURE HANDOUTS
CSE III/V/A
This classification is an example of a classification problem where there are two classes:
low-risk and high-risk customers. The information about a customer makes up the input to
the classifier whose task is to assign the input to one of the two classes.
Classification Rule:
IF income> θ1 AND savings> θ2 THEN low-risk ELSE high-risk
Discriminant:
For suitable values of θ1 and θ2 (see figure 1.1). This is an example of a discriminant; it is
function that separates the examples of different classes.
Knowledge Extraction:
Learning a rule from data also allows knowledge extraction. The rule is a simple model that
explains the data, and looking at this model we have an explanation about the process
underlying the data.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
L-3
LECTURE HANDOUTS
CSE III/V/A
Introduction: ( Maximum 5 sentences) This is learning a class from its positive and negative examples.
We generalize and discuss the case of multiple classes, then regression, where the outputs are continuous .
X ={xt, rt}Nt=1
* Class C
(p1 ≤ price ≤ p2) AND (e1 ≤ engine power ≤ e2)
* Hypothesis class H
H(x)= 1 ifh classifies x as a positive example
0 ifh classifies x as a negative example
If c is actual class and h is our induced hypothesis.The point where c is 1 but h is 0 is a false negative, and the
point where c is0 and h is 1 is false positive.Other points-namely,true positives and true negatives – are correctly
classified
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
L -4
LECTURE HANDOUTS
CSE III/V/A
* Regression
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
L-5
LECTURE HANDOUTS
CSE III/V/A
* Triple Trade-Off
There is a trade-off between three factors (Dietterich, 2003):
1.Complexity of H, c (H),
* Cross-Validation
To estimate generalization error, we need data unseen during training. We split the data
as
Training set (50%)
* Test Set
If we need to report the error to give an idea about the expected error of our best model,
we should not use the validation error.
We have used the validation set to choose the best model, and it has eftest set fectively
become a part of the training set.
We need a third set, a test set, sometimes also called the publication set, containing examples
not used in training or validation.
An analogy from our lives is when we are taking a course:
the example problems that the instructor solves in class while teaching a subject
form the training set;
exam questions are the validation set;
the problems we solve in our later, professional life are the
test set.
Video Content / Details of website for further learning (if any):
https://lecturenotes.in/notes/24274-note-for-machine-learning-ml-by-new-swaroop
https://www.youtube.com/watch?v=WpxK__SK2a0
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
L-6
LECTURE HANDOUTS
CSE III/V/A
Bayes’ rule is used to calculate the probabilities of the classes. We generalize to discuss
how we can make rational decisions among multiple actions to minimize expected risk.
We also discuss learning association rules from data.
Bayes Theorem
P(h|D) = P(D|h)P(h)
P(D)
P (h) = prior probability of hypothesis h
P (D) = prior probability of training data D
P (h|D) = probability of h given D
P (D|h) = probability of D given h
Choosing Hypotheses
P(h|D) = P(D|h)P(h)
P(D)
Generally want the most probable hypothesis given the training data
Maximum a posteriori hypothesis hMAP :
hMAP = arg max P(D|h)P(h)
h2H
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
L-7
LECTURE HANDOUTS
CSE III/V/A
* Bayes rule:
The problem Bayes’ rule then is to be able to calculate P (C|x). Using Bayes’ rule, it can be written
as
P(C|x) = P(C)p(x|C)
P(x)
* Prior Probability:
P(C = 1) is called the prior probability that C takes the value 1, which in our example
corresponds to the probability that a customer is highrisk, regardless of the x value.
It is called the prior probability because it is the knowledge we have as to the value of C
before looking at the observables x, satisfying
P(C = 0) + P(C =1) = 1
Class Likelihood:
P (x|C) is called the class likelihood and is the conditional probability that an event
belonging to C has the associated observation value x.
In our case, p(x1, x2|C = 1) is the probability that a high-risk customer has his or her X1 =
x1 and X2 = x2. It is what the data tells us regarding the class.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
L-8
LECTURE HANDOUTS
CSE III/V/A
R(αi|x) = λikP(Ck|x) 0 if i k
K ik
Σ 1 if i k
k=1
and we choose the action with minimum risk:
choose αi if
R(αi|x) = min R(αk|x)
k
Reject:
In such a case, we define reject an additional action of reject or doubt, αK+1, with αi, i = 1, . .
. , K, being the usual actions of deciding on classes Ci, i = 1, . . .,K (Duda, Hart, and Stork
2001).
A possible loss function is
0 ifi = k
λik = λ if i = K + 1
1 otherwise
Where 0 < λ < 1 is the loss incurred for choosing the (K + 1)st action of reject. Then the
risk of reject
is
R(αK+1|x) = K λP(Ck|x) = λ
Σ
k=1
The optimal decision rule is to
choose Ci if
R(αi|x) < R(αk|x) for all k ≠ i and
R(αi|x) < R(αK+1|x)
reject if
R(αK+1|x) < R(αi |x), i = 1, . . . ,K
Given the loss function of equation this simplifies to
choose Ci if
P(Ci |x) > P(Ck|x) for all k ≠ i and
P(Ci |x) > 1 – λ
reject otherwise .
This whole approach is meaningful if 0 < λ < 1:
If λ = 0, we always reject; a reject is as good as a correct classification.
If λ ≥ 1, we never reject; a reject is as costly as, or costlier than, an error.
Video Content / Details of website for further learning (if any):
https://lecturenotes.in/notes/24274-note-for-machine-learning-ml-by-new-swaroop
https://www.youtube.com/watch?v=WpxK__SK2a0
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu
L-9
LECTURE HANDOUTS
CSE III/V/A
and the maximum discriminant function corresponds to minimum conditional risk. When we use
the 0/1 loss function, we have
gi(x)= P(Ci|x)
This divides the feature space into K R1,...,RK, wheredecision regions Ri ={ x|gi(x) = maxk gk(x)}.
The regions are separated by decision boundaries, surfaces in feature space where ties occur
among the largest discriminant functions (see figure ).
When there are two classes, we can define a single discriminant
g(x)=g1(x)−g2(x) and we
C1 if g(x)>0
choose = C2 otherwise
An example is a two-class learning problem where the positive examples can be taken as C1 and
the negative examples as C2. WhenK = 2, the classification system is a dichotomizer and for K ≥ 3,
it is a poly-dichotomizer.
An association rule is an implication of the form X → Y where X is the antecedent and Y is the
consequent of the rule. One example of association rules is in basket analysis where we want
to find the dependency between two items X and Y. The typical application is in retail where
X and Y are items sold, In learning association rules, there are three measures that are
frequently calculated
Apriori algorithm:
1.To find frequent itemsets quickly (without complete enumeration of all subsets of items), the
Apriori algorithm uses the fact that for{X,Y,Z} to be frequent (have enough support), all its
subsets {X,Y}, {X,Z}, and{Y,Z} should be frequent as well—adding another item can never increase
support.
* That is, we only need to check for three-item sets all of whose two-item subsets are frequent; or,
in other words, if a twoitem set is known not to be frequent, all its supersets can be pruned and
need not be checked.
* We start by finding the frequent one-item sets and at each step, inductively, from frequent k-
item sets, we generate candidate k+1-item sets and then do a pass over the data to check if they
have enough support.
* The Apriori algorithm stores the frequent itemsets in a hash table for easy access.
Note that the number of candidate itemsets will decrease very rapidly as k increases. If the
largest itemset has n items, we need a total of n+1 passes over the data.
2. Once we find the frequent k-item sets, we need to convert them to rules by splitting the k items
into two as antecedent and consequent.
* Just like we do for generating the itemsets, we start by putting a single consequent and k−1
items in the antecedent.
* Then, for all possible single consequents, we check if the rule has enough confidence and
remove it if it does not.
* Here, as in itemset generation, we use the fact that to be able to have rules with two items in the
consequent with enough confidence, each of the two rules with single consequent by itself should
have enough confidence; that is, we go from one consequent rules to two consequent rules and
need not check for all possible two-term consequents
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
LECTURE HANDOUTS L - 10
CSE III/V/A
Introduction: ( Maximum 5 sentences):A learning model that summarizes data with a set of
parameters of fixed size (independent of the number of training examples) is called a parametric model.
The algorithms involve two steps:
This is the likelihood-based approach to classification where we use data to estimate the
densities separately, calculate posterior densities using Bayes’ rule, and then get the
discriminant.
In regression, we would like to write the numeric output, called the dependent variable,
as a function of the input, called the independent variable.
f (x) is the unknown function.
In linear regression, we have a linear model
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
LECTURE HANDOUTS L - 11
CSE III/V/A
In machine learning, model complexity often refers to the number of features or terms
included in a given predictive model, as well as whether the chosen model is linear, nonlinear,
and so on. It can also refer to the algorithmic learning complexity or
computational complexity.
When you increase complexity of your model, it is more likely to overfit, meaning it will
adapt to training data very well, but will not figure out general relationships in the data. In such
case, performance on a test set is going to be poor. ... This leads to poor test set performance.
Model selection is the process of choosing between different machine
learning approaches - e.g. SVM, logistic regression, etc - or choosing between different
hyperparameters or sets of features for the same machine learning approach - e.g.
deciding between the polynomial degrees/complexities for linear regression
The choice of the actual machine learning algorithm (e.g. SVM or logistic regression) is less
important than you'd think - there may be a "best" algorithm for a particular problem, but often
its performance is not much better than other well-performing approaches for that problem.
Interpretable - can we see or understand why the model is making the decisions it makes?
Simple - easy to explain and understand
Accurate
Fast (to train and test)
Scalable (it can be applied to a large dataset)
cross-validation
In practice, the method we use to find the optimal complexity is cross- validation. We
cannot calculate bias and variance for a model, but we can calculate the total error.
Regularization
Another approach that is used frequently is regularization,
In this approach, we write an augmented error function
E’ =error on data + λmodel complexity.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
LECTURE HANDOUTS L - 12
CSE III/V/A
Introduction: ( Maximum 5 sentences): Multivariate data is the data in which analysis are based
on more than two variables per observation. Usually multivariate data is used for explanatory
purposes. In many applications, several measurements are made on each individual or event generating
an observation vector. The sample may be viewed as a data matrix.
Bivariate data is used for little complex analysis than as compared with univariate data.
Bivariate data is the data in which analysis are based on two variables per observation
simultaneously.
.
d variables denoting the result of measurements made on an individual or event. N rows
correspond to independent and feature attribute identically distributed observations
mean vector
The mean vector μ is defined such that each of its elements is the mean of one column of X:
The variance of Xi is denoted as σ2i, and the covariance of two variables Xi and Xjis defined as
covariance matrix
In vector-matrix notation
* correlation
The correlation between variables Xi and Xj is a statistic normalized between −1 and+1, defined as
* sample mean
* sample covariance
* sample correlation
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
LECTURE HANDOUTS L - 13
CSE III/V/A
In imputation by regression, we try to predict the value of a missing variable from other variables
whose values are known for that case.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
LECTURE HANDOUTS L - 14
CSE III/V/A
Course Name with Code :Machine Learning Techniques -19CSC29
Course Teacher :R.Kavishree
Unit : II Date of Lecture:
Given a training sample for K ≥ 2 classes, X={ xt,rt}, where rti= 1 if xt∈Ci and 0 otherwise, estimates
for the means and covariances are found using maximum likelihood separately for each class:
These are then plugged into the discriminant function to get the estimates for the discriminants.
Ignoring the first constant term, we have
which defines a quadratic discriminant (see figure 5.3) that can also be written as
Where
Another possibility is to pool the data and estimate a common covariance matrix for all classes:
linear discriminant
the quadratic term xTS−1x cancels since it is common in all discriminants, and the decision boundaries
are linear, leading to a linear discriminant that can linear discriminant be written as
Where
Euclidean distance
Simplifying even further, if we assume all variances to be equal, the Mahalanobis distance reduces to
Euclidean distance
All classes have equal, diagonal covariance matrices, but variances are not equal.
Multivariate Regression
The goal in any data analysis is to extract from raw information the accurate estimation.
... Multivariate Regression is one of the simplest Machine Learning Algorithm. It comes under the
class of Supervised Learning Algorithms i.e, when we are provided with training dataset.
In multivariate linear regression, the numeric output r is assumed to be written as a linear
function, that is, a weighted sum, of several input variables, x1,...,xd, and noise. Actually in
statistical literature, this is called multiple regression; statisticians use the term multivariate
when there are multiple outputs. The multivariate linear model is
As in the univariate case, we assume to be normal
with mean 0 and constant variance, and maximizing the likelihood is equivalent to minimizing
the sum of squared errors:
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
LECTURE HANDOUTS L - 15
CSE III/V/A
Clustering is one of the most common exploratory data analysis technique used to get an
intuition about the structure of the data.
It can be defined as the task of identifying subgroups in the data such that data points in the same
subgroup (cluster) are very similar while data points in different clusters are very different.
Clustering is used in market segmentation; where we try to fined customers that are similar to
each other whether in terms of behaviors or attributes, image segmentation/compression; where
we try to group similar regions together, document clustering based on topics, etc.
K-means Algorithm
K-means clustering is a very popular unsupervised learning algorithm.
K-means algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined
distinct non-overlapping subgroups (clusters) where each data point belongs to only one group.
K-Means Clustering is an algorithm that, given a dataset, will identify which data points belong
to each one of the k clusters. It takes your data and learns how it can be grouped.
Through a series of iterations, the algorithm creates groups of data points — referred to as
clusters — that have similar variance and that minimize a specific cost function: the within-
cluster sum of squares.
K-MEANS CLUSTERING ALGORITHM
Applications
k-means algorithm is very popular and used in a variety of applications such as market segmentation,
document clustering, image segmentation and image compression, etc. The goal usually when we
undergo a cluster analysis is either:
Get a meaningful intuition of the structure of the data we’re dealing with.
Cluster-then-predict where different models will be built for different subgroups if we believe
there is a wide variation in the behaviors of different subgroups. An example of that is clustering
patients into different subgroups and build a model for each subgroup to predict the probability
of the risk of having heart attack.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
LECTURE HANDOUTS L - 16
CSE III/V/A
Introduction: ( Maximum 5 sentences): In data mining and statistics, hierarchical clustering (also
called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build
a hierarchy of clusters.
Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and
pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are
performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner. The results of hierarchical
clustering are usually presented in a dendrogram.
At each iteration of an agglomerative algorithm, we choose the two closest groups to merge.
single-link clustering
In single-link clustering, this distance is defined as the smallest distance between all possible pair of
elements of the two groups:
complete-link clustering
In complete-link clustering, the distance between two groups is taken as the largest distance between
all possible pairs:
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
LECTURE HANDOUTS L - 17
CSE III/V/A
Introduction: ( Maximum 5 sentences): In data mining and statistics, hierarchical clustering (also
called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build
a hierarchy of clusters.
Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and
pairs of clusters are merged as one moves up the hierarchy.
Agglomerative Hierarchical Clustering
Clustering starts by computing a distance between every pair of units that you want to cluster.
A distance matrix will be symmetric (because the distance between x and y is the same as the
distance between y and x) and will have zeroes on the diagonal (because every item is distance
zero from itself). The table below is an example of a distance matrix.
The smallest distance is between three and five and they get linked up or merged first into a the cluster
'35'.
To obtain the new distance matrix, we need to remove the 3 and 5 entries, and replace it by an
entry "35" . Since we are using complete linkage clustering, the distance between "35" and
every other item is the maximum of the distance between this item and 3 and this item and 5.
For example, d(1,3)= 3 and d(1,5)=11. So, D(1,"35")=11. This gives us the new distance
matrix. The items with the smallest distance get clustered next. This will be 2 and 4.
Continuing in this way, after 6 steps, everything is clustered. This is summarized below. On
this plot, the y-axis shows the distance between the objects at the time they were clustered.
This is called the cluster height. Different visualizations use different measures of cluster
height.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
LECTURE HANDOUTS L - 18
CSE III/V/A
Introduction: ( Maximum 5 sentences): Clustering is one of the most common exploratory data
analysis technique used to get an intuition about the structure of the data.It can be defined as the task of
identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar
while data points in different clusters are very different.
K Means Clustering
K-means clustering is a very popular unsupervised learning algorithm.
K-means algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined
distinct non-overlapping subgroups (clusters) where each data point belongs to only one group.
Step 1: Visualize n data points and decide the number of clusters (k). Choose k random points
on the graph as the centroids of each cluster. For this example, we would like to divide the data
into 4 clusters, so we pick 4 random centroids.
Step 2: Calculate the Euclidean distance between each data point and chosen clusters’
centroids. A point is considered to be in a particular cluster if it is closer to that cluster's
Step 3: After assigning all observations to the clusters, calculate the clustering score, by
summing up all the Euclidean distances between each data point and the corresponding
centroid.
Where:
k: the number of clusters
n: the number of points belonging to cluster j
cj: the centroid of cluster j
Step 4: Define the new centroid of each cluster by calculating the mean of all points assigned to
that cluster. Here’s the formula (n is the number of points assigned to that cluster)
Step 5: Repeat from step 2 until the positions of the centroids no longer move and the he
assignments stay the same.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Neural networks are a biologically-inspired algorithm that attempt to mimic the functions of
neurons in the brain. Each neuron acts as a computational unit, accepting input from the
dendrites and outputting signal through the axon terminals. Actions are triggered when a
specific combination of neurons are activated.
In logistic regression, we composed a linear model z(x)z(x) with the logistic function g(z)g(z) to form
our predictor. This linear model was a combination of feature inputs xi xi and weights wi wi.
z(x)=w1x1+w2x2+ w3x3+w4x4=wTx+b
Video Content / Details of website for further learning (if any):
https://youtu.be/EYeF2e2IKEo
https://www.codementor.io/@james_aka_yale/a-gentle-introduction-to-neural-networks-for-machine-
learning-hkijvz7lp
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
An Artificial Neural Network consists of large number of “neuron” like processing elements.
All these processing elements have a large number of weighted connections between them.
The connections between the elements provide a distributed representation of data.
.
Types of Neural Networks
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Introduction: ( Maximum 5 sentences) : A single neuron transforms given input into some output.
Depending on the given input and weights assigned to each input, decide whether the neuron fired or
not. Let’s assume the neuron has 3 input connections and one output.
each wi is a real-valued constant, or weight, that determines the contribution of input xi to the
perceptron output.
.
Perceptron – Learning
• Learning a perceptron involves choosing values for weights w0 , …,wn .
• The space H of candidate hypotheses considered in perceptron learning is the set of all possible
real-valued weight vectors
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
• To learn an acceptable weight vector is to begin with random weights, then iteratively apply the
perceptron to each training example, modifying the perceptron weights whenever it misclassifies an
example.
– If the training example classifies correctly, weights are not updated.
• This process is repeated, iterating through the training examples as many times as needed until
the perceptron classifies all training examples correctly.
– Each pass through all of the training examples is called one epoch
• Weights are modified at each step according to perceptron training rule
wi = wi + wi
wi = (t - o)
xi t is the target value o is the perceptron output is a small constant called learning rate
If the output is correct (t=o) the weights wi are not changed
• If the output is incorrect (to) the weights wi are changed such that the output of the
perceptron for the new weights is closer to t.
Gradient Descent and the Delta Rule:
Gradient descent is used in supervised machine learning. Supervised machine learning dataset
has true labels.
Machine learning model predicts labels during training. Machine learning model has error equal
to the difference between true label and predicted label for each sample of dataset.
The perceptron rule finds a successful weight vector when the training examples are linearly
separable, it can fail to converge if the examples are not linearly separable.
The delta rule overcomes this difficulty.
If the training examples are not linearly separable, the delta rule converges toward a best-fit
approximation to the target concept.
The key idea behind the delta rule is to use gradient descent to search the hypothesis space of
possible weight vectors to find the weights that best fit the training examples.
The delta rule is important because gradient descent provides the basis for the
Backpropagation Algorithm, which can learn networks with many interconnected units.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Algorithm:
Create a feed-forward network with ni inputs, nhidden hidden units, and nout output units.
Initialize each wi to some small random value (e.g., between -.05 and .05).
Until the termination condition is met, Do
– For each training example , Do
// Propagate the input forward through the network:
1. Input the instance (x1 ,…,xn ) to the network and compute the network outputs ok for
every unit
// Propagate the errors backward through the network:
2. For each output unit k, calculate its error term k
Where
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Here outputs is the set of output units in the network, tk is the target value of unit k for training
example d, and ok is the output of unit k given training example d. The derivation of the
stochastic gradient descent rule is conceptually straightforward, but requires keeping track of a
number of subscripts and variables.
xji = the ith input to unit j
wji = the weight associated with the ith input to unit j
netj = ∑𝑖 𝑤𝑖𝑗xji (the weighted sum of inputs for unit j)
oj = the output computed by unit j
t = the target output for unit j
a = the sigmoid function
outputs = the set of units in the final layer of the network
Downstream(j) = the set of units whose immediate inputs include the output of unit j
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Introduction: ( Maximum 5 sentences) : gradient descent can be performed for any function E that
is differentiable with respect to the parameterized hypothesis space.
Recurrent Networks
network topologies that correspond to acyclic directed graphs. Recurrent networks are
artificial neural networks that apply to time series data and that use outputs of network units
at time t as the input to other units at time t + 1. In this way, they support a form of directed
cycles in the network. To illustrate, consider the time series prediction task of predicting the
next day's stock market average y(t + 1) based on the current day's economic indicators x(t).
Given a time series of such data, one obvious approach is to train a feedforward network to
predict y(t + 1) as its output, based on the input values x(t) .
Dynamically Modifying Network Structure
A variety of methods have been proposed to dynamically grow or shrink the number of
network units and interconnections in an attempt to improve generalization accuracy
and training efficiency.
A second idea for dynamically altering network structure is to take the opposite
approach. Instead of beginning with the simplest possible network and adding
complexity, we begin with a complex network and prune it as we find that certain
connections are inessential.
In general, techniques for dynamically modifying network structure have met with
mixed success. It remains to be seen whether they can reliably improve on the
generalization accuracy of backpropagation. However, they have been shown in some
cases to provide significant improvements in training times.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Introduction: ( Maximum 5 sentences) : gradient descent can be performed for any function E that
is differentiable with respect to the parameterized hypothesis space.
Adding a penalty term for weight magnitude. As discussed above, we can add a term to E
that increases with the magnitude of the weight vector. This causes the gradient descent
search to seek weight vectors with small magnitudes, thereby reducing the risk of
overfitting
Adding a term for errors in the slope, or derivative of the target function. In some cases,
training information may be available regarding desired derivatives of the target function, as
well as desired values
Minimizing the cross entropy of the network with respect to the target values. Consider
learning a probabilistic function, such as predicting whether a loan applicant will pay back a
loan based on attributes such as the applicant's age and bank balance
od is the probability estimate output by the network for training example d, and td is the 1 or
0 target value for training example d. when and why the most probable network hypothesis
is the one that minimizes this cross entropy and derives the corresponding gradient descent
weight-tuning rule for sigmoid units. That chapter also describes other conditions under
which the most probable hypothesis is the one that minimizes the sum of squared errors.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Introduction: ( Maximum 5 sentences) : gradient descent can be performed for any function E that
is differentiable with respect to the parameterized hypothesis space.
gradient descent is one of the most general search methods for finding a hypothesis to
minimize the error function, it is not always the most efficient. It is not uncommon for
backpropagation to require tens of thousands of iterations through the weight update loop
when training complex networks
One optimization method, known as line search, involves a different approach to choosing
the distance for the weight update. In particular, once a line is chosen that specifies the
direction of the update, the update distance is chosen by finding the minimum of the error
function along this line.
A second method, that builds on the idea of line search, is called the conjugate gradient
method. Here, a sequence of line searches is performed to search for a minimum in the error
surface.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
The most basic instance-based method is the k-nearest neighbor algorithm. The nearest
neighbors of an instance are defined in terms of the standard Euclidean distance.
K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
where ar (x) denotes the value of the rth attribute of instance x. Then the distance between two
instances xi and xj is defined to be d(xi, xj), where .
Example
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Step-6: Our model is ready.
For every training example, the polyhedron indicates the set of query points whose
classification will be completely determined by that training example.
Query points outside the polyhedron are closer to some other training example.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Always needs to determine the value of K which may be complex some time.
The computation cost is high because of calculating the distance between the data
points for all the training samples.
Remarks on KNN
This lies in contrast to methods such as rule and decision tree learning systems that select only a
subset of the instance attributes when forming the hypothesis. To see the effect of this policy,
consider applying k-nearest neighbor to a problem in which each instance is described by 20
attributes, but where only 2 of these attributes are relevant to determining the
classification for the particular target function.
In this case, instances that have identical values for the 2 relevant attributes may nevertheless
be distant from one another in the 20-dimensional instance space. As a result, the similarity
metric used by k-nearest neighbor--depending on all 20 attributes-will be misleading.
The distance between neighbors will be dominated by the large number of irrelevant
attributes. This difficulty, which arises when many irrelevant attributes are present, is
sometimes referred to as the curse of dimensionality.
Nearest-neighbor approaches are especially sensitive to this problem. One interesting approach
to overcoming this problem is to weight each attribute differently when calculating the
distance between two instances
One additional practical issue in applying k-nearest neighbor is efficient memory indexing.
Because this algorithm delays all processing until a new query is received, significant
computation can be required to process each new query.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Where
We can distance-weight the instances for real-valued target functions in a similar fashion,
replacing the final line of the algorithm in this case by
The only disadvantage of considering all examples is that our classifier will run more
slowly.
If all training examples are considered when classifying a new query instance, we call the
algorithm a global method.
If only the nearest training examples are considered, we call it a local method.
Video Content / Details of website for further learning (if any):
https://youtu.be/JCUtq5T0DOY
https://www.geeksforgeeks.org/weighted-k-nn/
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Therefore, we derived methods to choose weights that minimize the squared error summed over the set
D of training examples
2. Minimize the squared error over the entire set D of training examples, while weighting
the error of each training example by some decreasing function K of its distance from
xq:
3. Combine 1 and 2:
If we choose criterion three above and rederive the gradient descent rule, we obtain the following
training rule
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
The literature on locally weighted regression contains a broad range of alternative methods for
distance weighting the training examples
The cost of fitting more complex functions for each query instance is prohibitively high, and
(1) These simple approximations model the target function quite well over a sufficiently
small subregion of the instance space.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
The literature on locally weighted regression contains a broad range of alternative methods for
distance weighting the training examples
In most cases, the target function is approximated by a constant, linear, or quadratic function.
More complex functional forms are not often found because.
The cost of fitting more complex functions for each query instance is prohibitively high, and
(2) These simple approximations model the target function quite well over a sufficiently
small subregion of the instance space.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Introduction: ( Maximum 5 sentences) : CBR has been applied to problems such as conceptual
design of mechanical devices based on a stored library of previous designs reasoning about new
legal cases based on previous rulings and solving planning and scheduling problems by reusing and
combining portions of previous solutions to similar problems.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Introduction: ( Maximum 5 sentences) : Lazy methods may consider the query instance x, when
deciding how to generalize beyond the training data D. Eager methods cannot. By the time they
observe the query instance x, they have already chosen their (global) approximation to the target
function.
To summarize lazy methods have the option of selecting a different hypothesis or local
approximation to the target function for each query instance.
Eager methods using the same hypothesis space are more restricted because they must commit
to a single hypothesis that covers the entire instance space.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Deep learning is a type of machine learning (ML) and artificial intelligence (AI) that imitates
the way humans gain certain types of knowledge.
Deep learning is an important element of data science, which includes statistics and predictive
modeling.
It is extremely beneficial to data scientists who are tasked with collecting, analyzing and
interpreting large amounts of data;
deep learning makes this process faster and easier.
At its simplest, deep learning can be thought of as a way to automate predictive analytics.
While traditional machine learning algorithms are linear, deep learning algorithms are stacked
in a hierarchy of increasing complexity and abstraction.
Deep learning vs. machine learning
Deep learning is a subset of machine learning that differentiates itself through the way it solves
problems. Machine learning requires a domain expert to identify most applied features. On the
other hand, deep learning learns features incrementally, thus eliminating the need for domain
expertise. This makes deep learning algorithms take much longer to train than machine learning
algorithms, which only need a few seconds to a few hours. However, the reverse is true during
testing. Deep learning algorithms take much less time to run tests than machine learning
algorithms, whose test time increases along with the size of the data.
Examples of Deep Learning
Deep learning research is a driving force behind many of the technologies we use every day --
from the voice control functions found in our smart devices to self-driving cars.
Both Netflix and Amazon use deep learning algorithms to suggest products and shows to
watch, intelligent virtual assistants (Alexa, Bixby, Cortana, Google Assistant or Siri) use deep
learning to understand speech and the language humans use when they interact with them.
Other examples of deep learning include colorization of black and white Images, autonomous
vehicles, translators, facial recognition, classification and medical disease diagnoses.
Material
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Graphical Model
Graphical models, also called Bayesian networks, belief networks, or probabilistic
networks, are composed of nodes and arcs between the nodes.
Each node corresponds to a random variable, X, and has a value corresponding to the
probability of the random variable, P(X).
If there is a directed arc from node X to node Y, this indicates that X has a direct influence on
Y. This influence is specified by the conditional probability P(Y|X).
The network is a directed acyclic graph (DAG); namely, there are no cycles.
The nodes and the arcs between the nodes define the structure of the network, and the
conditional probabilities are the parameters given the structure.
Independence
Conditional Independence
calculate the individual (marginal) probability of wet grass by summing up over the possible
values
knowing that the grass is wet, the probability that it rained can be calculated as follows:
Knowing that the grass is wet increased the probability of rain from 0.4 to 0.75; this is because
P(W|R) is high and P(W|∼R) is low.
Ethem Alpaydin, “Introduction to Machine Learning”, Second Edition, MITPress,2013,Page no: 387-389
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Three events may be connected serially.We see here that X and Z are independent given Y:
Knowing Y tells Z everything; we write
P(Z|Y,X)= P(Z|Y).
We say that Y blocks the path from X to Z, or in other words, it separates them in the sense that
if Y is removed, there is no path between X to Z. In this case, the joint is written as
P(X,Y,Z)=P(X)P(Y|X)P(Z|Y)
Writing the joint this way implies independence:
Case 2: Tail-to-tail Connection
Explaining away: Knowing that it has raineddecreases the probability that the sprinkler is on.
We can calculate P(C|W)and have a diagnostic inference:
As we have seen earlier, inference is also easier as the joint density is broken down into conditional
densities of smaller groups of variables:
Ethem Alpaydin, “Introduction to Machine Learning”, Second Edition, MITPress,2013,Page no: 389-396
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
If the inputs are independent, we have the graph ,which is called the naive Bayes’ classifier,
because it ignores possible dependencies, namely, correlations, among the inputs and reduces a
multivariate problem to a group of univariate problems:
Given C, xj are independent:
p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)
Ethem Alpaydin, “Introduction to Machine Learning”, Second Edition, MITPress,2013,Page no: 396-402
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Introduction: ( Maximum 5 sentences) : Learning algorithm dictates a certain model that comes
with a set of assumptions. This inductive bias leads to error if the assumptions do not hold for the data.
Learning is an ill-posed problem and with finite data, each algorithm converges to a different solution
and fails under different circumstances
Combination Methods
Static Structures
Ensemble averaging (Sum,Product,Min rule)
Bagging
Boosting
Error Correcting Output Codes
Dynamic structures
Mixture of Experts
Hierarchical Mixture of Experts
Ethem Alpaydin, “Introduction to Machine Learning”, Second Edition, MITPress,2013,Page no: 419-420.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Hard Voting
In hard voting, we predict the final class label as the class label that has been predicted
most frequently by the classification models.
classifier 1 -> class 0
classifier 2 -> class 0
classifier 3 -> class 1
Soft Voting
In soft voting, we predict the class labels by averaging the class-probabilities (only
recommended if the classifiers are well-calibrated).
However, assigning the weights {0.1, 0.1, 0.8} would yield a prediction
Bagging
Bagging is a voting method whereby base-learners are made different by training them over
slightly different training sets.
Bootstrap Aggregation (or Bagging for short), is a simple and very powerful ensemble method.
Bagging is the application of the Bootstrap procedure to a high-variance machine learning
algorithm, typically decision trees.
Bagging is used when the goal is to reduce the variance of a decision tree classifier.
Bagging Steps:
Suppose there are N observations and M features in training data set. A sample from training
data set is taken randomly with replacement.
A subset of M features are selected randomly and whichever feature gives the best split is used
to split the node iteratively.
The tree is grown to the largest.
Above steps are repeated n times and prediction is given based on the aggregation of
predictions from n number of trees.
Advantages:
Disadvantages:
Since final prediction is based on the mean predictions from subset trees, it won’t give precise
values for the classification and regression model.
Ethem Alpaydin, “Introduction to Machine Learning”, Second Edition, MITPress,2013,Page no: 424-430.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Boosting
Example:Ada Boost
Boosting Steps:
Draw a random subset of training samples d1 without replacement from the training set
D to train a weak learner C1
Draw second random training subset d2 without replacement from the training set and
add 50 percent of the samples that were previously falsely classified/misclassified to
train a weak learner C2
Find the training samples d3 in the training set D on which C1 and C2 disagree to train a
third weak learner C3
Combine all the weak learners via majority voting.
Advantages
Supports different loss function (we have used ‘binary:logistic’ for this example).
Disadvantages:
Prone to over-fitting.
Requires careful tuning of different hyper-parameters.
Ethem Alpaydin, “Introduction to Machine Learning”, Second Edition, MITPress,2013,Page no: 424-430.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Stacked generalization
Stacked generalization is a way of combining multiple models that have been learned for a
classification task.
Stacked generalization (or stacking) is a different way of combining multiple models, that
introduces the concept of a meta learner. Although an attractive idea, it is less widely used than
bagging and boosting. Unlike bagging and boosting, stacking may be (and normally is) used to
combine models of different types. The procedure is as follows:
The combiner learns what the correct output is when the base-learners give a certain output
combination.
We cannot train the combiner function on the training data because the base-learners may be
memorizing the training set; the combiner system should actually learn how the base learners
make errors. Stacking is a means of estimating and correcting for the biases of the base-learners
1) to 3) are the same as cross-validation, but instead of using a winner-takes-all approach, we combine
the base learners, possibly nonlinearly.
Ethem Alpaydin, “Introduction to Machine Learning”, Second Edition, MITPress,2013,Page no: 447-450.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Reinforcement learning
The learning decision maker is called the agent. The agent interacts with the environment that
includes everything outside the agent. The agent has sensors to decide on its state in the
environment and takes an action that modifies its state. When the agent takes an action, the
environment provides a reward.
Model-Based Learning
Once we have the optimal value function, the optimal policy is to choose the action that
maximizes the value in the next state:
The agent interacts with an environment. At any state of the environment, the agent takes an
action that changes the state and returns a reward.
Ethem Alpaydin, “Introduction to Machine Learning”, Second Edition, MITPress,2013,Page no: 450-454.
Course Teacher
Verified by HOD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram - 637 408, Namakkal Dist., Tamil Nadu.
CSE III/V/A
Introduction: ( Maximum 5 sentences) : State and action, we may receive different rewards or move
to different next states. What we do is keep a running average. This is known as the Q learning Q
learning algorithm .
for selecting its next action a, based on the current observed state st :that is
One obvious approach is to require the policy that produces the greatest possible cumulative
reward for the robot over time. To state this requirement more precisely, we define the
cumulative value Vπ(s,) achieved by following an arbitrary policy π from an arbitrary initial
state st as follows:
We require that the agent learn a policy π that maximizes Vπ(s) for all states s. We will call such a
policy an optimal policy and denote it by π*.
Q-learning
we may have an imperfect robot which sometimes fails to go in the intended direction and
deviates, or advances shorter or longer than expected.
In such a case, we have
We cannot do a direct assignment in this case because for the same state and action, we may
receive different rewards or move to different next states. What we do is keep a running
average. This is known as the Q learning algorithm:
Ethem Alpaydin, “Introduction to Machine Learning”, Second Edition, MITPress,2013,Page no: 454-461
Course Teacher
Verified by HOD