0% found this document useful (0 votes)
16 views68 pages

Unit 1 Machine Learning

Machine Learning (ML) is a sub-field of Artificial Intelligence focused on enabling computers to learn from data and improve their performance without explicit programming. It encompasses various types of learning, including supervised, unsupervised, and reinforcement learning, and has applications in areas such as facial recognition, medical diagnosis, and financial forecasting. The ML life cycle involves data gathering, preprocessing, model development, testing, and deployment, with an emphasis on optimizing algorithms for efficiency and accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views68 pages

Unit 1 Machine Learning

Machine Learning (ML) is a sub-field of Artificial Intelligence focused on enabling computers to learn from data and improve their performance without explicit programming. It encompasses various types of learning, including supervised, unsupervised, and reinforcement learning, and has applications in areas such as facial recognition, medical diagnosis, and financial forecasting. The ML life cycle involves data gathering, preprocessing, model development, testing, and deployment, with an emphasis on optimizing algorithms for efficiency and accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Machine Learning

Machine Learning
• Machine Learning (ML) is a sub-field of Artificial Intelligence (AI)
• The goal of machine learning generally is to understand the structure
of data and fit that data into models that can be understood and
utilized by people.
• Hence, ML algorithms enables the computers to learn from data and
improve themselves without being explicitly programmed.
• It is a continuously developing field.
Machine Learning
Machine Learning
• Machine Learning is programming computers to optimize a
performance criterion using example data or past experience
• We have a model defined up to some parameters
• Learning is the execution of computer program to optimize the
parameters of the model using training data
• The model may be predictive to make predictions in the future or
descriptive to gain knowledge from data or both
• Machine Learning uses: theory of statistics in building mathematical
models
Machine Learning
• The role of Computer Science is two folds:
• First: in training – we need efficient algorithms to solve the optimization
problems and to store and process massive amount of data we usually have
• Second: Once the model is learned, its solution needs to be efficient as well
• In some applications, the efficiency of learning algorithm: space and
time complexity, is as much important as its predictive accuracy
Machine Learning
• Machine Learning applications:
• Facial Recognition
• Optical Character Recognition
• Recommender Engines (what music to listen/movie or show to watch etc)
• Self driven cars
• Prediction of customer loan applications (probability of fault)
• Image Recognition
• Speech Recognition (translation of spoken words into text)
• Medical diagnosis (diagnosis of diseases) – based on images or data
• Financial industry and trading (fraud transactions etc)
Machine Learning Life Cycle
1. Gathering Data

2. Data Preprocessing

3. Model Development and Training

4. Model Testing

5. Model Deployment
Machine Learning Life Cycle
• Data Gathering:
• Identification of various sources and collection of data
• Data Preprocessing:
• Data is analyzed for missing values, duplicate values, invalid data etc. using various
analytical techniques
• It also does feature extraction, feature analysis and data visualization
• Model Development
• Develop model using machine learning algorithms and train it using the dataset
• Training is important: model understands various patterns, classes, rules, features
• Model Testing
• The trained model is tested on test dataset and model accuracy is checked
Machine Learning Life Cycle
• Model deployment:
• Involves integrating a machine learning model into an existing production
environment that takes input and returns output to make business decisions
based on data.
• Various technologies that you can use to deploy machine learning models are:
• Docker, Kubernetes, AWS SageMaker, MLFlow, Azure Machine Learning
Service
• Model Monitoring
• monitoring of machine learning models for factors like errors, crashes, and
latency and most importantly to ensure that your model is maintaining the
desired performance.
Types of Machine Learning
• Machine Learning is classified in three types:
• Supervised Learning
• Classification
• Regression
• Unsupervised Learning
• Clustering
• Association
• Dimensionality Reduction
• Reinforcement Learning
• Policy/Decision Making
Types of Machine Learning
Supervised Learning
• In Supervised learning, the system is presented with data which is labeled, which
means that each data is tagged with the correct label.
• The goal is to approximate the mapping function so well that when you have new
input data (x), you can predict the output variables (Y) for that data.
• Example: Identifying Spam Emails
Supervised Learning
• Example: Identifying Spam Emails
• Initially some data is taken and marked as ‘Spam’ or ‘Not Spam’.
• This labeled data is used to train the supervised model
• Once it is trained the model can be tested with some test mails (test
data) and see whether the model is able to predict the right output
Sample Dataset
Data Splitting

Data Split examples:


• Training/Testing : 70/30
• Training/Testing : 80/20
• Training/Testing : 90/10
Data Splitting
• Training Data Testing Data
Temperat Humidi Play
Day Outlook ure ty Wind Play Tennis? Day Outlook Temperature Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D11 Sunny Mild Normal Strong

D3 Overcast Hot High Weak Yes D12 Overcast Mild High Strong
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes D13 Overcast Hot Normal Weak

D6 Rain Cool Normal Strong No


D14 Rain Mild High Strong
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
Types of Supervised learning
• Types of Supervised learning
• Classification: A classification problem is when the output variable is a category, such
as “red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
• Both are supervised learning where input : X, output : Y and task is to learn the mapping
from input to output.
• The approach in machine learning is we assume a model defined up to a set of
parameters:
y = g(x|ϴ)
• Here g(.) is the model and ϴ are its parameters.
• In regression, Y is a number
• In classification, Y is a class code (e.g., 0/1)
• In regression g(.) is – regression function
• In classification g(.) is – discriminant function (separates instances of different classes)
Classification
• Example: Loan given by bank is amount of money to be paid back
with interest usually in installments.
• Bank: Should be able to predict risk associated with loan (probability
that the customer will default and not pay the amount back)
• Credit Scoring: Bank calculates the risk
• Input: amount of loan and customer details
• Customer info like: income, savings, profession, age, past history
• From such data of applications, the aim is to infer general rule telling
the association between customers attributes and risk
• Machine learning system fits a model to the past data such that it can
calculate risk for new application and decide to accept or refuse loan
Classification
• This is example of classification:
• Classes are two: low risk and high risk
• Input: Information of customer
• Classifier task: assign input to one of the two classes
• After learning the past data, classification rule learned can be:
IF income > ϴ1 and savings > ϴ2 THEN low-risk ELSE high-risk

• Suitable values of ϴ1 and ϴ2 are learned from training data.


• Example is : Discriminant (separating examples of different classes)
• In some cases instead of making a 0/1 type of decision, there is need to
calculate probability.
P(Y|X): X are customer attributes, Y is 0 or 1 (low-risk or high-risk)
• Then for given X=x, if P (Y=1|X=x) = 0.8, customer has 80% probability of being
high risk or 20% of being low-risk
Classification
• Example on training dataset
• Each circle represents one
data instance (customer)
• The signs in circles indicate
classes
• Customer attributes as
input: Savings and income
• Classes (output): low-risk
and high-risk
Classification
• Pattern Recognition – optical character recognition
• Face Recognition
• Medical Diagnosis
• Speech Recognition
• Biometrics
• Outlier Detection/Novelty Detection
Regression
• Ex: Developing a system that can predict the price of a used car
• Inputs: Car attributes: brand, year, engine capacity, milage etc.
• Output: car price
• Such problems where output is a number are regression problems
• Let X denote : car attributes, Y : car price
• Past transactions can be collected: training data and machine learning
problem fits a function to this data to learn Y as function of X
• Here fitted function is of the form:
y = wx + w0 (for suitable values of w and w0)
Regression

• Sample training dataset of


used cars and the function
fitted.

• For simplicity: mileage is only


the input attribute and linear
model is used.
Unsupervised Learning
• Supervised Learning: Learns mapping function from input to output
whose correct values are provided by a supervisor.
• Unsupervised Learning: No such supervisor, we only have input data.
• Aim: To find regularities in input. There is certain structure to the
input space such that some patterns occur more often than others.
• Aim is to find such patterns (called density estimation: in statistics)
• Ex: Clustering: Find clusters or grouping in inputs
• Ex: For a company with data of past customers (demographics & past
transactions), it may want to see the distribution of customers.
Unsupervised Learning
• Clustering here allocates customers of similar attributes to one group.
• Once such groups are found (customer segmentation), company may
decide strategies, services and products, specific to different groups
(customer relationship management).
• Other Application of Clustering: Image Compression:
• Input: Instances of image pixels represented by RGB values.
• Clustering groups pixels with similar colours in the same group
• We can code pixels belonging to same group with one number (say
their average)
Example of Clustering
Clustering in Machine Learning
Learning Association
• Ex. : In supermarket chain – application of machine learning is Market
Basket Analysis.
• This is finding associations between products bought by the customers.
• If people who buy X typically by Y, and if there is a customer who buys X
but does not buy Y, he or she is potential customer for Y.
• Once such associations are found, we can target them for cross-selling.
• Interest is in learning conditional probability: P (Y|X) where Y is the product
we would condition on X (products customers have already purchased)
Ex: P(Butter|Bread) = 0.7
• (70% of customers who buy bread also buy butter)
Association Rules
Reinforcement Learning
• In some applications, output of a system is a sequence of actions.
• In such case, single action is not important, policy that is a sequence
of correct actions to reach the goal is needed.
• Here, an action is good if it is a part of a good policy.
• In such case, machine learning algorithm should be able to learn from
past good action sequences and generate policy.
• Such methods are: Reinforcement Learning Algorithms
• Ex: Game playing, Robot Navigation.
Hypothesis
• Consider we want to learn the class, C, of a “family car”.
• We have a set of examples of cars and are labelled as: positive examples (family car)
and negative examples (not family cars).
• Class learning: finding description that is shared by all positive examples and none
of the negative examples.
• Then, we can make prediction: given a new car (not seen before), by checking
description learned, whether it is family car or not.
• We will consider two attributes (features) as input to classifier: price and engine
power.
• Consider price as first input: x1 and engine power as second attribute x2
• Each car is represented with two numeric values x = [x1, x2]
• Its label denotes its type r = 1, if x is positive example
0, if x is negative example
Hypothesis
• Figure shows training set for the
class “family car”
• Each data point corresponds to one
example car
• Coordinates of the point indicate the
price and engine power of the car
• ‘+’ denotes positive example and –
denotes negative example
Hypothesis
• Each car is represented by such an ordered pair (x, r) and the training set contains
N such examples:
𝑋 = {𝑥 𝑡 , 𝑟 𝑡 }𝑁
𝑡=1 (𝑡 𝑖𝑛𝑑𝑒𝑥𝑒𝑠 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡)
• Training data can now be plotted in 2-D space (x1, x2) where each instance t is a
data point at coordinates (𝑥1𝑡 , 𝑥2𝑡 ) and its type, positive vs negative is given by 𝑟 𝑡
• We can say that a car to be a family car, its price and engine power should be in a
certain range:
(p1 <= price <= p2) AND (e1 <= engine power <= e2) (for suitable values of p1 ,p2, e1, e2)
• This equation thus assumes class C to be a rectangle in price – engine power
space.
• This equation also fixes H, the hypothesis class from which class C is drawn,
namely, the set of rectangles.
• The learning algorithm then finds the particular hypothesis h ϵ H, specified by a
particular quadruple (𝑝1ℎ , 𝑝2ℎ , 𝑒1ℎ , 𝑒2ℎ ) to approximate C as closely as possible
Hypothesis

Example of hypothesis class. The class of family car is a rectangle is a rectangle in the price-engine power space.
Hypothesis
• Hypothesis: Is an idea that is suggested as possible explanation for something but
has not yet been found to be correct or true.
• The aim is to find h ϵ H that is as similar as possible to C.
• To evaluate how well hypothesis h matches C we find empirical error.
• It is the proportion of training instances where the predictions of h do not match
the required values given in X.
• The error of hypothesis h given the training set X is:
𝐸 ℎ 𝑋 = σ𝑁 𝑡
𝑡=1 1 (ℎ (𝑥 ) ≠ 𝑟 )
𝑡

• Hypothesis class H is the set of all possible rectangles: each quadruple (𝑝1ℎ , 𝑝2ℎ ,
𝑒1ℎ , 𝑒2ℎ ) defines one hypothesis, h, from H – we need to choose the best one
• We need to find the values of these 4 parameters given the training set, that
includes all positive examples and none of negative examples.
• If parameters are real valued, there are infinite h for which E is 0.
Hypothesis
• C is the actual class and
h is our induced
hypothesis.
• False Negative: point
where C is 1 but h is 0
• False Positive: point
where C is 0 but h is 1.
• Other points: True
positive and true
negative are correctly
classified
Hypothesis
• Generalization: Given a future example somewhere close to the boundary between
positive and negative examples, different candidate hypotheses may make different
predictions.
• This is problem of generalization – that is, how well our hypothesis will correctly
classify future examples that are not part of training set.
• Most Specific Hypothesis, S – tightest rectangle that includes all positive examples
and none of the negative examples. The actual class C may be larger that S but never
is smaller
• Most General Hypothesis G – largest rectangle that includes all positive examples
and none of the negative examples.
• Any h ϵ H between S and G is a valid hypothesis with no error (consistent with
training set) and such h make the version space.
• Given another training set, S, G, version space – the parameters and learning
hypothesis, h, can be different.
Hypothesis

S is most specific and G is most general hypothesis


Hypothesis
• Given X, we can find S or G or any h from
the version space and use it as our
hypothesis, h.
• Hypothesis, h can be chosen to be
halfway between S and G – to increase
the margin (distance between the
boundary and the instances closest to it)
• For our error function to have a
minimum at h with the maximum
margin, we should use an error (loss)
function which not only checks whether
an instance is on the correct side of the
boundary but also how far away it is. The shaded instances are those that
define (or support) the margin
Noise
• Noise is any unwanted anomaly in the data and due to noise, the class may be
difficult to learn and zero error may not be possible with simple hypothesis.
• Noise can be:
• Imprecision in recording the input attributes
• Errors in labelling the data points, which may relabel positive instances as negative and vice versa.
(Teacher noise)
• There may be additional attributes, which may have not taken into account, that affect the label
of an instance (hidden or latent attributes)
• When there is noise, we may need complex hypothesis (hypothesis class with larger
capacity)
• Rectangle: can be defined with four numbers (parameters), but to define complicated
shape, we need complex model with large number of parameters.
• With complex model, we can make a perfect fit to the data and attain zero error.
• Alternatively, we can keep simple model and allow some error.
• Given comparable imperical error, a simple model (not too simple) would generalize
better than a complex model --- this principle is called Occam’s razor
Noise - Figure
Model Selection and Generalization – ill posed problem
• Consider we have a dataset containing N points. These N points can be labelled in
2N ways as positive and negative. Therefore, 2N different learning problems can be
defined by N data points.
• Ex. A Boolean function (where all inputs and outputs are binary)
• There are 2d possible ways to write d binary values and so 𝑑with d inputs, the training
set has at most 2d examples and therefore there will be 22 possible Boolean
functions of d inputs.

• For Boolean function, to end up with a single hypothesis we need to see all 2d
training examples.
• If training set we are provided with contains only small subset of all possible
instances (usually has) – the solution is not unique.
• After seeing N example cases there remains 2d-N possible functions.
• This is an example of ill-posed problem where data by itself is not sufficient to
find a unique solution.
Model Selection – Boolean Function Example

d =2 (attributes x1,x2)
2x2 ways of writing input : 4
2N - learning ways
16 – learning ways
Model Selection and Generalization – Inductive bias
• If learning is ill-posed, and the data is not sufficient to find the solution, we
should make some extra assumptions to have a unique solution with the data.
• The set of assumptions we make to have learning possible is called the inductive
bias of the learning algorithm.
• One way to introduce inductive bias is when we assume a hypothesis class H.
• In learning class of family cars, there are infinitely many ways of separating the positive
examples from the negative ones.
• Assuming the shape of a rectangle is one inductive bias
• Considering rectangle with largest margin is another inductive bias
• In linear regression, assuming a linear function is an inductive bias and among all
lines, choosing the one that minimizes squared error is another inductive bias.
Model Selection - Generalization
• Learning is not possible without inductive bias, and now the question is
how to choose the right bias.
• This is called model selection, which is choosing between possible H.
• The aim of machine learning is rarely to replicate the training data but
the prediction for new cases.
• That is we would like to be able to generate the right output for an input
instance outside the training set, one for which the correct output is not
given in the training set.
• How well a model trained on the training set predicts the right output
for new instances is called generalization.
Model Selection – Underfitting and Overfitting
• For best generalization, we should match the complexity of the hypothesis class H
with the complexity of the function underlying the data.
• If H is less complex than the function, we have underfitting:
• Ex: when trying to fit a line to data sampled from a third-order polynomial.
• In such a case, as we increase the model complexity, the training error decreases.
• But if we have H that is too complex, the data is not enough to constrain it and we
may end up with a bad hypothesis, h ∈ H,
• Ex: when fitting two rectangles to data sampled from one rectangle.
• Or if there is noise, an overcomplex hypothesis may learn not only the underlying function but
also the noise in the data and may make a bad fit,
• Ex: when fitting a sixth-order polynomial to noisy data sampled from third order polynomial
• This is called overfitting (having more training data helps but only up to a certain
point)
• Given a training set and H, we can find h ∈ H that has the minimum training error
but if H is not chosen well, no matter which h ∈ H we pick, we will not have good
generalization.
Underfitting & Overfitting

• Underfitted Model: High Training and Testing Error


• Overfitted Model: Very Low Training Error but High Testing Error
Underfitting & Overfitting
• Bias: Assumptions made by a model to make a function easier to learn.
• Variance: If you train your data on training data and obtain a very low error, upon
changing the data and then training the same previous model you experience a
high error, this is variance.
• Underfitting:
• A statistical model or a machine learning algorithm is said to have underfitting when it
cannot capture the underlying trend of the data (model does not fit the data well).
• It reduces the accuracy of the machine learning model
• It usually happens when data is not enough and we try to build simple model on data taken
from more complex function (ex. Linear model on non-linear data)
• Underfitting – high bias and low variance
• To reduce underfitting: increase model complexity, increase number of features and data,
apply feature selection, remove noise from data, increase epochs or training duration
Underfitting & Overfitting
• Overfitting
• A statistical model is said to be overfitted when we train it with a lot of training data
• When a model gets trained with so much data, it starts learning from the noise and inaccurate
data entries in our data set.
• This may lead to poor performance on unseen data (test data).
• Overfitting – High variance and low bias
• To reduce overfitting: increase training data, reduce model complexity, use regularization
(L1,L2, use dropouts in neural networks)
• A model that is underfit will have high training and high testing error while an
overfit model will have extremely low training error but a high testing error.
Model Selection- Triple Trade-Off
• Triple trade-off
• In all learning algorithms that are trained from example data, there is a trade-off
between three factors:
• the complexity of the hypothesis we fit to data, namely, the capacity of the hypothesis class,
• the amount of training data, and
• the generalization error on new examples.
• We can measure the generalization ability of a hypothesis, if we have access to
data outside training set.
• For this purpose we divide training set into two: Training and Validation sets
Model Selection
• Training dataset is used to training and validation set is used to test the
generalization ability.
• Given a set of possible hypothesis classes Hi, for each we fit the best hi ϵ Hi on
training set.
• The hypothesis that is the most accurate on the validation set is the best one.
• Ex. To find right order in polynomial regression, given number of candidate polynomials of
different orders.
• Here, polynomials of different orders correspond to Hi for each order
• We find the coefficients on training set, calculate their errors on validation set
• Take the polynomial with least validation error as the best one.
• Now to report the error of our best model, validation error should not be used,
as validation set is used to choose the best model and hence it becomes part of
training set.
• So, to report error of best selected model, test set is used.
Evaluation
• Evaluation of a machine learning model is important as the models
are designed to predict the class of “future” unseen data.
• Typical choices for performance evaluation:
• Error
• Accuracy
• Precision/Recall
• Typical choices for sampling methods:
• Train/Test sets
• K-Fold Cross Validation
Evaluation
• Given hypothesis class H, Training data X, our learning algorithm comes up with
hypothesis h ϵ H.
• It is important to understand how good the hypothesis h is. We do this by
evaluation (i.e., we evaluate the performance of learning algorithm –
experimental evaluation)
• For experimental evaluation we must have a metric like: error, accuracy,
precision/recall etc.
• We evaluate the performance on a sample data.
• If we use training data for evaluation, it may not reflect true error as we come up
with hypothesis with training dataset.
• So, we use test dataset. One way to use train-test dataset properly is cross-
validation.
Evaluation – Evaluating Predictions
• Suppose we want to predict the value for a target feature for input x:
• Say, y is the actual (observed) value of target feature for input x
• 𝑦ො is the predicted value for input x
• we want to make prediction h(x) (ෝ 𝒚 = h(x))
• ෝ and y are same, then there is no error
If, 𝒚
• If they are different, then there is error. Error can be measured with:
• Absolute Error: |h(x) – y| for one training example. For N training example its average is
𝟏
taken: σ |h(x) – y|
𝑵
𝟏
• Sum of squares error: σ𝑵
𝟏 |h(x) – y|𝟐
𝑵
• Absolute error and sum of squares error are useful for regression
𝟏
• For classification we should check number of miss classified examples: 𝜹 (h(x), y)
𝑵
• Function 𝛅 returns 1 if there is misclassification, else 0. We calculate number of misclassifications
divided by total number of samples
• Sometimes confusion matrix is also used
Evaluation – Evaluating Predictions
• Confusion Matrix:
true(actual) class Positive Negative
Hypothesis class Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

𝑇𝑃+𝑇𝑁
• Accuracy =
𝑃+𝑁 (𝐴𝑙𝑙 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠)
• Precision: Out of the examples marked as positive by our learning
𝑇𝑃
algorithm/model how many are actually positive =
𝑇𝑃+𝐹𝑃
• Recall: Out of all positive examples how many are correctly predicted
𝑇𝑃 𝑇𝑃
as positive: =
𝑎𝑙𝑙 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝑇𝑃+𝐹𝑁
Cross Validation
• Cross-Validation is a technique used to test a model’s ability to predict unseen
data (data not used to train the model).
• It is useful if we have limited data when test set is not large enough.
• Cross Validation splits the training data into k blocks.
• In each iteration, the model trains on k-1 blocks and is validated using the last
block.
• The average error over all the iterations is used to evaluate the model.
• Types: K- Fold Cross Validation, Monte Carlo Cross Validation (leave one out),
Validation set approach, Stratified K-Fold Cross Validation.
• K-fold cross-validation is the most common technique for model evaluation and
model selection in machine learning.
• The idea behind K-Fold Cross Validation is each sample in the dataset has equal
opportunity of getting tested.
K-Fold Cross Validation
Steps of K-Fold Cross Validation:
1. Split training data into K equal parts
2. Fit the model on k-1 parts (merged as training set) and calculate test error using
the fitted model on the kth part (test set)
3. Repeat k times, using each data subset as the test set once.

5-Fold Cross Validation


(k = 5)
Linear Regression
• Regression: Supervised learning problem where we are given
examples of instances whose x and y values are given and you have to
learn a function, f, so that given an unknown value x, it should predict
value of y.
f (x) y (for regression, y is continuous)
• Different types of functions can be used for regression. The simplest
being “Linear Regression”
• X can have multiple features (attributes).
• Simple Regression – where x has only one feature
• We can plot now values of x and y in feature space.
Linear Regression
• Linear regression considers a simple line as
function for mapping x to y.
• Given x and y values, we need to find out the best
line that represents the data so that given a new
unknown value of x, y can be predicted.
• In simple words – Given an input x, we have to
compute y.
• Ex. Predict cost of flat (y), given number of rooms (x)
• Predict weight of a person (y), given person’s age (x)
• Predict Salary of a person (y), given work experience (x)
Linear
Simple Regression
(1 feature) • X – is called independent variable
Non-Linear • Y – dependent variable
Regression
Multiple Regression Linear • Simple Regression: One dependent variable, one
(more than one independent variable (ϴ0 + ϴ1.x)
features) Non Linear • Multiple Regression – One dependent variable, two or
more independent variables (ϴ0 + ϴ1.x1 + ϴ2.x2 + ϴ3.x3 …)
Linear Regression
• Formula for linear regression is given by:
y = a + bx
• In Machine Learning, Hypothesis function for Linear Regression: y = ϴ0 + ϴ1.x
• Here,
• x: is input data (training Data),
• y: is data label
• ϴ0 : intercept (y-intercept)
• ϴ1: coefficient of x (slope of line)
• When we train the model, it fits the best line to predict the value of y for given value of x.
• This is done by finding the best ϴ0 and ϴ1 values.
• Cost function (J): The aim is to predict the value of y such that error between predicted value are
true value is minimum.
• So, it is very important to update the θ0 and θ1 values, to reach the best value that minimize the error
between predicted y value (pred) and true y value (y).
𝟏
minimize σ𝒏𝒊−𝟏 𝒑𝒓𝒆𝒅𝒊 − 𝒚𝒊 𝟐
𝒏
𝟏
j= σ𝒏𝒊−𝟏 𝒑𝒓𝒆𝒅𝒊 − 𝒚𝒊 𝟐
- cost function
𝒏
Linear Regression
Gradient Descent:
• To update θ0 and θ1 values in order to reduce Cost function (minimizing error
value) and achieving the best fit line the model uses Gradient Descent.
• The idea is to start with random θ0 and θ1 values and then iteratively updating
the values, reaching minimum cost (minimum error).
• How θ0 and θ1 get updated:
• θj : Weights of the hypothesis
• hθ(xi) : predicted y value for ith input
• j : Feature index number (can be 0, 1, 2, ......, n)
• α : Learning Rate of Gradient Descent
Linear Regression
Consider a dataset Iteration 1: to start, θ0 and θ1 values are randomly
chosen. Let us suppose, θ0 = 0 and θ1 = 0
Sample Experience Salary (y) –
No (m) (X) in lakhs
1 2 3
2 6 10
3 5 4
4 7 3
Cost Function Error
Linear Regression: y = ϴ0 + ϴ1.x

1 1 1 1
[y1 y2 y3 y4] = [θ0 θ1]
x1 x2 x3 x4

y1= θ0.1 + θ1. x1


y2= θ0.1 + θ1. x2
y3= θ0.1 + θ1. x3
y4= θ0.1 + θ1. x4
Linear Regression
• Gradient Descent (Update θ0 value) Gradient Descent – Update θ1 value
Here, j = 1 (θ1)
Here, j = 0 (θ0)

Iteration 2 – θ0 = 0.005 and θ1 = 0.02657


Linear Regression

1
= 2 𝑋 4 (0.057 − 3) 2 +(0.161 − 10) 2 +(0.135 − 4) 2 +(0.187 − 3) 2

1
= 8 (8.66 + 96.80 + 14.93 + 7.91) = 16.03
Linear Regression
• Gradient Descent (Update θ0 value) Gradient Descent – Update θ1 value
Here, j = 1
Here, j = 0

0.001 0.001
Θ0 = 0.005 - [(0.057-3) + (0.161-10) + (0.135-4) + Θ1 = 0.026 - [(0.057-3)2 + (0.161-10)6 + (0.135-
4 4
(0.187 - 3)] 4)5 + (0.187 - 3)7]

0.001 0.001
= 0.005 - (-2.943 – 9.839 – 3.865 – 2.813) = 0.026 - (-2.943x2 + (– 9.839x6) + (– 3.865x5) +
4 4
(–2.813x7))
0.001
= 0.005 - (-19.46)
4 0.001
= 0.026 - (-5.886 + (-59.034) + (-19.325) +
4
= 0.005 – (-0.0048) = 0.0098 (-19.691))

0.001
= 0.026 – (-103.936) = 0.026 + 0.0259 = 0.0519
4
Linear Regression
Iteration 3 : θ0 = 0.098 and θ1 = 0.051
y1= 0.098 x 1 + 0.051 x 2 = 0.2
y2= 0.098 x 1 + 0.051 x 5 = 0.353
y3= 0.098 x 1 + 0.051 x 6 = 0.404
y4= 0.098 x 1 + 0.051 x 7 = 0.455
= 0.098 0.051 1 1 1 1
2 5 6 7

• y1= 0.098 x 1 + 0.051 x 2


• y2= 0.098 x 1 + 0.051 x 5
1 (0. 2 − 3) 2 +(0.353 − 10) 2 +(0.404 − 4) 2 +
=
• y3= 0.098 x 1 + 0.051 x 6 2𝑋4 (0.455 − 3) 2
• y4= 0.098 x 1 + 0.051 x 7 1
= 8 (7.84 + 93.06 + 12.93 + 6.47) = 15.03
Linear Regression
0.001 0.001
Θ0 = 0.098 - [(0.2-3) + (0.353-10) + (0.404-4) + Θ1 = 0.051 - [(0.2-3)2 + (0.353-10)6 + (0.404-
4 4
(0.455 - 3)] 4)5 + (0.455 - 3)7]

0.001 0.001
= 0.098 - (-2.8 + (– 9.647) + (– 3.596) +(-2.545)) = 0.051 - (-2.8x2 + (– 9.647x6) + (– 3.596x5) +
4 4
(–2.545x7))
0.001
= 0.098 - (-18.588)
4 0.001
= 0.051 - (-5.6 + (-57.882) + (-17.98) +
4
= 0.098 – (-0.0046) = 0.102 (-17.815))

0.001
= 0.051 – (-99.277) = 0.051 + 0.0248 = 0.075
4

θ0 = 0.102 and θ1 = 0.075


The iterations are continued till the error reduces to minimum and we get good fit (i.e., values of θ0 and θ1 )

You might also like