UNIT VI
Classification & Regression
Mark: 10
•Linear Regression in Machine Learning:
▪ Linear regression is one of the easiest and most popular Machine Learning
algorithms.
▪ It is a statistical method that is used for predictive analysis. Linear regression
makes predictions for continuous/real or numeric variables such as sales,
salary, age, product price, etc.
▪ Linear regression algorithm shows a linear relationship between a dependent
(y) and one or more independent (y) variables, hence called as linear
regression. Since linear regression shows the linear relationship,
• Mathematically, we can represent a linear regression as:
•y= a0+a1x+ ε
Here,
• Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
• The values for x and y variables are training datasets for Linear Regression
model representation.
•Types of Linear Regression:
•Simple Linear Regression:
If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression
algorithm is called Simple Linear Regression.
•Multiple Linear regression:
If more than one independent variable is used to predict the value of
a numerical dependent variable, then such a Linear Regression
algorithm is called Multiple Linear Regression.
•Mean Absolute Error(MAE)
•MAE is a very simple metric which calculates the absolute difference
between actual and predicted values.
•Mean Squared Error(MSE)
• MSE is a most used and very simple metric with a little bit of change in mean
absolute error. Mean squared error states that finding the squared difference
between actual and predicted value.
Root Mean Squared Error(RMSE)
As RMSE is clear by the name itself, that it is a simple square root of mean
squared error.
•R Squared (R2)
• R2 score is a metric that tells the performance of your model, not the loss in
an absolute sense that how many wells did your model perform.
• In contrast, MAE and MSE depend on the context as we have seen whereas
the R2 score is independent of context.
•Overfitting in Machine Learning:
• A statistical model is said to be overfitted when the model does not make
accurate predictions on testing data.
• When a model gets trained with so much data, it starts learning from the noise
and inaccurate data entries in our data set.
• And when testing with test data results in High variance. Then the model does
not categorize the data correctly, because of too many details and noise.
• The causes of overfitting are the non-parametric and non-linear methods
because these types of machine learning algorithms have more freedom in
building the model based on the dataset and therefore they can really build
unrealistic models.
• A solution to avoid overfitting is using a linear algorithm if we have linear data
or using the parameters like the maximal depth if we are using decision trees.
•Reasons for Overfitting:
• High variance and low bias.
• The model is too complex.
• The size of the training data.
•Techniques to Reduce Overfitting
• Increase training data.
• Reduce model complexity.
• Early stopping during the training phase (have an eye over the loss over
the training period as soon as loss begins to increase stop training).
• Ridge Regularization and Lasso Regularization.
• Use dropout for neural networks to tackle overfitting.
•
•Underfitting in Machine Learning:
•A statistical model or a machine learning algorithm is said to have underfitting
when a model is too simple to capture data complexities.
• It represents the inability of the model to learn the training data effectively
result in poor performance both on the training and testing data.
• In simple terms, an underfit model’s are inaccurate, especially when applied to
new, unseen examples.
• It mainly happens when we uses very simple model with overly simplified
assumptions.
• To address underfitting problem of the model, we need to use more complex
models, with enhanced feature representation, and less regularization.
• Reasons for Underfitting
• The model is too simple, So it may be not capable to represent the complexities in the data.
• The input features which is used to train the model is not the adequate representations of
underlying factors influencing the target variable.
• The size of the training dataset used is not enough.
• Excessive regularization are used to prevent the overfitting, which constraint the model to
capture the data well.
• Features are not scaled.
• Techniques to Reduce Underfitting
• Increase model complexity.
• Increase the number of features, performing feature engineering.
• Remove noise from the data.
• Increase the number of epochs or increase the duration of training to get better results.
• Multiple Linear Regression:
• Multiple Linear Regression is an extension of Simple Linear regression as it
takes more than one predictor variable to predict the response variable.
• Multiple Linear Regression is one of the important regression algorithms
which models the linear relationship between a single dependent continuous
variable and more than one independent variable.
• Some key points about MLR:
• For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor
or independent variable may be of continuous or categorical form.
• Each feature variable must model the linear relationship with the dependent variable.
• MLR tries to fit a regression line through a multidimensional space of data-points.
•MLR equation:
•In Multiple Linear Regression, the target variable(Y) is a linear
combination of multiple predictor variables x1, x2, x3, ...,xn. Since it is
an enhancement of Simple Linear Regression, so the same is applied
for the multiple linear regression equation, the equation becomes:
•Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x
<sub>2</sub>+ b<sub>3
•Where,
•Y= Output/Response variable
•b0, b1, b2, b3 , bn....= Coefficients of the model.
•x1, x2, x3, x4,...= Various Independent/feature variable
•
• Implementation of Multiple Linear Regression model:
• Problem Description:
• We have a dataset of 50 start-up companies.
• This dataset contains five main information: R&D Spend, Administration
Spend, Marketing Spend, State, and Profit for a financial year.
• Our goal is to create a model that can easily determine which company has a
maximum profit, and which is the most affecting factor for the profit of a
company.
• Since we need to find the Profit, so it is the dependent variable, and the other
four variables are independent variables. Below are the main steps of
deploying the MLR model:
• Data Pre-processing Steps
• Fitting the MLR model to the training set
• Predicting the result of the test set
• Step-1: Data Pre-processing Step:
• Importing libraries
• Importing dataset
• Extracting dependent and independent Variables:
• Encoding Dummy Variables:
• Step: 2- Fitting our MLR model to the Training set:
• Step: 3- Prediction of Test set results:
•Applications of Multiple Linear Regression:
•There are mainly two applications of Multiple Linear
Regression
•Effectiveness of Independent variable on prediction:
•Predicting the impact of changes:
• Logistic Regression in Machine Learning:
• Logistic regression is a supervised machine learning algorithm mainly used for
classification tasks where the goal is to predict the probability that an instance
of belonging to a given class or not.
• It is a kind of statistical algorithm, which analyze the relationship between a set
of independent variables and the dependent binary variables.
• It is a powerful tool for decision-making. For example email spam or not.
• Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous and
discrete datasets.
• Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for
the classification.
• Classification:
• In machine learning, Classification, as the name suggests, classifies data into
different parts/classes/groups. It is used to predict from which dataset the
input data belongs to.
• For example, if we are taking a dataset of scores of a cricketer in the past few
matches, along with average, strike rate, not outs etc, we can classify him as
“in form” or “out of form”.
• Types of Classification
• Binary classification
• Multi-class classification
•Binary Classification
• It is a process or task of classification, in which a given data is being classified
into two classes. It’s basically a kind of prediction about which of two groups
the thing belongs to.
• Let us suppose, two emails are sent to you, one is sent by an insurance
company that keeps sending their ads, and the other is from your bank
regarding your credit card bill. The email service provider will classify the two
emails, the first one will be sent to the spam folder and the second one will be
kept in the primary one.
• Binary classification uses some algorithms:
• Logistic Regression
• k-Nearest Neighbors
• Decision Trees
• Support Vector Machine
• Naive Bayes
• Term Related to binary classification :
• PRECISION:
• Precision in binary classification (Yes/No) refers to a model's ability to
correctly interpret positive observations.
• RECALL:
• The recall is also known as sensitivity. In binary classification (Yes/No)
recall is used to measure how “sensitive” the classifier is to detecting
positive cases.
•F1 SCORE
•The F1 score can be thought of as a weighted average of precision
and recall, with the best value being 1 and the worst being 0.
Precision and recall also make an equal contribution to the F1
ranking.
•Multiclass Classification
• Multi-class classification is the task of classifying elements into different
classes. Unlike binary, it doesn’t restrict itself to any number of classes.
Examples of multi-class classification are
• classification of news in different categories,
• classifying books according to the subject,
• classifying students according to their streams etc.
• In these, there are different classes for the response variable to be classified in
and thus according to the name, it is a Multi-class classification.
Parameters Binary classification Multi-class classification
It is a classification of two groups, There can be any number of classes
No. of classes i.e. classifies objects in at most two in it, i.e., classifies the object into
classes. more than two classes.
The most popular algorithms used Popular algorithms that can be used
by the binary classification are- for multi-class classification include:
• Logistic Regression •k-Nearest Neighbors
Algorithms used •k-Nearest Neighbors •Decision Trees
•Decision Trees •Naive Bayes
•Support Vector Machine •Random Forest.
•Naive Bayes •Gradient Boosting
Examples of binary classification Examples of multi-class classification
include- include:
Examples •Email spam detection (spam or not). •Face classification.
•Churn prediction (churn or not). •Plant species classification.
•Conversion prediction (buy or not). •Optical character recognition.
•Classification Performance:
• To evaluate the performance of a classification model, different metrics are
used, and some of them are as follows:
• Accuracy
• Confusion Matrix
• Precision
• Recall
• F-Score
• AUC(Area Under the Curve)-ROC
•I. Accuracy:
• The accuracy metric is one of the simplest Classification metrics to implement,
and it can be determined as the number of correct predictions to the total
number of predictions.
• Accuracy simply measures how often the classifier correctly predicts. We can
define accuracy as the ratio of the number of correct predictions and the total
number of predictions.
• When any model gives an accuracy rate of 99%, you might think that model is
performing very good but this is not always true and can be misleading in
some situations.
•Confusion Matrix:
•Confusion Matrix is a performance measurement for the machine
learning classification problems where the output can be two or
more classes. It is a table with combinations of predicted and actual
values.
•A confusion matrix is defined as thetable that is often used to
describe the performance of a classification model on a set of the
test data for which the true values are known.
•
•Precision :
•It explains how many of the correctly predicted cases actually turned
out to be positive. Precision is useful in the cases where False
Positive is a higher concern than False Negatives.
•Precision for a label is defined as the number of true positives
divided by the number of predicted positives.
•Recall (Sensitivity):
• It explains how many of the actual positive cases we were able to
predict correctly with our model. Recall is a useful metric in cases
where False Negative is of higher concern than False Positive.
•Recall for a label is defined as the number of true positives divided
by the total number of actual positives.
•F1 Score :
•It gives a combined idea about Precision and Recall metrics. It
is maximum when Precision is equal to Recall.
•F1 Score is the harmonic mean of precision and recall.
•
•AUC-ROC :
•The Receiver Operator Characteristic (ROC) is a probability curve that
plots the TPR(True Positive Rate) against the FPR(False Positive Rate)
at various threshold values and separates the ‘signal’ from the ‘noise’.
•The Area Under the Curve (AUC) is the measure of the ability of a
classifier to distinguish between classes. From the graph, we simply
say the area of the curve ABDE and the X and Y-axis.
•