lOMoARcPSD|41266623
Unit III ML Notes - R22 Regulation
Machine Learning (Jawaharlal Nehru Technological University, Hyderabad)
                                     Scan to open on Studocu
                  Studocu is not sponsored or endorsed by any college or university
               Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                               lOMoARcPSD|41266623
                                                                     Machine Learning UNIT -III Notes
Decision Tree in Machine Learning
A decision tree is a type of supervised learning algorithm that is
commonly used in machine learning to model and predict
outcomes based on input data. It is a tree-like structure where
each internal node tests on attribute, each branch corresponds
to attribute value and each leaf node represents the final
decision or prediction. The decision tree algorithm falls under the
category of supervised learning. They can be used to solve
both regression and classification problems.
Decision Tree Terminologies
There are specialized terms associated with decision trees that
denote various components and facets of the tree structure and
decision-making procedure. :
       Root Node: A decision tree’s root node, which
         represents the original choice or feature from which the
         tree branches, is the highest node.
       Internal Nodes (Decision Nodes): Nodes in the tree
         whose choices are determined by the values of particular
         attributes. There are branches on these nodes that go to
         other nodes.
       Leaf Nodes (Terminal Nodes): The branches’ termini,
         when choices or forecasts are decided upon. There are
         no more branches on leaf nodes.
       Branches (Edges): Links between nodes that show how
         decisions are made in response to particular
         circumstances.
       Splitting: The process of dividing a node into two or
         more sub-nodes based on a decision criterion. It involves
         selecting a feature and a threshold to create subsets of
         data.
       Parent Node: A node that is split into child nodes. The
         original node from which a split originates.
       Child Node: Nodes created as a result of a split from a
         parent node.
       Decision Criterion: The rule or condition used to
         determine how the data should be split at a decision
         node. It involves comparing feature values against a
         threshold.
       Pruning: The process of removing branches or nodes
         from a decision tree to improve its generalisation and
         prevent overfitting.
Understanding these terminologies is crucial for interpreting and
working with decision trees in machine learning applications.
How Decision Tree is formed?
                                                                                        Page 1 of 27
              Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                              lOMoARcPSD|41266623
                                                                    Machine Learning UNIT -III Notes
The process of forming a decision tree involves recursively
partitioning the data based on the values of different attributes.
The algorithm selects the best attribute to split the data at each
internal node, based on certain criteria such as information gain
or Gini impurity. This splitting process continues until a stopping
criterion is met, such as reaching a maximum depth or having a
minimum number of instances in a leaf node.
Why Decision Tree?
Decision trees are widely used in machine learning for a number
of reasons:
       Decision trees are so versatile in simulating intricate
         decision-making processes, because of their
         interpretability and versatility.
       Their portrayal of complex choice scenarios that take into
         account a variety of causes and outcomes is made
         possible by their hierarchical structure.
       They provide comprehensible insights into the decision
         logic, decision trees are especially helpful for tasks
         involving categorisation and regression.
       They are proficient with both numerical and categorical
         data, and they can easily adapt to a variety of datasets
         thanks to their autonomous feature selection capability.
       Decision trees also provide simple visualization, which
         helps to comprehend and elucidate the underlying
         decision processes in a model.
Decision Tree Approach
Decision tree uses the tree representation to solve the problem
in which each leaf node corresponds to a class label and
attributes are represented on the internal node of the tree. We
can represent any boolean function on discrete attributes using
the decision tree.
Constructing Decision Trees
  1. Decision Node 2. Leaf Node
Here D1 to D6 are Data Sets as per above classification
                                                                                       Page 2 of 27
             Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                               lOMoARcPSD|41266623
                                                                           Machine Learning UNIT -III Notes
Learning with Trees – Decision Trees – Constructing Decision
Trees
Step1 : given data set, choose target attribute
Step 2: calculate information gain with entropy
Step 3: Information Gain (IG) =                           −¿
                                                              N
                                                             P+ N
                                                                  log  ( N
                                                                                )
                                                                      2 P+ N ¿ ¿ ¿
                                                                         ¿
                                                                                               P i+N
For each attribute , find entropy E(A) where E(A) = ∑ P+ N I(                                          i
P i N i)
Entropy = IGX Probability
Step 4 : calculate Gain = IG – E(A)
CART(Classification And Regression
Tree) for Decision Tree
CART is a predictive algorithm used in Machine learning and it
explains how the target variable’s values can be predicted based
on other matters. It is a decision tree where each fork is split into
a predictor variable and each node has a prediction for the
target variable at the end.
The term CART serves as a generic term for the following
categories of decision trees:
                                                                                              Page 3 of 27
              Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                               lOMoARcPSD|41266623
                                                                     Machine Learning UNIT -III Notes
        Classification Trees: The tree is used to determine
         which “class” the target variable is most likely to fall into
         when it is continuous.
       Regression trees: These are used to predict a
         continuous variable’s value.
In the decision tree, nodes are split into sub-nodes based on a
threshold value of an attribute. The root node is taken as the
training set and is split into two by considering the best attribute
and threshold value. Further, the subsets are also split using the
same logic. This continues till the last pure sub-set is found in
the tree or the maximum number of leaves possible in that
growing tree.
CART Algorithm
Classification and Regression Trees (CART) is a decision tree
algorithm that is used for both classification and regression
tasks. It is a supervised learning algorithm that learns from
labelled data to predict unseen data.
       Tree structure: CART builds a tree-like structure
         consisting of nodes and branches. The nodes represent
         different decision points, and the branches represent the
         possible outcomes of those decisions. The leaf nodes in
         the tree contain a predicted class label or value for the
         target variable.
       Splitting criteria: CART uses a greedy approach to split
         the data at each node. It evaluates all possible splits and
         selects the one that best reduces the impurity of the
         resulting subsets. For classification tasks, CART uses Gini
         impurity as the splitting criterion. The lower the Gini
         impurity, the more pure the subset is. For regression
         tasks, CART uses residual reduction as the splitting
         criterion. The lower the residual reduction, the better the
         fit of the model to the data.
       Pruning: To prevent overfitting of the data, pruning is a
         technique used to remove the nodes that contribute little
         to the model accuracy. Cost complexity pruning and
         information gain pruning are two popular pruning
         techniques. Cost complexity pruning involves calculating
         the cost of each node and removing nodes that have a
         negative cost. Information gain pruning involves
         calculating the information gain of each node and
         removing nodes that have a low information gain.
How does CART algorithm works?
The CART algorithm works via the following process:
       The best-split point of each input is obtained.
                                                                                        Page 4 of 27
              Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                 lOMoARcPSD|41266623
                                                                       Machine Learning UNIT -III Notes
         Based on the best-split points of each input in Step 1, the
          new “best” split point is identified.
         Split the chosen input according to the “best” split point.
         Continue splitting until a stopping rule is satisfied or no
          further desirable splitting is available.
CART algorithm uses Gini Impurity to split the dataset into a
decision tree .It does that by searching for the best homogeneity
for the sub nodes, with the help of the Gini index criterion.
Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It
stores the sum of squared probabilities of each class. It
computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation
of the Gini coefficient. It works on categorical variables, provides
outcomes either “successful” or “failure” and hence conducts
binary splitting only.
The degree of the Gini index varies from 0 to 1,
       Where 0 depicts that all the elements are allied to a
         certain class, or only one class exists there.
       Gini index close to 1 means a high level of impurity,
         where each class contains a very small fraction of
         elements, and
       A value of 1-1/n occurs when the elements are uniformly
         distributed into n classes and each class has an equal
         probability of 1/n. For example, with two classes, the Gini
         impurity is 1 – 1/2 = 0.5.
Mathematically, we can write Gini Impurity as follows:
Gini=1−∑i=1n(pi)2Gini=1−∑i=1n(pi)2
                                                                                          Page 5 of 27
                Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                lOMoARcPSD|41266623
                                                                      Machine Learning UNIT -III Notes
where pipi is the probability of an object being classified to a
particular class.
In conclusion, Gini impurity is the probability of misclassification,
assuming independent selection of the element and its class
based on the class probabilities.
CART for Classification
A classification tree is an algorithm where the target variable is
categorical. The algorithm is then used to identify the “Class”
within which the target variable is most likely to fall.
Classification trees are used when the dataset needs to be split
into classes that belong to the response variable(like yes or no)
For classification in decision tree learning algorithm that creates
a tree-like structure to predict class labels. The tree consists of
nodes, which represent different decision points, and branches,
which represent the possible result of those decisions. Predicted
class labels are present at each leaf node of the tree.
How Does CART for Classification Work?
CART for classification works by recursively splitting the training
data into smaller and smaller subsets based on certain criteria.
The goal is to split the data in a way that minimizes the impurity
within each subset. Impurity is a measure of how mixed up the
data is in a particular subset. For classification tasks, CART uses
Gini impurity
       Gini Impurity- Gini impurity measures the probability of
         misclassifying a random instance from a subset labeled
         according to the majority class. Lower Gini impurity
         means more purity of the subset.
       Splitting Criteria- The CART algorithm evaluates all
         potential splits at every node and chooses the one that
         best decreases the Gini impurity of the resultant subsets.
         This process continues until a stopping criterion is
         reached, like a maximum tree depth or a minimum
         number of instances in a leaf node.
CART for Regression
A Regression tree is an algorithm where the target variable is
continuous and the tree is used to predict its value. Regression
trees are used when the response variable is continuous. For
example, if the response variable is the temperature of the day.
CART for regression is a decision tree learning method that
creates a tree-like structure to predict continuous target
variables. The tree consists of nodes that represent different
decision points and branches that represent the possible
outcomes of those decisions. Predicted values for the target
variable are stored in each leaf node of the tree.
                                                                                         Page 6 of 27
               Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                lOMoARcPSD|41266623
                                                                      Machine Learning UNIT -III Notes
How Does CART works for Regression?
Regression CART works by splitting the training data recursively
into smaller subsets based on specific criteria. The objective is to
split the data in a way that minimizes the residual reduction in
each subset.
        Residual Reduction- Residual reduction is a measure of
         how much the average squared difference between the
         predicted values and the actual values for the target
         variable is reduced by splitting the subset. The lower the
         residual reduction, the better the model fits the data.
        Splitting Criteria- CART evaluates every possible split
         at each node and selects the one that results in the
         greatest reduction of residual error in the resulting
         subsets. This process is repeated until a stopping
         criterion is met, such as reaching the maximum tree
         depth or having too few instances in a leaf node.
Pseudo-code of the CART algorithm
d = 0, endtree = 0 Note(0) = 1, Node(1) = 0, Node(2) =
0 while endtree < 1 if Node(2 d-1) + Node(2d) + .... + Node(2d+1-
2) = 2 - 2d+1 endtree = 1 else do i = 2 d-1, 2d, .... , 2d+1-2 if
Node(i) > -1 Split tree else Node(2i+1) = -1 Node(2i+2) = -1 end
if end do end if d = d + 1 end while
CART model representation
CART models are formed by picking input variables and
evaluating split points on those variables until an appropriate
tree is produced.
Steps to create a Decision Tree using the CART algorithm:
       Greedy algorithm: In this The input space is divided
         using the Greedy method which is known as a recursive
         binary spitting. This is a numerical method within which
         all of the values are aligned and several other split points
         are tried and assessed using a cost function.
       Stopping Criterion: As it works its way down the tree
         with the training data, the recursive binary splitting
         method described above must know when to stop
         splitting. The most frequent halting method is to utilize a
         minimum amount of training data allocated to every leaf
         node. If the count is smaller than the specified threshold,
         the split is rejected and also the node is considered the
         last leaf node.
       Tree pruning: Decision tree’s complexity is defined as
         the number of splits in the tree. Trees with fewer
                                                                                         Page 7 of 27
               Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                               lOMoARcPSD|41266623
                                                                     Machine Learning UNIT -III Notes
       branches are recommended as they are simple to grasp
       and less prone to cluster the data. Working through each
       leaf node in the tree and evaluating the effect of deleting
       it using a hold-out test set is the quickest and simplest
       pruning approach.
     Data preparation for the CART: No special data
       preparation is required for the CART algorithm.
POPULAR CART-BASED ALGORITHMS:
     CART (Classification and Regression Trees): The
       original algorithm that uses binary splits to build decision
       trees.
     C4.5 and C5.0: Extensions of CART that allow for
       multiway splits and handle categorical variables more
       effectively.
     Random Forests: Ensemble methods that use multiple
       decision trees (often CART) to improve predictive
       performance and reduce overfitting.
     Gradient Boosting Machines (GBM): Boosting
       algorithms that also use decision trees (often CART) as
       base learners, sequentially improving model
       performance.
Advantages of CART
     Results are simplistic.
     Classification and regression trees are Nonparametric
       and Nonlinear.
     Classification and regression trees implicitly perform
       feature selection.
     Outliers have no meaningful effect on CART.
     It requires minimal supervision and produces easy-to-
       understand models.
Limitations of CART
     Overfitting.
     High Variance.
     low bias.
     the tree structure may be unstable.
Applications of the CART algorithm
     For quick Data insights.
     In Blood Donors Classification.
     For environmental and ecological data.
     In the financial sectors.
What is Ensemble Learning with examples?
                                                                                        Page 8 of 27
              Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                              lOMoARcPSD|41266623
                                                                    Machine Learning UNIT -III Notes
      Ensemble learning is a machine learning technique that
       combines the predictions from multiple individual models
       to obtain a better predictive performance than any single
       model. The basic idea behind ensemble learning is to
       leverage the wisdom of the crowd by aggregating the
       predictions of multiple models, each of which may have
       its own strengths and weaknesses. This can lead to
       improved performance and generalization.
      Ensemble learning can be thought of as compensation
       for poor learning algorithms that are computationally
       more expensive than a single model. But they are more
       efficient than a single non-ensemble model that has
       passed through a lot of learning. In this article, we will
       have a comprehensive overview of the importance of
       ensemble learning and how it works, different types of
       ensemble classifiers, advanced ensemble learning
       techniques, and some algorithms (such as random
       forest, xgboost) for better clarification of the common
       ensemble classifiers and finally their uses in the technical
       world.
      Several individual base models (experts) are fitted to
       learn from the same data and produce an aggregation of
       output based on which a final decision is taken. These
       base models can be machine learning algorithms such as
       decision trees (mostly used), linear models, support
       vector machines (SVM), neural networks, or any other
       model that is capable of making predictions.
      Most commonly used ensembles include techniques such
       as Bagging- used to generate Random Forest algorithms
       and Boosting- to generate algorithms such
       as Adaboost, Xgboost etc.
Ensemble Learning Techniques
      Gradient Boosting Machines (GBM): Gradient
       Boosting is a popular ensemble learning technique that
       sequentially builds a group of decision trees and corrects
       the residual errors made by previous trees, enhancing its
       predictive accuracy. It trains each new weak learner to fit
       the residuals of the previous ensemble’s predictions thus
       making it less sensitive to individual data points or
       outliers in the data.
      Extreme Gradient Boosting (XGBoost): XGBoost
       features tree pruning, regularization, and parallel
       processing, which makes it a preferred choice for data
       scientists seeking robust and accurate predictive models.
                                                                                       Page 9 of 27
             Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                               lOMoARcPSD|41266623
                                                                     Machine Learning UNIT -III Notes
       CatBoost: It is designed to handle features categorically
        that eliminates the need for extensive pre-
        processing.CatBoost is known for its high predictive
        accuracy, fast training, and automatic handling of
        overfitting.
      Stacking: It combines the output of multiple base
        models by training a combiner(an algorithm that takes
        predictions of base models) and generate more accurate
        prediction. Stacking allows for more flexibility in
        combining diverse models, and the combiner can be any
        machine learning algorithm.
      Random Subspace Method (Random Subspace
        Ensembles): It is an ensemble learning approach that
        improves the predictive accuracy by training base
        models on random subsets of input features. It mitigates
        overfitting and improves the generalization by
        introducing diversity in the model space.
      Random Forest Variants: They introduce variations in
        tree construction, feature selection, or model
        optimization to enhance performance.
Selecting the right advanced ensemble technique depends on
the nature of the data, the specific problem trying to be solved,
and the computational resources available. It often requires
experimentation and changes to achieve the best results.
Algorithm based on Bagging and Boosting
Bagging Algorithm
Bagging is a supervised learning technique that can be used for
both regression and classification tasks. Here is an overview of
the steps including Bagging classifier algorithm:
      Bootstrap Sampling: Divides the original training data
        into ‘N’ subsets and randomly selects a subset with
        replacement in some rows from other subsets. This step
        ensures that the base models are trained on diverse
        subsets of the data and there is no class imbalance.
      Base Model Training: For each bootstrapped sample,
        train a base model independently on that subset of data.
        These weak models are trained in parallel to increase
        computational efficiency and reduce time consumption.
      Prediction Aggregation: To make a prediction on
        testing data combine the predictions of all base models.
        For classification tasks, it can include majority voting or
        weighted majority while for regression, it involves
        averaging the predictions.
                                                                                       Page 10 of 27
              Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                               lOMoARcPSD|41266623
                                                                     Machine Learning UNIT -III Notes
        Out-of-Bag (OOB) Evaluation: Some samples are
         excluded from the training subset of particular base
         models during the bootstrapping method. These “out-of-
         bag” samples can be used to estimate the model’s
         performance without the need for cross-validation.
       Final Prediction: After aggregating the predictions from
         all the base models, Bagging produces a final prediction
         for each instance.
Boosting Algorithm
Boosting is an ensemble technique that combines multiple weak
learners to create a strong learner. The ensemble of weak
models are trained in series such that each model that comes
next, tries to correct errors of the previous model until the entire
training dataset is predicted correctly. One of the most well-
known boosting algorithms is AdaBoost (Adaptive Boosting).
Here are few popular boosting algorithm frameworks:
       AdaBoost (Adaptive Boosting): AdaBoost assigns
         different weights to data points, focusing on challenging
         examples in each iteration. It combines weighted weak
         classifiers to make predictions.
       Gradient Boosting: Gradient Boosting, including
         algorithms like Gradient Boosting Machines (GBM),
         XGBoost, and LightGBM, optimizes a loss function by
         training a sequence of weak learners to minimize the
         residuals between predictions and actual values,
         producing strong predictive models.
Uses of Ensemble Learning
Ensemble learning is a versatile approach that can be applied to
a wide range of machine learning problems such as:-
      Classification and Regression: Ensemble techniques
        make problems like classification and regression
        versatile in various domains, including finance,
        healthcare, marketing, and more.
      Anomaly Detection: Ensembles can be used to detect
        anomalies in datasets by combining multiple anomaly
        detection algorithms, thus making it more robust.
      Portfolio Optimization: Ensembles can be employed to
        optimize investment portfolios by collecting predictions
        from various models to make better investment
        decisions.
      Customer Churn Prediction: In business and
        marketing analytics, by combining the results of various
        models capturing different aspects of customer
                                                                                       Page 11 of 27
              Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                               lOMoARcPSD|41266623
                                                                     Machine Learning UNIT -III Notes
         behaviour, ensembles can be used to predict customer
         churn.
       Medical Diagnostics: In healthcare, ensembles can be
         used to make more accurate predictions of diseases
         based on various medical data sources and diagnostic
         models.
       Credit Scoring: Ensembles can be used to improve the
         accuracy of credit scoring models by combining the
         outputs of various credit risk assessment models.
       Climate Prediction: Ensembles of climate models help
         in making more accurate and reliable predictions for
         weather forecasting, climate change projections, and
         related environmental studies.
       Time Series Forecasting: Ensemble learning combines
         multiple time series forecasting models to enhance
         accuracy and reliability, adapting to changing temporal
         patterns.
What is Statistics?
Statistics is the science of collecting, organizing, analyzing,
interpreting, and presenting data. It encompasses a wide range
of techniques for summarizing data, making inferences, and
drawing conclusions.
Statistical methods help quantify uncertainty and variability in
data, allowing researchers and analysts to make data-driven
decisions with confidence.
         Types of Statistics
         There are commonly two types of statistics, which are
         discussed below:
   • Descriptive Statistics: "Descriptive Statistics " helps us
      simplify and organize big chunks of data. This makes large
      amounts of data easier to understand.
   • Inferential Statistics: "Inferential Statistics " is a little
      different. It uses smaller data to draw conclusions about a
      larger group. It helps us predict and draw conclusions about
      a population.
   • Descriptive Statistics: "Descriptive Statistics" helps us
     simplify and organize big chunks of data. This makes large
     amounts of data easier to understand.
   • Mean is calculated by summing all values present in the
     sample divided by total number of values present in the
     sample.
     Mean(μ)=SumofValuesNumberofValues Mean(μ)=Numbero
     fValuesSumofValues
                                                                                       Page 12 of 27
              Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                      lOMoARcPSD|41266623
                                                                            Machine Learning UNIT -III Notes
    •
       Median is the middle of a sample when arranged from
       lowest to highest or highest to lowest. in order to find the
       median, the data must be sorted.
       For odd number of data points:
    • Median=(n+12)thMedian=(2n+1)th
    • For even number of data points:
    • Median=Average of(n2)th value and its next value
       Median=Average of (2n)th value and its next value
    • Mode is the most frequently occurring value in the dataset.
•    Measures of Dispersion
•    Range: The difference between the maximum and minimum
     values.
•    Variance: The measure of how data points differ from the
     mean.
•    Standard Deviation: The square root of the
     variance, representing the average distance from the mean.
Measures of Dispersion
    •   Range: The difference between the maximum and minimum values.
    •   Variance: The average squared deviation from the mean, representing data spread.
    •   Standard Deviation: The square root of variance, indicating data spread relative to the mean.
    •   Interquartile Range: The range between the first and third quartiles, measuring data spread
        around the median.
Measures of Shape
    •   Skewness: Indicates data asymmetry.
Measures of Shape
    •   Kurtosis: Measures the peakedness of the data distribution.
                                                                                              Page 13 of 27
                     Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                          lOMoARcPSD|41266623
                                                                                Machine Learning UNIT -III Notes
Inferential Statistics
Inferential statistics is used to make predictions by taking any group of data in which you are
interested. It can be defined as a random sample of data taken from a population to describe and
make inferences about the population. Any group of data that includes all the data you are
interested in is known as population. It basically allows you to make predictions by taking a small
sample instead of working on the whole population.
Population and Sample
    •   Population: The entire group being studied.
    •   Sample: A subset of the population used for analysis.
Estimation
    •   Point Estimation: Provides a single value estimate of a population parameter.
    •   Interval Estimation: Offers a range of values (confidence interval) within which the
        parameter likely lies.
    •   Confidence Intervals: Indicate the reliability of an estimate.
Hypothesis Testing
                                                                                                  Page 14 of 27
                         Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                       lOMoARcPSD|41266623
                                                                             Machine Learning UNIT -III Notes
    •   Null and Alternative Hypotheses: The null hypothesis assumes no effect or relationship, while
        the alternative suggests otherwise.
    •   Type I and Type II Errors: Type I error is rejecting a true null hypothesis, while Type II is failing
        to reject a false null hypothesis
    •   p-Values: Measure the probability of obtaining the observed results under the null
        hypothesis.
    •   t-Tests and z-Tests: Compare means to assess statistical significance.
ANOVA (Analysis of Variance):
Compares means across multiple groups to determine if they differ significantly.
Chi-Square Tests:
Assess the association between categorical variables.
Correlation and Regression:
Understanding relationships between variables is critical in machine learning.
Correlation
    •   Pearson Correlation Coefficient: Measures linear relationship strength between two
        variables.
    •   Spearman Rank Correlation: Assesses the strength and direction of the monotonic
        relationship between variables.
Regression Analysis
    •   Simple Linear Regression: Models the relationship between two variables.
    •   Multiple Linear Regression: Extends to multiple predictors.
    •   Assumptions of Linear Regression: Linearity, independence, homoscedasticity, normality.
    •   Interpretation of Regression Coefficients: Explains predictor influence on the response
        variable.
    •   Model Evaluation Metrics: R-squared, Adjusted R-squared, RMSE.
Bayesian Statistics
Bayesian statistics incorporate prior knowledge with current evidence to update beliefs.
                                                                                               Page 15 of 27
                      Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                      lOMoARcPSD|41266623
                                                                            Machine Learning UNIT -III Notes
Bayes' Theorem is a fundamental concept in probability theory that relates conditional probabilities.
It is named after the Reverend Thomas Bayes, who first introduced the theorem. Bayes' Theorem is a
mathematical formula that provides a way to update probabilities based on new evidence. The
formula is as follows:
P(A∣B)=P(B)P(B∣A)⋅P(A), where
    •   P(A∣B): The probability of event A given that event B has occurred (posterior probability).
    •   P(B∣A): The probability of event B given that event A has occurred (likelihood).
    •   P(A): The probability of event A occurring (prior probability).
    •   P(B): The probability of event B occurring.
Applications of Statistics in Machine Learning
Statistics is a key component of machine learning, with broad applicability in various fields.
    •   Feature engineering relies heavily on statistics to convert geometric features into meaningful
        predictors for machine learning algorithms.
    •   In image processing tasks like object recognition and segmentation, statistics accurately
        reflect the shape and structure of objects in images.
    •   Anomaly detection and quality control benefit from statistics by identifying deviations from
        norms, aiding in the detection of defects in manufacturing processes.
    •   Environmental observation and geospatial mapping leverage statistical analysis to monitor
        land cover patterns and ecological trends effectively.
Gaussian Mixture Models
Normal or Gaussian Distribution
In real life, many datasets can be modeled by Gaussian
Distribution (Univariate or Multivariate). So it is quite natural and
intuitive to assume that the clusters come from different
Gaussian Distributions. Or in other words, it tried to model the
dataset as a mixture of several Gaussian Distributions. This is
the core idea of this model.
In one dimension the probability density function of a Gaussian
Distribution is given by
                                                                                              Page 16 of 27
                     Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                   lOMoARcPSD|41266623
                                                                         Machine Learning UNIT -III Notes
where     and     are respectively the mean and variance of the
distribution. For Multivariate ( let us say d-variate) Gaussian
Distribution, the probability density function is given by
                                                           Here
µ is a d dimensional vector denoting the mean of the distribution
and ∑ is the d X d covariance matrix.
In probability theory and statistics, a normal distribution or Gaussian
distribution is a type of continuous probability distribution for a real-valued random
variable. The general form of its probability density function is
 A random variable with a Gaussian distribution is said to be normally distributed,
and is called a normal deviate.
Normal distributions are important in statistics and are often used in
the natural and social sciences to represent real-valued random variables whose
distributions are not known Their importance is partly due to the central limit
theorem. It states that, under some conditions, the average of many samples
(observations) of a random variable with finite mean and variance is itself a random
variable—whose distribution converges to a normal distribution as the number of
samples increases. Therefore, physical quantities that are expected to be the sum of
many independent processes, such as measurement errors, often have distributions
that are nearly normal
Moreover, Gaussian distributions have some unique properties that are valuable in
analytic studies. For instance, any linear combination of a fixed collection of
independent normal deviates is a normal deviate. Many results and methods, such
as propagation of uncertainty and least squares[5] parameter fitting, can be derived
analytically in explicit form when the relevant variables are normally distributed.
A normal distribution is sometimes informally called a bell curve. However, many
other distributions are bell-shaped (such as the Cauchy, Student's t,
and logistic distributions).
The univariate probability distribution is generalized for vectors in the multivariate
normal distribution and for matrices in the matrix normal distribution.
                                                                                           Page 17 of 27
                  Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                              lOMoARcPSD|41266623
                                                                    Machine Learning UNIT -III Notes
In real-world machine learning applications, it is common to have
many relevant features, but only a subset of them may be
observable. When dealing with variables that are sometimes
observable and sometimes not, it is indeed possible to utilize the
instances when that variable is visible or observed in order to
learn and make predictions for the instances where it is not
observable. This approach is often referred to as handling
missing data. By using the available instances where the variable
is observable, machine learning algorithms can learn patterns
and relationships from the observed data. These learned
patterns can then be used to predict the values of the variable in
instances where it is missing or not observable.
The expectation-Maximization algorithm can be used to handle
situations where variables are partially observable. When certain
variables are observable, we can use those instances to learn
and estimate their values. Then, we can predict the values of
these variables in instances when it is not observable.
Expectation-Maximization (EM) Algorithm
The Expectation-Maximization (EM) algorithm is an iterative
optimization method that combines
different unsupervised machine learning algorithms to find
maximum likelihood or maximum posterior estimates of
parameters in statistical models that involve unobserved latent
variables. The EM algorithm is commonly used for latent variable
models and can handle missing data. It consists of an estimation
step (E-step) and a maximization step (M-step), forming an
iterative process to improve model fit.
                                                                                      Page 18 of 27
             Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                lOMoARcPSD|41266623
                                                                      Machine Learning UNIT -III Notes
        In the E step, the algorithm computes the latent
         variables i.e. expectation of the log-likelihood using the
         current parameter estimates.
        In the M step, the algorithm determines the parameters
         that maximize the expected log-likelihood obtained in the
         E step, and corresponding model parameters are
         updated based on the estimated latent variables.
Key Terms in Expectation-Maximization (EM) Algorithm
Some of the most commonly used key terms in the Expectation-
Maximization (EM) Algorithm are as follows:
     Latent Variables: Latent variables are unobserved
       variables in statistical models that can only be inferred
       indirectly through their effects on observable variables.
       They cannot be directly measured but can be detected
       by their impact on the observable variables.
     Likelihood: It is the probability of observing the given
       data given the parameters of the model. In the EM
       algorithm, the goal is to find the parameters that
       maximize the likelihood.
     Log-Likelihood: It is the logarithm of the likelihood
       function, which measures the goodness of fit between
       the observed data and the model. EM algorithm seeks to
       maximize the log-likelihood.
     Maximum Likelihood Estimation (MLE): MLE is a
       method to estimate the parameters of a statistical model
       by finding the parameter values that maximize the
       likelihood function, which measures how well the model
       explains the observed data.
     Posterior Probability: In the context of Bayesian
       inference, the EM algorithm can be extended to estimate
                                                                                        Page 19 of 27
               Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                               lOMoARcPSD|41266623
                                                                     Machine Learning UNIT -III Notes
        the maximum a posteriori (MAP) estimates, where the
        posterior probability of the parameters is calculated
        based on the prior distribution and the likelihood
        function.
      Expectation (E) Step: The E-step of the EM algorithm
        computes the expected value or posterior probability of
        the latent variables given the observed data and current
        parameter estimates. It involves calculating the
        probabilities of each latent variable for each data point.
      Maximization (M) Step: The M-step of the EM
        algorithm updates the parameter estimates by
        maximizing the expected log-likelihood obtained from
        the E-step. It involves finding the parameter values that
        optimize the likelihood function, typically through
        numerical optimization methods.
      Convergence: Convergence refers to the condition
        when the EM algorithm has reached a stable solution. It
        is typically determined by checking if the change in the
        log-likelihood or the parameter estimates falls below a
        predefined threshold.
How Expectation-Maximization (EM) Algorithm Works:
The essence of the Expectation-Maximization algorithm is to use
the available observed data of the dataset to estimate the
missing data and then use that data to update the values of the
parameters. Let us understand the EM algorithm in detail.
                                                                                       Page 20 of 27
              Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                         lOMoARcPSD|41266623
                                                               Machine Learning UNIT -III Notes
                            EM Algorithm Flowchart
1. Initialization:
          Initially, a set of initial values of the parameters
            are considered. A set of incomplete observed
            data is given to the system with the assumption
            that the observed data comes from a specific
            model.
2. E-Step (Expectation Step): In this step, we use the
   observed data in order to estimate or guess the values of
   the missing or incomplete data. It is basically used to
   update the variables.
          Compute the posterior probability or
            responsibility of each latent variable given the
            observed data and current parameter estimates.
          Estimate the missing or incomplete data values
            using the current parameter estimates.
          Compute the log-likelihood of the observed data
            based on the current parameter estimates and
            estimated missing data.
3. M-step (Maximization Step): In this step, we use the
   complete data generated in the preceding “Expectation”
   – step in order to update the values of the parameters. It
   is basically used to update the hypothesis.
                                                                                 Page 21 of 27
        Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                 lOMoARcPSD|41266623
                                                                       Machine Learning UNIT -III Notes
              Update the parameters of the model by
               maximizing the expected complete data log-
               likelihood obtained from the E-step.
              This typically involves solving optimization
               problems to find the parameter values that
               maximize the log-likelihood.
              The specific optimization technique used
               depends on the nature of the problem and the
               model being used.
    4. Convergence: In this step, it is checked whether the
       values are converging or not, if yes, then stop otherwise
       repeat step-2 and step-3 i.e. “Expectation” – step and
       “Maximization” – step until the convergence occurs.
              Check for convergence by comparing the change
               in log-likelihood or the parameter values between
               iterations.
              If the change is below a predefined threshold,
               stop and consider the algorithm converged.
              Otherwise, go back to the E-step and repeat the
               process until convergence is achieved.
Applications of the EM algorithm
     It can be used to fill in the missing data in a sample.
     It can be used as the basis of unsupervised learning of
       clusters.
     It can be used for the purpose of estimating the
       parameters of the Hidden Markov Model (HMM).
     It can be used for discovering the values of latent
       variables.
Advantages of EM algorithm
     It is always guaranteed that likelihood will increase with
       each iteration.
     The E-step and M-step are often pretty easy for many
       problems in terms of implementation.
     Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm
     It has slow convergence.
     It makes convergence to the local optima only.
     It requires both the probabilities, forward and backward
       (numerical optimization requires only forward
       probability).
What is the K-Nearest Neighbors
Algorithm?
                                                                                         Page 22 of 27
                Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                               lOMoARcPSD|41266623
                                                                     Machine Learning UNIT -III Notes
KNN is one of the most basic yet essential classification
algorithms in machine learning. It belongs to the supervised
learning domain and finds intense application in pattern
recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-
parametric, meaning it does not make any underlying
assumptions about the distribution of data (as opposed to other
algorithms such as GMM, which assume a Gaussian
distribution of the given data). We are given some prior data
(also called training data), which classifies coordinates into
groups identified by an attribute.
As an example, consider the following table of data points
containing two features:
     Why do we need a KNN algorithm?
(K-NN) algorithm is a versatile and widely used machine learning
algorithm that is primarily used for its simplicity and ease of
implementation. It does not require any assumptions about the
underlying data distribution. It can also handle both numerical
and categorical data, making it a flexible choice for various types
of datasets in classification and regression tasks. It is a non-
parametric method that makes predictions based on the
similarity of data points in a given dataset. K-NN is less sensitive
to outliers compared to other algorithms.
The K-NN algorithm works by finding the K nearest neighbors to
a given data point based on a distance metric, such as Euclidean
distance. The class or value of the data point is then determined
by the majority vote or average of the K neighbors. This
                                                                                       Page 23 of 27
              Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                               lOMoARcPSD|41266623
                                                                     Machine Learning UNIT -III Notes
approach allows the algorithm to adapt to different patterns and
make predictions based on the local structure of the data.
Distance Metrics Used in KNN Algorithm
As we know that the KNN algorithm helps us identify the nearest
points or the groups for a query point. But to determine the
closest groups or the nearest points for a query point we need
some metric. For this purpose, we use below distance metrics:
Euclidean Distance
This is nothing but the cartesian distance between the two points
which are in the plane/hyperplane. Euclidean distance can also
be visualized as the length of the straight line that joins the two
points which are into consideration. This metric helps us
calculate the net displacement done between the two states of
an object.
          distance(x,Xi)=∑j=1d(xj–Xij)2] distance(x,Xi)=∑j=1d(xj–Xij)2]
Manhattan Distance
Manhattan Distance metric is generally used when we are
interested in the total distance traveled by the object instead of
the displacement. This metric is calculated by summing the
absolute difference between the coordinates of the points in n-
dimensions.
                      d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
Minkowski Distance
We can say that the Euclidean, as well as the Manhattan
distance, are special cases of the Minkowski distance .
                d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above we can say that when p = 2 then it is
the same as the formula for the Euclidean distance and when p
= 1 then we obtain the formula for the Manhattan distance.
The above-discussed metrics are most common while dealing
with a Machine Learning problem but there are other distance
metrics as well like Hamming Distance which come in handy
while dealing with problems that require overlapping
comparisons between two vectors whose contents can be
Boolean as well as string values.
How to choose the value of k for KNN
Algorithm?
The value of k is very crucial in the KNN algorithm to define the
number of neighbors in the algorithm. The value of k in the k-
nearest neighbors (k-NN) algorithm should be chosen based on
the input data. If the input data has more outliers or noise, a
higher value of k would be better. It is recommended to choose
                                                                                       Page 24 of 27
              Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                 lOMoARcPSD|41266623
                                                                       Machine Learning UNIT -III Notes
an odd value for k to avoid ties in classification. Cross-
validation methods can help in selecting the best k value for the
given dataset.
Algorithm for K-NN
DistanceToNN=sort(distance from 1st example, distance from
kth example)
value i=1 to number of training records:
Dist=distance(test example, ith example)
if (Dist<any example in DistanceToNN):
Remove the example from DistanceToNN and value.
Put new example in DistanceToNN and value in sorted order.
Return average of value
Fit using K-NN is more reasonable than 1-NN, K-NN affects very
less from noise if dataset is large.
In K-NN algorithm, We can see jump in prediction values due to
unit change in input. The reason for this due to change in
neighbors. To handles this situation, We can use weighting of
neighbors in algorithm. If the distance from neighbor is high, we
want less effect from that neighbor. If distance is low, that
neighbor should be more effective than others.
Workings of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the
principle of similarity, where it predicts the label or value of a
new data point by considering the labels or values of its K
nearest neighbors in the training dataset.
                                                                                         Page 25 of 27
                Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                     lOMoARcPSD|41266623
                                                                           Machine Learning UNIT -III Notes
Step-by-Step explanation of how KNN works is discussed below:
Step 1: Selecting the optimal value of K
      K represents the number of nearest neighbors that needs
         to be considered while making prediction.
Step 2: Calculating distance
      To measure the similarity between target and training
         data points, Euclidean distance is used. Distance is
         calculated between each of the data points in the
         dataset and target point.
Step 3: Finding Nearest Neighbors
      The k data points with the smallest distances to the
         target point are the nearest neighbors.
Step 4: Voting for Classification or Taking Average for
Regression
      In the classification problem, the class labels of K-nearest
         neighbors are determined by performing majority voting.
         The class with the most occurrences among the
         neighbors becomes the predicted class for the target
         data point.
      In the regression problem, the class label is calculated by
         taking average of the target values of K nearest
         neighbors. The calculated average value becomes the
         predicted output for the target data point.
Let X be the training dataset with n data points, where each data
point is represented by a d-dimensional feature vector XiXi and Y
be the corresponding labels or values for each data point in X.
Given a new data point x, the algorithm calculates the distance
between x and each data point XiXi in X using a distance metric,
such as Euclidean distance: distance(x,Xi)=∑j=1d(xj–Xij)2] distance(x,Xi
)=∑j=1d(xj–Xij)2]
The algorithm selects the K data points from X that have the
shortest distances to x. For classification tasks, the algorithm
                                                                                             Page 26 of 27
                    Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)
                                                   lOMoARcPSD|41266623
                                                                         Machine Learning UNIT -III Notes
assigns the label y that is most frequent among the K nearest
neighbors to x. For regression tasks, the algorithm calculates the
average or weighted average of the values y of the K nearest
neighbors and assigns it as the predicted value for x.
Advantages of the KNN Algorithm
            Easy to implement as the complexity of the algorithm
             is not that high.
            Adapts Easily – As per the working of the KNN algorithm
             it stores all the data in memory storage and hence
             whenever a new example or data point is added then the
             algorithm adjusts itself as per that new example and has
             its contribution to the future predictions as well.
            Few Hyperparameters – The only parameters which
             are required in the training of a KNN algorithm are the
             value of k and the choice of the distance metric which we
             would like to choose from our evaluation metric.
Disadvantages of the KNN Algorithm
       Does not scale – As we have heard about this that the KNN
        algorithm is also considered a Lazy Algorithm. The main
        significance of this term is that this takes lots of computing
        power as well as data storage. This makes this algorithm
        both time-consuming and resource exhausting.
       Curse of Dimensionality – There is a term known as the
        peaking phenomenon according to this the KNN algorithm is
        affected by the curse of dimensionality which implies the
        algorithm faces a hard time classifying the data points
        properly when the dimensionality is too high.
       Prone to Overfitting – As the algorithm is affected due to
        the curse of dimensionality it is prone to the problem of
        overfitting as well. Hence generally feature selection as well
        as dimensionality reduction techniques are applied to deal
        with this problem.
                                                                                           Page 27 of 27
                  Downloaded by Chintala Nikhil Kumar (nikhilkumar.nikki2004@gmail.com)