0% found this document useful (0 votes)

24 views66 pages

Aml Unit 1

The document outlines the course structure for 'Advance Machine Learning' taught by Prof. Sandip Eknathrao Ingle, covering topics such as unsupervised learning, clustering, and reinforcement learning. It details course objectives, outcomes, and the examination scheme, along with definitions and key concepts in machine learning. Additionally, it discusses various types of machine learning including supervised, unsupervised, semi-supervised, and reinforcement learning, along with their applications and algorithms.

Uploaded by

sumedhkamble1421

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views66 pages

Aml Unit 1

Uploaded by

sumedhkamble1421

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Prof.

Sandip Eknathrao Ingle 9960575712

Advance Machine Learning

BTAIC602 3L - 0T - 0P 3 Credits
Course Category: PCC8, Teaching Scheme Lecture: 3 hrs. /week
Examination Scheme
Continuous Assessment: Mid Semester Exam: End Semester Exam:
20Marks 20 Marks 60 Marks (Duration 03 hrs.)

Pre-Requisites: Data Analysis, Python Programming Language

Course Objectives: After completion of the course, students will learn:-
 To understand fundamental concepts of unsupervised learning and its various algorithms
 To understand Association Rules Mining and Recommendation Systems
 To apply ML algorithms on given data and interpret the results obtained
 To design appropriate ML solution to solve real world problems in AI domain

Course Outcomes: On completion of the course, students will be able to:

CO1 Develop a good understanding of fundamental of unsupervised learning.
CO2 Formulation of Association Rules Mining and Recommendation Systems
CO3 Interpret a model using Reinforcement Learning.
CO4 Evaluate the time series data.
CO5 Design and Concrete implementations using boosting.

Hi-Tech Institute of Technology 1 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Unit No 1: Unsupervised Learning [7 Hours]

Unsupervised Learning - 1
Introduction to Unsupervised Learning, Introduction to Clustering, Using K-means for Flat
Clustering, KMeans Algorithm, Using KMeans from Sklearn, Implementing Fit & Predict
Functions, Implementing K-Means Class

Unsupervised Learning - 2
How to choose Optimal K, Silhouette algorithm to choose K, Introduction to K Medoids, K
Medoids Algorithm, Introduction to Hierarchical Clustering, Top down/Divisive Approach,
Bottom up/Divisive Approach

Principal Component Analysis PCA

Intuition behind PCA, Applying PCA to 2D data, Applying PCA on 3D data, Math behind PCA,
Finding Optimal Number of Features, Magic behind PCA, Dimensionality reduction PCA -
PCA on Images, PCA on Olevitti Images, Reproducing Images, Eigenfaces, Classification of
LFW Images

Introduction
What Is Machine Learning? Machine learning is programming computers to optimize a
performance criterion using example data or past experience. We have a model defined up to some
parameters, and learning is the execution of a computer program to optimize the parameters of the
model using the training data or past experience.
The model may be predictive to make predictions in the future, or descriptive to gain knowledge
from data, or both.
 The term Machine Learning was coined by Arthur Samuel in 1959, an American pioneer
in the field of computer gamingand artificial intelligence and stated that
 “It gives computers the ability to learn without being explicitly programmed”.
 In 1997, Tom Mitchell gave a “well-posed” mathematical and Relational definition that

Definition:
“A computer program is said to learn from experience E with respect to some
task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E.
 Machine Learning is a latest buzzword floating around.
 It deserves to, as it is one of the most interesting subfield of Computer Science.

Hi-Tech Institute of Technology 2 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Fig: Machine learning model

However, there is no universally accepted definition for machine learning. Different
authors define the term differently.
Arthur Samuel's Definition (1959):
 "Machine learning is the field of study that gives computers the ability to learn without
being explicitly programmed."

Tom Mitchell's Definition (1997):

 "A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E."

Ethem Alpaydın's Definition:

 "Machine learning is programming computers to optimize a performance criterion using
example data or past experience."

Dieter Fox's Definition:

 "Machine learning is the study of algorithms that improve their performance P at some task
T with experience E."

Kevin P. Murphy's Definition:

 "Machine learning is a set of methods that can automatically detect patterns in data, and
then use the uncovered patterns to predict future data or perform other kinds of decision
making under uncertainty."

Trevor Hastie, Robert Tibshirani, and Jerome Friedman's Definition:

 "Machine learning is a branch of statistics, but with increased emphasis on the use of
computers to implement the underlying algorithms, and on evaluating their performance
on large datasets."

Andreas C. Müller and Sarah Guido's Definition:

 "Machine learning is the practice of teaching computers how to learn patterns from data,
and then making decisions or predictions based on those patterns."
Hi-Tech Institute of Technology 3 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

Wikipedia's Definition:
 "Machine learning is a subfield of artificial intelligence (AI) focused on the development
of algorithms that allow computers to learn from and make predictions on data."

Key Influential Figures

 Alan Turing: Pioneer of AI and theoretical computer science.
 Frank Rosenblatt: Developed the perceptron, an early neural network model.
 Geoffrey Hinton: Key figure in the development of deep learning and neural networks.
 Yann LeCun: Contributed to the development of convolutional neural networks.
 Yoshua Bengio: Worked extensively on deep learning and representation learning.
 Andrew Ng: Prominent advocate for online education in AI and machine learning.

Key aspects of machine learning include:

1. Learning from Data: Machine learning algorithms use historical data as input to predict
new output values. The quality and quantity of data significantly influence the performance
of the learning model.
2. Pattern Recognition: By identifying patterns and correlations in data, machine learning
models can make predictions or decisions without being explicitly programmed for the
task.
3. Adaptation and Improvement: Machine learning systems continuously improve their
performance as they are exposed to more data over time, refining their algorithms to
enhance accuracy and efficiency.

Core Components of Machine Learning

1. Data: The foundation of any machine learning system. It can be in various forms, such as
text, images, audio, and structured data like databases.
2. Features: Individual measurable properties or characteristics of the data used by the
algorithm to make predictions.
3. Algorithms: The set of rules or procedures that the machine learning system follows to
learn from data and make predictions. Common algorithms include decision trees, neural
networks, and support vector machines.

Simple Machine Learning Process

1. Data Collection: Gather relevant data for the task, such as customer purchase histories for
a recommendation system.
2. Data Preparation: Clean and pre-process the data to ensure quality and consistency. This
may involve handling missing values, normalizing data, and encoding categorical
variables.

Hi-Tech Institute of Technology 4 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

3. Model Training: Select an appropriate algorithm and use the prepared data to train the
model, allowing it to learn the patterns and relationships in the data.
4. Model Evaluation: Assess the model's performance using evaluation metrics like
accuracy, precision, recall, and F1 score. This typically involves testing the model on a
separate set of data not used during training.
5. Model Tuning: Optimize the model by adjusting hyper parameters and improving the
feature set to enhance performance.
6. Deployment: Implement the trained model in a real-world environment where it can make
predictions or decisions on new data.
7. Monitoring and Maintenance: Continuously monitor the model's performance and
update it with new data to maintain its accuracy and relevance.

Types of Machine Learning

1. Supervised Learning
o Definition: The model is trained on a labelled dataset, which means that each
training example is paired with an output label.
Supervised learning is a type of machine learning where the algorithm is trained on a labelled
dataset. This means that each input comes with a corresponding output label. The model learns to
map inputs to outputs and is then used to predict the labels for new, unseen data.

Fig: Supervised Learning

Key Components:
 Labelled Data: Data that includes input-output pairs.
 Training Phase: The model learns from the labelled dataset by finding patterns and
relationships between the inputs and outputs.
 Prediction Phase: The trained model is used to predict the output for new inputs.
Examples of Supervised Learning Algorithms:
 Classification: Identifying email as spam or not spam.
 Regression: Predicting house prices based on historical data.

Hi-Tech Institute of Technology 5 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

 Regression:
o Linear Regression: Predicting a continuous output, such as housing prices.
o Logistic Regression: Predicting binary outcomes, such as spam detection.
 Classification:
o Decision Trees: Classifying data into categories based on input features.
o Support Vector Machines (SVM): Finding the optimal boundary between classes.
o K-Nearest Neighbours (k-NN): Classifying data based on the closest training
examples.
• When an algorithm learns from example data and associated target responses that can
consist of numeric values or string labels, such as classes or tags, in order to later predict
the correct response when posed with new examples comes under the category of
supervised learning.
• This approach is indeed similar to human learning under the supervision of a teacher.
• The teacher provides good examples for the student to memorize, and the student then
derives general rules from these specific examples.
Applications:
 Email spam detection.  Image and speech recognition.
 Credit scoring.  Medical diagnosis

2. Unsupervised Learning
o Definition: The model is given data without explicit instructions on what to do
with it. It must find patterns and relationships in the data.
Unsupervised learning involves training a model on data without labelled responses. The algorithm
tries to learn the underlying structure of the data by identifying patterns, relationships, and clusters.

Fig: Unsupervised Learning

Key Components:
 Unlabelled Data: Data without explicit labels or outcomes.
 Clustering: Grouping similar data points together.
 Dimensionality Reduction: Reducing the number of variables under consideration.

Hi-Tech Institute of Technology 6 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Examples of Unsupervised Learning Algorithms:

 Clustering:
o K-Means Clustering: Partitioning data into k distinct clusters based on similarity.
o Hierarchical Clustering: Creating a tree of clusters based on nested grouping.
 Dimensionality Reduction:
o Principal Component Analysis (PCA): Reducing the number of features while
retaining variance.
o T-Distributed Stochastic Neighbour Embedding (t-SNE): Reducing
dimensionality for visualization.
• Whereas when an algorithm learns from plain examples without any associated response,
leaving to the algorithm to determine the data patterns on its own.
• This type of algorithm tends to restructure the data into something else, such as new features
that may represent a class or a new series of un-correlated values.
• They are quite useful in providing humans with insights into the meaning of data and new
useful inputs to supervised machine learning algorithms. As a kind of learning, it resembles
the methods humans use to figure out that certain objects or events are from the same class,
such as by observing the degree of similarity between objects.
Examples:
 Clustering: Grouping customers by purchasing behaviour.
 Dimensionality Reduction: Reducing the number of random variables under
consideration.
Applications:
 Customer segmentation.  Anomaly detection.
 Market basket analysis.  Image compression

3. Semi-supervised Learning
o Definition: The model is trained on a dataset that includes both labelled and
unlabelled data. This approach is often used when labelling data is expensive or
time-consuming.
Semi-supervised learning is a type of machine learning that involves training a model on a dataset that
includes both labelled and unlabelled data. This approach leverages the large amount of unlabelled data to
improve the learning process when labelled data is scarce or expensive to obtain.

Hi-Tech Institute of Technology 7 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Fig: Semi-supervised Learning

Key Components:
 Labelled Data: A small amount of data with labels.
 Unlabelled Data: A large amount of data without labels.
 Combination Approach: Using both labelled and unlabelled data to improve model
accuracy.
Examples of Semi-Supervised Learning Algorithms:
 Self-Training: The model is trained on labelled data, makes predictions on unlabelled data,
and then retrains using the most confident predictions as additional labelled data.
 Co-Training: Two models are trained on different views of the data and teach each other
using the most confident predictions.
 Graph-Based Methods: Using graph structures to propagate labels through the dataset.
• Where an incomplete training signal is given: a training set with some (often many) of the
target outputs missing.
• There is a special case of this principle known as Transduction where the entire set of
problem instances is known at learning time, except that part of the targets are missing.
Examples: Improving accuracy of image recognition by using a small amount of labelled images
and a large amount of unlabelled images.
Applications:
 Image classification with limited labelled images.
 Natural language processing with limited annotated text.
 Fraud detection with limited known fraud cases.

4. Reinforcement Learning
o Definition: The model learns by interacting with an environment and receiving
rewards or penalties for actions taken.
Reinforcement learning is a type of machine learning where an agent learns to make
decisions by taking actions in an environment to maximize some notion of cumulative
reward. The agent receives feedback in the form of rewards or penalties and uses this
feedback to improve its strategy over time.

Hi-Tech Institute of Technology 8 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Fig: Reinforcement Learning

Key Components:
 Agent: The learner or decision-maker.
 Environment: The external system with which the agent interacts.
 Actions: The set of all possible moves the agent can make.
 Rewards: Feedback from the environment based on the agent's actions.
 Policy: The strategy that the agent follows to decide on actions.
Examples of Reinforcement Learning Algorithms:
 Q-Learning: Learning the value of actions in states to maximize cumulative reward.
 Deep Q-Networks (DQN): Combining Q-learning with deep neural networks for complex
environments.
 Policy Gradient Methods: Directly optimizing the policy to maximize the expected
reward.
• When you present the algorithm with examples that lack labels, as in unsupervised
learning.
• However, you can accompany an example with positive or negative feedback according
to the solution the algorithm proposes comes under the category of Reinforcement
learning, which is connected to applications for which the algorithm must make
decisions (so the product is prescriptive, notjust descriptive, as in unsupervised
learning), and the decisions bear consequences.
• In the human world, it is just like learning by trial and error.
o Examples: Training a robot to walk, developing game-playing algorithms like AlphaGo.
Applications:
 Game playing (e.g., AlphaGo).
 Robotics (e.g., training robots to perform tasks).
 Autonomous vehicles.

Hi-Tech Institute of Technology 9 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

 Introduction to Clustering
Clustering
Clustering is an unsupervised machine learning technique used to group similar data points into
clusters. It plays a crucial role in data analysis, pattern recognition, and machine learning
applications. Clustering is the process of partitioning a dataset into groups (clusters) based on
similarity among the data points. Unlike classification, clustering does not use predefined labels
but identifies inherent patterns in the data.
Scientific Definitions of Clustering
Clustering is formally defined in various scientific contexts. Here are some widely accepted
definitions:
1. Statistical Definition
Clustering is the process of organizing objects into groups, or clusters, where objects within
the same cluster are more similar to each other than to those in other clusters based on a
predefined similarity or distance measure.

2. Machine Learning Definition

Clustering is an unsupervised learning technique that partitions a dataset into subsets (clusters)
such that data points in the same cluster share similar characteristics, and those in different
clusters have distinct properties.

3. Mathematical Definition
Given a dataset X with n points: X={x1,x2,…,xn}

A clustering algorithm finds a mapping function: f: X→C

Where C is a set of clusters {C1,C2,…,Ck}, satisfying some similarity or distance criterion.

4. Graph-Theoretic Definition
Clustering can be defined as a partitioning of a graph G=(V,E) , where vertices (V) represent
data points, and edges (E) represent relationships, such that intra-cluster edges have higher
weights (stronger connections) than inter-cluster edges.

History of Clustering and Pioneers

Clustering has a rich history, evolving from early statistical methods to advanced machine learning
techniques.
Karl Pearson (1857-1936)
 Developed the Pearson Correlation Coefficient, a fundamental statistical measure of similarity.
 His work laid the groundwork for measuring relationships between data points, a key concept
in clustering.

Hi-Tech Institute of Technology 10 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Sir Ronald Fisher (1890-1962)

 Introduced discriminant analysis in 1936, which later influenced clustering techniques.
 Worked on multivariate statistics, forming a base for future machine learning models.

Stuart Lloyd (1957)

 Introduced the Lloyd's Algorithm, which later became known as the K-Means algorithm in
the 1980s.
 Initially developed for signal processing at Bell Labs.

S.C. Johnson (1967)

 Developed Hierarchical Clustering, which organizes data into nested groups.
 Introduced Agglomerative and Divisive clustering approaches.

Martin Ester et al. (1996) - DBSCAN

 Introduced Density-Based Spatial Clustering of Applications with Noise (DBSCAN).
 This was a breakthrough in handling noisy and non-spherical clusters.
Features of Clustering
Clustering is an essential unsupervised learning technique used to group data points based on their
similarities. Below are the key features that define clustering

1. Unsupervised Learning
 Clustering is an unsupervised learning technique, meaning it does not require labelled data.
 The algorithm identifies patterns and structures in raw data without prior knowledge.

2. Similarity-Based Grouping
 Clustering groups data points based on similarity measures such as Euclidean distance,
cosine similarity, and correlation.
 Data points within a cluster are more similar to each other than to points in different
clusters.

3. Automatic Pattern Discovery

 Clustering helps in finding hidden patterns and structures in data.
 It is widely used in market segmentation, anomaly detection, and image segmentation.

Hi-Tech Institute of Technology 11 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

4. Scalability
 Some clustering algorithms work well with large datasets (e.g., K-Means, DBSCAN),
while others struggle with high-dimensional data.
 Scalability is a critical factor in big data applications.

5. Different Cluster Shapes & Structures

Clustering techniques can detect different types of structures, including:
 Spherical Clusters (K-Means)
 Hierarchical Clusters (Agglomerative)
 Density-Based Clusters (DBSCAN)
 Fuzzy Clusters (Fuzzy C-Means)

6. Noise and Outlier Handling

 Some clustering techniques can handle noise and outliers effectively, such as DBSCAN
and OPTICS.
 Others, like K-Means, are sensitive to outliers because they rely on mean values.

7. High-Dimensional Data Handling

 Some clustering techniques, like Principal Component Analysis (PCA) + Clustering, work
well in high-dimensional spaces.
 Others may suffer from the curse of dimensionality, requiring dimensionality reduction
techniques.

8. Cluster Validation
 Evaluating the quality of clusters is important and can be done using:
 Silhouette Score (Measures how well-separated clusters are)
 Dunn Index (Evaluates compactness and separation)
 Davies-Bouldin Index (Measures intra-cluster similarity)

9. Computational Complexity
 Clustering algorithms vary in computational cost:
 K-Means is efficient with O(nkT) complexity.
 Hierarchical Clustering is computationally expensive with O(n² log n).

Hi-Tech Institute of Technology 12 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

10. Interpretability & Visual Representation

 Some clustering results can be easily visualized using scatter plots, and t-SNE.
 Others require advanced dimensionality reduction methods for better interpretation.

List of Clustering Algorithms in Machine Learning

Clustering algorithms can be categorized based on their working principles. Below is a list of
commonly used clustering algorithms in ML:

1. Partition-Based Clustering
These algorithms divide the dataset into a predefined number of clusters.
 K-Means – Centroid-based, fast, and widely used.
 K-Medoids – Similar to K-Means but uses medoids instead of centroids, more robust to
outliers.
 CLARANS (Clustering Large Applications based upon Randomized Search) –
Optimized version of K-Medoids for large datasets.

2. Hierarchical Clustering
These algorithms create a tree-like structure (dendrogram) of clusters.
 Agglomerative Hierarchical Clustering – A bottom-up approach, starts with individual
points and merges them.
 Divisive Hierarchical Clustering – A top-down approach, starts with a single cluster and
splits iteratively.

3. Density-Based Clustering
These algorithms group data points based on density regions.
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) – Can detect
arbitrary-shaped clusters and handle noise well.
 OPTICS (Ordering Points to Identify the Clustering Structure) – Improves DBSCAN
for varying density clusters.
 Mean-Shift – Uses a density function to move data points toward dense regions.

4. Model-Based Clustering
These algorithms assume data is generated from a mixture of probability distributions.
 Gaussian Mixture Model (GMM) – Uses multiple Gaussian distributions to model data
clusters.

Hi-Tech Institute of Technology 13 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

 Expectation-Maximization (EM) – A probabilistic approach for clustering when data

follows a distribution.

5. Grid-Based Clustering
These algorithms divide the data space into grids and perform clustering on them.
 STING (Statistical Information Grid-based Clustering) – Uses statistical measures to
form clusters.
 CLIQUE (Clustering In QUEst) – Hybrid of grid-based and density-based clustering.

6. Fuzzy Clustering
These algorithms allow data points to belong to multiple clusters with a degree of membership.
 Fuzzy C-Means (FCM) – Assigns probabilities of belonging to each cluster instead of
hard assignment.

7. Deep Learning-Based Clustering

Advanced clustering techniques using neural networks.
 Autoencoder-Based Clustering – Uses deep learning models to extract features before
clustering.
 Deep Embedded Clustering (DEC) – Learns cluster assignments directly from data using
deep neural networks.
 Spectral Clustering – Uses eigenvalues of similarity matrices to cluster non-linearly
separable data.

8. Graph-Based Clustering
These algorithms use graph theory to form clusters.
 Affinity Propagation – Message-passing clustering method, no need to specify clusters
beforehand.
 Markov Clustering Algorithm (MCL) – Uses random walks on a graph to detect clusters.

9. Evolutionary Clustering
Metaheuristic clustering methods inspired by nature.
 Genetic Algorithm-Based Clustering – Uses genetic algorithms for optimization-based
clustering.
 Particle Swarm Optimization (PSO) Clustering – Swarm intelligence-based clustering
technique.

Hi-Tech Institute of Technology 14 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Comparison of Clustering Algorithms in Machine Learning

Algorithm Type Characteristics Advantages Disadvantages Best Use
Cases
K-Means Partition- Divides data Fast, simple, Sensitive to Customer
Based into k clusters; efficient for outliers, needs segmentation,
centroid-based large predefined k document
datasets clustering
K-Medoids Partition- Uses medoids More robust Slower than Small datasets
Based (actual points) to outliers K-Means, with noise,
instead of than K- computational fraud detection
centroids Means ly expensive
CLARANS Partition- Randomized Works better Still slower Large-scale
Based version of K- on large than K-Means clustering,
Medoids for datasets than geospatial
efficiency K-Medoids analysis
Agglomerati Hierarchi Bottom-up No need to Computationa Taxonomy,
ve cal approach; forms specify lly expensive gene
Clustering a hierarchy clusters (O(n²)) expression
(dendrogram) beforehand analysis
Divisive Hierarchi Top-down Better for Computationa Social network
Clustering cal approach; splits small lly expensive analysis,
clusters datasets, (O(n²)) document
recursively hierarchical organization
structure
DBSCAN Density- Clusters based Handles Struggles with Anomaly
Based on density; noise, varying detection,
detects noise arbitrary- densities, spatial data
shaped parameter- analysis
clusters sensitive
OPTICS Density- Similar to Better for More complex Large-scale
Based DBSCAN but varying than clustering,
works with densities DBSCAN uneven data
variable densities
densities

Hi-Tech Institute of Technology 15 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Mean-Shift Density- Moves points No need to High Image

Based toward dense define k, computational segmentation,
regions flexible cost (O(n²)) object tracking
cluster
shapes
Gaussian Model- Uses Soft Computationa Bioinformatics
Mixture Based probabilistic clustering, lly expensive, , financial
Model models flexible requires k fraud detection
(GMM) (Gaussian cluster
distributions) shapes
Expectation- Model- Finds maximum Works well Slow Image
Maximizatio Based likelihood with missing convergence, recognition,
n (EM) cluster data local optima speech
assignments problems processing
STING Grid- Divides data Fast and Loses Geographic
Based into hierarchical scalable precision at clustering,
grid cells lower levels large-scale
data
CLIQUE Grid- Combines Handles Computationa High-
Based density-based & high- lly complex dimensional
grid-based dimensional data analysis,
methods data gene clustering
Fuzzy C- Fuzzy Allows points to Soft Sensitive to Medical
Means Clusterin belong to clustering, noise, requires imaging, text
(FCM) g multiple clusters better for k document
overlapping clustering
clusters
Autoencoder Deep Uses neural Works well Requires deep Deep learning
-Based Learning networks to for high- learning applications,
Clustering -Based reduce dimensional expertise, complex
dimensions data computational patterns
ly expensive
Deep Deep Learns cluster Can extract Requires large Image & text
Embedded Learning structure meaningful labeled clustering at
Clustering -Based directly from representatio dataset for large scale
(DEC) data ns pretraining
Hi-Tech Institute of Technology 16 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

Spectral Graph- Uses Works well Computationa Graph

Clustering Based eigenvalues of for non- lly expensive clustering,
similarity convex (O(n³)) NLP
matrices clusters applications
Affinity Graph- Message- No need to High memory Image
Propagation Based passing between specify k usage, slow segmentation,
data points on large recommendati
datasets on systems
Markov Graph- Uses random Good for Not widely Biological
Clustering Based walks on a network- used outside networks,
(MCL) graph based graph-based social network
clustering applications clustering
Genetic Evolutio Uses genetic Works for Slow, requires Complex
Algorithm nary algorithms for dynamic tuning datasets,
Clustering optimization clustering evolutionary evolutionary
problems parameters data analysis
Particle Evolutio Uses swarm Works for Computationa Dynamic
Swarm nary intelligence for dynamically lly expensive, clustering
Optimizatio clustering changing convergence problems,
n (PSO) data issues optimization
Clustering tasks

Using K-means for Flat Clustering

A flat clustering algorithm (like K-Means) takes a set of N data points and a specified
number K, and divides the points into K distinct clusters, where each data point belongs to exactly
one cluster, and all clusters are at the same level with no hierarchy between them.
Flat Clustering Algorithm (K-Means style)
 Input: A set of N data points (or documents).
 Choose the number of clusters, K, in advance.
Step 1: Initialize K Cluster Centroids
 Randomly select K points from the data as initial centroids
Step 2: Assign Points to the Nearest Centroid
 For each data point:
o Calculate the distance (typically Euclidean) to each centroid.
o Assign the point to the closest centroid's cluster.
 Initial Result: All N points are divided into K groups based on proximity to centroids.
Hi-Tech Institute of Technology 17 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

Step 3: Recalculate Cluster Centroids

 For each of the K clusters:
o Compute the new centroid by averaging the data points in that cluster.
o This new centroid becomes the updated representative.
Step 4: Repeat Assignment and Update
 Reassign all data points to the nearest updated centroid.
 Recalculate centroids again.
 Repeat Steps 2 and 3 until:
o Cluster assignments no longer change (convergence), or
o A maximum number of iterations is reached.
Step 5: Output the Final Clusters
 Once stable, the algorithm outputs:
o The K clusters, each containing a group of data points.
o The final centroids representing each cluster.

Flat clustering refers to a type of clustering algorithm that organizes a given set of N data
points (or documents) into a predefined number K of distinct groups, called clusters, without
establishing any hierarchical structure among them. The term “flat” signifies that all clusters are
treated equally and exist at the same level—there are no parent-child relationships or nested
groupings as seen in hierarchical clustering. This means that each data point belongs to exactly
one cluster, and clusters do not overlap (unless fuzzy methods are used).
In flat clustering, the goal is to find a partition—that is, to divide the entire dataset into
non-overlapping subsets. Each subset (or cluster) groups together data points that are more similar
to each other than to those in other clusters. For instance, this method is widely used to cluster text
documents, customer profiles, or images, where N represents the total number of documents, users,
or images, and K is the number of categories or segments we want to identify.

Flat Clustering (K-Means) –Example

Dataset: We have 6 data points representing coordinates in 2D space:
A (1, 2) C (1, 0) E (10, 4)
B (1, 4) D (10, 2) F (10, 0)
Assume: K=2
Step 1: Initialize K = 2 Centroids
Let’s randomly choose:
Centroid 1: A (1, 2) Centroid 2: D (10, 2)

Hi-Tech Institute of Technology 18 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Step 2: Assign Each Point to Nearest Centroid

We use Euclidean Distance:
Euclidean distance is the straight-line distance between two points in Euclidean space
For two points
centroid = (x₁, y₁)
P₂ = (x₂, y₂)

The Euclidean Distance (d) is:

Point Distance to A (1,2) Distance to D (10,2) Assigned
to
A Cluster 1

B Cluster 1

C Cluster 1

D Cluster 2

E Cluster 2

F Cluster 2

 Cluster 1 (K1): A (1,2), B (1,4), C (1,0)

 Cluster 2 (K2): D (10,2), E (10,4), F (10,0)
Compute the New Centroids (Means)
For Cluster 1 (K1): For Cluster 2 (K2):
 Points: (1,2), (1,4), (1,0)  Points: (10,2), (10,4), (10,0)
 Mean X: (1 + 1 + 1) / 3 = 1  Mean X: (10 + 10 + 10) / 3 = 10
 Mean Y: (2 + 4 + 0) / 3 = 2  Mean Y: (2 + 4 + 0) / 3 = 2
New Centroid 1 = (1, 2) New Centroid 2 = (10, 2)

Here, new and assume centroid haven't changed from the previous iteration, K-Means has
converged. The algorithm has stopped changing, and the clusters are stable.

Hi-Tech Institute of Technology 19 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Using KMeans from Sklearn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Step 1: Define the dataset
data = np.array([ [1, 2] #A [1, 4] #B [1, 0] #C [10, 2] # D [10,
4] # E [10, 0] # F])
# Step 2: Run KMeans
kmeans = KMeans(n_clusters=2, random_state=0, n_init='auto')
kmeans.fit(data)
# Step 3: Get labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# Step 4: Plotting
plt.figure(figsize=(8, 6))
# Plot each cluster with a different color
for i in range(2): # for K=2
cluster_points = data[labels == i]
plt.scatter(cluster_points[:, 0], cluster_points[:, 1], label=f'Cluster {i + 1}')
# Plot centroids
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', s=200, marker='X',
label='Centroids')
# Annotate original points
point_labels = ['A', 'B', 'C', 'D', 'E', 'F']
for i, txt in enumerate(point_labels):
plt.annotate(txt, (data[i][0] + 0.2, data[i][1] + 0.2))
# Styling
plt.title("K-Means Clustering (K=2)")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
Hi-Tech Institute of Technology 20 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

plt.grid(True)
plt.tight_layout()
# Show plot
plt.show()

Fig: Output

Implementing Fit & Predict Functions

Fit() function
The fit() function in scikit-learn's KMeans class is the core method used to train the K-
means clustering model on your data. It takes your data as input and learns the optimal cluster
centers (centroids) that minimize the within-cluster sum of squares (WCSS).
Here's a detailed explanation of how the fit() function in KMeans works:
Core Goal:
The primary goal of the fit() function is to find the best possible positions for k centroids
in your data space, such that each data point is assigned to the cluster whose centroid is closest to
it. "Best possible" is defined by minimizing the WCSS, which is the sum of the squared Euclidean
distances between each data point and its assigned centroid.

Input to fit():
The fit() method typically takes one required argument:
kmeans.fit(X)

Hi-Tech Institute of Technology 21 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

 X: This is the training data. It should be a NumPy array or a sparse matrix of shape
(n_samples, n_features), where:
o n_samples is the number of data points.
o n_features is the number of dimensions (features) of each data point.
Steps Involved in fit():
The fit() function implements the K-means algorithm, which generally involves the following
iterative steps:
1. Initialization of Centroids:
o The algorithm starts by initializing k centroid points. The method of initialization can be
controlled by the init parameter of the KMeans class. Common initialization methods
include:
 'k-means++' (default): This is a smart initialization technique that aims to select initial
centroids that are far from each other, leading to faster convergence and potentially
better results. It works by:
 Randomly selecting the first centroid from the data points.
 For each subsequent centroid, it calculates the squared distance between each data
point and the nearest already chosen centroid.
 It then selects the next centroid with a probability proportional to the square of this
distance. This increases the likelihood of choosing points that are far away from
existing centroids.
 'random': This method simply selects k data points randomly from the input data as
initial centroids. This is simpler but can be more sensitive to the initial choice and may
lead to slower convergence or suboptimal solutions.
 A NumPy array: You can also provide your own initial centroids as a NumPy array
of shape (n_clusters, n_features).
2. Assignment Step (E-step - Expectation):
o Once the initial centroids are chosen, each data point in the dataset X is assigned to the
cluster whose centroid is closest to it.
o The distance is typically calculated using the Euclidean distance (the default). You can
potentially customize the distance metric using a custom callable via the metric parameter
in some related clustering algorithms, though KMeans primarily uses Euclidean distance.
o For each data point, the algorithm finds the centroid with the minimum distance and assigns
the data point to that cluster.
3. Update Step (M-step - Maximization):
o After all data points have been assigned to clusters, the algorithm recalculates the position
of each centroid.

Hi-Tech Institute of Technology 22 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

o The new position of each centroid is the mean (average) of all the data points that are
currently assigned to that cluster.
o This step essentially moves the centroids towards the "center" of their respective clusters.
4. Iteration and Convergence:
o Steps 2 (assignment) and 3 (update) are repeated iteratively.
o The algorithm continues to iterate until a stopping criterion is met. Common stopping
criteria include:
 Convergence: The centroids no longer move significantly between iterations. This is often
measured by checking if the change in centroid positions falls below a certain tolerance
(tol parameter).
 Maximum number of iterations: The algorithm reaches a predefined maximum number
of iterations (max_iter parameter). This is a safeguard to prevent the algorithm from
running indefinitely if it doesn't converge quickly.
Output of fit():
After the fit() method completes, the KMeans object will have learned the cluster structure from
the data. It updates several attributes of the object, including:
 cluster_centers_: A NumPy array of shape (n_clusters, n_features) containing the
coordinates of the final cluster centroids.
 labels_: A NumPy array of shape (n_samples,) containing the cluster label (0 to n_clusters
- 1) assigned to each data point in the training data X.
 inertia_: The sum of squared distances of samples to their closest cluster center. This is
the value of the objective function (WCSS) that the K-means algorithm tries to minimize.
 n_iter_: The number of iterations run to achieve convergence.

How to Use fit():

from sklearn.cluster import KMeans Output:
import numpy as np Centroids: [[ 1. 2.] [10. 2.]]
X = np.array([ [1, 2], [1, 4], [1, 0], [10, 2], Labels: [0 0 0 1 1 1]
[10, 4], [10, 0] ]) Inertia: 24.0
kmeans = KMeans(n_clusters=2,
random_state=42)
kmeans.fit(X)
print("Centroids:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)
print("Inertia:", kmeans.inertia_)

Hi-Tech Institute of Technology 23 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Predict () function
The predict() function in scikit-learn's KMeans class is used after the model has been trained using
the fit() function. Its purpose is to assign new, unseen data points to the clusters learned during
the training phase.
Here's a detailed explanation of how the predict() function works:

Core Goal:
The primary goal of the predict() function is to take new data points as input and, based on the
cluster centroids learned during the fit() stage, determine which of the existing clusters each new
data point most likely belongs to.

Input to predict():
The predict() method takes one required argument:
 X: This is the new data for which you want to predict the cluster assignments. It should be
a NumPy array or a sparse matrix of shape (n_samples, n_features), where:
o n_samples is the number of new data points.
o n_features must be the same as the number of features used to train the KMeans
model in the fit() step.

Steps Involved in predict():

The predict() function performs the following steps for each new data point in the input X:
1. Distance Calculation: For each new data point, the function calculates the distance
between that data point and each of the cluster centroids that were learned during the fit()
phase (stored in kmeans.cluster_centers_). The distance metric used is typically the same
as the one implicitly used during training, which is usually the Euclidean distance in the
standard KMeans implementation.
2. Assignment to the Closest Centroid: After calculating the distances to all the cluster
centroids, the predict() function assigns the new data point to the cluster whose centroid is
the closest to it (i.e., the centroid with the minimum distance).
3. Label Assignment: The function then returns the index (label) of the closest cluster
(ranging from 0 to n_clusters - 1) as the predicted cluster for that new data point.

Output of predict():
The predict() method returns a NumPy array of shape (n_samples,), where n_samples is the
number of new data points in the input X. Each element in this array represents the predicted
cluster label (an integer) for the corresponding new data point.

Hi-Tech Institute of Technology 24 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

How to Use predict():

from sklearn.cluster import KMeans Output:
import numpy as np Predicted clusters: [0 1]
# Original training data
X = np.array([ [1, 2], [1, 4], [1, 0],[10, 2],
[10, 4], [10, 0] ])
# Fit the model
kmeans = KMeans(n_clusters=2,
random_state=42, n_init=10)
kmeans.fit(X)
# New data points to classify
new_points = np.array([[0, 0], [12, 3]])
# Predict the clusters for new data
predictions = kmeans.predict(new_points)
print("Predicted clusters:", predictions)

Key Differences
Feature fit(X) predict(X_new)
Purpose Trains the K-means model. Learns the Assigns new, unseen data points to
cluster structure from the training data. the learned clusters. Uses the existing
model to classify new data.
Input Training data X (used to learn clusters). New data X_new (to be assigned to
Data existing clusters).
Process 1. Initializes centroids. 1. Calculates the distance between each
2. Iteratively assigns data points to the new data point and all the learned
closest centroid. cluster centroids.
3. Updates centroid positions based on the 2. Assigns each new data point to the
mean of assigned points. cluster with the nearest centroid.
4. Repeats until convergence or max
iterations.
Output Modifies the KMeans object by learning: Returns a NumPy array of cluster labels
- cluster_centers_ (centroid coordinates) (integers) for each data point in X_new,
- labels_ (cluster assignments for the indicating which of the learned clusters
training data) they belong to.
- inertia_ (WCSS)
- n_iter_ (number of iterations)

Hi-Tech Institute of Technology 25 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Model Changes the state of the KMeans object Does not change the state of the
State by learning the cluster centers. KMeans object. It only uses the
already learned cluster_centers_.
When to Used once to train the K-means model on Used after the model has been trained
Use your initial dataset. with fit() to classify new, unlabeled
data. Can be called multiple times with
different sets of new data.
Key Discovers the underlying cluster structure Determines the cluster membership of
Outcome in the training data. new data points based on the learned
structure.

 fit() is like the learning phase. You show the algorithm some examples (your training
data), and it figures out the best way to group them into clusters by finding the centers of
those groups.
 predict() is like the application phase. Once the algorithm has learned the groups (found
the centers), you give it new, unlabeled items, and it tells you which of the learned groups
each new item most likely belongs to based on its proximity to the centers.

Implementing K-Means Class

K-Means is a fundamental unsupervised learning algorithm aimed at partitioning a dataset
into K distinct clusters. The core idea is to iteratively assign data points to the nearest cluster
centroid and then update these centroids based on the mean of the assigned points. This process
continues until the centroids stabilize or a predefined stopping criterion is met.
K-Means is a popular unsupervised learning algorithm used for clustering data into k groups
based on feature similarity. When implementing from scratch, we gain full understanding of:
 Distance computation
 Cluster assignment
 Centroid updating
 Convergence condition
Implementing K-Means as a class offers several advantages:
 Encapsulation: It bundles the data (attributes like centroids and labels) and the operations
(methods like fit and predict) related to the K-Means algorithm into a single unit,
promoting code organization and reusability.
 Modularity: Different aspects of the algorithm, such as initialization, assignment, and
update steps, are implemented as separate methods, making the code easier to understand,
test, and modify.
Hi-Tech Institute of Technology 26 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

 State Management: The class instance maintains the state of the trained model (e.g., the
final centroids), allowing for prediction on new data after the model has been fitted.
 Extensibility: It provides a framework for adding more advanced features or variations of
K-Means in the future.

Structure of the KMeans Class:

import numpy as np
class KMeans: def __init__(self, n_clusters=8, max_iter=300, random_state=None,
init='random', n_init=10, tol=1e-4, algorithm='auto'):
def __init__(self, n_clusters=2, max_iters=100, tol=1e-4):
...
def fit(self, X):
...
def predict(self, X):
...
def _compute_distances(self, X):
...
Detailed Explanation of Each Method:
1. def __init__(self, n_clusters=2, max_iters=100, tol=1e-4):
 Purpose: This is the constructor of the K-Means class. It's the first method that gets called
when you create an instance (object) of the KMeans class. Its primary role is to initialize
the attributes (data and settings) of the K-Means object.
 Parameters:
o self: This is a convention in Python object-oriented programming. It refers to the
instance of the class being created. You always include self as the first parameter in
instance methods.
o n_clusters (int, default=2): This parameter specifies the number of clusters that the
K-Means algorithm should try to find in the data. It has a default value of 2, meaning
if you don't explicitly provide a value when creating a KMeans object, it will attempt
to form 2 clusters.
o max_iters (int, default=100): This parameter defines the maximum number of
iterations the K-Means algorithm will run for a single execution of the fit method.
This acts as a safeguard to prevent the algorithm from running indefinitely if it doesn't
converge quickly.
o tol (float, default=1e-4): This parameter sets the tolerance for convergence. It's a
small positive value. The algorithm stops iterating when the change in the cluster

Hi-Tech Institute of Technology 27 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

centroids between consecutive iterations is less than this tolerance. This indicates that
the centroids have stabilized and the clustering process has (likely) converged.

2. def fit(self, X):

 Purpose: This method is the core of the K-Means algorithm. It takes the training data X
as input and performs the iterative process of assigning data points to clusters and updating
the cluster centroids until a stopping criterion (either reaching max_iters or the change in
centroids being less than tol) is met. This process "fits" the K-Means model to the data,
learning the cluster structure.
 Parameters:
o self: Refers to the instance of the KMeans class.
o X (NumPy array or similar): This is the training data. It's typically a 2D array
where each row represents a data point, and each column represents a feature.
 Inside the fit method (indicated by ...):
o The fit method would typically involve the following steps:
1. Initialization of Centroids: Randomly select n_clusters data points from
X as initial centroids. More sophisticated initialization techniques like K-
Means++ might also be implemented here.
2. Iterative Process: Repeat the following until convergence or max_iters is
reached:
 Assignment Step: Assign each data point in X to the cluster whose
centroid is closest (using a distance metric like Euclidean distance).
 Update Step: Recalculate the centroids of each cluster by taking the
mean of all the data points assigned to that cluster

3. def predict(self, X):

 Purpose: Once the K-Means model has been fitted to the training data using the fit method,
the predict method can be used to assign new, unseen data points to the learned clusters.
It does this by finding the closest centroid (learned during the fit phase) for each new data
point.
 Parameters:
o self: Refers to the instance of the KMeans class.
o X (NumPy array or similar): This is the new data that you want to assign to clusters.
It should have the same number of features as the training data used in fit.
 Inside the predict method (indicated by ...):

Hi-Tech Institute of Technology 28 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

o The predict method would typically:

1. Check if the model is fitted: Ensure that the fit method has been called previously
and that self.centroids is not None.
2. Calculate distances: For each data point in the input X, calculate the distance to
each of the learned centroids stored in self.centroids.
3. Assign labels: Assign each data point to the cluster corresponding to the nearest
centroid. The index of the nearest centroid becomes the cluster label for that data
point.
4. Return the labels: Return a NumPy array containing the cluster labels for each
data point in the input X.

4. def _compute_distances(self, X):

 Purpose: This is a helper method (indicated by the leading underscore _). Helper methods
are usually intended for internal use within the class and are not meant to be called directly
by the user. The _compute_distances method is responsible for calculating the distances
between the data points in X and the cluster centroids.
 Parameters:
o self: Refers to the instance of the KMeans class.
o X (NumPy array or similar): This is the data for which distances to the centroids
need to be calculated. It could be the training data in the fit method or new data in
the predict method.
 Inside the _compute_distances method (indicated by ...):
o This method would implement the chosen distance metric, most commonly the
Euclidean distance.
o It would take the input data X and the current cluster centroids (likely stored in
self.centroids) and calculate the distance between each data point in X and each
centroid.
o The output would typically be a 2D array where each row corresponds to a data
point in X, and each column corresponds to a centroid. The value at [i, j] would be
the distance between the i-th data point and the j-th centroid.

Hi-Tech Institute of Technology 29 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Python Program – Custom KMeans Class

import numpy as np
class MyKMeans:
def __init__(self, n_clusters=2, max_iters=100, tol=1e-4):
self.n_clusters = n_clusters
self.max_iters = max_iters
self.tol = tol
self.centroids = None
self.labels_ = None
def fit(self, X):
np.random.seed(42) # for reproducibility
random_indices = np.random.choice(len(X), self.n_clusters, replace=False)
self.centroids = X[random_indices]
for _ in range(self.max_iters):
distances = self._compute_distances(X)
labels = np.argmin(distances, axis=1)
new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(self.n_clusters)])
if np.all(np.linalg.norm(self.centroids - new_centroids, axis=1) < self.tol): break
self.centroids = new_centroids
self.labels_ = labels
def predict(self, X):
distances = self._compute_distances(X)
return np.argmin(distances, axis=1)
def _compute_distances(self, X):
return np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)

# === Data ===

X = np.array([ [1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])

# === Using the Custom KMeans ===

model = MyKMeans(n_clusters=2)
model.fit(X)
print("Final Centroids:")
print(model.centroids)
print("\nCluster Labels:")
print(model.labels_)

Hi-Tech Institute of Technology 30 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

# Predict new points

new_points = np.array([[0, 0], [12, 3]])
predictions = model.predict(new_points)
print("\nPredicted Clusters for new points:")
print(predictions)

Output:
Final Centroids:
[[ 1. 2.] [10. 2.]]
Cluster Labels:
[0 0 0 1 1 1]
Predicted Clusters for new points:
[0 1]

Unsupervised Learning - 2
How to choose Optimal K
What Is K in K-Means?
K-Means is an unsupervised learning algorithm used to group data into K distinct clusters based
on feature similarity.
 It minimizes the distance between data points and their assigned cluster center (centroid).
 It's called "K-means" because we define the number of clusters as K.
In K-Means clustering, K is the number of clusters the algorithm divides the data into.
Choosing the right K is essential to capture the natural structure in your data.
 If K is too small → underfitting
 If K is too large → overfitting or meaningless clusters

Objective:
Minimize WCSS (Within-Cluster Sum of Squares) — i.e., make clusters as compact as possible.

K-Means Algorithm
1. Initialization
 Choose number of clusters K.
 Randomly select K initial centroids (or use K-Means++ for better starting points).
2. Assignment Step
 For each data point:
o Compute distance to each centroid (usually Euclidean distance).
Hi-Tech Institute of Technology 31 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

o Assign the point to the nearest centroid’s cluster.

3. Update Step
 For each cluster:
o Recalculate the centroid as the mean of all assigned points.
4. Repeat (Iterate)
 Repeat steps 2 and 3 until one of the following:
o Centroids stop moving (i.e., convergence).
o No change in cluster assignments.
o Max iterations reached.
Key Metric:
 WCSS (Within-Cluster Sum of Squares) (Inertia):

WCSS = ∑ (distance of each point to its centroid) 2

Lower WCSS = better clustering.

Determine the optimal K

Selecting the appropriate number of clusters (K) is crucial for the effectiveness of the K-Means
algorithm. Several methods can be used to help determine the optimal K, including:
 Elbow Method: Plotting WCSS against the number of clusters and looking for an
"elbow" point where the rate of decrease in WCSS sharply changes.
 Silhouette Analysis: Measuring how well each data point fits into its assigned cluster
compared to other clusters.
 Gap Statistic: Comparing the within-cluster dispersion of the actual data to that of
random data.
 Domain Knowledge: Utilizing prior understanding of the data to guide the choice of K.
Advantages of K-Means:
 Simple to understand and implement.
 Relatively efficient, especially for large datasets (with caveats).
 Often produces good results when clusters are well-separated and roughly spherical.
Disadvantages of K-Means:
 Sensitive to the initial placement of centroids.
 Assumes clusters are spherical and equally sized.
 Requires specifying the number of clusters (K) beforehand.
 Sensitive to outliers.
 Not guaranteed to find the global optimum.

Hi-Tech Institute of Technology 32 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Elbow Method
The Elbow Method is a heuristic technique used in cluster analysis, particularly with the
K-Means algorithm, to determine the optimal number of clusters (K) for a given dataset.
The Elbow Method is a graphical approach to finding the most appropriate number of
clusters (K) in K-Means clustering by evaluating the model’s performance for different K values.
It's a visual method that helps in making an informed decision about the value of K by
observing the change in Within-Cluster Sum of Squares (WCSS) as K increases.

1. Understanding Within-Cluster Sum of Squares (WCSS):

 Definition: WCSS measures the sum of the squared Euclidean distances between each data
point within a cluster and the centroid (mean) of that cluster.
 Interpretation:
o A lower WCSS indicates that the data points within each cluster are closer to their
respective centroids, suggesting that the clusters are more compact and internally
cohesive.
o A higher WCSS implies that the data points are more spread out within the clusters.
 Goal of K-Means: The K-Means algorithm aims to minimize the total WCSS across all
clusters.
Key Concept
When you increase the number of clusters:
 Inertia (WCSS) decreases — because points are closer to their centroids.
 But after a point, adding more clusters gives diminishing returns.
As you increase K:
 The inertia (within-cluster sum of squares) decreases.
 But the rate of decrease slows down after a certain point.
 The point where the rate changes sharply (like an elbow) is the optimal K.
The Elbow Method helps us find this point where adding another cluster doesn’t significantly
improve the model — it forms an “elbow” in the graph of K vs. WCSS.

Terminology
 K = Number of clusters.
 WCSS (Within-Cluster Sum of Squares) = Sum of squared distances between each point
and its assigned centroid.
 Inertia = another name for WCSS (used in sklearn).

Hi-Tech Institute of Technology 33 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

 Inertia (WCSS) Formula

Where:
 K = number of clusters
 Ci = set of points in cluster i
 μi = centroid of cluster i
 ∥x−μi∥2 = squared Euclidean distance
When Does the Elbow Occur?
 The elbow occurs where adding more clusters doesn’t significantly improve the WCSS.
 It's subjective — sometimes the elbow isn't clearly visible.
2. The Intuition behind the Elbow Method:
 High WCSS at K=1: All data in one cluster leads to high within-cluster variance.
 WCSS Decreases with K: More clusters mean data points are closer to centroids → WCSS
drops significantly.
 Diminishing Returns: After a certain K, adding more clusters results in only small WCSS
reductions — indicating possible overfitting.
 Elbow Point: The bend in the WCSS vs. K plot marks the optimal number of clusters —
where improvement slows down.
Algorithm Elbow Method:
1. Choose a Range of K Values: Decide on a reasonable range of K to test. This typically
starts from 1 and goes up to a certain number (e.g., 10 or more), depending on the size and
expected complexity of your dataset.
2. Run K-Means for Each K: For each value of K in the chosen range:
o Initialize and run the K-Means algorithm on your data. Ensure the algorithm is
allowed to converge (reach a stable state or a maximum number of iterations).
o After the algorithm converges, calculate and record the WCSS value. Most K-
Means implementations (like in scikit-learn) provide the WCSS as an attribute
3. Store the WCSS Values: Keep a record (e.g., a list or array) of the WCSS values
corresponding to each K value tested.
4. Plot the Results: Create a line plot where:
o The x-axis represents the number of clusters (K).
o The y-axis represents the corresponding WCSS value.
5. Identify the Elbow Point: Visually inspect the plot. Look for the point where the rate of
decrease in the WCSS starts to slow down significantly, forming an "elbow" shape.

Hi-Tech Institute of Technology 34 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

6. Determine the Optimal K: The K value at the "elbow point" is often considered a good
estimate for the optimal number of clusters for your data.

Interpreting the Elbow Plot:

 A steep decline in WCSS indicates that adding more clusters significantly improves the
clustering by reducing the within-cluster variance.
 The "elbow" is the point where this decrease starts to become less dramatic.
 After the elbow point, adding more clusters results in only marginal reductions in WCSS,
suggesting that the additional complexity might not be justified by the improvement in
cluster cohesion.

Eg: Solve [1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0] using elbow method
The given dataset:
Data points: x1=[1,2] x2=[1,4] x3=[1,0] x4=[10,2] x5=[10,4] x6=[10,0]
Calculate the Within-Cluster Sum of Squares (WCSS) for different values of K (number of
clusters).
Step 1: K = 1 There is only one cluster containing all the data points.
The centroid of this cluster is the mean of all points:

WCSS (Sum of squared distances from each point to centroid):

Euclidean Distance Formula:

Distances to Centroid (5.5, 2.0):

Point Calculation Distance Squared
A (1,2) √((1–5.5)² + (2–2)²) = √(−4.5)2+02 4.5 20.25
= √(20.25)
B (1,4) √((1–5.5)² + (4–2)²) = √(−4.5)2+22 4.924 24.25
=√(24.25)
C (1,0) √((1–5.5)² + (0–2)²) = √(−4.5)2+(−2)2 4.924 24.25
=√(24.25)
D (10,2) √((10–5.5)² + (2–2)²) = √(4.5)2+02 4.5 20.25

Hi-Tech Institute of Technology 35 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

=√(20.25)
E (10,4) √((10–5.5)² + (4–2)²) = √(4.5)2+22 4.924 24.25
=√(24.25)
F (10,0) √((10–5.5)² + (0–2)²) = √(4.5)2+(−2)2 4.924 24.25
=√(24.25)

WCSS for K=1 is the sum of these squared distances:

WCSS1=20.25+24.25+24.25+20.25+24.25+24.25=137

Step 2: K = 2
Clusters:
 Let's assume the initial centroids are C2a= [1, 2] and C2b= [10, 2] (we could choose randomly,
but this might lead to different initial clusters).
 Assignment Step: Assign each point to the nearest centroid.
Cluster 1: A, B, C → Centroid = C2a (1, 2)
Cluster 2: D, E, F → Centroid = C2b (10, 2)
Cluster 1 Distances to C2a (1, 2):
Point Calculation Distance Squared cluster

A (1,2) √ (1−1)2+(2−2)2 =√(0) 0 0 C2a

B (1,4) √ (1−1)2+(4−2)2 2 4 C2a
=√((0)² + (2)²) = √4
C (1,0) √ (1−1)2+(0−2)2=√((0)² + (2)²) = √4 2 4 C2a

D (10,2) √ (10−1)2+(2−2)2 =√((9)² + (0)²) = √81 9 81 C2b

E (10,4) √ (10−1)2+(4−2)2=√((9)² + (2)²) = √85 9.21 85 C2b

F (10,0) √ (10−1)2+(0−2)2=√((9)² + (2)²) = √85 9.21 85 C2b

WCSS1 = 0 + 4 + 4 = 8
Cluster 1 Distances to C2b (10, 2):
Point Calculation Distance Squared cluster

A (1,2) √ (1−10)2+(2−2)2 =√((9)² +(0))=9 9 81 C2a

B (1,4) √ (1−10)2+(4−2)2 =√((9)² + (2)²) = √85 9.21 85 C2a

C (1,0) √ (1−10)2+(0−2)2 =√((9)² + (2)²) = √85 9.21 85 C2a

D (10,2) √ (10−10)2+(2−2)2 =√((0)² + (0)²) = √0 0 0 C2b

Hi-Tech Institute of Technology 36 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

E (10,4) √ (10−10)2+(4−2)2 =√((0)² + (2)²) = √4 2 4 C2b

F (10,0) √ (10−10)2+(0−2)2 =√((0)² + (2)²) = √4 2 4 C2b

WCSS2 = 0 + 4 + 4 = 8
Calculate WCSS for K=2 Total WCSS = 8 + 8 = 16
Update Step: Calculate new centroids.
 C1′=[(1+1+1)/3,(2+4+0)/3]=[1,2]
 C2′=[(10+10+10)/3,(2+4+0)/3]=[10,2]
The centroids did not change in the first iteration, so the clustering is stable.

Step 3: K = 3
Let's assume initial centroids C3a=[1,2], C3b=[10,2], C3c=[1,4].
Point Calculation Distance Squared cluster

A (1,2) Nearest to C3a (distance 0) -> Cluster 1 0 0 C3a

B (1,4) Nearest to C3c (distance 0) -> Cluster 3 0 0 C3c

C (1,0) d(x3,C3a)2=4, -> Cluster 1 2 4 C3a

d(x3,C3b)2=85,
d(x3,C3c)2=16
D (10,2) Nearest to C3b (distance 0) -> Cluster 2 0 0 C3b
E (10,4) d(x5,C3a)2=85, 2 4 C3b
d(x5,C3b)2=4, -> Cluster 2
d(x5,C3c)2=85
F (10,0) d(x6,C3a)2=85, 2 4 C3b
d(x6,C3b)2=4, -> Cluster 2
d(x6,C3c)2=85

Update Step: Calculate new centroids.

 C1′=[(1+1)/2,(2+0)/2]=[1,1]
 C2′=[(10+10+10)/3,(2+4+0)/3]=[10,2]
 C3′=[1,4]
The centroids changed. We would need to iterate again until convergence. However, for the sake
of the elbow method, let's calculate the WCSS based on this first assignment and update.
Calculate WCSS for K=3 Total WCSS = 4 + 4 + 4 = 12

Hi-Tech Institute of Technology 37 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Plotting the WCSS values:

Elbow Method

K WCSS

WCSS
1 137
2 16
3 12 1 2 3
K= 3 CLUSTER

If we were to plot these values, we would see a significant drop in WCSS from K=1 to
K=2. The drop from K=2 to K=3 is also noticeable but smaller. As K increases further, the WCSS
will continue to decrease, reaching 0 at K=6.

Identifying the Elbow:

The "elbow" in this scenario is likely at K = 2. The most significant reduction in WCSS
occurs when moving from 1 to 2 clusters. The subsequent reduction when moving to 3 clusters is
less dramatic compared to the initial drop. After K=2, the benefit of adding more clusters in terms
of reducing WCSS diminishes.
Final Clusters: K = 2
Clusters:
 Let's consider centroids are C2a= [1, 2] and C2b= [10, 2]
Centroid = C2a (1 , 2) → Cluster 1: A, B, C
Centroid = C2b (10, 2) → Cluster 2: D, E, F

Silhouette algorithm to choose K

The silhouette method, it's a powerful unsupervised learning validation method. It is a
technique used in cluster analysis to interpret and validate the consistency within clusters of data.
It provides a graphical representation of how well each object lies within its cluster. The method
was introduced by Belgian statistician Peter Rousseeuw in 1987. The Silhouette Algorithm is a
technique used to:
 Measure the quality of clustering.
 Evaluate how well each data point fits in its assigned cluster.
 Determine how similar a data point is to its own cluster vs other clusters.
It uses the Silhouette Score (or coefficient), which quantifies how well a point fits within its own
cluster compared to others.
Hi-Tech Institute of Technology 38 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

The Silhouette Algorithm

Step 1: Cluster the data
 Use any clustering algorithm (e.g., KMeans) to assign data points to clusters.
Step 2: For each data point iii:
(a) Compute intra-cluster distance a (i):
 This is the average distance from point i to all other points in the same cluster.
(b) Compute nearest-cluster distance b (i):
 For each of the other clusters, compute the average distance from point i to all points in
that cluster.
 b (i) is the minimum of those average distances.
(c) Compute Silhouette Coefficient for point i:

Step 3: Repeat Step 2 for all data points

 You will get one silhouette score per data point.
Step 4: Compute the Average Silhouette Where
Score This is the average silhouette score across all
data points.
 s(i)) is the silhouette score of the ith point.
 n is the total number of data points.
 This gives the overall performance of  sˉ is the mean (average) silhouette score for
the clustering. the entire dataset.
 Higher sˉ means better clustering.

Solve clustering problem using the Silhouette Method to determine the best number of clusters
K for the dataset Points: [1,2],[1,4],[1,0],[10,2],[10,4],[10,0]
1. Determine the Range of K to Test:
 The minimum number of clusters is K=2.
 The maximum number of clusters can be up to N (number of data points), but that's
usually not meaningful (each point is its own cluster). A reasonable upper limit is N-1, or
even less. Here, N=6. Let's test K=2, K=3, K=4, K=5. K=6 would give a Silhouette score
of 0 or be undefined depending on the implementation, as there are no "other clusters".
2. Choose a Clustering Algorithm: The Silhouette score evaluates clustering results, but it
doesn't perform the clustering itself. We need a clustering algorithm. K-Means is the
most common choice.

Hi-Tech Institute of Technology 39 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

3. Choose a Distance Metric: Euclidean distance is standard for K-Means and Silhouette
analysis with this type of data.
 Distance formula: d(p,q)= √(𝑥𝑝 − 𝑥𝑞)2 + (𝑦𝑝 − 𝑦𝑞)2

4. Perform Calculations for each K:

 Case K=2:
o Clustering (K-Means): By visual inspection or simple calculation, the data
clearly splits into two groups: {P1, P2, P3} (around x=1)
o and {P4, P5, P6} (around x=10).
o Let's assume K-Means finds these clusters:
 Cluster 1 (C1): {[1, 2], [1, 4], [1, 0]}
 Cluster 2 (C2): {[10, 2], [10, 4], [10, 0]}
o Calculate Silhouette Scores for each point:
Intra-cluster distance a(P1)
 Point P1 = [1, 2]:
 Compute distance from P1P_1P1 to others in its own cluster (Cluster A):
 d(P1,P2)= √(1 − 1)2 + (2 − 4)2 =2

 d(P1,P3)= √(1 − 1)2 + (2 − 0)2 =2

 average distance from point i to all other points in the same cluster
a(P1)=(2+2)/2=2.0
Nearest-cluster distance b (P1)
Compute distance from P1to each point in other cluster (Cluster B):

Hi-Tech Institute of Technology 40 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Silhouette Score S (P1)

Clu
Point a(i) b(i) s(i)
ster

d(P1,P2)= √(1 − 1)2 + (2 − 4)2

P1 =2
[1,2] A d(P1,P3)= √(1 − 1)2 + (2 − 0)2
=2
a(P1)=(2+2) / 2 =2.0

P2
A
[1,4]

P3
A
[1,0]

Hi-Tech Institute of Technology 41 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

P4
B
[10,2]

P5
B
[10,4]

P6
B
[10,0]

Hi-Tech Institute of Technology 42 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Average Silhouette Score: (0.78 + 0.68 + 0.68 + 0.78 + 0.68 + 0.68 ) / 6 = 4.28 / 6
= 0.71

K Average
Silhouette
Score
2 0.71
3 0.55
4 0.4
5 0.32
6 0

Here's the Silhouette Score graph for K=2 to K=6.

The highest score occurs at K=2, confirming it as the optimal number of clusters for this dataset.
 For K = 2, the average silhouette score is ~0.71, which is very good.
 It confirms that the two clusters are well separated and compact.

Introduction to K Medoids
In clustering, a medoid is a data point within a cluster that minimizes the average
dissimilarity (or distance) to all other points in the same cluster. It's essentially the most central
object in a cluster, acting as a representative point for the cluster. Unlike K-means, which uses the
mean (average) of all points in a cluster as the cluster center, K-medoids uses the actual medoid.
This makes K-medoids more robust to outliers, as it's not influenced by extreme values.
K-Medoids (also called Partitioning Around Medoid) algorithm was proposed in 1987 by
Kaufman and Rousseeuw. A medoid can be defined as a point in the cluster, whose dissimilarities
with all the other points in the cluster are minimum.
Hi-Tech Institute of Technology 43 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

Definition:
A medoid is the point within a cluster for which the sum of distances to all other points in
the cluster is minimum.
The dissimilarity of the medoid (Ci) and object (Pi) is calculated by using E = | Pi – Ci |
The cost in K-Medoids algorithm is given as

Types of K-Medoids
1. PAM (Partitioning Around Medoids) – original algorithm, good for small datasets.
o Working:
 Select k initial medoids.
 Iteratively swap medoids with non-medoids to minimize cost.
 Stops when no improvement is possible.
2. CLARA (Clustering LARge Applications) – sampling-based, faster for large datasets.
o Working:
 Repeatedly take a random sample from the dataset.
 Apply PAM on each sample.
 Choose the clustering with the lowest total cost over the entire dataset.
3. CLARANS (Clustering Large Applications based upon RANdomized Search) –
random search in neighbour space.
o Working:
 Randomly explore the neighbouring medoid configurations (not all).
 Accept new medoid if it reduces cost.
 Repeat for multiple iterations to avoid local optima.
 Distance Metrics: Can use Euclidean, Manhattan, or custom distance functions.
 Medoids: Must be actual data points, unlike centroids in K-Means.
 Convergence: Typically reached when medoids no longer change.
K Medoids Algorithm
Algorithm:
1. Initialize: select k random points out of the n data points as the medoids.
2. Associate each data point to the closest medoid by using any common distance metric
methods.
3. While the cost decreases: For each medoid m, for each data o point which is not a medoid:
 Swap m and o, associate each data point to the closest medoid, and recompute the
cost.
 If the total cost is more than that in the previous step, undo the swap.

Hi-Tech Institute of Technology 44 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Find new medoid using the K-Medoids Algorithm Method to determine the best number of
clusters K for the dataset Points: [1,2],[1,4],[1,0],[10,2],[10,4],[10,0] use Manhattan distance
method.
We’ll solve for K = 2, and use the Manhattan Distance:
Manhattan Distance=∣x1−x2∣+∣y1−y2∣ here, ….. | | ….> cardinality
1. Choose Initial Medoids
Let's pick: Medoid A = P1 = [1,2] Medoid B = P4 = [10,2]
2. Assign Each Point to Nearest Medoid (Using Manhattan Distance)
Point d(P1 = [1,2]) d(P2=[1,4]) d(P3=[1,0])
P1 d(P2, P1) d(P3, P1)
[1,2] 0 = |1-1| + |4-2| = |1-1| + |0-2|
=0+2=2 =0+2=2
d(P1, P2) d(P3, P2)
P2
= |1-1| + |2-4| 0 = |1-1| + |0-4|
[1,4]
=0+2=2 =0+4=4
d(P1, P3) d(P2, P3)
P3
= |1-1| + |2-0| = |1-1| + |4-0| 0
[1,0]
=0+2=2 =0+4=4
Total
distance 2+2=4 2+4=6 2+4=6
(Sum)

3. Therefore, the new medoid for Cluster A is P1 = [1,2]

New Clusters
 Cluster A: P1, P2, P3
Point d(P4 = [10,2]) d(P5=[10,4]) d(P6=[10,0])
d(P5, P4) d(P6, P4)
P4
0 = |10-10| + |4-2| = |10-10| + |0-2|
[10,2]
=0+2=2 =0+2=2
d(P4, P5) d(P6, P5)
P5
= |10-10| + |2-4| 0 = |10-10| + |0-4|
[10,4]
=0+2=2 =0+4=4

Hi-Tech Institute of Technology 45 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

d(P4, P6) d(P5, P6) = |10-

P6
= |10-10| + |2-0| 10| + |4-0| = 0 + 0
[10,0]
=0+2=2 4=4
Total
distance 2+2=4 2+4=6 2+4=6
(Sum)

The new medoid for Cluster B is P4 = [10,2]

New Clusters
 Cluster B: P4, P5, P6

Since the medoids didn’t change, the algorithm converged.

Final Clustering (K = 2, Manhattan Distance)
Cluster Points Medoid
A [1,2], [1,4], [1,0] [1,2]
B [10,2], [10,4], [10,0] [10,2]

Introduction to Hierarchical Clustering

Hierarchical clustering is an unsupervised machine learning algorithm that groups data into
a tree of nested clusters. The main types include agglomerative and divisive. Hierarchical cluster
analysis helps find patterns and connections in datasets. Results are presented in a dendrogram
diagram showing the distance relationships between clusters.
Clustering is an unsupervised machine learning technique used in data analysis to detect
and group similar objects. Hierarchical cluster analysis (HCA), or hierarchical clustering, groups
objects into a cluster hierarchy without enforcing a linear order within them. Many disciplines,
such as biology, image analysis and the social sciences, use hierarchical clustering methods to
explore and recognize patterns in datasets. Use cases include categorizing populations in clinical
research, customer segmentation and detecting communities of nodes in network models.

There are two types of hierarchical clustering:

1. Agglomerative or bottom-up approach that repeatedly merges clusters into larger ones
until a single cluster emerges.
2. Divisive or top-down approach that starts with all data in a single cluster and continues to
split out successive clusters until all clusters are singletons.

Hi-Tech Institute of Technology 46 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Hierarchical clustering analysis has high computational costs. While using a heap can reduce
computation time, memory requirements are increased. Both the divisive and agglomerative types
of clustering are "greedy," meaning that the algorithm decides which clusters to merge or split by
making the locally optimal choice at each stage of the process. It is also possible to apply a stop
criterion, where the algorithm stops agglomeration or splitting clusters when it reaches a
predetermined number of clusters.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A
diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of
merges or splits) graphically represents this hierarchy and is an inverted tree that describes the
order often used to visualize the hierarchy of clusters. It displays the order in which clusters have
been merged or divided and shows the similarity or distance between data points. Dendrogram can
also be understood as a nested list of lists with attributes.

Fig: Dendrogram

Fig: Hierarchical Clustering approach

Hi-Tech Institute of Technology 47 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Distance (Similarity) Metrics in Hierarchical Clustering

In hierarchical clustering, we need to measure how similar or dissimilar two data points (or
clusters) are. This is done using distance metrics, which play a crucial role in determining which
clusters should be merged (or split) during the algorithm.
Here are some commonly used metrics:
a) Euclidean Distance (Most Common)
 It is the straight-line distance between two points in Euclidean space.
 Formula between two points A=(x1,y1) and B=(x2,y2):

b) Manhattan Distance (Taxicab or L1 Distance)

 It is the sum of absolute differences between coordinates of the points.
 Formula:
c) Cosine Similarity / Cosine Distance
 Measures the angle between two vectors (not the distance).
 Often used in text clustering and high-dimensional spaces.
 Cosine similarity is:

Linkage Criteria in Hierarchical Clustering

Once decide how to measure distance between individual data points, it also need to
decide how to measure the distance between entire clusters during the clustering process.
This is where linkage criteria come in — they define how to compute the distance between two
clusters based on the distances between the points inside them.

Linkage Types
a) Single Linkage (Minimum Linkage)
 Distance between two closest points (one from each cluster).
 Formula:
D(A,B)=min{d(a,b) ∣ a∈A,b∈B}
 Tends to form long, "chain-like" clusters.
 Can lead to “chaining effect”, which may not be ideal for compact clusters.

b) Complete Linkage (Maximum Linkage)

 Distance between two farthest points (one from each cluster).
 Formula:

Hi-Tech Institute of Technology 48 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

D(A,B)=max{d(a,b) ∣ a∈A,b∈B}
 Produces more compact, spherical clusters.
 Less sensitive to outliers than single linkage.
c) Average Linkage (UPGMA – Unweighted Pair Group Method with Arithmetic
Mean)
 Takes the average of all pairwise distances between points from the two clusters.
 Formula:

 Balanced approach between single and complete linkage.

 Tends to produce moderate-sized clusters.
d) Centroid Linkage
 Computes the distance between the centroids (mean vectors) of the clusters.
 Formula:
D(A,B)=d(μA,μB)
where μA and μB are centroids of clusters A and B.
 May cause inversions in the dendrogram (not always monotonic).

Hi-Tech Institute of Technology 49 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Top down/Divisive Approach

Initially consider every data point as an individual Cluster and at every step, merge the nearest
pairs of the cluster. (It is a bottom-up method). At first, every dataset is considered an individual
entity or cluster. At every iteration, the clusters merge with different clusters until one cluster is
formed.
The algorithm for Agglomerative Hierarchical Clustering is:
1. Calculate the similarity of one cluster with all the other clusters
(Calculate proximity matrix)
2. Consider every data point as an individual cluster
3. Merge the clusters which are highly similar or close to each other.
4. Recalculate the proximity matrix for each cluster
5. Repeat Steps 3 and 4 until only a single cluster remains.

Bottom up/Divisive Approach

Divisive clustering, it is also known as a top-down approach. This algorithm also does not
require to pre specify the number of clusters. Divisive clustering works just the opposite of
agglomerative clustering. It starts by considering all the data points into a big single cluster and
later on splitting them into smaller heterogeneous clusters continuously until all data points are in
their own cluster. Thus, they are good at identifying large clusters. It follows a top-down approach
and is more efficient than agglomerative clustering. But, due to its complexity in implementation,
it doesn’t have any predefined implementation in any of the major machine learning frameworks.

The algorithm for Divisive Hierarchical Clustering is:

Consider all the data points as a single cluster.
1. Split into clusters using any flat-clustering method, say k-means.
2. Choose the best cluster among the clusters to split further, choose the one that has the
largest Sum of Squared Error (SSE).
3. Repeat steps 2 and 3 until a single cluster is formed.

Principal Component Analysis PCA

Principal component analysis, or PCA, reduces the number of dimensions in large datasets
to principal components that retain most of the original information. It does this by transforming
potentially correlated variables into a smaller set of variables, called principal components.
Karl Pearson is credited with the development of PCA in 1901
PCA is commonly used for data pre-processing for use with machine learning algorithms.
It can extract the most informative features from large datasets while preserving the most relevant
Hi-Tech Institute of Technology 50 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

information from the initial dataset. This reduces model complexity as the addition of each new
feature negatively impacts model performance, which is also commonly referred to as the “curse
of dimensionality.”
PCA creates new variables, such as principal components, that are linear combinations of
the original variables. PCA takes a dataset with multiple variables as input, and it produces a
dataset into a lower subspace, that is, a reduced dataset with fewer variables.
K-means is a clustering algorithm that assigns data points to clusters based on their distance
from the cluster centers. It takes a dataset with one or more variables as input, and it produces a
set of clusters with similar data points.

Intuition behind PCA

Principal component analysis (PCA) is a dimensionality reduction and machine learning
method used to simplify a large data set into a smaller set while still maintaining significant
patterns and trends.
Principal component analysis can be broken down into five steps.
1. Standardize the range of continuous initial variables
2. Compute the covariance matrix to identify correlations
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
4. Create a feature vector to decide which principal components to keep
5. Recast the data along the principal components axes

How Principal Component Analysis (PCA) Works

Principal Component Analysis (PCA) is a statistical technique used to reduce the
dimensionality of large datasets by transforming them into a smaller set of uncorrelated variables
called principal components. These components are linear combinations of the original variables
and are chosen to maximize variance, capturing the most significant patterns in the data.
PCA relies on concepts from linear algebra and matrix operations. It transforms the original
data into a new coordinate system defined by the principal components. This transformation is
based on the eigenvectors and eigenvalues of the data's covariance matrix. The eigenvectors
indicate the directions of maximum variance (i.e., the new axes), while the eigenvalues reflect the
magnitude of variance in those directions—essentially measuring how important each direction is.
To visualize this, imagine a multi-dimensional scatterplot. In this plot, the eigenvectors
show the directions in which the data varies the most, and the eigenvalues tell us how much
variance exists along those directions. A larger eigenvalue means its associated eigenvector (or

Hi-Tech Institute of Technology 51 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

principal component) captures more information. Thus, principal components are the eigenvectors
of the covariance matrix that correspond to the largest eigenvalues.
In PCA, the first two principal components are especially important:
First Principal Component (PC1)
PC1 is the direction in the data space along which the variance is greatest. It represents the
line that best captures the overall spread or structure of the dataset. Since it contains the most
variance, PC1 retains the maximum amount of information from the original data. No other
component can capture more variance than PC1.
Second Principal Component (PC2)
PC2 is calculated similarly to PC1 but with a critical constraint: it must be uncorrelated
with PC1. In mathematical terms, PC2 must be orthogonal (perpendicular) to PC1. It captures the
second-highest amount of variance in the dataset after PC1. The lack of correlation between PC1
and PC2 ensures that each component contributes unique information.
When PCA is applied, a scatterplot of PC1 versus PC2 is often used to visualize the data in two
dimensions. Since PC1 and PC2 are orthogonal, the axes in the plot are at right angles, helping to
reveal structure, patterns, or clusters in the original high-dimensional data.

Applying PCA to 2D data

PCA (Principal Component Analysis) is a dimensionality reduction technique that can be
applied to data in order to find a lower-dimensional representation that captures most of the
variability in the data.
When working with 2D data, PCA can be applied to identify the directions of maximum
variance in the data and project the data onto these directions. PCA visualization helps to convert
3D data into a 2D format or even 2D data to a one-dimensional representation. The main goal of
principal component analysis is visualizing data in a lower dimensional space. This can be useful
for visualizing the data in a lower-dimensional space or for identifying patterns or clusters in the
data.
Here are the steps to apply PCA to 2D data:
1. Collect the 2D data that you want to analyze. For example, this data could represent the heights
and weights of a group of people, or the scores on two different tests for a group of students.
2. Calculate the mean of each dimension (column) in the data. This will give you the center of the
data.
3. Calculate the covariance matrix of the data. The covariance matrix will give you information
about the relationships between the two dimensions. The covariance matrix is calculated as
follows:
covariance_matrix = 1/n X.T @ X
Hi-Tech Institute of Technology 52 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

where X is the matrix of data (with n rows and 2 columns), and @ represents matrix
multiplication.
4. Calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent
the directions of maximum variance in the data, and the corresponding eigenvalues represent the
amount of variance that is explained by each eigenvector.
5. Choose the top k eigenvectors that explain the most variance in the data. This will depend on
your specific data and analysis goals. For example, you might choose the top 1 or 2 eigenvectors
to create a 1D or 2D representation of the data.
6. Project the data onto the selected eigenvectors. This can be done by multiplying the data matrix
by the matrix of selected eigenvectors:
projected_data = X @ selected_eigenvectors
where X is the original data matrix (with n rows and 2 columns), and selected_eigenvectors
is the matrix of selected eigenvectors (with 2 rows and k columns).
7. Visualize the projected data in the lower-dimensional space. This can be done using a scatter
plot or other visualization technique.

Applying PCA on 3D data

PCA (Principal Component Analysis) is a dimensionality reduction technique that can be
applied to data in order to find a lower-dimensional representation that captures most of the
variability in the data.
When working with 3D data, PCA can be applied to identify the directions of maximum
variance in the data and project the data onto these directions. This can be useful for visualizing
the data in a lower-dimensional space or for identifying patterns or clusters in the data.
Here are the Steps to Apply PCA to 3D Data:
1. Collect the 3D data that you want to analyze. For example, this data could represent the heights,
weights, and ages of a group of people, or the coordinates of points in a 3D space.
2. Calculate the mean of each dimension (column) in the data. This will give you the center of the
data.
3. Calculate the covariance matrix of the data. The covariance matrix will give you information
about the relationships between the three dimensions. The covariance matrix is calculated as
follows:
covariance_matrix = 1/n * X.T @ X
where X is the matrix of data (with n rows and 3 columns), and @ represents matrix
multiplication.

Hi-Tech Institute of Technology 53 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

4. Calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent
the directions of maximum variance in the data, and the corresponding eigenvalues represent the
amount of variance that is explained by each eigenvector.
5. Choose the top k eigenvectors that explain the most variance in the data. This will depend on
your specific data and analysis goals. For example, you might choose the top 1 or 2 eigenvectors
to create a 1D or 2D representation of the data.
6. Project the data onto the selected eigenvectors. This can be done by multiplying the data matrix
by the matrix of selected eigenvectors:
projected_data = X @ selected_eigenvectors
where X is the original data matrix (with n rows and 3 columns), and selected_eigenvectors is the
matrix of selected eigenvectors (with 3 rows and k columns).
7. Visualize the projected data in the lower-dimensional space. This can be done using a scatter
plot or other visualization technique.
Note that PCA assumes that the data is normally distributed and that the variables are
linearly related. If these assumptions do not hold, PCA may not be an appropriate technique to
use. Additionally, when working with 3D data, it can be more difficult to visualize the results of
PCA, especially when projecting the data onto more than one dimension.

Math behind PCA

PCA can be thought of as an unsupervised learning problem. The whole process of
obtaining principle components from a raw dataset can be simplified in six parts:
1. Take the whole dataset consisting of d+1 dimensions and ignore the labels such that our new
dataset becomes d dimensional.
2. Compute the mean for every dimension of the whole dataset.
3. Compute the covariance matrix of the whole dataset.
4. Compute eigenvectors and the corresponding eigenvalues.
5. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest
eigenvalues to form a d x k dimensional matrix W.
6. Use this d x k eigenvector matrix to transform the samples onto the new subspace.

Finding Optimal Number of Features

Finding the optimal number of features in a dataset is an important task in data analysis
and machine learning. The optimal number of features is the number that provides the best balance
between model complexity and predictive accuracy.
One popular method for finding the optimal number of features is to use PCA (Principal
Component Analysis). PCA can be used to reduce the dimensionality of the data by finding the
Hi-Tech Institute of Technology 54 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

directions of maximum variance, called principal components, and projecting the data onto these
components.
Here are the steps for finding the optimal number of features using PCA:
1. Standardize the Data: Standardize the data by subtracting the mean and dividing by the standard
deviation for each variable. This step is necessary to ensure that each variable has the same scale
and importance in the analysis.
2. Compute the Covariance Matrix: Compute the covariance matrix of the standardized data using
the formula: Covariance matrix = 1/n * X.T @ X
where X is the standardized data matrix with n observations and p variables.
3. Calculate the Eigenvectors and Eigenvalues: Calculate the eigenvectors and eigenvalues of the
covariance matrix using the eigen decomposition method.
4. Sort the Eigenvalues: Sort the eigenvalues in descending order and calculate the cumulative sum
of the eigenvalues. Determine the number of principal components: Determine the number of
principal components that explain at least 90% of the total variance in the data. This can be done
by examining the cumulative sum of the eigenvalues and selecting the number of components that
explain at least 90% of the total variance.
Project the Data onto the Selected Principal Components: Project the data onto the selected
principal components using the formula:
Projected data = X @ selected_eigenvectors
where X is the standardized data matrix and selected_eigenvectors is the matrix of selected
eigenvectors.

Magic behind PCA

PCA (Principal Component Analysis) is a powerful technique used in data analysis,
machine learning, and computer vision. The magic behind PCA lies in its ability to reduce the
dimensionality of a dataset while retaining the most important information in the data.
PCA achieves this by identifying the directions of maximum variance, called principal
components, and projecting the data onto these components. The principal components represent
linear combinations of the original variables in the data and are orthogonal to each other. The first
principal component explains the largest amount of variance in the data, the second principal
component explains the second largest amount of variance, and so on.

PCA is useful for Several Reasons:

Dimensionality Reduction: PCA can reduce the number of variables in a dataset without
losing important information. This can help simplify the data and make it easier to analyze.

Hi-Tech Institute of Technology 55 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Feature Extraction: PCA can be used to extract the most important features from a dataset. This
can be useful for tasks such as image recognition, where the most important features may be
difficult to identify by hand.
Noise Reduction: PCA can help reduce the impact of noise in a dataset. By focusing on the
directions of maximum variance, PCA can filter out noise that is present in other dimensions.
Visualization: PCA can be used to visualize high-dimensional data in two or three dimensions.
This can help identify patterns and relationships in the data that may not be apparent in higher
dimensions.
PCA is based on linear algebra and uses matrix operations to calculate the principal
components of a dataset. It is a powerful tool that can be used for a wide range of data analysis
and modelling tasks.

Dimensionality reduction
Dimensionality reduction is the process of reducing the number of variables or features in
a dataset. The goal of dimensionality reduction is to simplify the data while retaining the most
important information, which can help improve analysis and modelling performance.
There are Two Main Types of Dimensionality Reduction:
1. Feature Selection: Feature selection involves selecting a subset of the original features or
variables to use in the analysis. This can be done based on various criteria, such as correlation with
the target variable, importance in a model, or domain knowledge. Feature selection is a
straightforward approach but may not be effective if the selected features do not capture the most
important information in the data.
2. Feature Extraction: Feature extraction involves transforming the original features or variables
into a new set of features that capture the most important information in the data. Principal
Component Analysis (PCA) is a common feature extraction technique that identifies the directions
of maximum variance in the data and projects the data onto these directions. Other feature
extraction techniques include Linear Discriminant Analysis (LDA), t-Distributed Stochastic
Neighbor Embedding (t-SNE), and Non-negative Matrix Factorization (NMF).

The Advantages of Dimensionality Reduction Include:

Improved Analysis: By reducing the number of variables, dimensionality reduction can make it
easier to analyze and understand the data. This can help identify patterns and relationships that
may not be apparent in high-dimensional data.
Improved Model Performance: High-dimensional data can lead to overfitting in models, which
can lead to poor generalization performance. By reducing the number of variables, dimensionality
reduction can help improve the performance of models by reducing overfitting.
Hi-Tech Institute of Technology 56 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle 9960575712

Reduced Computational Complexity: High-dimensional data can be computationally expensive

to analyze and model. By reducing the number of variables, dimensionality reduction can help
reduce computational complexity and speed up analysis and modeling.
However, it is important to note that dimensionality reduction can also have some drawbacks, such
as loss of information and increased interpretability. Therefore, it is important to carefully consider
the benefits and drawbacks of dimensionality reduction for each specific dataset and analysis task.

PCA - PCA on Images

PCA (Principal Component Analysis) on images involves reducing the dimensionality of
image data while preserving important features. This is achieved by transforming the data into a
smaller set of principal components that capture the most significant variations in the image.
PCA Intuition on Images Each image is treated as a vector in high-dimensional space.
Example: A 64x64 image → flatten into a 4096-element vector.
PCA finds new basis vectors (called eigenfaces in face recognition):
The first principal component (PC1) captures the most variation.
The second principal component (PC2) captures the second-most variation, orthogonal to PC1.
By projecting images onto top K principal components, we get a compressed version.

Steps to Apply PCA on Images

Step 1: Load Images
 Each image → convert to grayscale (optional but common).
 Flatten image matrix (2D) into a vector (1D).
Step 2: Stack into Data Matrix
 Stack all image vectors into a matrix X of shape (n_samples, n_features).
o n_samples = number of images.
o n_features = number of pixels per image.
Step 3: Standardize Data
 Subtract mean and (optionally) divide by standard deviation.
o Centering is essential.
o Formula:
Xcentered=X−mean(X)
Step 4: Compute Covariance Matrix

Step 5: Compute Eigenvectors and Eigenvalues

 Eigenvectors = Principal Components (PCs).

Hi-Tech Institute of Technology 57 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

 Eigenvalues = how much variance each PC explains.

Step 6: Sort and Select Top K Components
 Sort eigenvectors based on eigenvalues (descending order).
 Select top K components to reduce dimensions.
Step 7: Project Images
 Multiply data by selected principal components to get compressed representation:
Projected Data = Xcentered×W where W contains top K eigenvectors.
Applications
Application PCA Role
Face Recognition Eigenfaces represent faces compactly
Image Compression Store fewer components, reconstruct image
Noise Filtering Discard components with tiny variance
Object Detection Extract features before classification

Python Code # Reconstruct image

import numpy as np X_reconstructed =
import matplotlib.pyplot as plt pca.inverse_transform(X_pca)
from sklearn.decomposition import PCA # Show original vs reconstructed
from sklearn.datasets import fetch_olivetti_faces plt.subplot(1, 2, 1)
# Load dataset plt.imshow(images[0], cmap='gray')
data = fetch_olivetti_faces() plt.title('Original')
images = data.images plt.subplot(1, 2, 2)
X = data.data # flattened images plt.imshow(X_reconstructed[0].reshape(64, 64),
# Apply PCA cmap='gray')
pca = PCA(n_components=50) plt.title('Reconstructed')
X_pca = pca.fit_transform(X) plt.show()

PCA on Olevitti Images

The Olivetti Faces Dataset, also known as the AT&T Face database, is a classic dataset
often used to demonstrate and evaluate face recognition and dimensionality reduction techniques
like PCA. Here's a breakdown of applying PCA to this dataset:
1. Understanding the Olivetti Faces Dataset
Source: Originally collected by the Olivetti Research Laboratory between April 1992 and April
1994. Now hosted by AT&T Laboratories Cambridge.
Content: Contains 400 grayscale images of 40 distinct subjects.
Image Characteristics:
Size: 64x64 pixels.

Hi-Tech Institute of Technology 58 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Gray levels: 256.

Pose: Subjects are generally in an upright, frontal position with some tolerance for slight
movements.
Variations: For some subjects, images include variations in lighting, facial expressions
(smiling/not smiling, open/closed eyes), and the presence or absence of glasses.
Background: Dark and homogeneous.
Organization: The dataset is typically organized into 40 directories, one for each subject, with 10
images per subject.
Purpose: Ideal for experimenting with face recognition algorithms due to its manageable size and
controlled variations. It's also well-suited for demonstrating unsupervised learning techniques like
PCA.
2. Applying PCA to the Olivetti Faces Dataset
The process of applying PCA to the Olivetti Faces dataset follows the general steps outlined in the
previous teaching notes, with specific considerations for image data:
Data Preparation: Loading the Dataset: Many machine learning libraries (like scikit-learn in
Python) provide convenient functions to download and load the Olivetti Faces dataset.
Flattening Images: Each 64x64 image is flattened into a 1D vector of 4096 pixel values.
Creating the Data Matrix: The dataset is represented as a matrix of size 400 (number of images)
x 4096 (number of features/pixels).
Mean Centering: Calculate the average face across all 400 images (pixel-wise mean). This results
in a 64x64 average image. Subtract this average face vector from each individual image vector in
the data matrix. This centers the data around the origin.
Calculating the Covariance Matrix (or using SVD): Directly calculating the 4096x4096
covariance matrix can be computationally expensive. A more efficient approach, especially for
high-dimensional data, is to use Singular Value Decomposition (SVD). SVD achieves the same
goal of finding principal components without explicitly calculating the covariance matrix.
Computing Eigenvectors (Eigenfaces) and Eigenvalues: The eigenvectors obtained from the
covariance matrix (or SVD) are 4096-dimensional vectors. When reshaped back into 64x64
images, these eigenvectors are called eigenfaces. Eigenvalues represent the amount of variance
captured by each corresponding eigenface. The eigenfaces associated with larger eigenvalues
capture more significant variations in the dataset. Intuition of Eigenfaces: The first few eigenfaces
often represent general facial features like overall shape, lighting direction, etc. Subsequent
eigenfaces capture more subtle variations like specific expressions or details.
Selecting Principal Components: Sort the eigenfaces by their corresponding eigenvalues in
descending order. Determine the number of principal components (k) to keep based on the desired
level of variance explained. For example, you might aim to retain 90% or 95% of the total variance.

Hi-Tech Institute of Technology 59 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Alternatively, you can visually inspect a scree plot of the eigenvalues to find an "elbow" point.
Projecting the Images: Create a projection matrix using the top k eigenfaces (reshaped back into
vectors) as columns. Multiply the mean-centered image data matrix by this projection matrix to
obtain a lower-dimensional representation of each face (a vector of k principal component scores).
3. Benefits of Applying PCA to Olivetti Images
Dimensionality Reduction for Face Recognition:
Instead of working with 4096 pixel values, each face can be represented by a much smaller
number of principal component scores (e.g., 100-200), significantly reducing the dimensionality
while retaining most of the identity-related information. This speeds up training and classification
in face recognition tasks.
Feature Extraction (Eigenfaces as Features):
The eigenfaces themselves can be considered as a set of basis features that capture the fundamental
variations in human faces present in the dataset.
Visualization:
The top few eigenfaces can be visualized to understand the most significant sources of variation
in the dataset (e.g., lighting, pose). Projecting the 400 images onto the first two or three principal
components allows for a 2D or 3D visualization of the face space, potentially revealing clusters
based on identity or other factors.
Noise Reduction:
By discarding lower-variance principal components, PCA can help to filter out some of the noise
or less important variations in the images, potentially leading to more robust face recognition
models.
Why Apply PCA on Olivetti Faces?
Purpose Benefit
Compression Store images using fewer dimensions (e.g., 50 instead of 4096)
Noise Reduction Ignore components with low variance (likely noise)
Feature Extraction Use principal components (eigenfaces) as features for ML models
Face Recognition Classify faces by comparing in reduced PCA space

Hi-Tech Institute of Technology 60 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

Python Code Example: PCA on Olivetti Faces # Step 3: Reconstruct the images
import numpy as np X_reconstructed =
import matplotlib.pyplot as plt pca.inverse_transform(X_pca)
from sklearn.decomposition import PCA # Step 4: Show original and reconstructed
from sklearn.datasets import image
fetch_olivetti_faces plt.figure(figsize=(8, 4))
# Step 1: Load dataset # Original
data = fetch_olivetti_faces() plt.subplot(1, 2, 1)
images = data.images # Shape: (400, 64, plt.imshow(images[0], cmap='gray')
64) plt.title('Original')
X = data.data # Shape: (400, 4096) plt.axis('off')
flattened # Reconstructed
# Step 2: Apply PCA plt.subplot(1, 2, 2)
pca = PCA(n_components=50) # Reduce plt.imshow(X_reconstructed[0].reshape(64,
from 4096 to 50 dimensions 64), cmap='gray')
X_pca = pca.fit_transform(X) plt.title('Reconstructed (50 PCs)')
plt.axis('off')
plt.show()

Applying PCA to the Olivetti Faces dataset is a common and insightful exercise in understanding
dimensionality reduction and feature extraction for image data. It demonstrates how a high-dimensional
representation of faces can be compressed into a lower-dimensional space using eigenfaces, while still
retaining significant information for tasks like face recognition. The Olivetti dataset provides a well-
controlled environment to observe the effects of PCA on real-world image data.

Reproducing Images
Principal Component Analysis (PCA) is not only a dimensionality reduction technique but
also a powerful method for image compression and reconstruction. When PCA is applied to image
data, it identifies the most important patterns (principal components) that capture the largest
variance in pixel intensities across images. Reproducing images with PCA involves two main
steps: compression and reconstruction.
In the compression phase, the original high-dimensional image data (for example, a 64×64
image with 4096 pixels) is projected onto a smaller set of dimensions defined by the selected
principal components. This reduced representation retains only the most significant features of the
image while discarding less important variations or noise. The number of components chosen

Hi-Tech Institute of Technology 61 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

directly affects how much detail is preserved; fewer components mean greater compression but
more loss of detail.
Reconstruction, or reproduction, is the reverse process where the compressed data is
transformed back to the original space. This is done by multiplying the compressed features with
the transpose of the principal components and then adding back the mean image that was
subtracted during the PCA process. The result is an approximation of the original image. If only a
small number of principal components are used, the reconstructed image appears blurry or
simplified but still captures the general shape and structure. As more components are included, the
reproduced image becomes sharper and more accurate, closely resembling the original.
An important aspect of PCA-based reproduction is the trade-off between compression and
quality. Using very few components, such as 10 or 20, can significantly reduce the storage size of
an image but at the cost of detail and clarity. On the other hand, using a larger number like 100 or
200 components results in a higher-quality reproduction with minimal visible loss. In fact, when
all components are used, PCA perfectly reconstructs the original image without any loss.
This process of reproducing images using PCA is especially useful in face recognition
tasks, where "eigenfaces" (the principal components of face images) are used to represent and
reconstruct faces efficiently. The technique not only demonstrates the power of dimensionality
reduction but also provides visual intuition about how much of an image’s information is captured
by its main components.

Eigenfaces
In the context of face recognition, each face image is represented as a high-dimensional
vector. For instance, a 64x64 pixel image can be viewed as a 4096-dimensional vector. The goal
is to reduce this high-dimensional data to a lower-dimensional space that captures the most
significant features of the faces.
What are Eigenfaces?
 Eigenfaces are a set of eigenvectors derived from the covariance matrix of facial images, primarily
used in the field of computer vision for human face recognition. The concept was introduced by
Sirovich and Kirby in 1987 and later expanded by Turk and Pentland in 1991.

 Principal Components of Face Images: Eigenfaces are essentially the principal

components (the eigenvectors of the covariance matrix) of a set of face images.
 "Ghostly" Faces: When these eigenvectors, which are initially long vectors of pixel
values, are reshaped back into a 2D image format, they often appear as ghostly, grayscale
patterns. These patterns highlight the directions of the most significant variance across the
training set of faces.

Hi-Tech Institute of Technology 62 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

 Basis Set: Eigenfaces form a basis set that can be used to represent any face from the
original training set (and even new faces to some extent) as a weighted combination of
these eigenfaces.
How are Eigenfaces Obtained?
The process of obtaining eigenfaces involves applying PCA to a dataset of face images:
1. Prepare the Training Data:
o Collect a set of face images.
o Ensure the images are aligned (e.g., eyes at roughly the same position) and resized
to a consistent size.
o Convert the images to grayscale.
o Flatten each image into a long 1D vector by concatenating the rows (or columns)
of pixel intensities.
o Arrange these image vectors as columns (or rows) in a data matrix.
2. Calculate the Mean Face:
o Compute the average of all the face images in the training set, resulting in a mean
face image (also a vector).
3. Subtract the Mean:
o Subtract the mean face vector from each individual face image vector. This centers
the data.
4. Calculate the Covariance Matrix:
o Compute the covariance matrix of the mean-centered image vectors. For a large
number of high-dimensional images, this matrix can be very large.
5. Compute Eigenvectors and Eigenvalues:
o Find the eigenvectors and eigenvalues of the covariance matrix.
o Each eigenvector has the same dimensionality as the flattened face images and can
be reshaped into an "eigenface."
o Eigenvalues indicate the amount of variance in the dataset captured by each
corresponding eigenvector (eigenface).
6. Select Principal Components (Eigenfaces):
o Sort the eigenvectors (eigenfaces) based on their corresponding eigenvalues in
descending order.
o Choose the top k eigenfaces that capture a significant portion of the total variance
in the face dataset. The number k is typically much smaller than the original
dimensionality of the images.
By applying Principal Component Analysis (PCA) to a set of face images, we can identify
the directions (principal components) that capture the most variance in the data. These principal

Hi-Tech Institute of Technology 63 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

components, when reshaped back into image form, are termed Eigenfaces. Each Eigenface
represents a pattern of variation among the faces in the dataset.
Given a set of centered face images (mean subtracted), PCA seeks to find a set of
orthonormal vectors (eigenvectors) that capture the directions of maximum variance.
Mathematically:

Where xi is the vectorized form of the ith image, and xˉ is the mean face vector.
The eigenvectors of this covariance matrix correspond to the Eigenfaces. Each face in the
dataset can then be approximated as a linear combination of these Eigenfaces.
Eigenfaces often appear as ghostly images highlighting specific facial features. The first
few Eigenfaces capture the most significant variations, such as lighting and general facial
structure, while subsequent Eigenfaces capture finer details.
For example, the first Eigenface might represent the average lighting condition across all
faces, the second might capture variations in facial width, and so on. By combining these
Eigenfaces with appropriate weights, we can reconstruct original face images with varying degrees
of accuracy, depending on the number of Eigenfaces used.
In face recognition systems, Eigenfaces are used to project high-dimensional face images
into a lower-dimensional space (face space). Recognition is performed by comparing the projected
test image with the projections of known faces and identifying the closest match.
This method significantly reduces computational complexity and storage requirements,
making it suitable for real-time face recognition applications.

Classification of LFW Images

The Labelled Faces in the Wild (LFW) dataset is a popular dataset used for face recognition
tasks. PCA can be used as a pre-processing step to reduce the dimensionality of the image data
and improve the accuracy of face recognition classifiers.
To classify LFW images using PCA, the following steps can be followed:
Convert the images to grayscale and resize them to a fixed size, such as 64x64 pixels, to
reduce the dimensionality of the data. Compute the mean image of the dataset and subtract it from
each image to center the data around zero. Compute the covariance matrix of the centered images.
Compute the eigenvectors and eigenvalues of the covariance matrix. Select a subset of the
eigenvectors with the largest eigenvalues to represent the images. These eigenvectors correspond
to the most important features of the images. Project the centered images onto the selected
eigenvectors to obtain a reduced representation of the images. Train a classifier, such as a neural

Hi-Tech Institute of Technology 64 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

network or a support vector machine (SVM), using the reduced representation of the images as
input features. Test the classifier on a separate set of LFW images and evaluate its accuracy.
Using PCA as a pre-processing step can improve the accuracy of face recognition
classifiers by reducing the dimensionality of the image data and removing noise and irrelevant
information. However, it is important to carefully balance the level of dimensionality reduction
and the accuracy of the classifier based on the specific requirements of the application.

Question Bank
1. What is unsupervised learning, and how does it differ from supervised learning?
2. What is clustering, explain the Features of Clustering and why is it useful in unsupervised
learning?
3. Explain the k-means algorithm and how it is used for flat clustering.
4. How does the scikit-learn library implement k-means clustering?
5. Describe the fit and predict functions in k-means clustering.
6. How do you choose the optimal value of k in k-means clustering?
7. What is the silhouette algorithm, and how does it help in selecting the optimal k value?
8. What is k-medoids clustering, and how does it differ from k-means clustering?
9. Explain the k-medoids algorithm for clustering.
10. What is hierarchical clustering, and what are the two approaches for it?
11. What is the intuition behind PCA, and why is it useful in unsupervised learning?
12. How can PCA be applied to 2D data, and what is the effect of the transformation?
13. How does PCA work for 3D data, and what are the benefits of using PCA?
14. Explain the mathematical basis of PCA, including covariance matrices and eigenvectors.
15. How do you find the optimal number of features in PCA, and why is this important?
16. How can PCA be used for image data, and what are some applications of this technique?
17. How does PCA work for Olevitti images, and how can the results be used for data
compression?
18. What is image reproduction, and how can PCA be used to achieve this?
19. What are eigenfaces, and how are they used in facial recognition systems?
20. Explain the classification of LFW images using PCA, including the steps involved and the
results obtained.
21. List and explain the clustering Algorithm.
22. What is k means Clustering? Explain K means algorithm.
23. Solve following example of k means clustering –
24. Dataset: We have 6 data points representing coordinates in 2D space:
A (2, 1) , B (4, 1) , C (1, 1) , D (10, 3) , E (10, 5) , F (10, 2)

Hi-Tech Institute of Technology 65 Dept of CSE (AIML)

Prof. Sandip Eknathrao Ingle 9960575712

25. In Unsupervised Learning How to choose Optimal K, explain with example.

26. Explain elbow method with following example Dataset: We have 6 data points representing
coordinates in 2D space: A (2, 1) , B (4, 1) , C (1, 1) , D (10, 3) , E (10, 5) , F (10, 2)
27. Differentiate between Supervised, Unsupervised, and Semi-supervised learning with
examples.
28. What is K-Means Clustering? Explain the steps of the K-Means Algorithm.
29. What are the advantages and disadvantages of K-Means Clustering?
30. In K-Means, what is the role of Euclidean distance? Explain with example.
31. Apply 2 iterations of K-Means on the following data with K = 3:
Data points: (1,2), (1,4), (1,0), (10,2), (10,4), (10,0)
Initial centroids: (1,2), (10,2), (5,5)
32. Explain the Elbow Method. Use the following dataset to illustrate the process:
A (2, 1), B (4, 1), C (1, 1), D (10, 3), E (10, 5), F (10, 2)
33. In Unsupervised Learning, how to choose the optimal number of clusters K? Explain with
examples.

Hi-Tech Institute of Technology 66 Dept of CSE (AIML)

ML-UNIT - I - Part A
No ratings yet
ML-UNIT - I - Part A
88 pages
ML Material Unit1
No ratings yet
ML Material Unit1
32 pages
ML Unit-1
No ratings yet
ML Unit-1
12 pages
Chapter 01 Machine Learning
No ratings yet
Chapter 01 Machine Learning
22 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
133 pages
UNIT-1 Machine Learning
No ratings yet
UNIT-1 Machine Learning
43 pages
Module 1
No ratings yet
Module 1
38 pages
Applied Machine Learning Course Overview
100% (4)
Applied Machine Learning Course Overview
22 pages
ML Report
No ratings yet
ML Report
19 pages
Machine Learning Industrial Training
No ratings yet
Machine Learning Industrial Training
38 pages
ML Module 1 Final
No ratings yet
ML Module 1 Final
134 pages
Presentation 33360 Content Document 20250319044717PM
No ratings yet
Presentation 33360 Content Document 20250319044717PM
126 pages
R20 ML Notes
No ratings yet
R20 ML Notes
118 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
68 pages
UNIT-1 Machine Learning
No ratings yet
UNIT-1 Machine Learning
42 pages
Machine Learning Notes - Concepts, Algorithms
No ratings yet
Machine Learning Notes - Concepts, Algorithms
171 pages
Machine Learning (R17A0534) Lecture Notes: B.Tech Iv Year - I Sem (R17) (2020-21)
No ratings yet
Machine Learning (R17A0534) Lecture Notes: B.Tech Iv Year - I Sem (R17) (2020-21)
9 pages
Unit 1
No ratings yet
Unit 1
88 pages
B.Tech IT Machine Learning Guide
No ratings yet
B.Tech IT Machine Learning Guide
135 pages
Machine Learning - Module 1
No ratings yet
Machine Learning - Module 1
105 pages
Csit (r22) 3-2 Machine Learning Digital Notes
No ratings yet
Csit (r22) 3-2 Machine Learning Digital Notes
120 pages
Machine Learning
No ratings yet
Machine Learning
44 pages
MCA AI Machine Learning Notes
No ratings yet
MCA AI Machine Learning Notes
42 pages
ML - Part - A
No ratings yet
ML - Part - A
10 pages
Unit1 ML
No ratings yet
Unit1 ML
23 pages
w1 - Introduction To ML
No ratings yet
w1 - Introduction To ML
40 pages
Chapter-1 ML Intro
No ratings yet
Chapter-1 ML Intro
36 pages
Machine Learning: Sri Chandrasekharendra Saraswathi Viswa Mahavidyalaya
No ratings yet
Machine Learning: Sri Chandrasekharendra Saraswathi Viswa Mahavidyalaya
333 pages
Magic Labeled (ML)
No ratings yet
Magic Labeled (ML)
689 pages
Machine Learning Basics & History
No ratings yet
Machine Learning Basics & History
458 pages
ML - Full Slides Srikanth Allamshatty
No ratings yet
ML - Full Slides Srikanth Allamshatty
369 pages
Machine Learning: BE Sixth Semester 20CS610
No ratings yet
Machine Learning: BE Sixth Semester 20CS610
211 pages
ML Module 01
No ratings yet
ML Module 01
28 pages
What Is Machine Learning?
No ratings yet
What Is Machine Learning?
6 pages
R20 ML Notes Unit-I
No ratings yet
R20 ML Notes Unit-I
30 pages
ML ch1
No ratings yet
ML ch1
20 pages
ML SecA
No ratings yet
ML SecA
34 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
ML-Unit 1 Merged
No ratings yet
ML-Unit 1 Merged
151 pages
ML-Unit 1
No ratings yet
ML-Unit 1
43 pages
Module1 Introduction
No ratings yet
Module1 Introduction
35 pages
Book of 843 - AI - Student - HandbookXI-104-127
No ratings yet
Book of 843 - AI - Student - HandbookXI-104-127
24 pages
ML Lecture Notes Unit-1
No ratings yet
ML Lecture Notes Unit-1
45 pages
ML Notes-1
No ratings yet
ML Notes-1
59 pages
Artificial Intelligence Lec 1 PDF
No ratings yet
Artificial Intelligence Lec 1 PDF
15 pages
ML Unit I - It
No ratings yet
ML Unit I - It
30 pages
What Is Machine Learning?
No ratings yet
What Is Machine Learning?
12 pages
Unit I Notes Machine Learning Techniques
No ratings yet
Unit I Notes Machine Learning Techniques
21 pages
Cbsyllabus Bda 1
No ratings yet
Cbsyllabus Bda 1
4 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
225 pages
Machine Learning Is A Branch of Artificial Intelligence (AI)
No ratings yet
Machine Learning Is A Branch of Artificial Intelligence (AI)
80 pages
Machine Learning (R20a0518)
No ratings yet
Machine Learning (R20a0518)
87 pages
Industrial Training Report On Machine Le
No ratings yet
Industrial Training Report On Machine Le
21 pages
ML - Module 1
No ratings yet
ML - Module 1
52 pages
ASSIGNMENT 1 Mavhine Learning
No ratings yet
ASSIGNMENT 1 Mavhine Learning
8 pages
Machine Learning Complete Notes
No ratings yet
Machine Learning Complete Notes
102 pages
Machine Learning Techniques Unit 1
No ratings yet
Machine Learning Techniques Unit 1
21 pages
Question Paper LH - MLT
No ratings yet
Question Paper LH - MLT
93 pages
T Thesis Topics in Machine Learning For Research Scholars
No ratings yet
T Thesis Topics in Machine Learning For Research Scholars
14 pages
ViTamin: Scalable Vision Models for VLMs
No ratings yet
ViTamin: Scalable Vision Models for VLMs
13 pages
Aiml Min - Max Tictactoe - 4
No ratings yet
Aiml Min - Max Tictactoe - 4
12 pages
Food Recipe Recommendation Based On Ingredients de
No ratings yet
Food Recipe Recommendation Based On Ingredients de
7 pages
R2 Unet PDF
No ratings yet
R2 Unet PDF
12 pages
Deep Learning Automates Wildlife Image Analysis
No ratings yet
Deep Learning Automates Wildlife Image Analysis
17 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
A Novel Data Poisoning Attack in Federated Learning Based On
No ratings yet
A Novel Data Poisoning Attack in Federated Learning Based On
8 pages
Precise Detection of Diabetic Retinopathy Using Adaptive Remora Optimization Algorithm With Deep Adversarial Approach
No ratings yet
Precise Detection of Diabetic Retinopathy Using Adaptive Remora Optimization Algorithm With Deep Adversarial Approach
31 pages
Spanish Word Vectors Analysis
No ratings yet
Spanish Word Vectors Analysis
5 pages
CS231n - Convolutional-Networks 1
No ratings yet
CS231n - Convolutional-Networks 1
3 pages
AI Midterm Paper
No ratings yet
AI Midterm Paper
2 pages
Artificial Intelligence Artificial Neural Networks - : Introduction
No ratings yet
Artificial Intelligence Artificial Neural Networks - : Introduction
43 pages
Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization
No ratings yet
Impact of Stemming and Word Embedding On Deep Learning-Based Arabic Text Categorization
16 pages
1 s2.0 S0040162523007618 Main
No ratings yet
1 s2.0 S0040162523007618 Main
18 pages
Hopfield Networks and Boltzman Machines-Part 2
No ratings yet
Hopfield Networks and Boltzman Machines-Part 2
13 pages
NN WK 3 Lec 5 6 Gradient Descent
No ratings yet
NN WK 3 Lec 5 6 Gradient Descent
7 pages
DL Mini Project Siddhesh
No ratings yet
DL Mini Project Siddhesh
9 pages
Automated Disease Diagnosis in Pumpkin Plants Using Advanced CNN Models
No ratings yet
Automated Disease Diagnosis in Pumpkin Plants Using Advanced CNN Models
10 pages
Major Project PPT 1
No ratings yet
Major Project PPT 1
10 pages
Instance Segmentation For Autonomous Vehicle
No ratings yet
Instance Segmentation For Autonomous Vehicle
6 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
6 pages
Chap 9-2 - Convolutional Neural Network - Heechul Lim
No ratings yet
Chap 9-2 - Convolutional Neural Network - Heechul Lim
58 pages
The Pytorch Kaldi Speech Recognition Too PDF
No ratings yet
The Pytorch Kaldi Speech Recognition Too PDF
5 pages
NLP & LLM - 11
No ratings yet
NLP & LLM - 11
16 pages
Multi Level Deep Learning Model For Network Anomal
No ratings yet
Multi Level Deep Learning Model For Network Anomal
12 pages
Lec7 - 10 - HMM Learning
No ratings yet
Lec7 - 10 - HMM Learning
88 pages
Deep Learning MCQ
No ratings yet
Deep Learning MCQ
7 pages
Chapter 3 Artificial Intelligent: Industry 4.0 in Mechanical Engineering
No ratings yet
Chapter 3 Artificial Intelligent: Industry 4.0 in Mechanical Engineering
45 pages
ML, ALgo Roadmap
No ratings yet
ML, ALgo Roadmap
21 pages
Binary Options Safety Signals
No ratings yet
Binary Options Safety Signals
6 pages