Prof.
Sandip Eknathrao Ingle                                                         9960575712
                           Advance Machine Learning
                    BTAIC602             3L - 0T - 0P           3 Credits
Course Category: PCC8, Teaching Scheme Lecture: 3 hrs. /week
                                     Examination Scheme
 Continuous Assessment:           Mid Semester Exam:            End Semester Exam:
          20Marks                      20 Marks             60 Marks (Duration 03 hrs.)
Pre-Requisites: Data Analysis, Python Programming Language
Course Objectives: After completion of the course, students will learn:-
   To understand fundamental concepts of unsupervised learning and its various algorithms
   To understand Association Rules Mining and Recommendation Systems
   To apply ML algorithms on given data and interpret the results obtained
   To design appropriate ML solution to solve real world problems in AI domain
Course Outcomes: On completion of the course, students will be able to:
CO1 Develop a good understanding of fundamental of unsupervised learning.
CO2 Formulation of Association Rules Mining and Recommendation Systems
CO3 Interpret a model using Reinforcement Learning.
CO4 Evaluate the time series data.
CO5 Design and Concrete implementations using boosting.
Hi-Tech Institute of Technology             1                                 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                           9960575712
                 Unit No 1: Unsupervised Learning [7 Hours]
Unsupervised Learning - 1
Introduction to Unsupervised Learning, Introduction to Clustering, Using K-means for Flat
Clustering, KMeans Algorithm, Using KMeans from Sklearn, Implementing Fit & Predict
Functions, Implementing K-Means Class
Unsupervised Learning - 2
How to choose Optimal K, Silhouette algorithm to choose K, Introduction to K Medoids, K
Medoids Algorithm, Introduction to Hierarchical Clustering, Top down/Divisive Approach,
Bottom up/Divisive Approach
Principal Component Analysis PCA
Intuition behind PCA, Applying PCA to 2D data, Applying PCA on 3D data, Math behind PCA,
Finding Optimal Number of Features, Magic behind PCA, Dimensionality reduction PCA -
PCA on Images, PCA on Olevitti Images, Reproducing Images, Eigenfaces, Classification of
LFW Images
Introduction
What Is Machine Learning? Machine learning is programming computers to optimize a
performance criterion using example data or past experience. We have a model defined up to some
parameters, and learning is the execution of a computer program to optimize the parameters of the
model using the training data or past experience.
The model may be predictive to make predictions in the future, or descriptive to gain knowledge
from data, or both.
 The term Machine Learning was coined by Arthur Samuel in 1959, an American pioneer
    in the field of computer gamingand artificial intelligence and stated that
 “It gives computers the ability to learn without being explicitly programmed”.
 In 1997, Tom Mitchell gave a “well-posed” mathematical and Relational definition that
Definition:
    “A computer program is said to learn from experience E with respect to some
    task T and some performance measure P, if its performance on T, as
    measured by P, improves with experience E.
 Machine Learning is a latest buzzword floating around.
 It deserves to, as it is one of the most interesting subfield of Computer Science.
Hi-Tech Institute of Technology              2                              Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                         9960575712
                                      Fig: Machine learning model
          However, there is no universally accepted definition for machine learning. Different
       authors define the term differently.
Arthur Samuel's Definition (1959):
      "Machine learning is the field of study that gives computers the ability to learn without
       being explicitly programmed."
Tom Mitchell's Definition (1997):
      "A computer program is said to learn from experience E with respect to some class of
       tasks T and performance measure P, if its performance at tasks in T, as measured by P,
       improves with experience E."
Ethem Alpaydın's Definition:
      "Machine learning is programming computers to optimize a performance criterion using
       example data or past experience."
Dieter Fox's Definition:
      "Machine learning is the study of algorithms that improve their performance P at some task
       T with experience E."
Kevin P. Murphy's Definition:
      "Machine learning is a set of methods that can automatically detect patterns in data, and
       then use the uncovered patterns to predict future data or perform other kinds of decision
       making under uncertainty."
Trevor Hastie, Robert Tibshirani, and Jerome Friedman's Definition:
      "Machine learning is a branch of statistics, but with increased emphasis on the use of
       computers to implement the underlying algorithms, and on evaluating their performance
       on large datasets."
Andreas C. Müller and Sarah Guido's Definition:
      "Machine learning is the practice of teaching computers how to learn patterns from data,
       and then making decisions or predictions based on those patterns."
Hi-Tech Institute of Technology               3                             Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                           9960575712
Wikipedia's Definition:
       "Machine learning is a subfield of artificial intelligence (AI) focused on the development
        of algorithms that allow computers to learn from and make predictions on data."
Key Influential Figures
   Alan Turing: Pioneer of AI and theoretical computer science.
   Frank Rosenblatt: Developed the perceptron, an early neural network model.
   Geoffrey Hinton: Key figure in the development of deep learning and neural networks.
   Yann LeCun: Contributed to the development of convolutional neural networks.
   Yoshua Bengio: Worked extensively on deep learning and representation learning.
   Andrew Ng: Prominent advocate for online education in AI and machine learning.
Key aspects of machine learning include:
    1. Learning from Data: Machine learning algorithms use historical data as input to predict
        new output values. The quality and quantity of data significantly influence the performance
        of the learning model.
    2. Pattern Recognition: By identifying patterns and correlations in data, machine learning
        models can make predictions or decisions without being explicitly programmed for the
        task.
    3. Adaptation and Improvement: Machine learning systems continuously improve their
        performance as they are exposed to more data over time, refining their algorithms to
        enhance accuracy and efficiency.
Core Components of Machine Learning
    1. Data: The foundation of any machine learning system. It can be in various forms, such as
        text, images, audio, and structured data like databases.
    2. Features: Individual measurable properties or characteristics of the data used by the
        algorithm to make predictions.
    3. Algorithms: The set of rules or procedures that the machine learning system follows to
        learn from data and make predictions. Common algorithms include decision trees, neural
        networks, and support vector machines.
Simple Machine Learning Process
    1. Data Collection: Gather relevant data for the task, such as customer purchase histories for
        a recommendation system.
    2. Data Preparation: Clean and pre-process the data to ensure quality and consistency. This
        may involve handling missing values, normalizing data, and encoding categorical
        variables.
Hi-Tech Institute of Technology               4                               Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                            9960575712
   3. Model Training: Select an appropriate algorithm and use the prepared data to train the
       model, allowing it to learn the patterns and relationships in the data.
   4. Model Evaluation: Assess the model's performance using evaluation metrics like
       accuracy, precision, recall, and F1 score. This typically involves testing the model on a
       separate set of data not used during training.
   5. Model Tuning: Optimize the model by adjusting hyper parameters and improving the
       feature set to enhance performance.
   6. Deployment: Implement the trained model in a real-world environment where it can make
       predictions or decisions on new data.
   7. Monitoring and Maintenance: Continuously monitor the model's performance and
       update it with new data to maintain its accuracy and relevance.
Types of Machine Learning
    1. Supervised Learning
          o    Definition: The model is trained on a labelled dataset, which means that each
               training example is paired with an output label.
Supervised learning is a type of machine learning where the algorithm is trained on a labelled
dataset. This means that each input comes with a corresponding output label. The model learns to
map inputs to outputs and is then used to predict the labels for new, unseen data.
                                        Fig: Supervised Learning
Key Components:
      Labelled Data: Data that includes input-output pairs.
      Training Phase: The model learns from the labelled dataset by finding patterns and
       relationships between the inputs and outputs.
      Prediction Phase: The trained model is used to predict the output for new inputs.
Examples of Supervised Learning Algorithms:
      Classification: Identifying email as spam or not spam.
      Regression: Predicting house prices based on historical data.
Hi-Tech Institute of Technology                5                                 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                9960575712
       Regression:
        o   Linear Regression: Predicting a continuous output, such as housing prices.
        o   Logistic Regression: Predicting binary outcomes, such as spam detection.
       Classification:
        o   Decision Trees: Classifying data into categories based on input features.
        o   Support Vector Machines (SVM): Finding the optimal boundary between classes.
        o   K-Nearest Neighbours (k-NN): Classifying data based on the closest training
            examples.
    •   When an algorithm learns from example data and associated target responses that can
        consist of numeric values or string labels, such as classes or tags, in order to later predict
        the correct response when posed with new examples comes under the category of
        supervised learning.
    •   This approach is indeed similar to human learning under the supervision of a teacher.
    •   The teacher provides good examples for the student to memorize, and the student then
        derives general rules from these specific examples.
Applications:
       Email spam detection.                                  Image and speech recognition.
       Credit scoring.                                        Medical diagnosis
2. Unsupervised Learning
            o   Definition: The model is given data without explicit instructions on what to do
                with it. It must find patterns and relationships in the data.
Unsupervised learning involves training a model on data without labelled responses. The algorithm
tries to learn the underlying structure of the data by identifying patterns, relationships, and clusters.
                                    Fig: Unsupervised Learning
Key Components:
       Unlabelled Data: Data without explicit labels or outcomes.
       Clustering: Grouping similar data points together.
       Dimensionality Reduction: Reducing the number of variables under consideration.
Hi-Tech Institute of Technology                 6                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                   9960575712
Examples of Unsupervised Learning Algorithms:
       Clustering:
            o   K-Means Clustering: Partitioning data into k distinct clusters based on similarity.
            o   Hierarchical Clustering: Creating a tree of clusters based on nested grouping.
       Dimensionality Reduction:
            o   Principal Component Analysis (PCA): Reducing the number of features while
                retaining variance.
            o   T-Distributed      Stochastic        Neighbour    Embedding         (t-SNE):    Reducing
                dimensionality for visualization.
    •   Whereas when an algorithm learns from plain examples without any associated response,
        leaving to the algorithm to determine the data patterns on its own.
    •   This type of algorithm tends to restructure the data into something else, such as new features
        that may represent a class or a new series of un-correlated values.
    •   They are quite useful in providing humans with insights into the meaning of data and new
        useful inputs to supervised machine learning algorithms. As a kind of learning, it resembles
        the methods humans use to figure out that certain objects or events are from the same class,
        such as by observing the degree of similarity between objects.
Examples:
       Clustering: Grouping customers by purchasing behaviour.
       Dimensionality Reduction: Reducing the number of random variables under
        consideration.
Applications:
       Customer segmentation.                                  Anomaly detection.
       Market basket analysis.                                 Image compression
3. Semi-supervised Learning
            o   Definition: The model is trained on a dataset that includes both labelled and
                unlabelled data. This approach is often used when labelling data is expensive or
                time-consuming.
Semi-supervised learning is a type of machine learning that involves training a model on a dataset that
includes both labelled and unlabelled data. This approach leverages the large amount of unlabelled data to
improve the learning process when labelled data is scarce or expensive to obtain.
Hi-Tech Institute of Technology                  7                                   Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                            9960575712
                                  Fig: Semi-supervised Learning
Key Components:
       Labelled Data: A small amount of data with labels.
       Unlabelled Data: A large amount of data without labels.
       Combination Approach: Using both labelled and unlabelled data to improve model
        accuracy.
Examples of Semi-Supervised Learning Algorithms:
       Self-Training: The model is trained on labelled data, makes predictions on unlabelled data,
        and then retrains using the most confident predictions as additional labelled data.
       Co-Training: Two models are trained on different views of the data and teach each other
        using the most confident predictions.
       Graph-Based Methods: Using graph structures to propagate labels through the dataset.
   •    Where an incomplete training signal is given: a training set with some (often many) of the
        target outputs missing.
   •    There is a special case of this principle known as Transduction where the entire set of
        problem instances is known at learning time, except that part of the targets are missing.
Examples: Improving accuracy of image recognition by using a small amount of labelled images
and a large amount of unlabelled images.
Applications:
       Image classification with limited labelled images.
       Natural language processing with limited annotated text.
       Fraud detection with limited known fraud cases.
4. Reinforcement Learning
            o   Definition: The model learns by interacting with an environment and receiving
                rewards or penalties for actions taken.
       Reinforcement learning is a type of machine learning where an agent learns to make
       decisions by taking actions in an environment to maximize some notion of cumulative
       reward. The agent receives feedback in the form of rewards or penalties and uses this
       feedback to improve its strategy over time.
Hi-Tech Institute of Technology                 8                              Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                       9960575712
                                    Fig: Reinforcement Learning
Key Components:
       Agent: The learner or decision-maker.
       Environment: The external system with which the agent interacts.
       Actions: The set of all possible moves the agent can make.
       Rewards: Feedback from the environment based on the agent's actions.
       Policy: The strategy that the agent follows to decide on actions.
Examples of Reinforcement Learning Algorithms:
       Q-Learning: Learning the value of actions in states to maximize cumulative reward.
       Deep Q-Networks (DQN): Combining Q-learning with deep neural networks for complex
        environments.
       Policy Gradient Methods: Directly optimizing the policy to maximize the expected
        reward.
    •   When you present the algorithm with examples that lack labels, as in unsupervised
        learning.
    •   However, you can accompany an example with positive or negative feedback according
        to the solution the algorithm proposes comes under the category of Reinforcement
        learning, which is connected to applications for which the algorithm must make
        decisions (so the product is prescriptive, notjust descriptive, as in unsupervised
        learning), and the decisions bear consequences.
   •    In the human world, it is just like learning by trial and error.
   o    Examples: Training a robot to walk, developing game-playing algorithms like AlphaGo.
Applications:
       Game playing (e.g., AlphaGo).
       Robotics (e.g., training robots to perform tasks).
       Autonomous vehicles.
Hi-Tech Institute of Technology                 9                           Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                9960575712
 Introduction to Clustering
Clustering
Clustering is an unsupervised machine learning technique used to group similar data points into
clusters. It plays a crucial role in data analysis, pattern recognition, and machine learning
applications. Clustering is the process of partitioning a dataset into groups (clusters) based on
similarity among the data points. Unlike classification, clustering does not use predefined labels
but identifies inherent patterns in the data.
Scientific Definitions of Clustering
Clustering is formally defined in various scientific contexts. Here are some widely accepted
definitions:
    1. Statistical Definition
       Clustering is the process of organizing objects into groups, or clusters, where objects within
       the same cluster are more similar to each other than to those in other clusters based on a
       predefined similarity or distance measure.
    2. Machine Learning Definition
       Clustering is an unsupervised learning technique that partitions a dataset into subsets (clusters)
       such that data points in the same cluster share similar characteristics, and those in different
       clusters have distinct properties.
    3. Mathematical Definition
       Given a dataset X with n points: X={x1,x2,…,xn}
       A clustering algorithm finds a mapping function:      f:   X→C
       Where C is a set of clusters {C1,C2,…,Ck}, satisfying some similarity or distance criterion.
    4. Graph-Theoretic Definition
       Clustering can be defined as a partitioning of a graph G=(V,E) , where vertices (V) represent
       data points, and edges (E) represent relationships, such that intra-cluster edges have higher
       weights (stronger connections) than inter-cluster edges.
History of Clustering and Pioneers
Clustering has a rich history, evolving from early statistical methods to advanced machine learning
techniques.
Karl Pearson (1857-1936)
     Developed the Pearson Correlation Coefficient, a fundamental statistical measure of similarity.
     His work laid the groundwork for measuring relationships between data points, a key concept
      in clustering.
Hi-Tech Institute of Technology                  10                                Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                            9960575712
Sir Ronald Fisher (1890-1962)
   Introduced discriminant analysis in 1936, which later influenced clustering techniques.
   Worked on multivariate statistics, forming a base for future machine learning models.
Stuart Lloyd (1957)
   Introduced the Lloyd's Algorithm, which later became known as the K-Means algorithm in
    the 1980s.
   Initially developed for signal processing at Bell Labs.
S.C. Johnson (1967)
   Developed Hierarchical Clustering, which organizes data into nested groups.
   Introduced Agglomerative and Divisive clustering approaches.
Martin Ester et al. (1996) - DBSCAN
   Introduced Density-Based Spatial Clustering of Applications with Noise (DBSCAN).
   This was a breakthrough in handling noisy and non-spherical clusters.
Features of Clustering
Clustering is an essential unsupervised learning technique used to group data points based on their
similarities. Below are the key features that define clustering
1. Unsupervised Learning
       Clustering is an unsupervised learning technique, meaning it does not require labelled data.
       The algorithm identifies patterns and structures in raw data without prior knowledge.
2. Similarity-Based Grouping
       Clustering groups data points based on similarity measures such as Euclidean distance,
        cosine similarity, and correlation.
       Data points within a cluster are more similar to each other than to points in different
        clusters.
3. Automatic Pattern Discovery
       Clustering helps in finding hidden patterns and structures in data.
       It is widely used in market segmentation, anomaly detection, and image segmentation.
Hi-Tech Institute of Technology                11                              Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                           9960575712
4. Scalability
      Some clustering algorithms work well with large datasets (e.g., K-Means, DBSCAN),
       while others struggle with high-dimensional data.
      Scalability is a critical factor in big data applications.
5. Different Cluster Shapes & Structures
Clustering techniques can detect different types of structures, including:
      Spherical Clusters (K-Means)
      Hierarchical Clusters (Agglomerative)
      Density-Based Clusters (DBSCAN)
      Fuzzy Clusters (Fuzzy C-Means)
6. Noise and Outlier Handling
      Some clustering techniques can handle noise and outliers effectively, such as DBSCAN
       and OPTICS.
      Others, like K-Means, are sensitive to outliers because they rely on mean values.
7. High-Dimensional Data Handling
      Some clustering techniques, like Principal Component Analysis (PCA) + Clustering, work
       well in high-dimensional spaces.
      Others may suffer from the curse of dimensionality, requiring dimensionality reduction
       techniques.
8. Cluster Validation
      Evaluating the quality of clusters is important and can be done using:
      Silhouette Score (Measures how well-separated clusters are)
      Dunn Index (Evaluates compactness and separation)
      Davies-Bouldin Index (Measures intra-cluster similarity)
9. Computational Complexity
      Clustering algorithms vary in computational cost:
      K-Means is efficient with O(nkT) complexity.
      Hierarchical Clustering is computationally expensive with O(n² log n).
Hi-Tech Institute of Technology                12                               Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                          9960575712
10. Interpretability & Visual Representation
      Some clustering results can be easily visualized using scatter plots, and t-SNE.
      Others require advanced dimensionality reduction methods for better interpretation.
List of Clustering Algorithms in Machine Learning
Clustering algorithms can be categorized based on their working principles. Below is a list of
commonly used clustering algorithms in ML:
1. Partition-Based Clustering
These algorithms divide the dataset into a predefined number of clusters.
      K-Means – Centroid-based, fast, and widely used.
      K-Medoids – Similar to K-Means but uses medoids instead of centroids, more robust to
       outliers.
      CLARANS (Clustering Large Applications based upon Randomized Search) –
       Optimized version of K-Medoids for large datasets.
2. Hierarchical Clustering
These algorithms create a tree-like structure (dendrogram) of clusters.
      Agglomerative Hierarchical Clustering – A bottom-up approach, starts with individual
       points and merges them.
      Divisive Hierarchical Clustering – A top-down approach, starts with a single cluster and
       splits iteratively.
3. Density-Based Clustering
These algorithms group data points based on density regions.
      DBSCAN (Density-Based Spatial Clustering of Applications with Noise) – Can detect
       arbitrary-shaped clusters and handle noise well.
      OPTICS (Ordering Points to Identify the Clustering Structure) – Improves DBSCAN
       for varying density clusters.
      Mean-Shift – Uses a density function to move data points toward dense regions.
4. Model-Based Clustering
These algorithms assume data is generated from a mixture of probability distributions.
      Gaussian Mixture Model (GMM) – Uses multiple Gaussian distributions to model data
       clusters.
Hi-Tech Institute of Technology              13                              Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                        9960575712
      Expectation-Maximization (EM) – A probabilistic approach for clustering when data
       follows a distribution.
5. Grid-Based Clustering
These algorithms divide the data space into grids and perform clustering on them.
      STING (Statistical Information Grid-based Clustering) – Uses statistical measures to
       form clusters.
      CLIQUE (Clustering In QUEst) – Hybrid of grid-based and density-based clustering.
6. Fuzzy Clustering
These algorithms allow data points to belong to multiple clusters with a degree of membership.
      Fuzzy C-Means (FCM) – Assigns probabilities of belonging to each cluster instead of
       hard assignment.
7. Deep Learning-Based Clustering
Advanced clustering techniques using neural networks.
      Autoencoder-Based Clustering – Uses deep learning models to extract features before
       clustering.
      Deep Embedded Clustering (DEC) – Learns cluster assignments directly from data using
       deep neural networks.
      Spectral Clustering – Uses eigenvalues of similarity matrices to cluster non-linearly
       separable data.
8. Graph-Based Clustering
These algorithms use graph theory to form clusters.
      Affinity Propagation – Message-passing clustering method, no need to specify clusters
       beforehand.
      Markov Clustering Algorithm (MCL) – Uses random walks on a graph to detect clusters.
9. Evolutionary Clustering
Metaheuristic clustering methods inspired by nature.
      Genetic Algorithm-Based Clustering – Uses genetic algorithms for optimization-based
       clustering.
      Particle Swarm Optimization (PSO) Clustering – Swarm intelligence-based clustering
       technique.
Hi-Tech Institute of Technology             14                              Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                     9960575712
                 Comparison of Clustering Algorithms in Machine Learning
  Algorithm         Type       Characteristics         Advantages Disadvantages              Best Use
                                                                                              Cases
 K-Means          Partition- Divides data              Fast, simple,   Sensitive to       Customer
                  Based        into k clusters;        efficient for   outliers, needs segmentation,
                               centroid-based          large           predefined k       document
                                                       datasets                           clustering
 K-Medoids        Partition- Uses medoids              More robust     Slower than        Small datasets
                  Based        (actual points)         to outliers     K-Means,           with noise,
                               instead of              than K-         computational      fraud detection
                               centroids               Means           ly expensive
 CLARANS          Partition- Randomized                Works better    Still slower       Large-scale
                  Based        version of K-           on large        than K-Means       clustering,
                               Medoids for             datasets than                      geospatial
                               efficiency              K-Medoids                          analysis
 Agglomerati      Hierarchi Bottom-up                  No need to      Computationa       Taxonomy,
 ve               cal          approach; forms         specify         lly expensive      gene
 Clustering                    a hierarchy             clusters        (O(n²))            expression
                               (dendrogram)            beforehand                         analysis
 Divisive         Hierarchi Top-down                   Better for      Computationa       Social network
 Clustering       cal          approach; splits        small           lly expensive      analysis,
                               clusters                datasets,       (O(n²))            document
                               recursively             hierarchical                       organization
                                                       structure
 DBSCAN           Density-     Clusters based          Handles         Struggles with Anomaly
                  Based        on density;             noise,          varying            detection,
                               detects noise           arbitrary-      densities,         spatial data
                                                       shaped          parameter-         analysis
                                                       clusters        sensitive
 OPTICS           Density-     Similar to              Better for      More complex Large-scale
                  Based        DBSCAN but              varying         than               clustering,
                               works with              densities       DBSCAN             uneven data
                               variable                                                   densities
                               densities
Hi-Tech Institute of Technology                   15                                   Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                  9960575712
 Mean-Shift       Density-     Moves points          No need to      High              Image
                  Based        toward dense          define k,       computational     segmentation,
                               regions               flexible        cost (O(n²))      object tracking
                                                     cluster
                                                     shapes
 Gaussian         Model-       Uses                  Soft            Computationa      Bioinformatics
 Mixture          Based        probabilistic         clustering,     lly expensive,    , financial
 Model                         models                flexible        requires k        fraud detection
 (GMM)                         (Gaussian             cluster
                               distributions)        shapes
 Expectation-     Model-       Finds maximum         Works well      Slow              Image
 Maximizatio      Based        likelihood            with missing    convergence,      recognition,
 n (EM)                        cluster               data            local optima      speech
                               assignments                           problems          processing
 STING            Grid-        Divides data          Fast and        Loses             Geographic
                  Based        into hierarchical     scalable        precision at      clustering,
                               grid cells                            lower levels      large-scale
                                                                                       data
 CLIQUE           Grid-        Combines              Handles         Computationa      High-
                  Based        density-based &       high-           lly complex       dimensional
                               grid-based            dimensional                       data analysis,
                               methods               data                              gene clustering
 Fuzzy C-         Fuzzy        Allows points to Soft                 Sensitive to      Medical
 Means            Clusterin    belong to             clustering,     noise, requires imaging, text
 (FCM)            g            multiple clusters better for          k                 document
                                                     overlapping                       clustering
                                                     clusters
 Autoencoder      Deep         Uses neural           Works well      Requires deep     Deep learning
 -Based           Learning     networks to           for high-       learning          applications,
 Clustering       -Based       reduce                dimensional     expertise,        complex
                               dimensions            data            computational     patterns
                                                                     ly expensive
 Deep             Deep         Learns cluster        Can extract     Requires large Image & text
 Embedded         Learning     structure             meaningful      labeled           clustering at
 Clustering       -Based       directly from         representatio   dataset for       large scale
 (DEC)                         data                  ns              pretraining
Hi-Tech Institute of Technology                 16                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                     9960575712
 Spectral         Graph-       Uses                    Works well     Computationa       Graph
 Clustering       Based        eigenvalues of          for non-       lly expensive      clustering,
                               similarity              convex         (O(n³))            NLP
                               matrices                clusters                          applications
 Affinity         Graph-       Message-                No need to     High memory        Image
 Propagation      Based        passing between         specify k      usage, slow        segmentation,
                               data points                            on large           recommendati
                                                                      datasets           on systems
 Markov           Graph-       Uses random             Good for       Not widely         Biological
 Clustering       Based        walks on a              network-       used outside       networks,
 (MCL)                         graph                   based          graph-based        social network
                                                       clustering     applications       clustering
 Genetic          Evolutio     Uses genetic            Works for      Slow, requires Complex
 Algorithm        nary         algorithms for          dynamic        tuning             datasets,
 Clustering                    optimization            clustering     evolutionary       evolutionary
                                                       problems       parameters         data analysis
 Particle         Evolutio     Uses swarm              Works for      Computationa       Dynamic
 Swarm            nary         intelligence for        dynamically    lly expensive,     clustering
 Optimizatio                   clustering              changing       convergence        problems,
 n (PSO)                                               data           issues             optimization
 Clustering                                                                              tasks
Using K-means for Flat Clustering
       A flat clustering algorithm (like K-Means) takes a set of N data points and a specified
number K, and divides the points into K distinct clusters, where each data point belongs to exactly
one cluster, and all clusters are at the same level with no hierarchy between them.
            Flat Clustering Algorithm (K-Means style)
      Input: A set of N data points (or documents).
      Choose the number of clusters, K, in advance.
Step 1: Initialize K Cluster Centroids
      Randomly select K points from the data as initial centroids
Step 2: Assign Points to the Nearest Centroid
      For each data point:
            o   Calculate the distance (typically Euclidean) to each centroid.
            o   Assign the point to the closest centroid's cluster.
      Initial Result: All N points are divided into K groups based on proximity to centroids.
Hi-Tech Institute of Technology                   17                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                              9960575712
Step 3: Recalculate Cluster Centroids
      For each of the K clusters:
           o    Compute the new centroid by averaging the data points in that cluster.
           o    This new centroid becomes the updated representative.
Step 4: Repeat Assignment and Update
      Reassign all data points to the nearest updated centroid.
      Recalculate centroids again.
      Repeat Steps 2 and 3 until:
           o    Cluster assignments no longer change (convergence), or
           o    A maximum number of iterations is reached.
Step 5: Output the Final Clusters
      Once stable, the algorithm outputs:
           o    The K clusters, each containing a group of data points.
           o    The final centroids representing each cluster.
       Flat clustering refers to a type of clustering algorithm that organizes a given set of N data
points (or documents) into a predefined number K of distinct groups, called clusters, without
establishing any hierarchical structure among them. The term “flat” signifies that all clusters are
treated equally and exist at the same level—there are no parent-child relationships or nested
groupings as seen in hierarchical clustering. This means that each data point belongs to exactly
one cluster, and clusters do not overlap (unless fuzzy methods are used).
       In flat clustering, the goal is to find a partition—that is, to divide the entire dataset into
non-overlapping subsets. Each subset (or cluster) groups together data points that are more similar
to each other than to those in other clusters. For instance, this method is widely used to cluster text
documents, customer profiles, or images, where N represents the total number of documents, users,
or images, and K is the number of categories or segments we want to identify.
Flat Clustering (K-Means) –Example
Dataset: We have 6 data points representing coordinates in 2D space:
A (1, 2)       C (1, 0)     E (10, 4)
B (1, 4)       D (10, 2)    F (10, 0)
Assume: K=2
Step 1: Initialize K = 2 Centroids
Let’s randomly choose:
Centroid 1: A (1, 2) Centroid 2: D (10, 2)
Hi-Tech Institute of Technology                18                                Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                             9960575712
Step 2: Assign Each Point to Nearest Centroid
We use Euclidean Distance:
Euclidean distance is the straight-line distance between two points in Euclidean space
For two points
centroid = (x₁, y₁)
P₂ = (x₂, y₂)
The Euclidean Distance (d) is:
Point            Distance to A (1,2)                Distance to D (10,2)                 Assigned
                                                                                             to
 A                                                                                       Cluster 1
 B                                                                                       Cluster 1
 C                                                                                       Cluster 1
 D                                                                                       Cluster 2
 E                                                                                       Cluster 2
 F                                                                                       Cluster 2
      Cluster 1 (K1): A (1,2), B (1,4), C (1,0)
      Cluster 2 (K2): D (10,2), E (10,4), F (10,0)
Compute the New Centroids (Means)
                For Cluster 1 (K1):                           For Cluster 2 (K2):
         Points: (1,2), (1,4), (1,0)                  Points: (10,2), (10,4), (10,0)
         Mean X: (1 + 1 + 1) / 3 = 1                  Mean X: (10 + 10 + 10) / 3 = 10
         Mean Y: (2 + 4 + 0) / 3 = 2                  Mean Y: (2 + 4 + 0) / 3 = 2
      New Centroid 1 = (1, 2)                         New Centroid 2 = (10, 2)
Here, new and assume centroid haven't changed from the previous iteration, K-Means has
converged. The algorithm has stopped changing, and the clusters are stable.
Hi-Tech Institute of Technology              19                               Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                  9960575712
Using KMeans from Sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Step 1: Define the dataset
data = np.array([        [1, 2]   #A    [1, 4]   #B   [1, 0]   #C   [10, 2] # D        [10,
4] # E [10, 0] # F])
# Step 2: Run KMeans
kmeans = KMeans(n_clusters=2, random_state=0, n_init='auto')
kmeans.fit(data)
# Step 3: Get labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# Step 4: Plotting
plt.figure(figsize=(8, 6))
# Plot each cluster with a different color
for i in range(2): # for K=2
   cluster_points = data[labels == i]
   plt.scatter(cluster_points[:, 0], cluster_points[:, 1], label=f'Cluster {i + 1}')
# Plot centroids
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', s=200, marker='X',
label='Centroids')
# Annotate original points
point_labels = ['A', 'B', 'C', 'D', 'E', 'F']
for i, txt in enumerate(point_labels):
   plt.annotate(txt, (data[i][0] + 0.2, data[i][1] + 0.2))
# Styling
plt.title("K-Means Clustering (K=2)")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
Hi-Tech Institute of Technology             20                        Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                             9960575712
plt.grid(True)
plt.tight_layout()
# Show plot
plt.show()
                                            Fig: Output
Implementing Fit & Predict Functions
       Fit() function
       The fit() function in scikit-learn's KMeans class is the core method used to train the K-
means clustering model on your data. It takes your data as input and learns the optimal cluster
centers (centroids) that minimize the within-cluster sum of squares (WCSS).
Here's a detailed explanation of how the fit() function in KMeans works:
Core Goal:
       The primary goal of the fit() function is to find the best possible positions for k centroids
in your data space, such that each data point is assigned to the cluster whose centroid is closest to
it. "Best possible" is defined by minimizing the WCSS, which is the sum of the squared Euclidean
distances between each data point and its assigned centroid.
Input to fit():
The fit() method typically takes one required argument:
kmeans.fit(X)
Hi-Tech Institute of Technology               21                                Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                       9960575712
       X: This is the training data. It should be a NumPy array or a sparse matrix of shape
        (n_samples, n_features), where:
                    o    n_samples is the number of data points.
                    o    n_features is the number of dimensions (features) of each data point.
Steps Involved in fit():
The fit() function implements the K-means algorithm, which generally involves the following
iterative steps:
    1. Initialization of Centroids:
    o   The algorithm starts by initializing k centroid points. The method of initialization can be
        controlled by the init parameter of the KMeans class. Common initialization methods
        include:
               'k-means++' (default): This is a smart initialization technique that aims to select initial
                centroids that are far from each other, leading to faster convergence and potentially
                better results. It works by:
                       Randomly selecting the first centroid from the data points.
                       For each subsequent centroid, it calculates the squared distance between each data
                        point and the nearest already chosen centroid.
                       It then selects the next centroid with a probability proportional to the square of this
                        distance. This increases the likelihood of choosing points that are far away from
                        existing centroids.
                   'random': This method simply selects k data points randomly from the input data as
            initial centroids. This is simpler but can be more sensitive to the initial choice and may
            lead to slower convergence or suboptimal solutions.
                   A NumPy array: You can also provide your own initial centroids as a NumPy array
                    of shape (n_clusters, n_features).
    2. Assignment Step (E-step - Expectation):
    o   Once the initial centroids are chosen, each data point in the dataset X is assigned to the
        cluster whose centroid is closest to it.
    o   The distance is typically calculated using the Euclidean distance (the default). You can
        potentially customize the distance metric using a custom callable via the metric parameter
        in some related clustering algorithms, though KMeans primarily uses Euclidean distance.
    o   For each data point, the algorithm finds the centroid with the minimum distance and assigns
        the data point to that cluster.
    3. Update Step (M-step - Maximization):
    o   After all data points have been assigned to clusters, the algorithm recalculates the position
        of each centroid.
Hi-Tech Institute of Technology                          22                               Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                           9960575712
   o   The new position of each centroid is the mean (average) of all the data points that are
       currently assigned to that cluster.
   o   This step essentially moves the centroids towards the "center" of their respective clusters.
   4. Iteration and Convergence:
   o   Steps 2 (assignment) and 3 (update) are repeated iteratively.
   o   The algorithm continues to iterate until a stopping criterion is met. Common stopping
       criteria include:
      Convergence: The centroids no longer move significantly between iterations. This is often
       measured by checking if the change in centroid positions falls below a certain tolerance
       (tol parameter).
      Maximum number of iterations: The algorithm reaches a predefined maximum number
       of iterations (max_iter parameter). This is a safeguard to prevent the algorithm from
       running indefinitely if it doesn't converge quickly.
Output of fit():
After the fit() method completes, the KMeans object will have learned the cluster structure from
the data. It updates several attributes of the object, including:
      cluster_centers_: A NumPy array of shape (n_clusters, n_features) containing the
       coordinates of the final cluster centroids.
      labels_: A NumPy array of shape (n_samples,) containing the cluster label (0 to n_clusters
       - 1) assigned to each data point in the training data X.
      inertia_: The sum of squared distances of samples to their closest cluster center. This is
       the value of the objective function (WCSS) that the K-means algorithm tries to minimize.
      n_iter_: The number of iterations run to achieve convergence.
How to Use fit():
 from sklearn.cluster import KMeans                  Output:
 import numpy as np                                  Centroids: [[ 1. 2.] [10. 2.]]
 X = np.array([ [1, 2], [1, 4], [1, 0], [10, 2], Labels: [0 0 0 1 1 1]
 [10, 4], [10, 0] ])                                 Inertia: 24.0
 kmeans          =         KMeans(n_clusters=2,
 random_state=42)
 kmeans.fit(X)
 print("Centroids:", kmeans.cluster_centers_)
 print("Labels:", kmeans.labels_)
 print("Inertia:", kmeans.inertia_)
Hi-Tech Institute of Technology                23                               Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                            9960575712
Predict () function
The predict() function in scikit-learn's KMeans class is used after the model has been trained using
the fit() function. Its purpose is to assign new, unseen data points to the clusters learned during
the training phase.
Here's a detailed explanation of how the predict() function works:
Core Goal:
The primary goal of the predict() function is to take new data points as input and, based on the
cluster centroids learned during the fit() stage, determine which of the existing clusters each new
data point most likely belongs to.
Input to predict():
The predict() method takes one required argument:
      X: This is the new data for which you want to predict the cluster assignments. It should be
       a NumPy array or a sparse matrix of shape (n_samples, n_features), where:
           o   n_samples is the number of new data points.
           o   n_features must be the same as the number of features used to train the KMeans
               model in the fit() step.
Steps Involved in predict():
The predict() function performs the following steps for each new data point in the input X:
   1. Distance Calculation: For each new data point, the function calculates the distance
       between that data point and each of the cluster centroids that were learned during the fit()
       phase (stored in kmeans.cluster_centers_). The distance metric used is typically the same
       as the one implicitly used during training, which is usually the Euclidean distance in the
       standard KMeans implementation.
   2. Assignment to the Closest Centroid: After calculating the distances to all the cluster
       centroids, the predict() function assigns the new data point to the cluster whose centroid is
       the closest to it (i.e., the centroid with the minimum distance).
   3. Label Assignment: The function then returns the index (label) of the closest cluster
       (ranging from 0 to n_clusters - 1) as the predicted cluster for that new data point.
Output of predict():
The predict() method returns a NumPy array of shape (n_samples,), where n_samples is the
number of new data points in the input X. Each element in this array represents the predicted
cluster label (an integer) for the corresponding new data point.
Hi-Tech Institute of Technology               24                               Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                               9960575712
How to Use predict():
 from sklearn.cluster import KMeans                   Output:
 import numpy as np                                   Predicted clusters: [0 1]
 # Original training data
 X = np.array([ [1, 2], [1, 4], [1, 0],[10, 2],
 [10, 4], [10, 0] ])
 # Fit the model
 kmeans          =         KMeans(n_clusters=2,
 random_state=42, n_init=10)
 kmeans.fit(X)
 # New data points to classify
 new_points = np.array([[0, 0], [12, 3]])
 # Predict the clusters for new data
 predictions = kmeans.predict(new_points)
 print("Predicted clusters:", predictions)
                                           Key Differences
Feature                           fit(X)                                 predict(X_new)
Purpose      Trains the K-means model. Learns the            Assigns new, unseen data points to
             cluster structure from the training data.       the learned clusters. Uses the existing
                                                             model to classify new data.
Input        Training data X (used to learn clusters).       New data X_new (to be assigned to
Data                                                         existing clusters).
Process      1. Initializes centroids.                       1. Calculates the distance between each
             2. Iteratively assigns data points to the       new data point and all the learned
             closest centroid.                               cluster centroids.
             3. Updates centroid positions based on the 2. Assigns each new data point to the
             mean of assigned points.                        cluster with the nearest centroid.
             4. Repeats until convergence or max
             iterations.
Output       Modifies the KMeans object by learning:         Returns a NumPy array of cluster labels
             - cluster_centers_ (centroid coordinates)       (integers) for each data point in X_new,
             - labels_ (cluster assignments for the          indicating which of the learned clusters
             training data)                                  they belong to.
             - inertia_ (WCSS)
             - n_iter_ (number of iterations)
Hi-Tech Institute of Technology                 25                                 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                              9960575712
Model          Changes the state of the KMeans object       Does not change the state of the
State          by learning the cluster centers.             KMeans object. It only uses the
                                                            already learned cluster_centers_.
When to Used once to train the K-means model on             Used after the model has been trained
Use            your initial dataset.                        with fit() to classify new, unlabeled
                                                            data. Can be called multiple times with
                                                            different sets of new data.
Key            Discovers the underlying cluster structure   Determines the cluster membership of
Outcome        in the training data.                        new data points based on the learned
                                                            structure.
         fit() is like the learning phase. You show the algorithm some examples (your training
          data), and it figures out the best way to group them into clusters by finding the centers of
          those groups.
         predict() is like the application phase. Once the algorithm has learned the groups (found
          the centers), you give it new, unlabeled items, and it tells you which of the learned groups
          each new item most likely belongs to based on its proximity to the centers.
Implementing K-Means Class
          K-Means is a fundamental unsupervised learning algorithm aimed at partitioning a dataset
into K distinct clusters. The core idea is to iteratively assign data points to the nearest cluster
centroid and then update these centroids based on the mean of the assigned points. This process
continues until the centroids stabilize or a predefined stopping criterion is met.
      K-Means is a popular unsupervised learning algorithm used for clustering data into k groups
based on feature similarity. When implementing from scratch, we gain full understanding of:
         Distance computation
         Cluster assignment
         Centroid updating
         Convergence condition
Implementing K-Means as a class offers several advantages:
         Encapsulation: It bundles the data (attributes like centroids and labels) and the operations
          (methods like fit and predict) related to the K-Means algorithm into a single unit,
          promoting code organization and reusability.
         Modularity: Different aspects of the algorithm, such as initialization, assignment, and
          update steps, are implemented as separate methods, making the code easier to understand,
          test, and modify.
Hi-Tech Institute of Technology                   26                             Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                   9960575712
             State Management: The class instance maintains the state of the trained model (e.g., the
              final centroids), allowing for prediction on new data after the model has been fitted.
             Extensibility: It provides a framework for adding more advanced features or variations of
              K-Means in the future.
Structure of the KMeans Class:
import numpy as np
class         KMeans:     def   __init__(self,   n_clusters=8,    max_iter=300,     random_state=None,
init='random', n_init=10, tol=1e-4, algorithm='auto'):
   def __init__(self, n_clusters=2, max_iters=100, tol=1e-4):
        ...
   def fit(self, X):
        ...
   def predict(self, X):
        ...
   def _compute_distances(self, X):
        ...
Detailed Explanation of Each Method:
1. def __init__(self, n_clusters=2, max_iters=100, tol=1e-4):
             Purpose: This is the constructor of the K-Means class. It's the first method that gets called
              when you create an instance (object) of the KMeans class. Its primary role is to initialize
              the attributes (data and settings) of the K-Means object.
             Parameters:
              o   self: This is a convention in Python object-oriented programming. It refers to the
                  instance of the class being created. You always include self as the first parameter in
                  instance methods.
              o   n_clusters (int, default=2): This parameter specifies the number of clusters that the
                  K-Means algorithm should try to find in the data. It has a default value of 2, meaning
                  if you don't explicitly provide a value when creating a KMeans object, it will attempt
                  to form 2 clusters.
              o   max_iters (int, default=100): This parameter defines the maximum number of
                  iterations the K-Means algorithm will run for a single execution of the fit method.
                  This acts as a safeguard to prevent the algorithm from running indefinitely if it doesn't
                  converge quickly.
              o   tol (float, default=1e-4): This parameter sets the tolerance for convergence. It's a
                  small positive value. The algorithm stops iterating when the change in the cluster
Hi-Tech Institute of Technology                     27                                Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                              9960575712
            centroids between consecutive iterations is less than this tolerance. This indicates that
            the centroids have stabilized and the clustering process has (likely) converged.
   2. def fit(self, X):
      Purpose: This method is the core of the K-Means algorithm. It takes the training data X
       as input and performs the iterative process of assigning data points to clusters and updating
       the cluster centroids until a stopping criterion (either reaching max_iters or the change in
       centroids being less than tol) is met. This process "fits" the K-Means model to the data,
       learning the cluster structure.
      Parameters:
           o    self: Refers to the instance of the KMeans class.
           o    X (NumPy array or similar): This is the training data. It's typically a 2D array
                where each row represents a data point, and each column represents a feature.
      Inside the fit method (indicated by ...):
           o    The fit method would typically involve the following steps:
                   1. Initialization of Centroids: Randomly select n_clusters data points from
                       X as initial centroids. More sophisticated initialization techniques like K-
                       Means++ might also be implemented here.
                   2. Iterative Process: Repeat the following until convergence or max_iters is
                       reached:
                           Assignment Step: Assign each data point in X to the cluster whose
                            centroid is closest (using a distance metric like Euclidean distance).
                           Update Step: Recalculate the centroids of each cluster by taking the
                            mean of all the data points assigned to that cluster
   3. def predict(self, X):
      Purpose: Once the K-Means model has been fitted to the training data using the fit method,
       the predict method can be used to assign new, unseen data points to the learned clusters.
       It does this by finding the closest centroid (learned during the fit phase) for each new data
       point.
      Parameters:
           o    self: Refers to the instance of the KMeans class.
           o    X (NumPy array or similar): This is the new data that you want to assign to clusters.
                It should have the same number of features as the training data used in fit.
      Inside the predict method (indicated by ...):
Hi-Tech Institute of Technology                28                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                             9960575712
           o   The predict method would typically:
            1. Check if the model is fitted: Ensure that the fit method has been called previously
               and that self.centroids is not None.
            2. Calculate distances: For each data point in the input X, calculate the distance to
               each of the learned centroids stored in self.centroids.
            3. Assign labels: Assign each data point to the cluster corresponding to the nearest
               centroid. The index of the nearest centroid becomes the cluster label for that data
               point.
            4. Return the labels: Return a NumPy array containing the cluster labels for each
               data point in the input X.
   4. def _compute_distances(self, X):
      Purpose: This is a helper method (indicated by the leading underscore _). Helper methods
       are usually intended for internal use within the class and are not meant to be called directly
       by the user. The _compute_distances method is responsible for calculating the distances
       between the data points in X and the cluster centroids.
      Parameters:
           o   self: Refers to the instance of the KMeans class.
           o   X (NumPy array or similar): This is the data for which distances to the centroids
               need to be calculated. It could be the training data in the fit method or new data in
               the predict method.
      Inside the _compute_distances method (indicated by ...):
           o   This method would implement the chosen distance metric, most commonly the
               Euclidean distance.
           o   It would take the input data X and the current cluster centroids (likely stored in
               self.centroids) and calculate the distance between each data point in X and each
               centroid.
           o   The output would typically be a 2D array where each row corresponds to a data
               point in X, and each column corresponds to a centroid. The value at [i, j] would be
               the distance between the i-th data point and the j-th centroid.
Hi-Tech Institute of Technology               29                                 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                           9960575712
Python Program – Custom KMeans Class
import numpy as np
class MyKMeans:
  def __init__(self, n_clusters=2, max_iters=100, tol=1e-4):
     self.n_clusters = n_clusters
     self.max_iters = max_iters
     self.tol = tol
     self.centroids = None
     self.labels_ = None
  def fit(self, X):
     np.random.seed(42) # for reproducibility
     random_indices = np.random.choice(len(X), self.n_clusters, replace=False)
     self.centroids = X[random_indices]
     for _ in range(self.max_iters):
        distances = self._compute_distances(X)
        labels = np.argmin(distances, axis=1)
     new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(self.n_clusters)])
        if np.all(np.linalg.norm(self.centroids - new_centroids, axis=1) < self.tol): break
        self.centroids = new_centroids
     self.labels_ = labels
  def predict(self, X):
     distances = self._compute_distances(X)
     return np.argmin(distances, axis=1)
  def _compute_distances(self, X):
     return np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)
# === Data ===
X = np.array([ [1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
# === Using the Custom KMeans ===
model = MyKMeans(n_clusters=2)
model.fit(X)
print("Final Centroids:")
print(model.centroids)
print("\nCluster Labels:")
print(model.labels_)
Hi-Tech Institute of Technology                 30                            Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                              9960575712
# Predict new points
new_points = np.array([[0, 0], [12, 3]])
predictions = model.predict(new_points)
print("\nPredicted Clusters for new points:")
print(predictions)
Output:
Final Centroids:
[[ 1. 2.] [10. 2.]]
Cluster Labels:
[0 0 0 1 1 1]
Predicted Clusters for new points:
[0 1]
Unsupervised Learning - 2
How to choose Optimal K
What Is K in K-Means?
K-Means is an unsupervised learning algorithm used to group data into K distinct clusters based
on feature similarity.
 It minimizes the distance between data points and their assigned cluster center (centroid).
 It's called "K-means" because we define the number of clusters as K.
In K-Means clustering, K is the number of clusters the algorithm divides the data into.
Choosing the right K is essential to capture the natural structure in your data.
 If K is too small → underfitting
 If K is too large → overfitting or meaningless clusters
Objective:
Minimize WCSS (Within-Cluster Sum of Squares) — i.e., make clusters as compact as possible.
                                       K-Means Algorithm
1. Initialization
       Choose number of clusters K.
       Randomly select K initial centroids (or use K-Means++ for better starting points).
2. Assignment Step
       For each data point:
             o   Compute distance to each centroid (usually Euclidean distance).
Hi-Tech Institute of Technology               31                                   Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                           9960575712
            o   Assign the point to the nearest centroid’s cluster.
3. Update Step
       For each cluster:
            o   Recalculate the centroid as the mean of all assigned points.
4. Repeat (Iterate)
       Repeat steps 2 and 3 until one of the following:
            o   Centroids stop moving (i.e., convergence).
            o   No change in cluster assignments.
            o   Max iterations reached.
 Key Metric:
       WCSS (Within-Cluster Sum of Squares) (Inertia):
                       WCSS = ∑ (distance of each point to its centroid) 2
        Lower WCSS = better clustering.
Determine the optimal K
Selecting the appropriate number of clusters (K) is crucial for the effectiveness of the K-Means
algorithm. Several methods can be used to help determine the optimal K, including:
       Elbow Method: Plotting WCSS against the number of clusters and looking for an
        "elbow" point where the rate of decrease in WCSS sharply changes.
       Silhouette Analysis: Measuring how well each data point fits into its assigned cluster
        compared to other clusters.
       Gap Statistic: Comparing the within-cluster dispersion of the actual data to that of
        random data.
       Domain Knowledge: Utilizing prior understanding of the data to guide the choice of K.
Advantages of K-Means:
       Simple to understand and implement.
       Relatively efficient, especially for large datasets (with caveats).
       Often produces good results when clusters are well-separated and roughly spherical.
Disadvantages of K-Means:
       Sensitive to the initial placement of centroids.
       Assumes clusters are spherical and equally sized.
       Requires specifying the number of clusters (K) beforehand.
       Sensitive to outliers.
       Not guaranteed to find the global optimum.
Hi-Tech Institute of Technology                32                              Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                          9960575712
Elbow Method
       The Elbow Method is a heuristic technique used in cluster analysis, particularly with the
K-Means algorithm, to determine the optimal number of clusters (K) for a given dataset.
       The Elbow Method is a graphical approach to finding the most appropriate number of
clusters (K) in K-Means clustering by evaluating the model’s performance for different K values.
       It's a visual method that helps in making an informed decision about the value of K by
observing the change in Within-Cluster Sum of Squares (WCSS) as K increases.
1. Understanding Within-Cluster Sum of Squares (WCSS):
      Definition: WCSS measures the sum of the squared Euclidean distances between each data
       point within a cluster and the centroid (mean) of that cluster.
      Interpretation:
           o   A lower WCSS indicates that the data points within each cluster are closer to their
               respective centroids, suggesting that the clusters are more compact and internally
               cohesive.
           o   A higher WCSS implies that the data points are more spread out within the clusters.
      Goal of K-Means: The K-Means algorithm aims to minimize the total WCSS across all
       clusters.
Key Concept
When you increase the number of clusters:
      Inertia (WCSS) decreases — because points are closer to their centroids.
      But after a point, adding more clusters gives diminishing returns.
As you increase K:
      The inertia (within-cluster sum of squares) decreases.
      But the rate of decrease slows down after a certain point.
      The point where the rate changes sharply (like an elbow) is the optimal K.
The Elbow Method helps us find this point where adding another cluster doesn’t significantly
improve the model — it forms an “elbow” in the graph of K vs. WCSS.
Terminology
      K = Number of clusters.
      WCSS (Within-Cluster Sum of Squares) = Sum of squared distances between each point
       and its assigned centroid.
      Inertia = another name for WCSS (used in sklearn).
Hi-Tech Institute of Technology              33                              Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                            9960575712
        Inertia (WCSS) Formula
Where:
        K = number of clusters
        Ci = set of points in cluster i
        μi = centroid of cluster i
        ∥x−μi∥2 = squared Euclidean distance
When Does the Elbow Occur?
        The elbow occurs where adding more clusters doesn’t significantly improve the WCSS.
        It's subjective — sometimes the elbow isn't clearly visible.
2. The Intuition behind the Elbow Method:
        High WCSS at K=1: All data in one cluster leads to high within-cluster variance.
        WCSS Decreases with K: More clusters mean data points are closer to centroids → WCSS
         drops significantly.
        Diminishing Returns: After a certain K, adding more clusters results in only small WCSS
         reductions — indicating possible overfitting.
        Elbow Point: The bend in the WCSS vs. K plot marks the optimal number of clusters —
         where improvement slows down.
                                       Algorithm Elbow Method:
   1. Choose a Range of K Values: Decide on a reasonable range of K to test. This typically
         starts from 1 and goes up to a certain number (e.g., 10 or more), depending on the size and
         expected complexity of your dataset.
   2. Run K-Means for Each K: For each value of K in the chosen range:
             o   Initialize and run the K-Means algorithm on your data. Ensure the algorithm is
                 allowed to converge (reach a stable state or a maximum number of iterations).
             o   After the algorithm converges, calculate and record the WCSS value. Most K-
                 Means implementations (like in scikit-learn) provide the WCSS as an attribute
   3. Store the WCSS Values: Keep a record (e.g., a list or array) of the WCSS values
         corresponding to each K value tested.
   4. Plot the Results: Create a line plot where:
             o   The x-axis represents the number of clusters (K).
             o   The y-axis represents the corresponding WCSS value.
   5. Identify the Elbow Point: Visually inspect the plot. Look for the point where the rate of
         decrease in the WCSS starts to slow down significantly, forming an "elbow" shape.
Hi-Tech Institute of Technology                 34                             Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                               9960575712
    6. Determine the Optimal K: The K value at the "elbow point" is often considered a good
        estimate for the optimal number of clusters for your data.
Interpreting the Elbow Plot:
       A steep decline in WCSS indicates that adding more clusters significantly improves the
        clustering by reducing the within-cluster variance.
       The "elbow" is the point where this decrease starts to become less dramatic.
       After the elbow point, adding more clusters results in only marginal reductions in WCSS,
        suggesting that the additional complexity might not be justified by the improvement in
        cluster cohesion.
Eg: Solve [1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0] using elbow method
The given dataset:
Data points: x1=[1,2] x2=[1,4] x3=[1,0] x4=[10,2] x5=[10,4] x6=[10,0]
Calculate the Within-Cluster Sum of Squares (WCSS) for different values of K (number of
clusters).
    Step 1: K = 1 There is only one cluster containing all the data points.
    The centroid of this cluster is the mean of all points:
WCSS (Sum of squared distances from each point to centroid):
Euclidean Distance Formula:
Distances to Centroid (5.5, 2.0):
   Point                    Calculation                       Distance        Squared
 A (1,2)     √((1–5.5)² + (2–2)²) = √(−4.5)2+02                      4.5           20.25
             = √(20.25)
 B (1,4)     √((1–5.5)² + (4–2)²) = √(−4.5)2+22                   4.924            24.25
             =√(24.25)
 C (1,0)     √((1–5.5)² + (0–2)²) = √(−4.5)2+(−2)2                4.924            24.25
             =√(24.25)
 D (10,2)    √((10–5.5)² + (2–2)²) = √(4.5)2+02                      4.5           20.25
Hi-Tech Institute of Technology                35                               Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                   9960575712
              =√(20.25)
 E (10,4)     √((10–5.5)² + (4–2)²) = √(4.5)2+22                 4.924                 24.25
              =√(24.25)
 F (10,0)     √((10–5.5)² + (0–2)²) = √(4.5)2+(−2)2              4.924                 24.25
              =√(24.25)
WCSS for K=1 is the sum of these squared distances:
WCSS1=20.25+24.25+24.25+20.25+24.25+24.25=137
Step 2: K = 2
Clusters:
   Let's assume the initial centroids are C2a= [1, 2] and C2b= [10, 2] (we could choose randomly,
    but this might lead to different initial clusters).
   Assignment Step: Assign each point to the nearest centroid.
         Cluster 1: A, B, C → Centroid = C2a (1, 2)
         Cluster 2: D, E, F → Centroid = C2b (10, 2)
Cluster 1 Distances to C2a (1, 2):
 Point        Calculation                                 Distance       Squared          cluster
 A (1,2)      √ (1−1)2+(2−2)2 =√(0)                       0                   0           C2a
 B (1,4)      √ (1−1)2+(4−2)2                             2                   4           C2a
              =√((0)² + (2)²) = √4
 C (1,0)      √ (1−1)2+(0−2)2=√((0)² + (2)²) = √4         2                   4           C2a
 D (10,2)     √ (10−1)2+(2−2)2 =√((9)² + (0)²) = √81      9              81               C2b
 E (10,4)     √ (10−1)2+(4−2)2=√((9)² + (2)²) = √85       9.21           85               C2b
 F (10,0)     √ (10−1)2+(0−2)2=√((9)² + (2)²) = √85       9.21           85               C2b
                                       WCSS1 = 0 + 4 + 4 = 8
Cluster 1 Distances to C2b (10, 2):
 Point        Calculation                                        Distance Squared          cluster
 A (1,2)      √ (1−10)2+(2−2)2 =√((9)² +(0))=9                   9            81           C2a
 B (1,4)      √ (1−10)2+(4−2)2 =√((9)² + (2)²) = √85             9.21         85           C2a
 C (1,0)      √ (1−10)2+(0−2)2 =√((9)² + (2)²) = √85             9.21         85           C2a
 D (10,2)     √ (10−10)2+(2−2)2 =√((0)² + (0)²) = √0             0                 0       C2b
Hi-Tech Institute of Technology                 36                                 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                               9960575712
 E (10,4)     √ (10−10)2+(4−2)2 =√((0)² + (2)²) = √4               2                4      C2b
 F (10,0)     √ (10−10)2+(0−2)2 =√((0)² + (2)²) = √4               2                4      C2b
                                       WCSS2 = 0 + 4 + 4 = 8
                        Calculate WCSS for K=2 Total WCSS = 8 + 8 = 16
Update Step: Calculate new centroids.
        C1′=[(1+1+1)/3,(2+4+0)/3]=[1,2]
        C2′=[(10+10+10)/3,(2+4+0)/3]=[10,2]
The centroids did not change in the first iteration, so the clustering is stable.
Step 3: K = 3
Let's assume initial centroids C3a=[1,2], C3b=[10,2], C3c=[1,4].
 Point        Calculation                                         Distance Squared         cluster
 A (1,2)      Nearest to C3a (distance 0) -> Cluster 1             0          0            C3a
 B (1,4)      Nearest to C3c (distance 0) -> Cluster 3             0          0            C3c
 C (1,0)        d(x3,C3a)2=4, -> Cluster 1                         2          4            C3a
              d(x3,C3b)2=85,
              d(x3,C3c)2=16
 D (10,2)     Nearest to C3b (distance 0) -> Cluster 2             0          0            C3b
 E (10,4)     d(x5,C3a)2=85,                                       2          4            C3b
                d(x5,C3b)2=4, -> Cluster 2
              d(x5,C3c)2=85
 F (10,0)     d(x6,C3a)2=85,                                       2          4            C3b
                d(x6,C3b)2=4, -> Cluster 2
              d(x6,C3c)2=85
Update Step: Calculate new centroids.
        C1′=[(1+1)/2,(2+0)/2]=[1,1]
        C2′=[(10+10+10)/3,(2+4+0)/3]=[10,2]
        C3′=[1,4]
The centroids changed. We would need to iterate again until convergence. However, for the sake
of the elbow method, let's calculate the WCSS based on this first assignment and update.
                      Calculate WCSS for K=3 Total WCSS = 4 + 4 + 4 = 12
Hi-Tech Institute of Technology                 37                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                              9960575712
Plotting the WCSS values:
                                             Elbow Method
       K WCSS
                               WCSS
       1    137
       2    16
       3    12                          1                 2                   3
                                                     K= 3 CLUSTER
       If we were to plot these values, we would see a significant drop in WCSS from K=1 to
K=2. The drop from K=2 to K=3 is also noticeable but smaller. As K increases further, the WCSS
will continue to decrease, reaching 0 at K=6.
Identifying the Elbow:
       The "elbow" in this scenario is likely at K = 2. The most significant reduction in WCSS
occurs when moving from 1 to 2 clusters. The subsequent reduction when moving to 3 clusters is
less dramatic compared to the initial drop. After K=2, the benefit of adding more clusters in terms
of reducing WCSS diminishes.
Final Clusters: K = 2
Clusters:
   Let's consider centroids are C2a= [1, 2] and C2b= [10, 2]
       Centroid = C2a (1 , 2) → Cluster 1: A, B, C
       Centroid = C2b (10, 2) → Cluster 2: D, E, F
Silhouette algorithm to choose K
       The silhouette method, it's a powerful unsupervised learning validation method. It is a
technique used in cluster analysis to interpret and validate the consistency within clusters of data.
It provides a graphical representation of how well each object lies within its cluster. The method
was introduced by Belgian statistician Peter Rousseeuw in 1987. The Silhouette Algorithm is a
technique used to:
           Measure the quality of clustering.
           Evaluate how well each data point fits in its assigned cluster.
           Determine how similar a data point is to its own cluster vs other clusters.
It uses the Silhouette Score (or coefficient), which quantifies how well a point fits within its own
cluster compared to others.
Hi-Tech Institute of Technology                 38                                Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                               9960575712
                                     The Silhouette Algorithm
Step 1: Cluster the data
        Use any clustering algorithm (e.g., KMeans) to assign data points to clusters.
Step 2: For each data point iii:
(a) Compute intra-cluster distance a (i):
        This is the average distance from point i to all other points in the same cluster.
(b) Compute nearest-cluster distance b (i):
        For each of the other clusters, compute the average distance from point i to all points in
         that cluster.
        b (i) is the minimum of those average distances.
(c) Compute Silhouette Coefficient for point i:
Step 3: Repeat Step 2 for all data points
        You will get one silhouette score per data point.
 Step 4: Compute the Average Silhouette          Where
 Score                                           This is the average silhouette score across all
                                                 data points.
                                                    s(i)) is the silhouette score of the ith point.
                                                    n is the total number of data points.
   This gives the overall performance of           sˉ is the mean (average) silhouette score for
    the clustering.                                  the entire dataset.
   Higher sˉ means better clustering.
Solve clustering problem using the Silhouette Method to determine the best number of clusters
K for the dataset Points: [1,2],[1,4],[1,0],[10,2],[10,4],[10,0]
    1. Determine the Range of K to Test:
        The minimum number of clusters is K=2.
        The maximum number of clusters can be up to N (number of data points), but that's
         usually not meaningful (each point is its own cluster). A reasonable upper limit is N-1, or
         even less. Here, N=6. Let's test K=2, K=3, K=4, K=5. K=6 would give a Silhouette score
         of 0 or be undefined depending on the implementation, as there are no "other clusters".
    2. Choose a Clustering Algorithm: The Silhouette score evaluates clustering results, but it
         doesn't perform the clustering itself. We need a clustering algorithm. K-Means is the
         most common choice.
Hi-Tech Institute of Technology                39                                 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                        9960575712
   3. Choose a Distance Metric: Euclidean distance is standard for K-Means and Silhouette
       analysis with this type of data.
      Distance formula: d(p,q)= √(𝑥𝑝 − 𝑥𝑞)2 + (𝑦𝑝 − 𝑦𝑞)2
   4. Perform Calculations for each K:
      Case K=2:
           o   Clustering (K-Means): By visual inspection or simple calculation, the data
               clearly splits into two groups: {P1, P2, P3} (around x=1)
           o   and {P4, P5, P6} (around x=10).
           o   Let's assume K-Means finds these clusters:
                      Cluster 1 (C1): {[1, 2], [1, 4], [1, 0]}
                      Cluster 2 (C2): {[10, 2], [10, 4], [10, 0]}
           o   Calculate Silhouette Scores for each point:
Intra-cluster distance a(P1)
                      Point P1 = [1, 2]:
      Compute distance from P1P_1P1 to others in its own cluster (Cluster A):
      d(P1,P2)= √(1 − 1)2 + (2 − 4)2 =2
      d(P1,P3)= √(1 − 1)2 + (2 − 0)2 =2
        average distance from point i to all other points in the same cluster
               a(P1)=(2+2)/2=2.0
Nearest-cluster distance b (P1)
Compute distance from P1to each point in other cluster (Cluster B):
Hi-Tech Institute of Technology                40                           Dept of CSE (AIML)
     Prof. Sandip Eknathrao Ingle                                    9960575712
     Silhouette Score S (P1)
          Clu
Point                            a(i)                  b(i)              s(i)
          ster
                 d(P1,P2)= √(1 − 1)2 + (2 − 4)2
P1               =2
[1,2]      A     d(P1,P3)= √(1 − 1)2 + (2 − 0)2
                 =2
                 a(P1)=(2+2) / 2 =2.0
P2
          A
[1,4]
P3
          A
[1,0]
     Hi-Tech Institute of Technology              41          Dept of CSE (AIML)
     Prof. Sandip Eknathrao Ingle                  9960575712
P4
          B
[10,2]
P5
          B
[10,4]
P6
          B
[10,0]
     Hi-Tech Institute of Technology   42   Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                              9960575712
Average Silhouette Score:             (0.78 + 0.68 + 0.68 + 0.78 + 0.68 + 0.68 ) / 6 = 4.28 / 6
                                  = 0.71
   K                Average
                    Silhouette
                    Score
   2                0.71
   3                0.55
   4                0.4
   5                0.32
   6                0
Here's the Silhouette Score graph for K=2 to K=6.
The highest score occurs at K=2, confirming it as the optimal number of clusters for this dataset.
          For K = 2, the average silhouette score is ~0.71, which is very good.
          It confirms that the two clusters are well separated and compact.
Introduction to K Medoids
           In clustering, a medoid is a data point within a cluster that minimizes the average
dissimilarity (or distance) to all other points in the same cluster. It's essentially the most central
object in a cluster, acting as a representative point for the cluster. Unlike K-means, which uses the
mean (average) of all points in a cluster as the cluster center, K-medoids uses the actual medoid.
This makes K-medoids more robust to outliers, as it's not influenced by extreme values.
           K-Medoids (also called Partitioning Around Medoid) algorithm was proposed in 1987 by
Kaufman and Rousseeuw. A medoid can be defined as a point in the cluster, whose dissimilarities
with all the other points in the cluster are minimum.
Hi-Tech Institute of Technology                43                               Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                9960575712
       Definition:
       A medoid is the point within a cluster for which the sum of distances to all other points in
the cluster is minimum.
       The dissimilarity of the medoid (Ci) and object (Pi) is calculated by using E = | Pi – Ci |
       The cost in K-Medoids algorithm is given as
Types of K-Medoids
    1. PAM (Partitioning Around Medoids) – original algorithm, good for small datasets.
       o Working:
          Select k initial medoids.
          Iteratively swap medoids with non-medoids to minimize cost.
          Stops when no improvement is possible.
    2. CLARA (Clustering LARge Applications) – sampling-based, faster for large datasets.
       o Working:
          Repeatedly take a random sample from the dataset.
          Apply PAM on each sample.
          Choose the clustering with the lowest total cost over the entire dataset.
    3. CLARANS (Clustering Large Applications based upon RANdomized Search) –
        random search in neighbour space.
       o Working:
          Randomly explore the neighbouring medoid configurations (not all).
          Accept new medoid if it reduces cost.
          Repeat for multiple iterations to avoid local optima.
    Distance Metrics: Can use Euclidean, Manhattan, or custom distance functions.
    Medoids: Must be actual data points, unlike centroids in K-Means.
    Convergence: Typically reached when medoids no longer change.
K Medoids Algorithm
Algorithm:
    1. Initialize: select k random points out of the n data points as the medoids.
    2. Associate each data point to the closest medoid by using any common distance metric
        methods.
    3. While the cost decreases: For each medoid m, for each data o point which is not a medoid:
                Swap m and o, associate each data point to the closest medoid, and recompute the
                 cost.
                If the total cost is more than that in the previous step, undo the swap.
Hi-Tech Institute of Technology                44                                Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                    9960575712
Find new medoid using the K-Medoids Algorithm Method to determine the best number of
clusters K for the dataset Points: [1,2],[1,4],[1,0],[10,2],[10,4],[10,0] use Manhattan distance
method.
We’ll solve for K = 2, and use the Manhattan Distance:
Manhattan Distance=∣x1−x2∣+∣y1−y2∣                            here, ….. | |        ….> cardinality
    1. Choose Initial Medoids
       Let's pick: Medoid A = P1 = [1,2]            Medoid B = P4 = [10,2]
    2. Assign Each Point to Nearest Medoid (Using Manhattan Distance)
                        Point     d(P1 = [1,2])      d(P2=[1,4])        d(P3=[1,0])
                    P1                               d(P2, P1)          d(P3, P1)
                    [1,2]         0                  = |1-1| + |4-2|    = |1-1| + |0-2|
                                                     =0+2=2             =0+2=2
                                  d(P1, P2)                             d(P3, P2)
                    P2
                                  = |1-1| + |2-4|    0                  = |1-1| + |0-4|
                    [1,4]
                                  =0+2=2                                =0+4=4
                                  d(P1, P3)          d(P2, P3)
                    P3
                                  = |1-1| + |2-0|    = |1-1| + |4-0|    0
                    [1,0]
                                  =0+2=2             =0+4=4
                    Total
                    distance      2+2=4              2+4=6              2+4=6
                    (Sum)
    3. Therefore, the new medoid for Cluster A is P1 = [1,2]
New Clusters
      Cluster A: P1, P2, P3
                Point       d(P4 = [10,2])          d(P5=[10,4])        d(P6=[10,0])
                                                    d(P5, P4)           d(P6, P4)
               P4
                            0                       = |10-10| + |4-2|       = |10-10| + |0-2|
               [10,2]
                                                    =0+2=2              =0+2=2
                            d(P4, P5)                                   d(P6, P5)
               P5
                            = |10-10| + |2-4|       0                   = |10-10| + |0-4|
               [10,4]
                            =0+2=2                                      =0+4=4
Hi-Tech Institute of Technology                     45                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                              9960575712
                               d(P4, P6)           d(P5, P6) = |10-
                 P6
                               = |10-10| + |2-0|   10| + |4-0| = 0 +    0
                 [10,0]
                               =0+2=2              4=4
                 Total
                 distance      2+2=4               2+4=6                2+4=6
                 (Sum)
The new medoid for Cluster B is P4 = [10,2]
New Clusters
          Cluster B: P4, P5, P6
Since the medoids didn’t change, the algorithm converged.
Final Clustering (K = 2, Manhattan Distance)
 Cluster              Points            Medoid
 A            [1,2], [1,4], [1,0]       [1,2]
 B            [10,2], [10,4], [10,0] [10,2]
Introduction to Hierarchical Clustering
           Hierarchical clustering is an unsupervised machine learning algorithm that groups data into
a tree of nested clusters. The main types include agglomerative and divisive. Hierarchical cluster
analysis helps find patterns and connections in datasets. Results are presented in a dendrogram
diagram showing the distance relationships between clusters.
           Clustering is an unsupervised machine learning technique used in data analysis to detect
and group similar objects. Hierarchical cluster analysis (HCA), or hierarchical clustering, groups
objects into a cluster hierarchy without enforcing a linear order within them. Many disciplines,
such as biology, image analysis and the social sciences, use hierarchical clustering methods to
explore and recognize patterns in datasets. Use cases include categorizing populations in clinical
research, customer segmentation and detecting communities of nodes in network models.
There are two types of hierarchical clustering:
1.       Agglomerative or bottom-up approach that repeatedly merges clusters into larger ones
     until a single cluster emerges.
2. Divisive or top-down approach that starts with all data in a single cluster and continues to
     split out successive clusters until all clusters are singletons.
Hi-Tech Institute of Technology                    46                            Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                             9960575712
    Hierarchical clustering analysis has high computational costs. While using a heap can reduce
computation time, memory requirements are increased. Both the divisive and agglomerative types
of clustering are "greedy," meaning that the algorithm decides which clusters to merge or split by
making the locally optimal choice at each stage of the process. It is also possible to apply a stop
criterion, where the algorithm stops agglomeration or splitting clusters when it reaches a
predetermined number of clusters.
    In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A
diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of
merges or splits) graphically represents this hierarchy and is an inverted tree that describes the
order often used to visualize the hierarchy of clusters. It displays the order in which clusters have
been merged or divided and shows the similarity or distance between data points. Dendrogram can
also be understood as a nested list of lists with attributes.
                                            Fig: Dendrogram
                                  Fig: Hierarchical Clustering approach
Hi-Tech Institute of Technology                 47                              Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                            9960575712
Distance (Similarity) Metrics in Hierarchical Clustering
       In hierarchical clustering, we need to measure how similar or dissimilar two data points (or
clusters) are. This is done using distance metrics, which play a crucial role in determining which
clusters should be merged (or split) during the algorithm.
Here are some commonly used metrics:
a) Euclidean Distance (Most Common)
      It is the straight-line distance between two points in Euclidean space.
      Formula between two points A=(x1,y1) and B=(x2,y2):
b) Manhattan Distance (Taxicab or L1 Distance)
      It is the sum of absolute differences between coordinates of the points.
      Formula:
c) Cosine Similarity / Cosine Distance
      Measures the angle between two vectors (not the distance).
      Often used in text clustering and high-dimensional spaces.
      Cosine similarity is:
Linkage Criteria in Hierarchical Clustering
       Once decide how to measure distance between individual data points, it also need to
decide how to measure the distance between entire clusters during the clustering process.
This is where linkage criteria come in — they define how to compute the distance between two
clusters based on the distances between the points inside them.
Linkage Types
a) Single Linkage (Minimum Linkage)
      Distance between two closest points (one from each cluster).
      Formula:
       D(A,B)=min{d(a,b) ∣ a∈A,b∈B}
      Tends to form long, "chain-like" clusters.
      Can lead to “chaining effect”, which may not be ideal for compact clusters.
b) Complete Linkage (Maximum Linkage)
      Distance between two farthest points (one from each cluster).
      Formula:
Hi-Tech Institute of Technology              48                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                         9960575712
       D(A,B)=max{d(a,b) ∣ a∈A,b∈B}
      Produces more compact, spherical clusters.
      Less sensitive to outliers than single linkage.
c) Average Linkage (UPGMA – Unweighted Pair Group Method with Arithmetic
Mean)
      Takes the average of all pairwise distances between points from the two clusters.
      Formula:
      Balanced approach between single and complete linkage.
      Tends to produce moderate-sized clusters.
d) Centroid Linkage
      Computes the distance between the centroids (mean vectors) of the clusters.
      Formula:
       D(A,B)=d(μA,μB)
       where μA and μB are centroids of clusters A and B.
      May cause inversions in the dendrogram (not always monotonic).
Hi-Tech Institute of Technology               49                           Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                              9960575712
Top down/Divisive Approach
    Initially consider every data point as an individual Cluster and at every step, merge the nearest
pairs of the cluster. (It is a bottom-up method). At first, every dataset is considered an individual
entity or cluster. At every iteration, the clusters merge with different clusters until one cluster is
formed.
    The algorithm for Agglomerative Hierarchical Clustering is:
   1.   Calculate the similarity of one cluster with all the other clusters
                (Calculate proximity matrix)
   2.   Consider every data point as an individual cluster
   3.   Merge the clusters which are highly similar or close to each other.
   4.   Recalculate the proximity matrix for each cluster
   5.   Repeat Steps 3 and 4 until only a single cluster remains.
Bottom up/Divisive Approach
    Divisive clustering, it is also known as a top-down approach. This algorithm also does not
require to pre specify the number of clusters. Divisive clustering works just the opposite of
agglomerative clustering. It starts by considering all the data points into a big single cluster and
later on splitting them into smaller heterogeneous clusters continuously until all data points are in
their own cluster. Thus, they are good at identifying large clusters. It follows a top-down approach
and is more efficient than agglomerative clustering. But, due to its complexity in implementation,
it doesn’t have any predefined implementation in any of the major machine learning frameworks.
    The algorithm for Divisive Hierarchical Clustering is:
             Consider all the data points as a single cluster.
          1. Split into clusters using any flat-clustering method, say k-means.
          2. Choose the best cluster among the clusters to split further, choose the one that has the
             largest Sum of Squared Error (SSE).
          3. Repeat steps 2 and 3 until a single cluster is formed.
Principal Component Analysis PCA
        Principal component analysis, or PCA, reduces the number of dimensions in large datasets
to principal components that retain most of the original information. It does this by transforming
potentially correlated variables into a smaller set of variables, called principal components.
        Karl Pearson is credited with the development of PCA in 1901
        PCA is commonly used for data pre-processing for use with machine learning algorithms.
It can extract the most informative features from large datasets while preserving the most relevant
Hi-Tech Institute of Technology                50                                 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                             9960575712
information from the initial dataset. This reduces model complexity as the addition of each new
feature negatively impacts model performance, which is also commonly referred to as the “curse
of dimensionality.”
       PCA creates new variables, such as principal components, that are linear combinations of
the original variables. PCA takes a dataset with multiple variables as input, and it produces a
dataset into a lower subspace, that is, a reduced dataset with fewer variables.
       K-means is a clustering algorithm that assigns data points to clusters based on their distance
from the cluster centers. It takes a dataset with one or more variables as input, and it produces a
set of clusters with similar data points.
Intuition behind PCA
       Principal component analysis (PCA) is a dimensionality reduction and machine learning
method used to simplify a large data set into a smaller set while still maintaining significant
patterns and trends.
       Principal component analysis can be broken down into five steps.
       1. Standardize the range of continuous initial variables
       2. Compute the covariance matrix to identify correlations
       3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
           principal components
       4. Create a feature vector to decide which principal components to keep
       5. Recast the data along the principal components axes
How Principal Component Analysis (PCA) Works
       Principal Component Analysis (PCA) is a statistical technique used to reduce the
dimensionality of large datasets by transforming them into a smaller set of uncorrelated variables
called principal components. These components are linear combinations of the original variables
and are chosen to maximize variance, capturing the most significant patterns in the data.
       PCA relies on concepts from linear algebra and matrix operations. It transforms the original
data into a new coordinate system defined by the principal components. This transformation is
based on the eigenvectors and eigenvalues of the data's covariance matrix. The eigenvectors
indicate the directions of maximum variance (i.e., the new axes), while the eigenvalues reflect the
magnitude of variance in those directions—essentially measuring how important each direction is.
       To visualize this, imagine a multi-dimensional scatterplot. In this plot, the eigenvectors
show the directions in which the data varies the most, and the eigenvalues tell us how much
variance exists along those directions. A larger eigenvalue means its associated eigenvector (or
Hi-Tech Institute of Technology               51                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                              9960575712
principal component) captures more information. Thus, principal components are the eigenvectors
of the covariance matrix that correspond to the largest eigenvalues.
In PCA, the first two principal components are especially important:
First Principal Component (PC1)
        PC1 is the direction in the data space along which the variance is greatest. It represents the
line that best captures the overall spread or structure of the dataset. Since it contains the most
variance, PC1 retains the maximum amount of information from the original data. No other
component can capture more variance than PC1.
Second Principal Component (PC2)
        PC2 is calculated similarly to PC1 but with a critical constraint: it must be uncorrelated
with PC1. In mathematical terms, PC2 must be orthogonal (perpendicular) to PC1. It captures the
second-highest amount of variance in the dataset after PC1. The lack of correlation between PC1
and PC2 ensures that each component contributes unique information.
When PCA is applied, a scatterplot of PC1 versus PC2 is often used to visualize the data in two
dimensions. Since PC1 and PC2 are orthogonal, the axes in the plot are at right angles, helping to
reveal structure, patterns, or clusters in the original high-dimensional data.
Applying PCA to 2D data
        PCA (Principal Component Analysis) is a dimensionality reduction technique that can be
applied to data in order to find a lower-dimensional representation that captures most of the
variability in the data.
        When working with 2D data, PCA can be applied to identify the directions of maximum
variance in the data and project the data onto these directions. PCA visualization helps to convert
3D data into a 2D format or even 2D data to a one-dimensional representation. The main goal of
principal component analysis is visualizing data in a lower dimensional space. This can be useful
for visualizing the data in a lower-dimensional space or for identifying patterns or clusters in the
data.
Here are the steps to apply PCA to 2D data:
1. Collect the 2D data that you want to analyze. For example, this data could represent the heights
and weights of a group of people, or the scores on two different tests for a group of students.
2. Calculate the mean of each dimension (column) in the data. This will give you the center of the
data.
3. Calculate the covariance matrix of the data. The covariance matrix will give you information
about the relationships between the two dimensions. The covariance matrix is calculated as
follows:
                           covariance_matrix = 1/n X.T @ X
Hi-Tech Institute of Technology                52                                Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                             9960575712
        where X is the matrix of data (with n rows and 2 columns), and @ represents matrix
multiplication.
4. Calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent
the directions of maximum variance in the data, and the corresponding eigenvalues represent the
amount of variance that is explained by each eigenvector.
5. Choose the top k eigenvectors that explain the most variance in the data. This will depend on
your specific data and analysis goals. For example, you might choose the top 1 or 2 eigenvectors
to create a 1D or 2D representation of the data.
6. Project the data onto the selected eigenvectors. This can be done by multiplying the data matrix
by the matrix of selected eigenvectors:
                             projected_data = X @ selected_eigenvectors
        where X is the original data matrix (with n rows and 2 columns), and selected_eigenvectors
is the matrix of selected eigenvectors (with 2 rows and k columns).
7. Visualize the projected data in the lower-dimensional space. This can be done using a scatter
plot or other visualization technique.
Applying PCA on 3D data
        PCA (Principal Component Analysis) is a dimensionality reduction technique that can be
applied to data in order to find a lower-dimensional representation that captures most of the
variability in the data.
        When working with 3D data, PCA can be applied to identify the directions of maximum
variance in the data and project the data onto these directions. This can be useful for visualizing
the data in a lower-dimensional space or for identifying patterns or clusters in the data.
        Here are the Steps to Apply PCA to 3D Data:
1. Collect the 3D data that you want to analyze. For example, this data could represent the heights,
weights, and ages of a group of people, or the coordinates of points in a 3D space.
2. Calculate the mean of each dimension (column) in the data. This will give you the center of the
data.
3. Calculate the covariance matrix of the data. The covariance matrix will give you information
about the relationships between the three dimensions. The covariance matrix is calculated as
follows:
                  covariance_matrix = 1/n * X.T @ X
                  where X is the matrix of data (with n rows and 3 columns), and @ represents matrix
        multiplication.
Hi-Tech Institute of Technology                53                               Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                           9960575712
4. Calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent
the directions of maximum variance in the data, and the corresponding eigenvalues represent the
amount of variance that is explained by each eigenvector.
5. Choose the top k eigenvectors that explain the most variance in the data. This will depend on
your specific data and analysis goals. For example, you might choose the top 1 or 2 eigenvectors
to create a 1D or 2D representation of the data.
6. Project the data onto the selected eigenvectors. This can be done by multiplying the data matrix
by the matrix of selected eigenvectors:
               projected_data = X @ selected_eigenvectors
where X is the original data matrix (with n rows and 3 columns), and selected_eigenvectors is the
matrix of selected eigenvectors (with 3 rows and k columns).
7. Visualize the projected data in the lower-dimensional space. This can be done using a scatter
plot or other visualization technique.
       Note that PCA assumes that the data is normally distributed and that the variables are
linearly related. If these assumptions do not hold, PCA may not be an appropriate technique to
use. Additionally, when working with 3D data, it can be more difficult to visualize the results of
PCA, especially when projecting the data onto more than one dimension.
Math behind PCA
       PCA can be thought of as an unsupervised learning problem. The whole process of
obtaining principle components from a raw dataset can be simplified in six parts:
1. Take the whole dataset consisting of d+1 dimensions and ignore the labels such that our new
dataset becomes d dimensional.
2. Compute the mean for every dimension of the whole dataset.
3. Compute the covariance matrix of the whole dataset.
4. Compute eigenvectors and the corresponding eigenvalues.
5. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest
eigenvalues to form a d x k dimensional matrix W.
6. Use this d x k eigenvector matrix to transform the samples onto the new subspace.
Finding Optimal Number of Features
       Finding the optimal number of features in a dataset is an important task in data analysis
and machine learning. The optimal number of features is the number that provides the best balance
between model complexity and predictive accuracy.
       One popular method for finding the optimal number of features is to use PCA (Principal
Component Analysis). PCA can be used to reduce the dimensionality of the data by finding the
Hi-Tech Institute of Technology               54                              Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                             9960575712
directions of maximum variance, called principal components, and projecting the data onto these
components.
       Here are the steps for finding the optimal number of features using PCA:
1. Standardize the Data: Standardize the data by subtracting the mean and dividing by the standard
deviation for each variable. This step is necessary to ensure that each variable has the same scale
and importance in the analysis.
2. Compute the Covariance Matrix: Compute the covariance matrix of the standardized data using
the formula: Covariance matrix = 1/n * X.T @ X
       where X is the standardized data matrix with n observations and p variables.
3. Calculate the Eigenvectors and Eigenvalues: Calculate the eigenvectors and eigenvalues of the
covariance matrix using the eigen decomposition method.
4. Sort the Eigenvalues: Sort the eigenvalues in descending order and calculate the cumulative sum
of the eigenvalues. Determine the number of principal components: Determine the number of
principal components that explain at least 90% of the total variance in the data. This can be done
by examining the cumulative sum of the eigenvalues and selecting the number of components that
explain at least 90% of the total variance.
Project the Data onto the Selected Principal Components: Project the data onto the selected
principal components using the formula:
       Projected data = X @ selected_eigenvectors
       where X is the standardized data matrix and selected_eigenvectors is the matrix of selected
eigenvectors.
Magic behind PCA
       PCA (Principal Component Analysis) is a powerful technique used in data analysis,
machine learning, and computer vision. The magic behind PCA lies in its ability to reduce the
dimensionality of a dataset while retaining the most important information in the data.
       PCA achieves this by identifying the directions of maximum variance, called principal
components, and projecting the data onto these components. The principal components represent
linear combinations of the original variables in the data and are orthogonal to each other. The first
principal component explains the largest amount of variance in the data, the second principal
component explains the second largest amount of variance, and so on.
PCA is useful for Several Reasons:
       Dimensionality Reduction: PCA can reduce the number of variables in a dataset without
losing important information. This can help simplify the data and make it easier to analyze.
Hi-Tech Institute of Technology               55                                Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                             9960575712
Feature Extraction: PCA can be used to extract the most important features from a dataset. This
can be useful for tasks such as image recognition, where the most important features may be
difficult to identify by hand.
Noise Reduction: PCA can help reduce the impact of noise in a dataset. By focusing on the
directions of maximum variance, PCA can filter out noise that is present in other dimensions.
Visualization: PCA can be used to visualize high-dimensional data in two or three dimensions.
This can help identify patterns and relationships in the data that may not be apparent in higher
dimensions.
       PCA is based on linear algebra and uses matrix operations to calculate the principal
components of a dataset. It is a powerful tool that can be used for a wide range of data analysis
and modelling tasks.
Dimensionality reduction
       Dimensionality reduction is the process of reducing the number of variables or features in
a dataset. The goal of dimensionality reduction is to simplify the data while retaining the most
important information, which can help improve analysis and modelling performance.
There are Two Main Types of Dimensionality Reduction:
1. Feature Selection: Feature selection involves selecting a subset of the original features or
variables to use in the analysis. This can be done based on various criteria, such as correlation with
the target variable, importance in a model, or domain knowledge. Feature selection is a
straightforward approach but may not be effective if the selected features do not capture the most
important information in the data.
2. Feature Extraction: Feature extraction involves transforming the original features or variables
into a new set of features that capture the most important information in the data. Principal
Component Analysis (PCA) is a common feature extraction technique that identifies the directions
of maximum variance in the data and projects the data onto these directions. Other feature
extraction techniques include Linear Discriminant Analysis (LDA), t-Distributed Stochastic
Neighbor Embedding (t-SNE), and Non-negative Matrix Factorization (NMF).
The Advantages of Dimensionality Reduction Include:
Improved Analysis: By reducing the number of variables, dimensionality reduction can make it
easier to analyze and understand the data. This can help identify patterns and relationships that
may not be apparent in high-dimensional data.
Improved Model Performance: High-dimensional data can lead to overfitting in models, which
can lead to poor generalization performance. By reducing the number of variables, dimensionality
reduction can help improve the performance of models by reducing overfitting.
Hi-Tech Institute of Technology                56                               Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                              9960575712
Reduced Computational Complexity: High-dimensional data can be computationally expensive
to analyze and model. By reducing the number of variables, dimensionality reduction can help
reduce computational complexity and speed up analysis and modeling.
However, it is important to note that dimensionality reduction can also have some drawbacks, such
as loss of information and increased interpretability. Therefore, it is important to carefully consider
the benefits and drawbacks of dimensionality reduction for each specific dataset and analysis task.
PCA - PCA on Images
        PCA (Principal Component Analysis) on images involves reducing the dimensionality of
image data while preserving important features. This is achieved by transforming the data into a
smaller set of principal components that capture the most significant variations in the image.
PCA Intuition on Images Each image is treated as a vector in high-dimensional space.
Example: A 64x64 image → flatten into a 4096-element vector.
PCA finds new basis vectors (called eigenfaces in face recognition):
The first principal component (PC1) captures the most variation.
The second principal component (PC2) captures the second-most variation, orthogonal to PC1.
By projecting images onto top K principal components, we get a compressed version.
Steps to Apply PCA on Images
Step 1: Load Images
       Each image → convert to grayscale (optional but common).
       Flatten image matrix (2D) into a vector (1D).
Step 2: Stack into Data Matrix
       Stack all image vectors into a matrix X of shape (n_samples, n_features).
            o   n_samples    = number of images.
            o   n_features   = number of pixels per image.
Step 3: Standardize Data
       Subtract mean and (optionally) divide by standard deviation.
            o   Centering is essential.
            o   Formula:
                Xcentered=X−mean(X)
                Step 4: Compute Covariance Matrix
Step 5: Compute Eigenvectors and Eigenvalues
       Eigenvectors = Principal Components (PCs).
Hi-Tech Institute of Technology                 57                               Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                   9960575712
         Eigenvalues = how much variance each PC explains.
Step 6: Sort and Select Top K Components
         Sort eigenvectors based on eigenvalues (descending order).
         Select top K components to reduce dimensions.
Step 7: Project Images
         Multiply data by selected principal components to get compressed representation:
Projected Data = Xcentered×W                        where W contains top K eigenvectors.
Applications
        Application                       PCA Role
 Face Recognition        Eigenfaces represent faces compactly
 Image Compression Store fewer components, reconstruct image
 Noise Filtering         Discard components with tiny variance
 Object Detection        Extract features before classification
Python Code                                                # Reconstruct image
import numpy as np                                         X_reconstructed =
import matplotlib.pyplot as plt                            pca.inverse_transform(X_pca)
from sklearn.decomposition import PCA                      # Show original vs reconstructed
from sklearn.datasets import fetch_olivetti_faces          plt.subplot(1, 2, 1)
# Load dataset                                             plt.imshow(images[0], cmap='gray')
data = fetch_olivetti_faces()                              plt.title('Original')
images = data.images                                       plt.subplot(1, 2, 2)
X = data.data # flattened images                           plt.imshow(X_reconstructed[0].reshape(64, 64),
# Apply PCA                                                cmap='gray')
pca = PCA(n_components=50)                                 plt.title('Reconstructed')
X_pca = pca.fit_transform(X)                               plt.show()
PCA on Olevitti Images
          The Olivetti Faces Dataset, also known as the AT&T Face database, is a classic dataset
often used to demonstrate and evaluate face recognition and dimensionality reduction techniques
like PCA. Here's a breakdown of applying PCA to this dataset:
1. Understanding the Olivetti Faces Dataset
Source: Originally collected by the Olivetti Research Laboratory between April 1992 and April
1994. Now hosted by AT&T Laboratories Cambridge.
Content: Contains 400 grayscale images of 40 distinct subjects.
Image Characteristics:
Size: 64x64 pixels.
Hi-Tech Institute of Technology                     58                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                             9960575712
Gray levels: 256.
Pose: Subjects are generally in an upright, frontal position with some tolerance for slight
movements.
Variations: For some subjects, images include variations in lighting, facial expressions
(smiling/not smiling, open/closed eyes), and the presence or absence of glasses.
Background: Dark and homogeneous.
Organization: The dataset is typically organized into 40 directories, one for each subject, with 10
images per subject.
Purpose: Ideal for experimenting with face recognition algorithms due to its manageable size and
controlled variations. It's also well-suited for demonstrating unsupervised learning techniques like
PCA.
2. Applying PCA to the Olivetti Faces Dataset
The process of applying PCA to the Olivetti Faces dataset follows the general steps outlined in the
previous teaching notes, with specific considerations for image data:
Data Preparation: Loading the Dataset: Many machine learning libraries (like scikit-learn in
Python) provide convenient functions to download and load the Olivetti Faces dataset.
Flattening Images: Each 64x64 image is flattened into a 1D vector of 4096 pixel values.
Creating the Data Matrix: The dataset is represented as a matrix of size 400 (number of images)
x 4096 (number of features/pixels).
Mean Centering: Calculate the average face across all 400 images (pixel-wise mean). This results
in a 64x64 average image. Subtract this average face vector from each individual image vector in
the data matrix. This centers the data around the origin.
Calculating the Covariance Matrix (or using SVD): Directly calculating the 4096x4096
covariance matrix can be computationally expensive. A more efficient approach, especially for
high-dimensional data, is to use Singular Value Decomposition (SVD). SVD achieves the same
goal of finding principal components without explicitly calculating the covariance matrix.
Computing Eigenvectors (Eigenfaces) and Eigenvalues: The eigenvectors obtained from the
covariance matrix (or SVD) are 4096-dimensional vectors. When reshaped back into 64x64
images, these eigenvectors are called eigenfaces. Eigenvalues represent the amount of variance
captured by each corresponding eigenface. The eigenfaces associated with larger eigenvalues
capture more significant variations in the dataset. Intuition of Eigenfaces: The first few eigenfaces
often represent general facial features like overall shape, lighting direction, etc. Subsequent
eigenfaces capture more subtle variations like specific expressions or details.
Selecting Principal Components: Sort the eigenfaces by their corresponding eigenvalues in
descending order. Determine the number of principal components (k) to keep based on the desired
level of variance explained. For example, you might aim to retain 90% or 95% of the total variance.
Hi-Tech Institute of Technology               59                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                              9960575712
Alternatively, you can visually inspect a scree plot of the eigenvalues to find an "elbow" point.
Projecting the Images: Create a projection matrix using the top k eigenfaces (reshaped back into
vectors) as columns. Multiply the mean-centered image data matrix by this projection matrix to
obtain a lower-dimensional representation of each face (a vector of k principal component scores).
3. Benefits of Applying PCA to Olivetti Images
Dimensionality Reduction for Face Recognition:
Instead of working with 4096 pixel values, each face can be represented by a much smaller
number of principal component scores (e.g., 100-200), significantly reducing the dimensionality
while retaining most of the identity-related information. This speeds up training and classification
in face recognition tasks.
Feature Extraction (Eigenfaces as Features):
The eigenfaces themselves can be considered as a set of basis features that capture the fundamental
variations in human faces present in the dataset.
Visualization:
The top few eigenfaces can be visualized to understand the most significant sources of variation
in the dataset (e.g., lighting, pose). Projecting the 400 images onto the first two or three principal
components allows for a 2D or 3D visualization of the face space, potentially revealing clusters
based on identity or other factors.
Noise Reduction:
By discarding lower-variance principal components, PCA can help to filter out some of the noise
or less important variations in the images, potentially leading to more robust face recognition
models.
Why Apply PCA on Olivetti Faces?
      Purpose                                       Benefit
 Compression           Store images using fewer dimensions (e.g., 50 instead of 4096)
 Noise Reduction       Ignore components with low variance (likely noise)
 Feature Extraction Use principal components (eigenfaces) as features for ML models
 Face Recognition      Classify faces by comparing in reduced PCA space
Hi-Tech Institute of Technology                60                                 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                               9960575712
Python Code Example: PCA on Olivetti Faces             # Step 3: Reconstruct the images
import numpy as np                                     X_reconstructed =
import matplotlib.pyplot as plt                        pca.inverse_transform(X_pca)
from sklearn.decomposition import PCA                  # Step 4: Show original and reconstructed
from sklearn.datasets import                           image
fetch_olivetti_faces                                   plt.figure(figsize=(8, 4))
# Step 1: Load dataset                                 # Original
data = fetch_olivetti_faces()                          plt.subplot(1, 2, 1)
images = data.images # Shape: (400, 64,                plt.imshow(images[0], cmap='gray')
64)                                                    plt.title('Original')
X = data.data          # Shape: (400, 4096)            plt.axis('off')
flattened                                              # Reconstructed
# Step 2: Apply PCA                                    plt.subplot(1, 2, 2)
pca = PCA(n_components=50) # Reduce                    plt.imshow(X_reconstructed[0].reshape(64,
from 4096 to 50 dimensions                             64), cmap='gray')
X_pca = pca.fit_transform(X)                           plt.title('Reconstructed (50 PCs)')
                                                       plt.axis('off')
                                                       plt.show()
        Applying PCA to the Olivetti Faces dataset is a common and insightful exercise in understanding
dimensionality reduction and feature extraction for image data. It demonstrates how a high-dimensional
representation of faces can be compressed into a lower-dimensional space using eigenfaces, while still
retaining significant information for tasks like face recognition. The Olivetti dataset provides a well-
controlled environment to observe the effects of PCA on real-world image data.
Reproducing Images
        Principal Component Analysis (PCA) is not only a dimensionality reduction technique but
also a powerful method for image compression and reconstruction. When PCA is applied to image
data, it identifies the most important patterns (principal components) that capture the largest
variance in pixel intensities across images. Reproducing images with PCA involves two main
steps: compression and reconstruction.
        In the compression phase, the original high-dimensional image data (for example, a 64×64
image with 4096 pixels) is projected onto a smaller set of dimensions defined by the selected
principal components. This reduced representation retains only the most significant features of the
image while discarding less important variations or noise. The number of components chosen
Hi-Tech Institute of Technology                 61                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                                  9960575712
directly affects how much detail is preserved; fewer components mean greater compression but
more loss of detail.
        Reconstruction, or reproduction, is the reverse process where the compressed data is
transformed back to the original space. This is done by multiplying the compressed features with
the transpose of the principal components and then adding back the mean image that was
subtracted during the PCA process. The result is an approximation of the original image. If only a
small number of principal components are used, the reconstructed image appears blurry or
simplified but still captures the general shape and structure. As more components are included, the
reproduced image becomes sharper and more accurate, closely resembling the original.
        An important aspect of PCA-based reproduction is the trade-off between compression and
quality. Using very few components, such as 10 or 20, can significantly reduce the storage size of
an image but at the cost of detail and clarity. On the other hand, using a larger number like 100 or
200 components results in a higher-quality reproduction with minimal visible loss. In fact, when
all components are used, PCA perfectly reconstructs the original image without any loss.
        This process of reproducing images using PCA is especially useful in face recognition
tasks, where "eigenfaces" (the principal components of face images) are used to represent and
reconstruct faces efficiently. The technique not only demonstrates the power of dimensionality
reduction but also provides visual intuition about how much of an image’s information is captured
by its main components.
Eigenfaces
        In the context of face recognition, each face image is represented as a high-dimensional
vector. For instance, a 64x64 pixel image can be viewed as a 4096-dimensional vector. The goal
is to reduce this high-dimensional data to a lower-dimensional space that captures the most
significant features of the faces.
What are Eigenfaces?
       Eigenfaces are a set of eigenvectors derived from the covariance matrix of facial images, primarily
        used in the field of computer vision for human face recognition. The concept was introduced by
        Sirovich and Kirby in 1987 and later expanded by Turk and Pentland in 1991.
       Principal Components of Face Images: Eigenfaces are essentially the principal
        components (the eigenvectors of the covariance matrix) of a set of face images.
       "Ghostly" Faces: When these eigenvectors, which are initially long vectors of pixel
        values, are reshaped back into a 2D image format, they often appear as ghostly, grayscale
        patterns. These patterns highlight the directions of the most significant variance across the
        training set of faces.
Hi-Tech Institute of Technology                  62                                 Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                           9960575712
      Basis Set: Eigenfaces form a basis set that can be used to represent any face from the
       original training set (and even new faces to some extent) as a weighted combination of
       these eigenfaces.
How are Eigenfaces Obtained?
The process of obtaining eigenfaces involves applying PCA to a dataset of face images:
   1. Prepare the Training Data:
           o   Collect a set of face images.
           o   Ensure the images are aligned (e.g., eyes at roughly the same position) and resized
               to a consistent size.
           o   Convert the images to grayscale.
           o   Flatten each image into a long 1D vector by concatenating the rows (or columns)
               of pixel intensities.
           o   Arrange these image vectors as columns (or rows) in a data matrix.
   2. Calculate the Mean Face:
           o   Compute the average of all the face images in the training set, resulting in a mean
               face image (also a vector).
   3. Subtract the Mean:
           o   Subtract the mean face vector from each individual face image vector. This centers
               the data.
   4. Calculate the Covariance Matrix:
           o   Compute the covariance matrix of the mean-centered image vectors. For a large
               number of high-dimensional images, this matrix can be very large.
   5. Compute Eigenvectors and Eigenvalues:
           o   Find the eigenvectors and eigenvalues of the covariance matrix.
           o   Each eigenvector has the same dimensionality as the flattened face images and can
               be reshaped into an "eigenface."
           o   Eigenvalues indicate the amount of variance in the dataset captured by each
               corresponding eigenvector (eigenface).
   6. Select Principal Components (Eigenfaces):
           o   Sort the eigenvectors (eigenfaces) based on their corresponding eigenvalues in
               descending order.
           o   Choose the top k eigenfaces that capture a significant portion of the total variance
               in the face dataset. The number k is typically much smaller than the original
               dimensionality of the images.
       By applying Principal Component Analysis (PCA) to a set of face images, we can identify
the directions (principal components) that capture the most variance in the data. These principal
Hi-Tech Institute of Technology                63                             Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                           9960575712
components, when reshaped back into image form, are termed Eigenfaces. Each Eigenface
represents a pattern of variation among the faces in the dataset.
       Given a set of centered face images (mean subtracted), PCA seeks to find a set of
orthonormal vectors (eigenvectors) that capture the directions of maximum variance.
Mathematically:
       Where xi is the vectorized form of the ith image, and xˉ is the mean face vector.
       The eigenvectors of this covariance matrix correspond to the Eigenfaces. Each face in the
dataset can then be approximated as a linear combination of these Eigenfaces.
       Eigenfaces often appear as ghostly images highlighting specific facial features. The first
few Eigenfaces capture the most significant variations, such as lighting and general facial
structure, while subsequent Eigenfaces capture finer details.
       For example, the first Eigenface might represent the average lighting condition across all
faces, the second might capture variations in facial width, and so on. By combining these
Eigenfaces with appropriate weights, we can reconstruct original face images with varying degrees
of accuracy, depending on the number of Eigenfaces used.
       In face recognition systems, Eigenfaces are used to project high-dimensional face images
into a lower-dimensional space (face space). Recognition is performed by comparing the projected
test image with the projections of known faces and identifying the closest match.
       This method significantly reduces computational complexity and storage requirements,
making it suitable for real-time face recognition applications.
Classification of LFW Images
       The Labelled Faces in the Wild (LFW) dataset is a popular dataset used for face recognition
tasks. PCA can be used as a pre-processing step to reduce the dimensionality of the image data
and improve the accuracy of face recognition classifiers.
To classify LFW images using PCA, the following steps can be followed:
       Convert the images to grayscale and resize them to a fixed size, such as 64x64 pixels, to
reduce the dimensionality of the data. Compute the mean image of the dataset and subtract it from
each image to center the data around zero. Compute the covariance matrix of the centered images.
Compute the eigenvectors and eigenvalues of the covariance matrix. Select a subset of the
eigenvectors with the largest eigenvalues to represent the images. These eigenvectors correspond
to the most important features of the images. Project the centered images onto the selected
eigenvectors to obtain a reduced representation of the images. Train a classifier, such as a neural
Hi-Tech Institute of Technology               64                              Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                               9960575712
network or a support vector machine (SVM), using the reduced representation of the images as
input features. Test the classifier on a separate set of LFW images and evaluate its accuracy.
       Using PCA as a pre-processing step can improve the accuracy of face recognition
classifiers by reducing the dimensionality of the image data and removing noise and irrelevant
information. However, it is important to carefully balance the level of dimensionality reduction
and the accuracy of the classifier based on the specific requirements of the application.
Question Bank
 1. What is unsupervised learning, and how does it differ from supervised learning?
 2. What is clustering, explain the Features of Clustering and why is it useful in unsupervised
 learning?
 3. Explain the k-means algorithm and how it is used for flat clustering.
 4. How does the scikit-learn library implement k-means clustering?
 5. Describe the fit and predict functions in k-means clustering.
  6. How do you choose the optimal value of k in k-means clustering?
 7. What is the silhouette algorithm, and how does it help in selecting the optimal k value?
 8. What is k-medoids clustering, and how does it differ from k-means clustering?
 9. Explain the k-medoids algorithm for clustering.
 10. What is hierarchical clustering, and what are the two approaches for it?
 11. What is the intuition behind PCA, and why is it useful in unsupervised learning?
 12. How can PCA be applied to 2D data, and what is the effect of the transformation?
 13. How does PCA work for 3D data, and what are the benefits of using PCA?
 14. Explain the mathematical basis of PCA, including covariance matrices and eigenvectors.
 15. How do you find the optimal number of features in PCA, and why is this important?
 16. How can PCA be used for image data, and what are some applications of this technique?
 17. How does PCA work for Olevitti images, and how can the results be used for data
 compression?
 18. What is image reproduction, and how can PCA be used to achieve this?
 19. What are eigenfaces, and how are they used in facial recognition systems?
 20. Explain the classification of LFW images using PCA, including the steps involved and the
 results obtained.
21. List and explain the clustering Algorithm.
22. What is k means Clustering? Explain K means algorithm.
23. Solve following example of k means clustering –
24. Dataset: We have 6 data points representing coordinates in 2D space:
               A (2, 1) , B (4, 1) , C (1, 1) , D (10, 3) , E (10, 5) , F (10, 2)
Hi-Tech Institute of Technology                 65                                  Dept of CSE (AIML)
Prof. Sandip Eknathrao Ingle                                                               9960575712
25. In Unsupervised Learning How to choose Optimal K, explain with example.
26. Explain elbow method with following example Dataset: We have 6 data points representing
    coordinates in 2D space: A (2, 1) , B (4, 1) , C (1, 1) , D (10, 3) , E (10, 5) , F (10, 2)
27. Differentiate between Supervised, Unsupervised, and Semi-supervised learning with
    examples.
28. What is K-Means Clustering? Explain the steps of the K-Means Algorithm.
29. What are the advantages and disadvantages of K-Means Clustering?
30. In K-Means, what is the role of Euclidean distance? Explain with example.
31. Apply 2 iterations of K-Means on the following data with K = 3:
    Data points: (1,2), (1,4), (1,0), (10,2), (10,4), (10,0)
    Initial centroids: (1,2), (10,2), (5,5)
32. Explain the Elbow Method. Use the following dataset to illustrate the process:
    A (2, 1), B (4, 1), C (1, 1), D (10, 3), E (10, 5), F (10, 2)
33. In Unsupervised Learning, how to choose the optimal number of clusters K? Explain with
    examples.
Hi-Tech Institute of Technology                 66                                Dept of CSE (AIML)