0% found this document useful (0 votes)
12 views38 pages

Unit 5

Unsupervised learning involves analyzing input data without labeled outputs to discover hidden patterns, with techniques including clustering, dimensionality reduction, density estimation, and association rules. Key challenges include evaluation difficulties, the curse of dimensionality, and potential overfitting. Market basket analysis, a common application, uses association rules to identify purchasing patterns, aiding in decision-making for businesses.

Uploaded by

klnc prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views38 pages

Unit 5

Unsupervised learning involves analyzing input data without labeled outputs to discover hidden patterns, with techniques including clustering, dimensionality reduction, density estimation, and association rules. Key challenges include evaluation difficulties, the curse of dimensionality, and potential overfitting. Market basket analysis, a common application, uses association rules to identify purchasing patterns, aiding in decision-making for businesses.

Uploaded by

klnc prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Unsupervised Learning (Learning without a Teacher)

 In unsupervised learning, there is no teacher to give correct answers.

 You are given only the input data XXX without any output labels YYY.

 The goal is to find hidden patterns or structures in the data.

Example:

 Imagine you have customer data, but no information about their behavior.

 Using unsupervised learning, you can group similar customers together (clustering)
to create marketing strategies.

Types of Unsupervised Learning

There are several types of unsupervised learning techniques. Here are some of the most
common ones:

1. Clustering

 Clustering groups similar data points together.

 It tries to find clusters or groups where data points are close to each other.

Example:

 Grouping customers based on their purchase behavior.

 Grouping similar documents in a large set of text files.

Popular Algorithms:

 K-Means
 Hierarchical Clustering

 DBSCAN

2. Dimensionality Reduction

 Dimensionality reduction is about simplifying data by reducing the number of


features (variables).

 It’s useful when data has too many variables (high-dimensional data).

Example:

 Visualizing a dataset with hundreds of features in a 2D or 3D plot.

Popular Algorithms:

 Principal Component Analysis (PCA)

 t-SNE (t-Distributed Stochastic Neighbor Embedding)

 Autoencoders (using Neural Networks)

3. Density Estimation

 This technique estimates the probability distribution of data.

 It helps in finding regions where data is more concentrated.

Example:

 Detecting anomalies by identifying areas with low-density data.

Popular Algorithms:

 Kernel Density Estimation (KDE)

 Gaussian Mixture Models (GMM)

4. Association Rules

 Association rules identify relationships or patterns in data.

 It’s commonly used in market basket analysis to find product purchase patterns.

Example:

 “Customers who buy bread are likely to buy butter.”


Popular Algorithms:

 Apriori

 FP-Growth

Challenges in Unsupervised Learning

 No Clear Evaluation: Unlike supervised learning, there’s no definite answer to check


against. Evaluating results can be subjective.

 Curse of Dimensionality: In high-dimensional data, distances between points


become less meaningful, making clustering or density estimation harder.

 Overfitting: Models may detect patterns that are not meaningful.

14.2 Association Rules

Association rule analysis is a technique used to identify patterns and relationships within
large datasets. It is especially useful in business and retail industries to uncover connections
between items that are frequently purchased together. This type of analysis is commonly
referred to as market basket analysis.

Market Basket Analysis

Market basket analysis is a technique used to find patterns or relationships between items
that are often bought together in a store.

 Imagine a supermarket where thousands of transactions happen daily.

 Each transaction consists of a set of items purchased by a customer.

 Association rule analysis helps to determine which items are frequently bought
together.

 In this analysis, we typically deal with binary data.

 Binary data means that each variable (item) has only two possible values:

o 1 → The item was purchased.

o 0 → The item was not purchased.

 Each row in the dataset represents a transaction.

 Each column represents a specific item in the store.

 The value in each cell indicates whether the item was purchased (1) or not (0).
The goal is to find patterns or associations that frequently occur in the dataset. Specifically,
we are looking for joint occurrences of items.

Key Objectives:

1. Identify frequent item sets → Groups of items often bought together.

2. Generate association rules → Relationships like:

o "If a customer buys Bread, they are likely to buy Butter."

3. Support decision-making → Helps businesses with:

o Store layout design

o Cross-marketing promotions

o Personalized recommendations

Probability in Association Rules:


The above can be written like,
Note:

 Market Basket Analysis helps businesses understand purchasing patterns.


 By converting data into dummy variables, we simplify the analysis.
 Support measures how frequently an item set appears in transactions.
 We filter results using a support threshold to find the most meaningful
associations.
The Apriori Algorithm is a widely used method in data mining to find association rules
between items in large datasets. It is especially useful in market basket analysis, where we
analyze shopping data to find patterns of products that are frequently bought together.

Working Apriori Algorithm :

The Apriori algorithm works in multiple passes over the data, filtering out item sets that
don't meet the support threshold (a minimum percentage set by the user).

Step 1: Calculate Support for Single Items

 The algorithm checks the support for each individual item.

 Items with support below the threshold are discarded.

Step 2: Generate Item sets of Size 2

 Form pairs of items using only the items that passed Step 1.
 Calculate support for all possible 2-item combinations.

 Discard pairs with low support.

Step 3: Generate Larger Item sets

 Combine the surviving item sets to form larger item sets (e.g., sets of 3 items).

 Continue this process until no further item sets meet the support threshold.

Step 4: Generate Association Rules

 For each itemset, generate all possible association rules of the form:

A→B

Where A and B are non-empty subsets of the itemset.

 Calculate confidence and lift for each rule.

 Keep only the rules that meet both the confidence threshold and the support
threshold.
This means that 50% of the time, when customers bought Peanut Butter and Jelly, they also
bought Bread.
14.2.4 Unsupervised as Supervised Learning:
we can turn an Unsupervised Learning problem into a Supervised Learning problem! This is
useful because supervised learning algorithms (like logistic regression or neural networks)
are often more mature and easier to apply.
Example on Association rule mining:
14.3 Cluster Analysis:

What is Cluster Analysis?

Cluster analysis, also known as data segmentation, is a technique used to group a set of
objects into subsets called clusters. The main idea is that objects in the same cluster should
be more similar to each other than to objects in different clusters.

Goals of Cluster Analysis

1. Grouping Similar Objects: Objects with similar characteristics are grouped together.

2. Finding Hierarchical Structures: Some clustering methods create a hierarchy of


clusters.

3. Identifying Patterns: It helps in understanding the structure of data by identifying


subgroups with distinct properties.

Key Concept: Similarity and Dissimilarity

The success of clustering depends on how we define similarity (or dissimilarity) between
objects. This is often measured using a distance metric like:

 Euclidean Distance: Measures straight-line distance between points.

 Manhattan Distance: Measures distance along grid-like paths.

 Cosine Similarity: Measures the angle between two vectors.

Clustering Techniques

There are two main types of clustering methods:

1. Partitioning (Top-Down) Methods: These divide data into a fixed number of clusters.

o Example: K-Means Clustering

 Assigns data points to clusters based on their distance to cluster


centers.

 Iteratively refines cluster centers to minimize overall variance.

2. Hierarchical (Bottom-Up) Methods: These build a hierarchy of clusters.

o Example: Agglomerative Clustering

 Starts with each data point as its own cluster.

 Merges the closest clusters step by step until one cluster remains.

K-Means Clustering (Example)

1. Initialize: Start with random cluster centers.


2. Assign Points: Assign each data point to the closest cluster center.

3. Update Centers: Compute the average position of all points in a cluster.

4. Repeat: Continue until cluster centers stop changing (convergence).

Choosing the Number of Clusters

Selecting the right number of clusters is important. Common techniques include:

 Elbow Method: Plot the clustering cost vs. number of clusters and look for an
“elbow” point.

 Silhouette Score: Measures how well-separated clusters are.

Applications of Clustering

 Customer Segmentation: Grouping customers based on buying behaviour.

 Image Segmentation: Identifying different objects in an image.

 Anomaly Detection: Identifying unusual patterns in data.

 Bioinformatics: Grouping genes with similar properties.


14.3.3 Object Dissimilarity:

In data analysis, especially in clustering and pattern recognition, we often need to measure
how "different" two objects (data points) are. This difference is called dissimilarity.

Step 1: How Do We Measure Dissimilarity?

Imagine you have two observations (objects) in a dataset, each described by several features
(also called attributes or variables). The goal is to compute a single value that tells us how
different these two observations are.

Let’s say we have two people, and we are comparing them based on height, weight, and
age:
 Person A: Height = 5.8 ft, Weight = 160 lbs, Age = 25

 Person B: Height = 5.5 ft, Weight = 140 lbs, Age = 30

We need a way to combine these differences into a single dissimilarity measure.


Step 7: Handling Missing Data

What if some values are missing? There are two common approaches:

1. Omitting Pairs with Missing Data

o If an attribute is missing for either object, we exclude that attribute when


computing dissimilarity.

o Problem: If too many values are missing, we might lose important


observations.

2. Imputing Missing Values

o Replace missing values with the mean or median of that attribute.

o For categorical variables, treat "missing" as a new category.

o Problem: This assumes the missing values follow a pattern, which may no

14.3.4 Clustering Algorithms

Types of Clustering Algorithms

Clustering algorithms can be grouped into three main types:

1. Combinatorial Algorithms

 Do not assume any probability model behind the data.


 They work directly on the raw data.
 Try to find clusters by minimizing distances or dissimilarities.
 Example: K-Means Clustering
o Groups data into k clusters.
o Tries to minimize the distance from each point to the centre of its cluster.

2. Mixture Models

 Assumes the data comes from a mixture of probability distributions.


 Each cluster is represented by a component of the mixture.
 Example: Gaussian Mixture Models (GMM)
o Data is assumed to come from several Gaussian distributions.
o We estimate the parameters using maximum likelihood or Bayesian
methods.

3. Mode Seeking (a.k.a. Bump Hunting)

 Non-parametric approach (no fixed distribution assumed).


 Focuses on finding peaks (modes) in the data's density.
 A cluster is a group of points around a mode (high-density area).
 Example: PRIM Algorithm
o Searches for regions in data space with high concentration of data points.

14.3.5 Combinatorial Algorithms:

 hese are clustering methods that do not assume any underlying probability model.
 Instead, they work directly on data by grouping observations (data points) into K
distinct clusters.
 Each point is assigned to exactly one cluster.
 The main goal is to group similar data points together, minimizing the distance
between points in the same cluster.

Goal of Combinatorial Clustering

We want to find the best cluster assignments that minimize the within-cluster dissimilarities.
14.3.6 K-means

K-means is a clustering algorithm used to group data points into K clusters based on
similarity. It is one of the most popular and simple unsupervised learning methods.
Note:

 K-means always converges, but it may find a local minimum (not the best possible
solution).
 It's sensitive to the initial positions of the centroids.
 To improve results, run the algorithm multiple times with different starting points
and choose the best one.
 After clustering, the data space is split into regions where each region contains points
closest to a specific centroid.
 These regions form a Voronoi diagram.
14.3.7 Gaussian Mixtures as Soft K-means Clustering

📌 E-Step in EM (Expectation Step):

 Instead of assigning a point directly to one cluster (like K-means), EM calculates how
likely the point is to belong to each cluster.

 It gives a probability (called “responsibility”) to each cluster.

 For example, point X might have:

o 80% chance to be in Cluster A

o 20% chance to be in Cluster B

📍 Analogy: It’s like asking, “How much does this point belong to each group?”

📌 M-Step in EM (Maximization Step):

 Using these probabilities, EM updates the cluster centers and spreads (variances) to
better fit the data.

 It’s like recalculating the average position of the cluster, but considering partial
membership of points.
EM a “soft” version of K-means because,

 In K-means, each point is entirely assigned to one cluster.

 In EM, each point is partially assigned to all clusters with certain probabilities.

14.3.9 Vector Quantization:

Vector Quantization is a technique for compressing data like images or audio. It's based on
the idea of finding patterns in the data and replacing repeating patterns with simpler codes
— a bit like turning sentences into short keywords.

VQ is often powered by the K-means clustering algorithm, which helps group similar data
chunks together and represent them with a few representative "codes."
🧪 Why Does VQ Work?

 In real images, many small blocks look very similar (e.g., white sky, dark clothes).

 Instead of storing all blocks exactly, we just store a small set of typical blocks (the
codebook).

 Then, use pointers to reuse these blocks wherever similar patterns appear.
Tree-Structured VQ (Optional Advanced Concept)

 Instead of flat clustering, we build a tree of centroids.

 Top-down clustering like repeated 2-means.

 Helps with progressive refinement — improving quality step-by-step.

14.3.10 K-medoids

K-medoids is a clustering algorithm like K-means, but it’s designed to be more robust to
outliers and more flexible with different types of data.

 In K-means, each cluster is represented by the mean (average) of the points in the
cluster.

 In K-medoids, each cluster is represented by a medoid — an actual data point that is


the most central to others in its group.

Advantages:
 Robust to outliers: Outliers can pull the mean far away, but medoids stay close to
the main data.

 Works with any distance metric, not just Euclidean (can even use Manhattan
distance, cosine similarity, or proximity matrices).

 Better for categorical or mixed data, not just numeric.

Disadvantage:

 More computationally expensive than K-means (slower for large datasets).

Note:

 Instead of minimizing squared distances like in K-means, K-medoids minimizes the


sum of distances.
 Medoid of a cluster = the point that has the least total distance to all others in that
cluster.
 This avoids outliers pulling the cluster center too far.
14.3.11 Practical Issues

When Using K-Means or K-Medoids

You need:

 A number of clusters K.

 An initialization (starting points for the clusters).

There are different ways to initialize:

 Randomly select K data points as the initial centers.

 Step-by-step method: Choose one center at a time to reduce the clustering error.

How do you choose the number of clusters (K)?

 It depends on your goal:


14.3.12 Hierarchical Clustering

Hierarchical clustering is a method used to group similar data points together into clusters.
Unlike other methods like K-means or K-medoids (which require you to choose the number
of clusters ahead of time), hierarchical clustering doesn’t need you to specify that. Instead, it
builds a hierarchy of clusters step by step.

There are two main ways to do this:

 Agglomerative (Bottom-Up): Start with each data point as its own cluster. Then, step
by step, merge the two closest clusters until all points are combined into one big
cluster.

 Divisive (Top-Down): Start with all data points in one big cluster and split it
repeatedly into smaller groups based on how different they are from each other.
This process forms a tree-like diagram called a dendrogram. At the bottom of the
dendrogram, each point is by itself. As you move up, similar points are merged into clusters.
At the very top, you have one cluster containing all data points.

You can “cut” the dendrogram at a certain height to decide how many clusters you want.
The height represents how different the merged groups are. A large height means the
merged clusters are quite different, so the earlier merges (at shorter heights) are more
natural groupings.

However, hierarchical clustering always forces the data into a hierarchy — even if the real
structure isn’t hierarchical. So, it may not always reflect the true pattern in the data. To
check how well the dendrogram reflects the actual data, we can use something called the
cophenetic correlation coefficient. This compares the original distances between data
points with the distances shown in the dendrogram. If they match well, the dendrogram is a
good summary of the data.

Still, because small changes in the data can lead to very different dendrograms, and different
methods of hierarchical clustering produce different trees, you should be cautious in
interpreting them. Think of the dendrogram as a useful visualization tool, but not as an
absolute truth about your data.

Agglomerative Clustering:

Agglomerative clustering is a way to group similar things (like people, animals, or data
points) into clusters. It's called "agglomerative" because it starts with each item in its own
group and gradually merges the closest groups step by step.

We use a dissimilarity or distance measure — this tells us how different two groups are.
There are three popular ways to calculate this:

1. Single Linkage (Nearest Neighbor)

 We look at the closest pair of points between the two groups.

 The distance between the two groups is the smallest distance between any two
points — one from each group.

 It links clusters based on the closest connection.

 Problem: It can lead to "chaining", where clusters grow in long, loose chains that
don’t feel compact or tight.

2. Complete Linkage (Furthest Neighbor)

 We look at the furthest pair of points between the two groups.


 The distance between the groups is the largest distance between any two points
from different groups.

 This makes clusters that are tight and compact.

 Problem: Sometimes it puts points together that aren’t very close, just to keep the
group tight.

3. Group Average (Average Linkage)

 We look at all the distances between points from both groups and take the average.

 It gives a balanced result, combining the strengths of the first two methods.

 It tries to form clusters that are compact and well separated.

 Limitation: The result can change if the distances are transformed (e.g., if we double
all distances), unlike single or complete linkage.

Choosing the Right Method

 If your data has natural chains, single linkage might link them well — but it might
also make messy, spread-out clusters.

 If you want tight groups, complete linkage is better — but it might split up similar
items just to keep the group compact.

 If you want a balanced result, group average is usually a good middle ground.

Dendrograms

 You can draw this whole merging process as a tree diagram, where similar items join
together low on the tree. You can then "cut" the tree at a certain level to form your
final clusters.
Divisive Clustering

Divisive clustering is a "top-down" approach to grouping similar items. You start with
everything in one big group, and then you split it apart step by step, creating smaller and
smaller groups.

1. Start with all your data points in one big cluster.

2. Pick a cluster to split.

3. Divide it into two smaller clusters.

4. Keep repeating this splitting process until:

o Each point is in its own group, or

o The points in a group are too similar to split any further.

This is the opposite of agglomerative clustering (which builds clusters from the bottom up by
merging small ones).

 It’s helpful when you only need a few large clusters.

 It gives a broad overview first, then adds detail.

 It's useful in areas like data compression, where you start general and refine.
But it's less studied and used than agglomerative clustering, especially in basic data science
work.

A Special Divisive Method (Macnaughton-Smith Algorithm)

This method works like this:

1. Put all your data points in one cluster (G).

2. Find the point that’s most different from the others.

3. Move it to a new cluster (H).

4. Then, one by one, move more points from G to H if they’re more similar to H than to
the rest of G.

5. Stop when no more points clearly “belong” with H.

Now you have your first two clusters: the split is complete.

Keep Splitting

You repeat the same splitting process on one of the current clusters. How do you pick which
cluster to split next? Here are two common strategies:

 Choose the cluster that’s spread out the most (has the largest diameter).

 Or choose the one with the highest average difference between its points.

Keep doing this until:

 Each cluster has only one point, or

 Points in a cluster are so similar that splitting them makes no sense.

You might also like