➢ WEEK 2
Classification and Clustering Techniques: Simplified Overview
1. Introduction to Data Science and Machine Learning
• Data Science is an interdisciplinary field involving computer science, statistics, and
machine learning.
• It extracts meaningful information from large amounts of data (structured and
unstructured).
• With advancements in technology, the amount of data has grown, making problems
more complex.
• Machine learning, a key part of data science, helps analyze data, identify patterns,
and solve real-world problems (fraud detection, weather forecasting, etc.).
• Classification and clustering are two important techniques used to categorize data.
2. Classification Techniques: Supervised vs. Unsupervised Learning
• Classification techniques fall under two categories: Supervised and Unsupervised.
o Supervised Learning: The model is trained using labeled data, meaning the
correct outcome is already known.
o Unsupervised Learning (like clustering): The model works with unlabeled
data and tries to group similar items.
Supervised Classification Algorithms:
2.1 Bayesian Classification (Supervised Learning)
• Bayesian classification is based on probability theory, specifically Bayes' decision
theory.
• The goal is to classify data into different groups (classes) based on their features.
• A feature vector (a list of characteristics) is used to determine the probability of the
data belonging to a particular class.
• Bayes' Rule is used to calculate the probability that a given data point belongs to a
certain class.
o Key Concepts:
▪ Posterior Probability: The probability that a data point belongs to a
class, given the data.
▪ Prior Probability: The overall probability of a class, before looking at
the data.
▪ Conditional Probability: The likelihood that a feature will be observed,
given that it belongs to a certain class.
▪ Decision Rule: Choose the class with the highest posterior probability.
2.2 Support Vector Machine (SVM)
• SVM is used for classification by finding the optimal boundary (called a hyperplane)
that best separates data into different classes.
• It ensures that the data points from different classes are as far apart as possible from
this boundary.
a) Linear SVM
o In a linear SVM, data can be separated using a straight line (or plane in higher
dimensions).
o The model finds the hyperplane that maximizes the distance between two
groups (classes) of data points.
o SVM works by solving a mathematical optimization problem that minimizes
error and finds the best separation between the classes.
b) Non-Linear SVM
o If data is not linearly separable (cannot be separated by a straight line), SVM
uses a technique called the kernel trick.
o This technique transforms the data into a higher-dimensional space where it
becomes linearly separable.
o Common kernel functions include:
▪ Polynomial Kernel: Transforms the data using a polynomial equation.
▪ Radial Basis Function (RBF) Kernel: Maps the data to a new space
based on the distance between data points, making it easier to
classify.
Decision Tree Algorithm (in Simple Terms)
1. What is a Decision Tree?
o A decision tree is a model that helps in making decisions by splitting the data
into branches based on certain conditions. It works like a flowchart, where
each internal node represents a "test" on an attribute (for example, "Is the
person's income greater than $30,000?"), and the outcome of the test leads
to further splits until a final decision is made.
2. Why Use Decision Trees?
o They are simple to understand, easy to explain, and can be visualized. For
example, you can explain the result of a decision (why someone gets
approved for a loan or not) using the tree.
o Decision trees are popular because they are faster to build compared to other
complex models like neural networks.
3. Key Terms:
o Root Node: The first decision point (e.g., "What is the person's age?")
o Splitting: Dividing the data into two or more parts based on a test.
o Decision Node: A point where another decision is made.
o Leaf/Terminal Node: The final result (e.g., "Approve loan" or "Deny loan").
o Pruning: Removing unnecessary branches to simplify the tree.
4. How Does It Work?
o The tree starts with a root node. For example, if you're trying to decide
whether to offer a credit card, the first decision might be based on the
person's employment status.
o As the person’s details match the tree's tests (like age or GPA), they follow
different paths down the tree until they reach a final decision.
5. Case Study Example:
o Imagine you're deciding if Andrew (22, student, $2000 income, GPA of 3.39)
should be invited to apply for a credit card.
o The decision tree might first ask if Andrew is a student (yes, so follow the
student branch).
o Then it asks if he’s over 21 (yes, continue).
o Finally, it checks if his GPA is 3.0 or higher (yes, so the tree decides to invite
him).
6. Building a Decision Tree:
o To create a decision tree, you start with past data, called "training examples."
This data has cases with attributes (like age, income) and a known outcome
(e.g., whether the person was approved for a credit card or not).
o You divide the data into branches by asking questions that split the data
based on similarities. This process continues until all the branches lead to a
final decision (leaf).
k-Nearest Neighbor (k-NN) Algorithm (in Simple Terms)
1. What is k-NN?
o k-Nearest Neighbor (k-NN) is a simple algorithm that classifies an object
based on the votes of its closest neighbors. It's like asking your friends for
advice: you make a decision based on what most of them say.
2. How Does It Work?
o Imagine you're trying to identify whether an animal is a duck. You might
compare the animal to its nearest neighbors (other animals you've seen)
based on their characteristics (like walking or quacking).
o In k-NN, when you're given a new point of data (say, a new animal), you look
at the closest neighbors from your training data (past examples), and assign
the new point the same label as the majority of its neighbors.
3. Key Steps:
o Training Set: You have a collection of known examples, with their labels (like
"duck" or "not a duck").
o Distance Metric: To compare how close one point is to another, you measure
the "distance" between them using methods like Euclidean distance (like
measuring the straight-line distance between two points on a map).
o Choosing k: You choose a number (k), which is how many neighbors to look
at. If k=3, you look at the 3 closest neighbors and classify based on what the
majority of them say.
4. Example:
o If you have a new animal and you're using k-NN with k=3, you might compare
it to the 3 nearest known animals. If 2 of them are ducks and 1 is not, you'd
classify the new animal as a duck (since the majority say so).
5. Distance Formulas:
o Euclidean Distance: Measures the straight-line distance between two points
(like the shortest path between two cities).
o Manhattan Distance: Measures distance in a grid-like pattern, like walking the
streets of a city where you can’t go through buildings.
o Hamming Distance: Used when comparing categorical data, it counts how
many positions in two strings differ (e.g., comparing "cat" and "bat" results in
a distance of 1 because the first letter is different).
6. Case Studies:
o k-NN has been used in a variety of fields:
▪ Handwritten Character Recognition: Recognizing numbers written by
hand.
▪ Image Retrieval: Finding similar images in a database.
▪ Intrusion Detection: Identifying whether a computer program's
behavior is normal or suspicious.
Unsupervised Classification / Clustering Techniques Simplified
Sometimes in real-life problems, we don't know how to classify data because the data comes
without labels or categories. Since there’s no one to tell us which category the data belongs
to, we use unsupervised learning techniques like clustering to group similar data points
together based on their patterns.
Clustering is a way to divide data into groups (clusters) where similar things are grouped
together. Here, we'll talk about a popular clustering method: hierarchical clustering.
3.1 Hierarchical Clustering
Hierarchical clustering is a method of grouping data that builds a hierarchy (a tree-like
structure) of clusters. It works by either merging smaller clusters into larger ones (bottom-
up) or breaking larger clusters into smaller ones (top-down). The goal is to find hidden
patterns or outliers (unusual data points).
Steps of Hierarchical Clustering:
1. Create a pattern matrix: You have 'n' data points, each with 'p' attributes
(characteristics).
2. Proximity Matrix: Calculate how similar or different each data point is from others
using distance measures like Euclidean distance.
There are two main types of hierarchical clustering:
1. Agglomerative Clustering (Bottom-Up)
• Start with each data point as its own cluster.
• Merge the two clusters that are most similar (smallest distance) into one.
• Repeat this process until all data points are combined into one big cluster.
2. Divisive Clustering (Top-Down)
• Start with all data points in one cluster.
• Split the cluster into two smaller clusters based on the largest distance.
• Repeat this process until each data point is in its own cluster.
Both methods produce a tree diagram called a dendrogram, which shows how clusters are
merged or split at each step.
Distance Between Clusters:
When merging clusters, you can calculate the distance between clusters in different ways:
• Complete Linkage: Use the largest distance between points in different clusters.
• Single Linkage: Use the smallest distance between points in different clusters.
• Average Linkage: Use the average distance between points in different clusters.
• Centroid Linkage: Use the distance between the centroids (center points) of the
clusters.
Different linkage methods affect the shape of the clusters. For example:
• Complete Linkage favors compact (round) clusters.
• Single Linkage can create long, stretched-out clusters.
K-Means Clustering (3.2)
K-Means is an unsupervised algorithm for clustering data into k groups. It works by
minimizing the differences between points in each group (cluster) and their group’s center
(centroid). Here’s a simpler breakdown of how K-Means works:
1. Initialize: Start by randomly choosing k points (called centroids).
2. Assign Data: Each data point is assigned to the nearest centroid based on Euclidean
distance.
3. Update Centroids: The centroid of each group is recalculated as the average of all
data points in the group.
4. Repeat: Steps 2 and 3 are repeated until the centroids no longer change significantly
(or a maximum number of iterations is reached).
Since the starting points are chosen randomly, different initializations might lead to different
results. So, it’s common to run the algorithm multiple times to ensure you get the best
solution.
Fuzzy C-Means (3.3)
In Fuzzy C-Means (FCM), instead of assigning each point strictly to one cluster, it allows
points to belong to multiple clusters with varying degrees of membership. This is useful
when clusters overlap. A point’s membership in a cluster is expressed as a number between
0 and 1, where:
• A value of 1 means full membership in the cluster.
• A value close to 0 means little or no membership.
The algorithm works similarly to K-Means but instead of hard assignments, it assigns degrees
of belonging to each cluster and updates these memberships iteratively.
K-Fold Cross-Validation & Performance Evaluation (4)
K-Fold Cross-Validation is a technique to assess how well a model performs by splitting the
data into k parts (folds):
1. Use (k-1) parts for training.
2. Use 1 part for testing.
3. Repeat this process k times, each time using a different fold for testing.
This ensures that every data point is used for both training and testing, providing a more
accurate measure of model performance.
Performance is often evaluated using a confusion matrix, which helps to visualize true
positive (correct positive predictions), true negative, false positive, and false negative
predictions. From this, we can calculate measures like accuracy, sensitivity (true positive
rate), and specificity (true negative rate).
Case Studies in R (5)
• Bayesian Classification for Student Admission: This uses a probabilistic model to
predict whether a student will be admitted based on features like GRE scores, GPA,
and rank.
• SVM for Iris Data Classification: The Support Vector Machine (SVM) is used to
classify the famous Iris dataset, which includes different types of iris flowers.
• Decision Tree for Reading Skill Classification: A decision tree is used to classify
reading skills based on factors like age, shoe size, and reading test scores.
• Hierarchical Clustering for US Arrests Data: This method groups US states based on
crime statistics (murder, assault, rape) into clusters of states with similar crime
patterns. The result is shown in a dendrogram (tree-like diagram).