Mooc Part 2

This document provides an overview of classification and clustering techniques in data science and machine learning, emphasizing the importance of these methods in analyzing and categorizing data. It details various supervised learning algorithms such as Bayesian classification, Support Vector Machines, Decision Trees, and k-Nearest Neighbors, as well as unsupervised techniques like hierarchical clustering and K-Means. Additionally, it discusses model evaluation through K-Fold Cross-Validation and presents case studies demonstrating the application of these techniques.

Uploaded by

dont.waste.time.only.study

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views8 pages

Mooc Part 2

Uploaded by

dont.waste.time.only.study

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

➢ WEEK 2

Classification and Clustering Techniques: Simplified Overview

1. Introduction to Data Science and Machine Learning
• Data Science is an interdisciplinary field involving computer science, statistics, and
machine learning.
• It extracts meaningful information from large amounts of data (structured and
unstructured).
• With advancements in technology, the amount of data has grown, making problems
more complex.
• Machine learning, a key part of data science, helps analyze data, identify patterns,
and solve real-world problems (fraud detection, weather forecasting, etc.).
• Classification and clustering are two important techniques used to categorize data.
2. Classification Techniques: Supervised vs. Unsupervised Learning
• Classification techniques fall under two categories: Supervised and Unsupervised.
o Supervised Learning: The model is trained using labeled data, meaning the
correct outcome is already known.
o Unsupervised Learning (like clustering): The model works with unlabeled
data and tries to group similar items.
Supervised Classification Algorithms:
2.1 Bayesian Classification (Supervised Learning)
• Bayesian classification is based on probability theory, specifically Bayes' decision
theory.
• The goal is to classify data into different groups (classes) based on their features.
• A feature vector (a list of characteristics) is used to determine the probability of the
data belonging to a particular class.
• Bayes' Rule is used to calculate the probability that a given data point belongs to a
certain class.
o Key Concepts:
▪ Posterior Probability: The probability that a data point belongs to a
class, given the data.
▪ Prior Probability: The overall probability of a class, before looking at
the data.
▪ Conditional Probability: The likelihood that a feature will be observed,
given that it belongs to a certain class.
▪ Decision Rule: Choose the class with the highest posterior probability.
2.2 Support Vector Machine (SVM)
• SVM is used for classification by finding the optimal boundary (called a hyperplane)
that best separates data into different classes.
• It ensures that the data points from different classes are as far apart as possible from
this boundary.
a) Linear SVM
o In a linear SVM, data can be separated using a straight line (or plane in higher
dimensions).
o The model finds the hyperplane that maximizes the distance between two
groups (classes) of data points.
o SVM works by solving a mathematical optimization problem that minimizes
error and finds the best separation between the classes.
b) Non-Linear SVM
o If data is not linearly separable (cannot be separated by a straight line), SVM
uses a technique called the kernel trick.
o This technique transforms the data into a higher-dimensional space where it
becomes linearly separable.
o Common kernel functions include:
▪ Polynomial Kernel: Transforms the data using a polynomial equation.
▪ Radial Basis Function (RBF) Kernel: Maps the data to a new space
based on the distance between data points, making it easier to
classify.

Decision Tree Algorithm (in Simple Terms)

1. What is a Decision Tree?
o A decision tree is a model that helps in making decisions by splitting the data
into branches based on certain conditions. It works like a flowchart, where
each internal node represents a "test" on an attribute (for example, "Is the
person's income greater than $30,000?"), and the outcome of the test leads
to further splits until a final decision is made.
2. Why Use Decision Trees?
o They are simple to understand, easy to explain, and can be visualized. For
example, you can explain the result of a decision (why someone gets
approved for a loan or not) using the tree.
o Decision trees are popular because they are faster to build compared to other
complex models like neural networks.
3. Key Terms:
o Root Node: The first decision point (e.g., "What is the person's age?")
o Splitting: Dividing the data into two or more parts based on a test.
o Decision Node: A point where another decision is made.
o Leaf/Terminal Node: The final result (e.g., "Approve loan" or "Deny loan").
o Pruning: Removing unnecessary branches to simplify the tree.

4. How Does It Work?

o The tree starts with a root node. For example, if you're trying to decide
whether to offer a credit card, the first decision might be based on the
person's employment status.
o As the person’s details match the tree's tests (like age or GPA), they follow
different paths down the tree until they reach a final decision.
5. Case Study Example:
o Imagine you're deciding if Andrew (22, student, $2000 income, GPA of 3.39)
should be invited to apply for a credit card.
o The decision tree might first ask if Andrew is a student (yes, so follow the
student branch).
o Then it asks if he’s over 21 (yes, continue).
o Finally, it checks if his GPA is 3.0 or higher (yes, so the tree decides to invite
him).
6. Building a Decision Tree:
o To create a decision tree, you start with past data, called "training examples."
This data has cases with attributes (like age, income) and a known outcome
(e.g., whether the person was approved for a credit card or not).
o You divide the data into branches by asking questions that split the data
based on similarities. This process continues until all the branches lead to a
final decision (leaf).

k-Nearest Neighbor (k-NN) Algorithm (in Simple Terms)

1. What is k-NN?
o k-Nearest Neighbor (k-NN) is a simple algorithm that classifies an object
based on the votes of its closest neighbors. It's like asking your friends for
advice: you make a decision based on what most of them say.
2. How Does It Work?
o Imagine you're trying to identify whether an animal is a duck. You might
compare the animal to its nearest neighbors (other animals you've seen)
based on their characteristics (like walking or quacking).
o In k-NN, when you're given a new point of data (say, a new animal), you look
at the closest neighbors from your training data (past examples), and assign
the new point the same label as the majority of its neighbors.
3. Key Steps:
o Training Set: You have a collection of known examples, with their labels (like
"duck" or "not a duck").
o Distance Metric: To compare how close one point is to another, you measure
the "distance" between them using methods like Euclidean distance (like
measuring the straight-line distance between two points on a map).
o Choosing k: You choose a number (k), which is how many neighbors to look
at. If k=3, you look at the 3 closest neighbors and classify based on what the
majority of them say.
4. Example:
o If you have a new animal and you're using k-NN with k=3, you might compare
it to the 3 nearest known animals. If 2 of them are ducks and 1 is not, you'd
classify the new animal as a duck (since the majority say so).
5. Distance Formulas:
o Euclidean Distance: Measures the straight-line distance between two points
(like the shortest path between two cities).
o Manhattan Distance: Measures distance in a grid-like pattern, like walking the
streets of a city where you can’t go through buildings.
o Hamming Distance: Used when comparing categorical data, it counts how
many positions in two strings differ (e.g., comparing "cat" and "bat" results in
a distance of 1 because the first letter is different).
6. Case Studies:
o k-NN has been used in a variety of fields:
▪ Handwritten Character Recognition: Recognizing numbers written by
hand.
▪ Image Retrieval: Finding similar images in a database.
▪ Intrusion Detection: Identifying whether a computer program's
behavior is normal or suspicious.
Unsupervised Classification / Clustering Techniques Simplified
Sometimes in real-life problems, we don't know how to classify data because the data comes
without labels or categories. Since there’s no one to tell us which category the data belongs
to, we use unsupervised learning techniques like clustering to group similar data points
together based on their patterns.
Clustering is a way to divide data into groups (clusters) where similar things are grouped
together. Here, we'll talk about a popular clustering method: hierarchical clustering.

3.1 Hierarchical Clustering

Hierarchical clustering is a method of grouping data that builds a hierarchy (a tree-like
structure) of clusters. It works by either merging smaller clusters into larger ones (bottom-
up) or breaking larger clusters into smaller ones (top-down). The goal is to find hidden
patterns or outliers (unusual data points).
Steps of Hierarchical Clustering:
1. Create a pattern matrix: You have 'n' data points, each with 'p' attributes
(characteristics).
2. Proximity Matrix: Calculate how similar or different each data point is from others
using distance measures like Euclidean distance.
There are two main types of hierarchical clustering:
1. Agglomerative Clustering (Bottom-Up)
• Start with each data point as its own cluster.
• Merge the two clusters that are most similar (smallest distance) into one.
• Repeat this process until all data points are combined into one big cluster.
2. Divisive Clustering (Top-Down)
• Start with all data points in one cluster.
• Split the cluster into two smaller clusters based on the largest distance.
• Repeat this process until each data point is in its own cluster.
Both methods produce a tree diagram called a dendrogram, which shows how clusters are
merged or split at each step.
Distance Between Clusters:
When merging clusters, you can calculate the distance between clusters in different ways:
• Complete Linkage: Use the largest distance between points in different clusters.
• Single Linkage: Use the smallest distance between points in different clusters.
• Average Linkage: Use the average distance between points in different clusters.
• Centroid Linkage: Use the distance between the centroids (center points) of the
clusters.
Different linkage methods affect the shape of the clusters. For example:
• Complete Linkage favors compact (round) clusters.
• Single Linkage can create long, stretched-out clusters.

K-Means Clustering (3.2)

K-Means is an unsupervised algorithm for clustering data into k groups. It works by
minimizing the differences between points in each group (cluster) and their group’s center
(centroid). Here’s a simpler breakdown of how K-Means works:
1. Initialize: Start by randomly choosing k points (called centroids).
2. Assign Data: Each data point is assigned to the nearest centroid based on Euclidean
distance.
3. Update Centroids: The centroid of each group is recalculated as the average of all
data points in the group.
4. Repeat: Steps 2 and 3 are repeated until the centroids no longer change significantly
(or a maximum number of iterations is reached).
Since the starting points are chosen randomly, different initializations might lead to different
results. So, it’s common to run the algorithm multiple times to ensure you get the best
solution.
Fuzzy C-Means (3.3)
In Fuzzy C-Means (FCM), instead of assigning each point strictly to one cluster, it allows
points to belong to multiple clusters with varying degrees of membership. This is useful
when clusters overlap. A point’s membership in a cluster is expressed as a number between
0 and 1, where:
• A value of 1 means full membership in the cluster.
• A value close to 0 means little or no membership.
The algorithm works similarly to K-Means but instead of hard assignments, it assigns degrees
of belonging to each cluster and updates these memberships iteratively.
K-Fold Cross-Validation & Performance Evaluation (4)
K-Fold Cross-Validation is a technique to assess how well a model performs by splitting the
data into k parts (folds):
1. Use (k-1) parts for training.
2. Use 1 part for testing.
3. Repeat this process k times, each time using a different fold for testing.
This ensures that every data point is used for both training and testing, providing a more
accurate measure of model performance.
Performance is often evaluated using a confusion matrix, which helps to visualize true
positive (correct positive predictions), true negative, false positive, and false negative
predictions. From this, we can calculate measures like accuracy, sensitivity (true positive
rate), and specificity (true negative rate).
Case Studies in R (5)
• Bayesian Classification for Student Admission: This uses a probabilistic model to
predict whether a student will be admitted based on features like GRE scores, GPA,
and rank.
• SVM for Iris Data Classification: The Support Vector Machine (SVM) is used to
classify the famous Iris dataset, which includes different types of iris flowers.
• Decision Tree for Reading Skill Classification: A decision tree is used to classify
reading skills based on factors like age, shoe size, and reading test scores.
• Hierarchical Clustering for US Arrests Data: This method groups US states based on
crime statistics (murder, assault, rape) into clusters of states with similar crime
patterns. The result is shown in a dendrogram (tree-like diagram).

Machine Learning
No ratings yet
Machine Learning
15 pages
Big Data Notes
No ratings yet
Big Data Notes
33 pages
Unit Ii
No ratings yet
Unit Ii
102 pages
Introduction To Classification and Classification Algorithms
100% (1)
Introduction To Classification and Classification Algorithms
9 pages
Discovering Knowledge in Data: Lecture Review of
No ratings yet
Discovering Knowledge in Data: Lecture Review of
20 pages
Week 15 Lecture Notes
No ratings yet
Week 15 Lecture Notes
66 pages
Unit - II
No ratings yet
Unit - II
37 pages
Slides Courtesy: Ling Chen lchen@L3S.de
No ratings yet
Slides Courtesy: Ling Chen lchen@L3S.de
42 pages
Entropy (S) Log (P) : I 1c I I
No ratings yet
Entropy (S) Log (P) : I 1c I I
5 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
ML IA2v1
No ratings yet
ML IA2v1
18 pages
Classification and Clustering Method
0% (1)
Classification and Clustering Method
30 pages
Classification
No ratings yet
Classification
52 pages
Classification and Clustering Algorithm Notes
No ratings yet
Classification and Clustering Algorithm Notes
19 pages
Module Iii
No ratings yet
Module Iii
15 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
9 pages
49 Machine Learning
No ratings yet
49 Machine Learning
300 pages
Unit 6
No ratings yet
Unit 6
22 pages
IDS26 Clustering and Classification
No ratings yet
IDS26 Clustering and Classification
30 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
FPA Unit 2
No ratings yet
FPA Unit 2
20 pages
ML Imp Ques 2
No ratings yet
ML Imp Ques 2
37 pages
Machine Learning: Supervised Learning Basics
No ratings yet
Machine Learning: Supervised Learning Basics
46 pages
ML Unit-5
No ratings yet
ML Unit-5
31 pages
Sayan Das - Machine Learning
No ratings yet
Sayan Das - Machine Learning
4 pages
CZ4032 Data Analytics & Mining Notes
No ratings yet
CZ4032 Data Analytics & Mining Notes
16 pages
Unit 3,4,5 ML (CS - AI)
No ratings yet
Unit 3,4,5 ML (CS - AI)
37 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
21 pages
DW&M Unit 3 Part I
No ratings yet
DW&M Unit 3 Part I
101 pages
Unit 3 DVA
No ratings yet
Unit 3 DVA
50 pages
Classification
No ratings yet
Classification
32 pages
Module 3
No ratings yet
Module 3
21 pages
DM - MP
No ratings yet
DM - MP
15 pages
Fifth Chapter Classification Clustering
No ratings yet
Fifth Chapter Classification Clustering
16 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
13 pages
Data Mining Intro IEP
No ratings yet
Data Mining Intro IEP
47 pages
Data Mining All Summary
No ratings yet
Data Mining All Summary
47 pages
Ai Word Document Session 2 Detailed Exaple
No ratings yet
Ai Word Document Session 2 Detailed Exaple
15 pages
ML Notes
No ratings yet
ML Notes
12 pages
DataMining Unit-3
No ratings yet
DataMining Unit-3
8 pages
Ai Unit 4
No ratings yet
Ai Unit 4
17 pages
CH 04 Classification Techniques
No ratings yet
CH 04 Classification Techniques
89 pages
ML Unit-2
No ratings yet
ML Unit-2
26 pages
Data Mining Algorithms Comparison
No ratings yet
Data Mining Algorithms Comparison
32 pages
UNIT 3 - Final
No ratings yet
UNIT 3 - Final
37 pages
Classification Clustering Overview
No ratings yet
Classification Clustering Overview
7 pages
Machine Learning: Distance Models
No ratings yet
Machine Learning: Distance Models
20 pages
ML ModuleUntitled 2
No ratings yet
ML ModuleUntitled 2
8 pages
Unit 1
No ratings yet
Unit 1
15 pages
Q & A Unit 3 - Clustering Methods
No ratings yet
Q & A Unit 3 - Clustering Methods
21 pages
ML (Interview)
No ratings yet
ML (Interview)
20 pages
Classifiction
No ratings yet
Classifiction
42 pages
Unit 4 Introduction To Algorithm
No ratings yet
Unit 4 Introduction To Algorithm
10 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
U02Lecture08 Statistical Machine Learning
No ratings yet
U02Lecture08 Statistical Machine Learning
41 pages
Data Mining Overview & Applications
No ratings yet
Data Mining Overview & Applications
47 pages
Review of Underlying Technology-IV
No ratings yet
Review of Underlying Technology-IV
18 pages
Unit 7 Data Visualization in Punjab
No ratings yet
Unit 7 Data Visualization in Punjab
6 pages
History - of - Internet
No ratings yet
History - of - Internet
12 pages
Data Analytics in Gnu of India
No ratings yet
Data Analytics in Gnu of India
10 pages
MOOC PART 3 in Gndu
No ratings yet
MOOC PART 3 in Gndu
9 pages
Unit 5 Educational Data Analyticsz
No ratings yet
Unit 5 Educational Data Analyticsz
13 pages
AdmitCard (1) - 1
No ratings yet
AdmitCard (1) - 1
1 page
Btech Question Papers
No ratings yet
Btech Question Papers
7 pages
Mini Project
No ratings yet
Mini Project
31 pages
Engineers' Guide to Data Imputation
No ratings yet
Engineers' Guide to Data Imputation
3 pages
Intern Project
No ratings yet
Intern Project
113 pages
IIT Madras Certificate Course Project - 20250302 - 155546 - 0000
No ratings yet
IIT Madras Certificate Course Project - 20250302 - 155546 - 0000
3 pages
Liu Et Al 2022 Data Driven Machine Learning in Environmental Pollution Gains and Problems
No ratings yet
Liu Et Al 2022 Data Driven Machine Learning in Environmental Pollution Gains and Problems
10 pages
Tugas 2 Feri Tri Setiawan
No ratings yet
Tugas 2 Feri Tri Setiawan
4 pages
EDUREKHA Data Science and ML Internship Program V2 - Program Brochure
No ratings yet
EDUREKHA Data Science and ML Internship Program V2 - Program Brochure
60 pages
Report
No ratings yet
Report
74 pages
Feature Selection for Student Performance
No ratings yet
Feature Selection for Student Performance
10 pages
SMSSpam Detectionusing Deep Neural IJPAM
No ratings yet
SMSSpam Detectionusing Deep Neural IJPAM
13 pages
2 A Convolutional Neural Network Ensemble Model For Pneumonia Detection Using Chest X-Ray Images
No ratings yet
2 A Convolutional Neural Network Ensemble Model For Pneumonia Detection Using Chest X-Ray Images
6 pages
Copie Sghira
No ratings yet
Copie Sghira
9 pages
Algorithmic Learning Theory (Lecture Notes in Artificial Intelligence) (LNCS 6925 LNAI) (Springer, 2011)
No ratings yet
Algorithmic Learning Theory (Lecture Notes in Artificial Intelligence) (LNCS 6925 LNAI) (Springer, 2011)
468 pages
Questions and Answers
No ratings yet
Questions and Answers
7 pages
Statistical Machine Learning Assignment
No ratings yet
Statistical Machine Learning Assignment
5 pages
HPCL Section 13 Notes CH
No ratings yet
HPCL Section 13 Notes CH
17 pages
KNN
No ratings yet
KNN
53 pages
Automatic Thyroid Ultrasound Image Classification Using Feature Fusion Network
No ratings yet
Automatic Thyroid Ultrasound Image Classification Using Feature Fusion Network
9 pages
Moblie Price Classification 1
No ratings yet
Moblie Price Classification 1
11 pages
Data Mining: Business Intelligence
No ratings yet
Data Mining: Business Intelligence
68 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
13 pages
Computers and Electronics in Agriculture: P.S. Maya Gopal, R. Bhargavi T
No ratings yet
Computers and Electronics in Agriculture: P.S. Maya Gopal, R. Bhargavi T
9 pages
Second Review
No ratings yet
Second Review
10 pages
Final MCA Report
No ratings yet
Final MCA Report
146 pages
A Survey On Fingerprinting Technologies For Smartphones Based On Embedded Transducers
No ratings yet
A Survey On Fingerprinting Technologies For Smartphones Based On Embedded Transducers
25 pages
Internship IN Data Analysis Using Machine Learning: Gopal Tiwari
No ratings yet
Internship IN Data Analysis Using Machine Learning: Gopal Tiwari
44 pages
Machine Learning Updated Lesson Plan
No ratings yet
Machine Learning Updated Lesson Plan
6 pages
Automated White Blood Cancer Detection
No ratings yet
Automated White Blood Cancer Detection
7 pages
Neural Network Exercise PDF
No ratings yet
Neural Network Exercise PDF
5 pages
Pattern Recognition Chip With 1024 Neurons in Parallel: Data Sheet
No ratings yet
Pattern Recognition Chip With 1024 Neurons in Parallel: Data Sheet
2 pages

Mooc Part 2

Uploaded by

Mooc Part 2

Uploaded by

➢ WEEK 2

Classification and Clustering Techniques: Simplified Overview

Decision Tree Algorithm (in Simple Terms)

4. How Does It Work?

k-Nearest Neighbor (k-NN) Algorithm (in Simple Terms)

3.1 Hierarchical Clustering

K-Means Clustering (3.2)

You might also like