0% found this document useful (0 votes)

46 views43 pages

Clustering

The document discusses clustering, particularly K-means clustering, as an unsupervised learning technique used to group unlabeled data into distinct clusters based on similarities. It highlights the challenges of evaluating clusters and the importance of domain knowledge in interpreting results. Additionally, it presents a consulting project scenario where clustering is applied to analyze hacker behavior based on various session data features.

Uploaded by

abhimanyu thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views43 pages

Clustering

Uploaded by

abhimanyu thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Clustering

Let’s learn something!

Python and Spark

● We’ve seen how to deal with labeled

data, but what about unlabeled data?
● Often you’ll ﬁnd yourself trying to create
groups from data, instead of trying to
predict classes or values.
Python and Spark

● This sort of problem is known as

clustering, you can think of it as an
attempt to create labels.
● You input some unlabeled data, and the
unsupervised learning algorithm
returns back possible clusters of the data.
Python and Spark

● This means you have data that only

contains features and you want to see if
there are patterns in the data that would
allow you to create groups or clusters.
Python and Spark

● This is a key distinction from our previous

supervised learning tasks, where we
had historical labeled data.
● Now we will have unlabeled data, and
attempt to “discover” possible labels,
through clustering.
Python and Spark

● By the nature of this problem, it can be

difﬁcult to evaluate the groups or
clusters for “correctness”.
● A large part of being able to interpret the
clusters assigned comes down to
domain knowledge!
Python and Spark

● Maybe you have some customer data, and

then cluster them into distinct groups.
● It will be up to you to decide what the
groups actually represent.
● Sometimes this is easy, sometimes it’s
really hard!
Python and Spark

● For example, you could cluster tumors into

two groups, hoping to separate between
benign and malignant.
● But there is no guarantee that the clusters
will fall along those lines, it will just split
into the two most separable groups.
Python and Spark

● Also depending on the clustering

algorithm, it may be up to you to decide
beforehand how many clusters you expect
to create!
Python and Spark

● A lot of clustering problems have no 100%

correct approach or answer, that is the
nature of unsupervised learning!
● Let’s continue by discussing K-means
clustering.
Reading Assignment

Chapter 10 of
Introduction to Statistical Learning
By Gareth James, et al.
K Means Clustering

K Means Clustering is an unsupervised learning

algorithm that will attempt to group similar clusters
together in your data.
So what does a typical clustering problem look like?
● Cluster Similar Documents
● Cluster Customers based on Features
● Market Segmentation
● Identify similar physical groups
K Means Clustering

● The overall goal is to divide data into distinct

groups such that observations within each group
are similar
K Means Clustering

The K Means Algorithm

● Choose a number of Clusters “K”
● Randomly assign each point to a cluster
● Until clusters stop changing, repeat the following:
○ For each cluster, compute the cluster centroid
by taking the mean vector of points in the
cluster
○ Assign each data point to the cluster for which
the centroid is the closest
K Means Clustering
Choosing a K Value
Choosing a K Value

● There is no easy answer for choosing a “best” K

value
● One way is the elbow method
First of all, compute the sum of squared error (SSE) for
some values of k (for example 2, 4, 6, 8, etc.).
The SSE is deﬁned as the sum of the squared distance
between each member of the cluster and its centroid.
Choosing a K Value

If you plot k against the SSE, you will see that the error
decreases as k gets larger; this is because when the
number of clusters increases, they should be smaller,
so distortion is also smaller.
The idea of the elbow method is to choose the k at
which the SSE decreases abruptly.
This produces an "elbow effect" in the graph, as you
can see in the following picture:
Choosing a K Value
Choosing a K Value

● Pyspark by itself doesn’t support a

plotting mechanism, but you could use
collect() and then plot the results with
matplotlib or other visualization
libraries.
Choosing a K Value

● But don’t take this as a strict rule when

choosing a K value!
● A lot of depends more on the context of
the exact situation (domain knowledge)
● We’ll try our best to get a feel for this
with the examples and consulting
projects!
K-Means Clustering
Documentation
Example
Let’s learn something!
Python and Spark

● Let’s work through the documentation

example for clustering.
● Pay close attention to how we don’t need
the label column (which makes sense
given clustering)
Python and Spark

● The documentation’s example is a bit

peculiar in its choice of data set, but we’ll
explain it along the way.
● Hopefully our own custom code along
will clarify things further!
● Let’s get started!
K-Means Clustering
Code Along
Python and Spark

● We’ll work through a real data set

containing some data on three distinct
seed types.
● Notebook: Clustering Code Along.ipynb
Python and Spark

● For certain Machine Learning algorithms,

it is a good idea to scale your data.
● Drops in model performance can occur
with highly dimensional data, so we’ll
practice scaling features using PySpark!
Python and Spark

● Remember, there won’t be any

confusion matrix or classiﬁcation test
results.
● This is unsupervised learning!
● Meaning we don’t have the original
labels to actually perform some sort of
test against!
Python and Spark

● This is a common point of confusion for

beginners, you can’t easily check to see
how well your clustering algorithm
performed, this is the difﬁculty of all
unsupervised tasks!
● Let’s get started!
K-Means Clustering
Consulting Project
Python and Spark

● You’re becoming world famous due to

your machine learning skills!
● A technology start-up in California needs
your help!
Python and Spark

● It’s time for

you to go to
San Francisco
to help out a
tech startup!
Python and Spark

● They’ve been
recently
hacked and
need your help
ﬁnding out
about the
hackers!
Python and Spark

● Luckily their forensic engineers have

grabbed valuable data about the hacks,
including information like session
time,locations, wpm typing speed, etc.
Python and Spark

● The forensic engineer relates to you what

she has been able to ﬁgure out so far, she
has been able to grab meta-data of each
session that the hackers used to connect
to their servers.
● These are the features of the data...
Python and Spark

● 'Session_Connection_Time': How long the session lasted in

minutes
● 'Bytes Transferred': Number of MB transferred during
session
● 'Kali_Trace_Used': Indicates if the hacker was using Kali
Linux
● 'Servers_Corrupted': Number of server corrupted during the
attack
● 'Pages_Corrupted': Number of pages illegally accessed
● 'Location': Location attack came from (Probably useless
because the hackers used VPNs)
● 'WPM_Typing_Speed': Their estimated typing speed based
on session logs.
Python and Spark

● The technology ﬁrm has 3 potential

hackers that perpetrated the attack.
● They are certain of the ﬁrst two hackers
but they aren't very sure if the third
hacker was involved or not.
● They have requested your help!
Python and Spark

● Can you help ﬁgure out whether or not

the third suspect had anything to do
with the attacks, or was it just two
hackers?
● It's probably not possible to know for
sure, but maybe what you've just learned
about Clustering can help!
Python and Spark

● One last key fact, the forensic engineer

knows that the hackers trade off attacks.
● Meaning they should each have roughly
the same amount of attacks.
Python and Spark

● For example if there were 100 total

attacks, then in a 2 hacker situation each
should have about 50 hacks, in a three
hacker situation each would have about
33 hacks.
Python and Spark

● The engineer believes this is the key

element to solving this, but doesn't know
how to distinguish this unlabeled data
into groups of hackers.
Python and Spark

● Best of luck with this project, it should be

a fun one!
● If you get stuck, feel free to go straight to
the solution lecture.
● Enjoy!
K-Means Clustering
Consulting Project
Solutions

K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
Session 37 CO4 Unsupervised Learning
No ratings yet
Session 37 CO4 Unsupervised Learning
34 pages
Python K-Means Clustering Guide
No ratings yet
Python K-Means Clustering Guide
6 pages
Day 3
No ratings yet
Day 3
74 pages
K Means
No ratings yet
K Means
9 pages
AAI101 - Session 2 - Unsupervised Learning
No ratings yet
AAI101 - Session 2 - Unsupervised Learning
38 pages
10.lab Activity
No ratings yet
10.lab Activity
11 pages
Machine Learning K Means - Unsupervised
No ratings yet
Machine Learning K Means - Unsupervised
5 pages
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
No ratings yet
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
48 pages
K Means Clustering Detailed Steps With Code
No ratings yet
K Means Clustering Detailed Steps With Code
12 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Clustering Course Slides
No ratings yet
Clustering Course Slides
26 pages
CC Unit IV
No ratings yet
CC Unit IV
30 pages
3.unsupervised Learning
No ratings yet
3.unsupervised Learning
9 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
Week 14 and 15 Machine Learning Unsupervised 2
No ratings yet
Week 14 and 15 Machine Learning Unsupervised 2
25 pages
K Means
No ratings yet
K Means
25 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Minor Project
No ratings yet
Minor Project
10 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Cluster Lecture-1
No ratings yet
Cluster Lecture-1
20 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Data Mining
No ratings yet
Data Mining
27 pages
Unit 4
No ratings yet
Unit 4
46 pages
K Means - Ipynb - Colab
No ratings yet
K Means - Ipynb - Colab
10 pages
Clustering in R
No ratings yet
Clustering in R
12 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
ML Unit 4 V1
No ratings yet
ML Unit 4 V1
30 pages
SLide#4 - Clustering and Elbow Technique
No ratings yet
SLide#4 - Clustering and Elbow Technique
29 pages
Lect 6 - Clustering
No ratings yet
Lect 6 - Clustering
50 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Week 9. Unsupervised Learning
No ratings yet
Week 9. Unsupervised Learning
32 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
82 pages
Bayesian and Clustering Algorithms in Python
No ratings yet
Bayesian and Clustering Algorithms in Python
18 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
9 pages
JAVIER KMeans Clustering Jupyter Notebook
No ratings yet
JAVIER KMeans Clustering Jupyter Notebook
7 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
22 pages
DM After Midz
No ratings yet
DM After Midz
22 pages
Wa0033.
No ratings yet
Wa0033.
38 pages
0006 - K Means Clustering - Introduction - 2025
No ratings yet
0006 - K Means Clustering - Introduction - 2025
19 pages
Mastering Python For Data Science - Sample Chapter
71% (7)
Mastering Python For Data Science - Sample Chapter
24 pages
AppliedML Chap1 Clustering
No ratings yet
AppliedML Chap1 Clustering
37 pages
Clustering and K-Mean Algorithm
No ratings yet
Clustering and K-Mean Algorithm
38 pages
Unit 4
No ratings yet
Unit 4
63 pages
Clustering
No ratings yet
Clustering
7 pages
Meeting 7 Unsupervised Learnign
No ratings yet
Meeting 7 Unsupervised Learnign
95 pages
Paul Mather The New Microsoft Project
No ratings yet
Paul Mather The New Microsoft Project
41 pages
Server Side PHP 1
No ratings yet
Server Side PHP 1
19 pages
DAA Lab
No ratings yet
DAA Lab
6 pages
PHP Webforms
No ratings yet
PHP Webforms
39 pages
CH - 5 JS
No ratings yet
CH - 5 JS
109 pages
Haard 1
No ratings yet
Haard 1
1 page
Spring Slides
No ratings yet
Spring Slides
63 pages
Spring Boot Ecommerce Masterclass
No ratings yet
Spring Boot Ecommerce Masterclass
337 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Spark DataFrame Basics
No ratings yet
Spark DataFrame Basics
10 pages
Machine Learning Section
No ratings yet
Machine Learning Section
29 pages
CH - 5 JS
No ratings yet
CH - 5 JS
109 pages
Natural Language Processing
No ratings yet
Natural Language Processing
19 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
Tutorial 8 DataTable Aslists in Cucumber
No ratings yet
Tutorial 8 DataTable Aslists in Cucumber
13 pages
Youtube PavanKumar Manual Testing 02 (Practical)
No ratings yet
Youtube PavanKumar Manual Testing 02 (Practical)
21 pages
Parsing Json
No ratings yet
Parsing Json
1 page
Tutorial 6 BackgroundKeyword
No ratings yet
Tutorial 6 BackgroundKeyword
9 pages
UDEMY - SK - XPath Tutorial From Basic To Advance Level
No ratings yet
UDEMY - SK - XPath Tutorial From Basic To Advance Level
9 pages
Chapter 09 Advanced Data Structures
No ratings yet
Chapter 09 Advanced Data Structures
9 pages
Tutorial 10 Data Driven Testing in Cucumber Scenario Outline
No ratings yet
Tutorial 10 Data Driven Testing in Cucumber Scenario Outline
10 pages
UDEMY - SK - SelectorsHub Tutorial - A Free Next Gen XPath & Locators Tool
No ratings yet
UDEMY - SK - SelectorsHub Tutorial - A Free Next Gen XPath & Locators Tool
20 pages
Tutorial 1 What Is Cucumber-BDD
No ratings yet
Tutorial 1 What Is Cucumber-BDD
9 pages
Slides For Windows OS
No ratings yet
Slides For Windows OS
43 pages
Testing - Apache POI
No ratings yet
Testing - Apache POI
12 pages
IPD Checklist
No ratings yet
IPD Checklist
1 page
Xpath Vs CSS - Everything You Need To Know About XPath and CSS
No ratings yet
Xpath Vs CSS - Everything You Need To Know About XPath and CSS
11 pages
BC Contact Numbers Emails All
No ratings yet
BC Contact Numbers Emails All
1 page
Testing - Log4J
No ratings yet
Testing - Log4J
7 pages
Iterator+in+Java+Collection+ Iterator
No ratings yet
Iterator+in+Java+Collection+ Iterator
8 pages
Network Management
No ratings yet
Network Management
27 pages
Fault and Alarm Troubleshooting Guide
No ratings yet
Fault and Alarm Troubleshooting Guide
10 pages
Schedule M Gap Assessment Earlier Vs 2024
No ratings yet
Schedule M Gap Assessment Earlier Vs 2024
15 pages
P121 OrderForm - V16 - 092017
0% (1)
P121 OrderForm - V16 - 092017
15 pages
High-Performance Computing Scaling Challenges
No ratings yet
High-Performance Computing Scaling Challenges
10 pages
cp1 Project
No ratings yet
cp1 Project
35 pages
Winter 24 Key Points
No ratings yet
Winter 24 Key Points
56 pages
Data Sheet ROTAVISC Me-Vi Complete
No ratings yet
Data Sheet ROTAVISC Me-Vi Complete
4 pages
ME Syllabus Iitk
No ratings yet
ME Syllabus Iitk
340 pages
Acct Statement - XX2591 - 08112022 PDF
No ratings yet
Acct Statement - XX2591 - 08112022 PDF
13 pages
Lock Picking - The Complete Guide For Beginners To Master The Art of Lock Picking Skills and Avoid Beginner Mistakes
100% (3)
Lock Picking - The Complete Guide For Beginners To Master The Art of Lock Picking Skills and Avoid Beginner Mistakes
50 pages
Thesis Writing Laptop
100% (3)
Thesis Writing Laptop
8 pages
Electrical Fitting Contracting
50% (2)
Electrical Fitting Contracting
77 pages
Wiring Diagram: If Power Windows/Mirrors/Locks
No ratings yet
Wiring Diagram: If Power Windows/Mirrors/Locks
1 page
AMDPJ-MEGA-SYS-2101 - A - Sub-systemOverview (ATS)
No ratings yet
AMDPJ-MEGA-SYS-2101 - A - Sub-systemOverview (ATS)
27 pages
Heat Balance (Capl)
No ratings yet
Heat Balance (Capl)
19 pages
Suraj Jagdale
No ratings yet
Suraj Jagdale
15 pages
Assignment
No ratings yet
Assignment
6 pages
014 CAT-6060 AC 60Hz E-Drive CAMP + SIL + BCS4 FS Legend H-Schematic Canada No New Available
No ratings yet
014 CAT-6060 AC 60Hz E-Drive CAMP + SIL + BCS4 FS Legend H-Schematic Canada No New Available
13 pages
Factoring Techniques Guide
No ratings yet
Factoring Techniques Guide
36 pages
Title: AI in Art and Music Generation
No ratings yet
Title: AI in Art and Music Generation
8 pages
ABB - Terra - AC - Charger - OCPP1.6 - ImplementationOverview - v1.8 - FW1.6.6
No ratings yet
ABB - Terra - AC - Charger - OCPP1.6 - ImplementationOverview - v1.8 - FW1.6.6
24 pages
Database Project for Students
No ratings yet
Database Project for Students
7 pages
Hawke UK Riflescopes 2022 Catalog
No ratings yet
Hawke UK Riflescopes 2022 Catalog
78 pages
The Regulatory Vacuum
No ratings yet
The Regulatory Vacuum
4 pages
Startup Hubs Overview
No ratings yet
Startup Hubs Overview
4 pages
Description: Print
No ratings yet
Description: Print
8 pages
AP510C-510CX QuickRef
No ratings yet
AP510C-510CX QuickRef
2 pages
Time Table Summer 2024 SMME V1.1
No ratings yet
Time Table Summer 2024 SMME V1.1
1 page
FFRTC Log Bak
No ratings yet
FFRTC Log Bak
2,823 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

Let’s learn something!

● We’ve seen how to deal with labeled

● This sort of problem is known as

● This means you have data that only

● This is a key distinction from our previous

● By the nature of this problem, it can be

● Maybe you have some customer data, and

● For example, you could cluster tumors into

● Also depending on the clustering

● A lot of clustering problems have no 100%

K Means Clustering is an unsupervised learning

● The overall goal is to divide data into distinct

The K Means Algorithm

● There is no easy answer for choosing a “best” K

● Pyspark by itself doesn’t support a

● But don’t take this as a strict rule when

● Let’s work through the documentation

● The documentation’s example is a bit

● We’ll work through a real data set

● For certain Machine Learning algorithms,

● Remember, there won’t be any

● This is a common point of confusion for

● You’re becoming world famous due to

● It’s time for

● Luckily their forensic engineers have

● The forensic engineer relates to you what

● 'Session_Connection_Time': How long the session lasted in

● The technology ﬁrm has 3 potential

● Can you help ﬁgure out whether or not

● One last key fact, the forensic engineer

● For example if there were 100 total

● The engineer believes this is the key

● Best of luck with this project, it should be

You might also like