0% found this document useful (0 votes)
24 views8 pages

A216 - DWM - LAbno 9

Uploaded by

kratikpaliwal20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views8 pages

A216 - DWM - LAbno 9

Uploaded by

kratikpaliwal20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Experiment No.

9
Name: -KRATIK PALIWAL Part A

Roll No: - A216

Aim:Implement a program for K-Means Clustering algorithm using WEKA


Tool.

Prerequisite: Database

Outcome: To impart knowledge of Data Mining& its techniques

Theory:

Data Mining Definition


Data mining not only finds out but also provides analysis of the hidden patterns of data in a
data warehouse. Data mining aims at exploring knowledge from, data warehouses it
organizes data in a manner so that it can derive the inherent meaning to contribute in the
knowledge base. The data from a database or a data warehouse is first sorted to prepare the
target data and then analysed to find out the structure, correlations and the meaning that it
contains. It has immense applications in the finance industry, healthcare industry and the
intelligence industry.

K-means Algorithm:
Part B

Code:
Output:(Paste screen shot of Output)
Conclusion:

Clustering is a powerful technique for discovering patterns and similarities in data. By


grouping similar data points together, clustering can help in various tasks such as customer
segmentation, anomaly detection, and pattern recognition. However, it is important to
choose the right clustering algorithm, similarity measure, and number of clusters to ensure
that the results are meaningful and useful.

Questions:
Q1. Explain K-Medoid Algorithm?

1. Choose k number of random points from the data and assign these k points to
k number of clusters. These are the initial medoids.

2. For all the remaining data points, calculate the distance from each medoid and
assign it to the cluster with the nearest medoid.

3. Calculate the total cost (Sum of all the distances from all the data points to the
medoids)

4. Select a random point as the new medoid and swap it with the previous
medoid. Repeat 2 and 3 steps.

5. If the total cost of the new medoid is less than that of the previous medoid,
make the new medoid permanent and repeat step 4.

6. If the total cost of the new medoid is greater than the cost of the previous
medoid, undo the swap and repeat step 4.

7. The Repetitions have to continue until no change is encountered with new


medoids to classify data points.

In the K-Means algorithm, given the value of k and unlabelled data:

Choose k number of random points (Data point from the data set or some other
points). These points are also called"Centroids" or "Means".

Assign all the data points in the data set to the closest centroid by applying any
distance formula like Euclidian distance, Manhattan distance, etc.

Now, choose new centroids by calculating the mean of all the data points in the
clusters and goto step 2

Continue step 3 until no data point changes classification between two iterations.

The problem with the K-Means algorithm is that the algorithm needs to handle
outlier data. An outlier is a point different from the rest of the points. All the outlier
data points show up in a different cluster and will attract other clusters to merge with
it. Outlier data increases the mean of a cluster by up to 10 units. Hence, K-Means
clustering is highly affected by outlier data.

Q2. Discuss about Association Rule Mining: Also define Support & Confidence?
Association Rule Mining is a data mining technique used to identify relationships
between items in a dataset. It is often used in market basket analysis to discover
patterns in consumer purchasing behavior. The goal is to find sets of items that
frequently co-occur in transactions, indicating that they are likely to be bought
together.

Support and confidence are two important metrics used in association rule mining:

1. Support: The support of an itemset is the proportion of transactions in the


dataset that contain that itemset. It measures the frequency of occurrence of an
itemset in the dataset. Mathematically, it is defined as:

Support(A) = (Transactions containing A) / (Total transactions)

For example, if there are 100 transactions in a dataset and 50 of them contain item A,
then the support of A is 50/100 = 0.5.

2. Confidence: The confidence of a rule A → B is the conditional probability that


B occurs in a transaction given that A occurs in that transaction. It measures the
strength of the association between A and B. Mathematically, it is defined as:

Confidence(A → B) = Support(A ∪ B) / Support(A)

For example, if the support of {A, B} is 0.3 and the support of A is 0.5, then the
confidence of the rule A → B is 0.3 / 0.5 = 0.6.

Association rule mining typically involves finding rules that have both high support
and high confidence. High support indicates that the rule is applicable to a large
number of transactions, while high confidence indicates that the rule is reliable.

You might also like