0% found this document useful (0 votes)

5 views12 pages

ML PR 4

Uploaded by

Ghanshyam Dhomse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views12 pages

ML PR 4

Uploaded by

Ghanshyam Dhomse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 12

Department of Computer Engineering Course: Laboratory Practice-III

Coding Timely Dated Sign of Subject

Answer Viva Total
Efficiency Completion Teacher

5 5 5 5 20

Expected Date of Completion:...................... Actual Date of Completion:......................

—--------------------------------------------------------------------------------------

Group B

Assignment No:4
—---------------------------------------------------------------------------------------

Title of the Assignment: Implement K-Nearest Neighbors algorithm on diabetes.csv

dataset. Compute confusion matrix, accuracy, error rate, precision and recall on the given
dataset.

Dataset Description: This dataset is originally from the National Institute of Diabetes and
Digestive and Kidney Diseases. The objective is to predict based on diagnostic
measurements whether a patient has diabetes

Link for Dataset: https://www.kaggle.com/datasets/abdallamahgoub/diabetes

Objective of the Assignment:

Students should be able to learn the concept of K-Nearest Neighbours and Confusion Matrix

Prerequisite:
1. Basic knowledge of Python
2. Concept of Confusion Matrix
3. Concept of K-Nearest Neighbour

Contents of the Theory:

1. K-Nearest Neighbour
2. Confusion Matrix
3. Scikit learn

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

1
Department of Computer Engineering Course: Laboratory Practice-III

K-Nearest Neighbour:

● K-Nearest Neighbour is one of the simplest Machine Learning algorithms based

on
Supervised Learning technique.

● K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.

● K-NN algorithm stores all the available data and classiﬁes a new data point based
on the similarity. This means when new data appears then it can be easily
classiﬁed into a well suite category by using K- NN algorithm.

● K-NN algorithm can be used for Regression as well as for Classiﬁcation but
mostly it is used for the Classiﬁcation problems.

● K-NN is a non-parametric algorithm, which means it does not make any

assumption on underlying data.

● It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classiﬁcation, it performs an action on the dataset.

● KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classiﬁes that data into a category that is much similar to the
new data.

● Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identiﬁcation, we can
use the KNN algorithm, as it works on a similarity measure. Our KNN model
will ﬁnd the similar features of the new data set to the cats and dogs images
and based on the most similar features it will put it in either cat or dog category.

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

2
Department of Computer Engineering Course: Laboratory Practice-III

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

3
Department of Computer Engineering Course: Laboratory Practice-III

How does K-NN

work?

The K-NN working can be explained on the basis of the below algorithm:

● Step-1: Select the number K of the neighbors

● Step-2: Calculate the Euclidean distance of K number of neighbors
● Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

● Step-4: Among these k neighbors, count the number of the data points in
each category.

● Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
● Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

● Firstly, we will choose the number of neighbors, so we will choose the k=5.

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

4
Department of Computer Engineering Course: Laboratory Practice-III

● Next, we will calculate the Euclidean distance between the data points.
The
Euclidean distance is the distance between two points, which we have
already
studied in geometry. It can be calculated as:

● By calculating the Euclidean distance we got the nearest neighbors, as three

nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

5
Department of Computer Engineering Course: Laboratory Practice-III

Confusion Matrix:

The confusion matrix is a matrix used to determine the performance of the

classiﬁcation models for a given set of test data. It can only be determined if the true
values for test data are known. The matrix itself can be easily understood, but the
related terminologies may be confusing. Since it shows the errors in the model
performance in the form of a matrix, hence also known as an error matrix. Some
features of Confusion matrix are given below:

● For the 2 prediction classes of classiﬁers, the matrix is of 2*2 table, for 3 classes,
it is
3*3 table, and so on.

● The matrix is divided into two dimensions, that are predicted values and
actual values along with the total number of predictions.

● Predicted values are those values, which are predicted by the model, and
actual values are the true values for the given observations.
● It looks like the below table:

The above table has the following

cases:

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

6
Department of Computer Engineering Course: Laboratory Practice-III

● True Negative: Model has given prediction No, and the real or actual value was
also
No.
● True Positive: The model has predicted yes, and the actual value was also true.

● False Negative: The model has predicted no, but the actual value was Yes, it is
also called as Type-II error.

● False Positive: The model has predicted Yes, but the actual value was No. It is
also called a Type-I error.

Need for Confusion Matrix in Machine learning

● It evaluates the performance of the classiﬁcation models, when they

make predictions on test data, and tells how good our classiﬁcation
model is.

● It not only tells the error made by the classiﬁers but also the type of errors such
as it is either type-I or type-II error.

● With the help of the confusion matrix, we can calculate the different parameters
for the model, such as accuracy, precision, etc.

Example: We can understand the confusion matrix using an example.

Suppose we are trying to create a model that can predict the result for the disease that
is either a person has that disease or not. So, the confusion matrix for this is given as:

From the above example, we can conclude that:

● The table is given for the two-class classifier, which has two predictions "Yes"
and "NO." Here, Yes defines that patient has the disease, and No defines that
patient does not has that disease.

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

7
Department of Computer Engineering Course: Laboratory Practice-III

● The classiﬁer has made a total of 100 predictions. Out of 100 predictions, 89 are true
predictions, and 11 are incorrect predictions.

● The model has given prediction "yes" for 32 times, and "No" for 68 times. Whereas
the actual "Yes" was 27, and actual "No" was 73 times.

Calculations using Confusion Matrix:

We can perform various calculations for the model, such as the model's accuracy, using
this matrix. These calculations are given below:

● Classiﬁcation Accuracy: It is one of the important parameters to determine the

accuracy of the classification problems. It defines how often the model predicts
the correct output. It can be calculated as the ratio of the number of correct
predictions made by the classifier to all number of predictions made by the
classifiers. The formula is given below:

● Misclassification rate: It is also termed as Error rate, and it defines how often
the model gives the wrong predictions. The value of error rate can be calculated
as the number of incorrect predictions to all number of the predictions made
by the classifier. The formula is given below:

● Precision: It can be deﬁned as the number of correct outputs provided by the

model or out of all positive classes that have predicted correctly by the model,
how many of them were actually true. It can be calculated using the below
formula:

● Recall: It is deﬁned as the out of total positive classes, how our model
predicted correctly. The recall must be as high as possible.

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

8
Department of Computer Engineering Course: Laboratory Practice-III

● F-measure: If two models have low precision and high recall or vice versa, it is
diﬃcult to compare these models. So, for this purpose, we can use F-score. This
score helps us to evaluate the recall and precision at the same time. The F-score is
maximum if the recall is equal to the precision. It can be calculated using the below
formula:

Other important terms used in Confusion

Matrix:

● Null Error rate: It deﬁnes how often our model would be incorrect if it always
predicted the majority class. As per the accuracy paradox, it is said that "the
best classiﬁer has a higher error rate than the null error rate."

● ROC Curve: The ROC is a graph displaying a classiﬁer's performance for all
possible thresholds. The graph is plotted between the true positive rate (on the Y-
axis) and the false Positive rate (on the x-axis).

Scikit Learn:

French research scientist David Cournapeau's scikits.learn is a Google Summer of

Code venture where the scikit-learn project ﬁrst began. Its name refers to the idea
that it's a modiﬁcation to SciPy called "SciKit" (SciPy Toolkit), which was
independently created and published. Later, other programmers rewrote the core
codebase.

The French Institute for Research in Computer Science and Automation at

Rocquencourt, France, led the work in 2010 under the direction of Alexandre Gramfort,
Gael Varoquaux, Vincent Michel, and Fabian Pedregosa. On February 1st of that year, the
institution issued the project's ﬁrst oﬃcial release. In November 2012, scikit-learn and
scikit-image were cited as examples of scikits that were "well-maintained and popular".
One of the most widely used machine learning packages on GitHub is Python's scikit-
learn.

Steps to Build a Model in Sklearn

Let us now learn the modelling

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

9
Department of Computer Engineering Course: Laboratory Practice-III
process.

Step 1: Loading a Dataset

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

1
0
Department of Computer Engineering Course: Laboratory Practice-III

Simply put, a dataset is a collection of sample data points. A dataset typically consists
of
two primary
parts:

Features: Features are essentially the variables in our dataset, often called predictors,
data inputs, or attributes. A feature matrix, which is frequently symbolised by the letter
"X," can be used to represent them since many of them may exist. The term "feature
names" refers to a list of names of all the features.

Response: (sometimes referred to as the target feature, label, or output) Based on

the variables feature, this variable is the output. In most cases, we only have one
response column, which is depicted by a response column or vector (the letter 'y' is
frequently used to denote a response vector). Target names refer to all the various
values a response vector could take.

Step 2: Splitting the Dataset

The correctness of each machine learning model is a crucial consideration. Now, one
may train a model with the provided dataset and then use that model to predict the
target values for another set of the dataset to ascertain the correctness of the model.

To sum it
up:

● Make a training dataset and a testing dataset out of the given dataset.
● On the practise set, train the model.
● Test the model using the testing dataset and assess its performance.

Step 3: Training the Model

It's time to use the training dataset to train the model, which will make predictions. A
variety of machine learning techniques with an easy-to-use interface for ﬁtting,
prediction accuracy, etc., are offered by Scikit-learn.

Our classiﬁer must now be tested using the testing dataset. For this, we can use
the
.predict() model class method, giving back the predicted
values.

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

10
Department of Computer Engineering Course: Laboratory Practice-III

By comparing the actual values of the testing dataset and the predicted values, we
can
assess the model's performance with the help of sklearn methods. The
accuracy_score
function of the metrics package is used for this.

Conclusion:

In this we have successfully implemented the above assignment.

Assignment Questions:

1) Explain Confusion Matrix ?

2) Explain Scikit Learn?
3) Explain the need of Confusion Matrix?
4) What Ratios can be calculated from the Confusion Matrix?

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

ML PR 4
No ratings yet
ML PR 4
12 pages
Assignment 3 B
No ratings yet
Assignment 3 B
7 pages
Classification
No ratings yet
Classification
58 pages
Machine Learning and Web Scraping Lecture 03
No ratings yet
Machine Learning and Web Scraping Lecture 03
22 pages
AI & ML with Python Project Report
No ratings yet
AI & ML with Python Project Report
59 pages
ml5 1
No ratings yet
ml5 1
8 pages
Algorithm
No ratings yet
Algorithm
27 pages
ML Unit 5..
No ratings yet
ML Unit 5..
40 pages
ML Unit 2 (Ab22)
No ratings yet
ML Unit 2 (Ab22)
61 pages
ML-Unit 5
No ratings yet
ML-Unit 5
40 pages
ML Unit-2
No ratings yet
ML Unit-2
55 pages
ML 6
No ratings yet
ML 6
26 pages
ML Unit - 2
No ratings yet
ML Unit - 2
85 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
18 pages
ML Unit 3
No ratings yet
ML Unit 3
12 pages
19ECE357 - V Sem End - Odd 2023
No ratings yet
19ECE357 - V Sem End - Odd 2023
4 pages
cp4252 Machine Learning Lab Manual
No ratings yet
cp4252 Machine Learning Lab Manual
38 pages
ML CH 3
No ratings yet
ML CH 3
88 pages
ML Unit 2 Possible Questions and Answers
No ratings yet
ML Unit 2 Possible Questions and Answers
48 pages
Distance-Based Methods - KNN
0% (1)
Distance-Based Methods - KNN
8 pages
Unit 3 - Supervise Learning Classification
No ratings yet
Unit 3 - Supervise Learning Classification
23 pages
Assignment B 2 EmailClassification
No ratings yet
Assignment B 2 EmailClassification
6 pages
Data Sciene - Unit 5 Material
No ratings yet
Data Sciene - Unit 5 Material
15 pages
Lab Manual
No ratings yet
Lab Manual
17 pages
K-Nearest Neighbor (KNN) 6
No ratings yet
K-Nearest Neighbor (KNN) 6
46 pages
ML Unit 3
No ratings yet
ML Unit 3
106 pages
KNN Algorithm: Basics and Python Guide
No ratings yet
KNN Algorithm: Basics and Python Guide
17 pages
Univt - IV
No ratings yet
Univt - IV
72 pages
Classification and K Nearest Neighbour Algorithm
No ratings yet
Classification and K Nearest Neighbour Algorithm
53 pages
ML UNIT - III-Complete
No ratings yet
ML UNIT - III-Complete
52 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
15 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Evaluation of Predictive Models Final
No ratings yet
Evaluation of Predictive Models Final
6 pages
Unit 4
100% (1)
Unit 4
26 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
Lecture 7
No ratings yet
Lecture 7
25 pages
Unit 2 1
No ratings yet
Unit 2 1
28 pages
Classification Methods I
No ratings yet
Classification Methods I
20 pages
ML Unit-2
No ratings yet
ML Unit-2
24 pages
ML Mid2 Ans
No ratings yet
ML Mid2 Ans
24 pages
ML Unit2
No ratings yet
ML Unit2
38 pages
ML Classification Algorithms Guide
No ratings yet
ML Classification Algorithms Guide
13 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Machine Learning Theory Essentials
No ratings yet
Machine Learning Theory Essentials
9 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
12 pages
ML Unit V
No ratings yet
ML Unit V
10 pages
Classification
No ratings yet
Classification
50 pages
ML Unit 3 Part 3
No ratings yet
ML Unit 3 Part 3
33 pages
Unit 2 Classification
No ratings yet
Unit 2 Classification
59 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
CH 2
No ratings yet
CH 2
30 pages
UNIT 2 - Notes
No ratings yet
UNIT 2 - Notes
31 pages
Ai 5
No ratings yet
Ai 5
11 pages
Clustering & Classification Metrics
No ratings yet
Clustering & Classification Metrics
13 pages
Untitled 9
No ratings yet
Untitled 9
17 pages
Machine Lar Arii
No ratings yet
Machine Lar Arii
9 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
19 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
1 - 2 ProbStat Lectures Summary Statistics 1 - 2
No ratings yet
1 - 2 ProbStat Lectures Summary Statistics 1 - 2
24 pages
Amity University
No ratings yet
Amity University
3 pages
Quality and Productivity Management
No ratings yet
Quality and Productivity Management
7 pages
40 MCQ On Research Methodology
No ratings yet
40 MCQ On Research Methodology
9 pages
Basic Statistical Tools For Research
No ratings yet
Basic Statistical Tools For Research
53 pages
Lesson 7 For Basic Mathematics
No ratings yet
Lesson 7 For Basic Mathematics
27 pages
Sampling Methods - Types, Techniques & Examples
No ratings yet
Sampling Methods - Types, Techniques & Examples
16 pages
Groundwater Glyphosate Precision
No ratings yet
Groundwater Glyphosate Precision
3 pages
Assignment Cover Sheet: Efei Wong B2000319
No ratings yet
Assignment Cover Sheet: Efei Wong B2000319
11 pages
ESELAB2 Merged
No ratings yet
ESELAB2 Merged
43 pages
Note - Unit-5
No ratings yet
Note - Unit-5
19 pages
Excel For Statistical Data Analysis
No ratings yet
Excel For Statistical Data Analysis
54 pages
SAT Math Problem Analysis
No ratings yet
SAT Math Problem Analysis
20 pages
Rangkuman Rumus & Tabel Statistika
No ratings yet
Rangkuman Rumus & Tabel Statistika
12 pages
Thesis
No ratings yet
Thesis
212 pages
BRM Unit 3 Mcom Sem1
No ratings yet
BRM Unit 3 Mcom Sem1
40 pages
Final Examination LNU - Educational Statistics
No ratings yet
Final Examination LNU - Educational Statistics
2 pages
Locating Percentiles Under The Normal Curve: Statistics & Probability
No ratings yet
Locating Percentiles Under The Normal Curve: Statistics & Probability
14 pages
Textbook Correlation and Regression Analysis Egypt en
No ratings yet
Textbook Correlation and Regression Analysis Egypt en
39 pages
CH 8 & 9 RN 2 PDF
No ratings yet
CH 8 & 9 RN 2 PDF
39 pages
Sampling Basics for Researchers
No ratings yet
Sampling Basics for Researchers
15 pages
SP Day3 Q4
No ratings yet
SP Day3 Q4
15 pages
ch05 Edit v2
No ratings yet
ch05 Edit v2
56 pages
ML Regression for Data Scientists
No ratings yet
ML Regression for Data Scientists
7 pages
QI Macros User Guide 2016 PDF
No ratings yet
QI Macros User Guide 2016 PDF
36 pages
1645102337
No ratings yet
1645102337
23 pages
R Programming Lab Assignments
No ratings yet
R Programming Lab Assignments
40 pages
Slides Module 4 Lesson 2
No ratings yet
Slides Module 4 Lesson 2
34 pages
Time Series Chap21
No ratings yet
Time Series Chap21
27 pages
G. S. Maddala - Introduction To Econometrics-Macmillan Pub. Co. - Maxwell Macmillan Canada - Maxwell Macmillan International (1992)
No ratings yet
G. S. Maddala - Introduction To Econometrics-Macmillan Pub. Co. - Maxwell Macmillan Canada - Maxwell Macmillan International (1992)
637 pages

ML PR 4

Uploaded by

ML PR 4

Uploaded by

Department of Computer Engineering Course: Laboratory Practice-III

Coding Timely Dated Sign of Subject

Expected Date of Completion:...................... Actual Date of Completion:......................

Title of the Assignment: Implement K-Nearest Neighbors algorithm on diabetes.csv

Link for Dataset: https://www.kaggle.com/datasets/abdallamahgoub/diabetes

Objective of the Assignment:

Contents of the Theory:

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

● K-Nearest Neighbour is one of the simplest Machine Learning algorithms based

● K-NN is a non-parametric algorithm, which means it does not make any

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

Why do we need a K-NN Algorithm?

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

How does K-NN

● Step-1: Select the number K of the neighbors

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

● By calculating the Euclidean distance we got the nearest neighbors, as three

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

The confusion matrix is a matrix used to determine the performance of the

The above table has the following

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

Need for Confusion Matrix in Machine learning

● It evaluates the performance of the classiﬁcation models, when they

Example: We can understand the confusion matrix using an example.

From the above example, we can conclude that:

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

Calculations using Confusion Matrix:

● Classiﬁcation Accuracy: It is one of the important parameters to determine the

● Precision: It can be deﬁned as the number of correct outputs provided by the

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

Other important terms used in Confusion

French research scientist David Cournapeau's scikits.learn is a Google Summer of

The French Institute for Research in Computer Science and Automation at

Steps to Build a Model in Sklearn

Let us now learn the modelling

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

Step 1: Loading a Dataset

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

Response: (sometimes referred to as the target feature, label, or output) Based on

Step 2: Splitting the Dataset

Step 3: Training the Model

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

In this we have successfully implemented the above assignment.

1) Explain Confusion Matrix ?

SNJB’s Late Sau. K.B. Jain College of Engineering Chandwad

You might also like