0% found this document useful (0 votes)
32 views5 pages

Dsbda 10

The document outlines an assignment to implement a Simple Naïve Bayes classification algorithm using the Iris dataset in Python or R. It includes steps for data preprocessing, model training, and evaluation metrics such as confusion matrix, accuracy, precision, recall, and F1 score. The Iris dataset consists of 150 samples of iris flowers with four features and three species classifications.

Uploaded by

Purva Kamat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views5 pages

Dsbda 10

The document outlines an assignment to implement a Simple Naïve Bayes classification algorithm using the Iris dataset in Python or R. It includes steps for data preprocessing, model training, and evaluation metrics such as confusion matrix, accuracy, precision, recall, and F1 score. The Iris dataset consists of 150 samples of iris flowers with four features and three species classifications.

Uploaded by

Purva Kamat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Assignment No: 10

Data Analytics III

1. Implement Simple Naïve Bayes classification algorithm using Python/R on


iris.csv

2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate,
Precision. Recall on the given dataset.

About the Dataset


The Iris dataset consists of 150 samples of iris flowers, each belonging to one of
three species: Setosa, Versicolor, and Virginica. For each sample, four features
were measured:

Sepal length in centimeters.


Sepal width in centimeters.
Petal length in centimeters.
Petal width in centimeters.

data : A 2D array containing the features of the dataset (150 samples x 4


features).
target : A 1D array containing the target variable (the species of each sample
encoded as integers: 0 for Setosa, 1 for Versicolor, and 2 for Virginica).
target_names : An array containing the names of the target classes (Setosa,
Versicolor, Virginica).
feature_names : An array containing the names of the features (Sepal Length,
Sepal Width, Petal Length, Petal Width).
DESCR : A description of the dataset.

Step 1: Import all Necessasry Libraries

In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

Step 2: Load the dataset & Start Preprocessing

In [57]:
iris = load_iris()
data = pd.DataFrame(iris.data)
data['class'] = iris.target

# rename columns
data.columns = ['sepal_len','sepal_wid','petal_len','petal_wid','clas
data.head()

Out[57]: sepal len sepal wid petal len petal wid class
Ou [5 ] p _ p _ p _ p _

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

In [58]:
data.describe()

Out[58]: sepal_len sepal_wid petal_len petal_wid class

count 150.000000 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.057333 3.758000 1.199333 1.000000

std 0.828066 0.435866 1.765298 0.762238 0.819232

min 4.300000 2.000000 1.000000 0.100000 0.000000

25% 5.100000 2.800000 1.600000 0.300000 0.000000

50% 5.800000 3.000000 4.350000 1.300000 1.000000

75% 6.400000 3.300000 5.100000 1.800000 2.000000

max 7.900000 4.400000 6.900000 2.500000 2.000000

In [59]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_len 150 non-null float64
1 sepal_wid 150 non-null float64
2 petal_len 150 non-null float64
3 petal_wid 150 non-null float64
4 class 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB

In [60]:
#check null
data.isnull().sum()

Out[60]: sepal_len 0
sepal_wid 0
petal_len 0
petal_wid 0
class 0
dtype: int64

Model Traininig

Step 3: Split the dataset into features and target variable

In [62]:
X = data[['sepal_len','sepal_wid','petal_len','petal_wid']] # here we
l l
Y = data['class'] #selected secies

Step 4: Split the data into training and testing sets

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0

train_test_split() is a function from scikit-learn that is commonly used to


split a dataset into two subsets:

one for training a machine learning model


other for testing its performance

X, Y : These are the input features (X) and target variable (Y) that you want to
split into training and testing sets.
test_size=0.3 : Proportion of dataset or train and test. Here, it's set to 0.3,
meaning 30% for test, and 70% for train.
random_state=42 : This parameter is used to set the random seed for
reproducibility. Setting a specific random seed ensures that the data is split in the
same way each time the code is run, which is useful for obtaining consistent
results. In this case, the random seed is set to 42.
X_train, X_test, y_train, y_test :These are the resulting subsets of the
data. X_train and y_train contain the input features and target variable for the
training set, while X_test and y_test contain the input features and target variable
for the testing set.

Step 5: Train the Naïve Bayes classifier

In [63]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

classifier = GaussianNB()

classifier.fit(X_train, y_train)

Out[63]: GaussianNB()
In a Jupyter environment, please rerun this cell to show the HTML
representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading
this page with nbviewer.org.

In [65]:
from sklearn import metrics

y_pred = classifier.predict(X_test)

print("Accuracy Score: ", metrics.accuracy_score(y_test, y_pred)*100)


Accuracy Score: 97.77777777777777

In [70]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import confusion_matrix, accuracy_score, precisi

cm = confusion_matrix(y_test, y_pred)

# Get Number of classes


num_classes = cm.shape[0]

precision = []

# Calculate precision for all classes


for i in range(num_classes):
correct_precision = cm[i][i]
total_predicted_positives = sum(cm[:, i])
precision.append(correct_precision / total_predicted_positives)

print("Precision for each class: ", precision)

# Calculate recall for all classes


recall = []

for i in range(num_classes):
correct_predictions = cm[i][i]
total_actual_positive = sum(cm[i, :])
recall.append(correct_predictions / total_actual_positive)

print("Recall for each class: ", recall)

print("\nOverall")
conf_matrix = confusion_matrix(y_test, y_pred)

accuracy = accuracy_score(y_test, y_pred)


precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
error_rate = 1 - accuracy
print("Confusion Matrix:\n", conf_matrix)
print("Accuracy:", accuracy)
print("Error Rate:", error_rate)
print("Precision:", precision)
print("Recall:", recall)

Precision for each class: [1.0, 1.0, 0.9285714285714286]


Recall for each class: [1.0, 0.9230769230769231, 1.0]

Overall
Confusion Matrix:
[[19 0 0]
[ 0 12 1]
[ 0 0 13]]
Accuracy: 0.9777777777777777
Error Rate: 0.022222222222222254
Precision: 0.9761904761904763
Recall: 0.9743589743589745

In [71]:
# f1 score
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred, average='macro')
print("F1 Score: ", f1)

F1 Score: 0.974320987654321

In [72]:
# Classification report
print(classification_report(y_test, y_pred))

precision recall f1-score support

0 1.00 1.00 1.00 19


1 1.00 0.92 0.96 13
2 0.93 1.00 0.96 13

accuracy 0.98 45
macro avg 0.98 0.97 0.97 45
weighted avg 0.98 0.98 0.98 45

You might also like