Assignment No: 10
Data Analytics III
1. Implement Simple Naïve Bayes classification algorithm using Python/R on
iris.csv
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate,
Precision. Recall on the given dataset.
About the Dataset
The Iris dataset consists of 150 samples of iris flowers, each belonging to one of
three species: Setosa, Versicolor, and Virginica. For each sample, four features
were measured:
Sepal length in centimeters.
Sepal width in centimeters.
Petal length in centimeters.
Petal width in centimeters.
data : A 2D array containing the features of the dataset (150 samples x 4
features).
target : A 1D array containing the target variable (the species of each sample
encoded as integers: 0 for Setosa, 1 for Versicolor, and 2 for Virginica).
target_names : An array containing the names of the target classes (Setosa,
Versicolor, Virginica).
feature_names : An array containing the names of the features (Sepal Length,
Sepal Width, Petal Length, Petal Width).
DESCR : A description of the dataset.
Step 1: Import all Necessasry Libraries
In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
Step 2: Load the dataset & Start Preprocessing
In [57]:
iris = load_iris()
data = pd.DataFrame(iris.data)
data['class'] = iris.target
# rename columns
data.columns = ['sepal_len','sepal_wid','petal_len','petal_wid','clas
data.head()
Out[57]: sepal len sepal wid petal len petal wid class
Ou [5 ] p _ p _ p _ p _
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
In [58]:
data.describe()
Out[58]: sepal_len sepal_wid petal_len petal_wid class
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333 1.000000
std 0.828066 0.435866 1.765298 0.762238 0.819232
min 4.300000 2.000000 1.000000 0.100000 0.000000
25% 5.100000 2.800000 1.600000 0.300000 0.000000
50% 5.800000 3.000000 4.350000 1.300000 1.000000
75% 6.400000 3.300000 5.100000 1.800000 2.000000
max 7.900000 4.400000 6.900000 2.500000 2.000000
In [59]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_len 150 non-null float64
1 sepal_wid 150 non-null float64
2 petal_len 150 non-null float64
3 petal_wid 150 non-null float64
4 class 150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB
In [60]:
#check null
data.isnull().sum()
Out[60]: sepal_len 0
sepal_wid 0
petal_len 0
petal_wid 0
class 0
dtype: int64
Model Traininig
Step 3: Split the dataset into features and target variable
In [62]:
X = data[['sepal_len','sepal_wid','petal_len','petal_wid']] # here we
l l
Y = data['class'] #selected secies
Step 4: Split the data into training and testing sets
In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0
train_test_split() is a function from scikit-learn that is commonly used to
split a dataset into two subsets:
one for training a machine learning model
other for testing its performance
X, Y : These are the input features (X) and target variable (Y) that you want to
split into training and testing sets.
test_size=0.3 : Proportion of dataset or train and test. Here, it's set to 0.3,
meaning 30% for test, and 70% for train.
random_state=42 : This parameter is used to set the random seed for
reproducibility. Setting a specific random seed ensures that the data is split in the
same way each time the code is run, which is useful for obtaining consistent
results. In this case, the random seed is set to 42.
X_train, X_test, y_train, y_test :These are the resulting subsets of the
data. X_train and y_train contain the input features and target variable for the
training set, while X_test and y_test contain the input features and target variable
for the testing set.
Step 5: Train the Naïve Bayes classifier
In [63]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
classifier = GaussianNB()
classifier.fit(X_train, y_train)
Out[63]: GaussianNB()
In a Jupyter environment, please rerun this cell to show the HTML
representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading
this page with nbviewer.org.
In [65]:
from sklearn import metrics
y_pred = classifier.predict(X_test)
print("Accuracy Score: ", metrics.accuracy_score(y_test, y_pred)*100)
Accuracy Score: 97.77777777777777
In [70]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import confusion_matrix, accuracy_score, precisi
cm = confusion_matrix(y_test, y_pred)
# Get Number of classes
num_classes = cm.shape[0]
precision = []
# Calculate precision for all classes
for i in range(num_classes):
correct_precision = cm[i][i]
total_predicted_positives = sum(cm[:, i])
precision.append(correct_precision / total_predicted_positives)
print("Precision for each class: ", precision)
# Calculate recall for all classes
recall = []
for i in range(num_classes):
correct_predictions = cm[i][i]
total_actual_positive = sum(cm[i, :])
recall.append(correct_predictions / total_actual_positive)
print("Recall for each class: ", recall)
print("\nOverall")
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
error_rate = 1 - accuracy
print("Confusion Matrix:\n", conf_matrix)
print("Accuracy:", accuracy)
print("Error Rate:", error_rate)
print("Precision:", precision)
print("Recall:", recall)
Precision for each class: [1.0, 1.0, 0.9285714285714286]
Recall for each class: [1.0, 0.9230769230769231, 1.0]
Overall
Confusion Matrix:
[[19 0 0]
[ 0 12 1]
[ 0 0 13]]
Accuracy: 0.9777777777777777
Error Rate: 0.022222222222222254
Precision: 0.9761904761904763
Recall: 0.9743589743589745
In [71]:
# f1 score
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred, average='macro')
print("F1 Score: ", f1)
F1 Score: 0.974320987654321
In [72]:
# Classification report
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 1.00 1.00 1.00 19
1 1.00 0.92 0.96 13
2 0.93 1.00 0.96 13
accuracy 0.98 45
macro avg 0.98 0.97 0.97 45
weighted avg 0.98 0.98 0.98 45