0% found this document useful (0 votes)
27 views11 pages

Practical 3

The document outlines a practical implementation of the Naïve Bayesian classifier using a sample dataset in CSV format, detailing its theoretical foundation based on Bayes' Theorem. It describes the steps to calculate prior probabilities, likelihoods, and make class predictions, while also discussing different types of Naïve Bayes classifiers and their applications. Additionally, it includes a Python code example for loading data, training the classifier, making predictions, and evaluating its performance.

Uploaded by

2203031050417
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views11 pages

Practical 3

The document outlines a practical implementation of the Naïve Bayesian classifier using a sample dataset in CSV format, detailing its theoretical foundation based on Bayes' Theorem. It describes the steps to calculate prior probabilities, likelihoods, and make class predictions, while also discussing different types of Naïve Bayes classifiers and their applications. Additionally, it includes a Python code example for loading data, training the classifier, making predictions, and evaluating its performance.

Uploaded by

2203031050417
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Practical_3

AIM:Write a program to implement the naïve Bayesian classifier for a sample


training data set stored as a .CSV file. Compute the accuracy of the classifier,
considering a few test data sets.

What is Naïve Bayes?

Naïve Bayes is a probabilistic machine learning algorithm based on Bayes' Theorem. It is called
"naïve" because it assumes that the features (or predictors) are independent of each other, which
is often not true in real-world data. Despite this simplification, Naïve Bayes performs well for
certain tasks like text classification, spam filtering, sentiment analysis, and recommendation
systems.

Bayes' Theorem

The foundation of Naïve Bayes is Bayes' Theorem, which calculates the probability of a class C
given a set of features X:
How Naïve Bayes Works

1. Calculate Prior Probabilities P(C)


The prior probability is simply the fraction of instances of class C in the dataset.

2. Calculate Likelihood P(X∣C)


The likelihood of a feature X given the class C is calculated differently for continuous
and categorical features:
○ For Categorical Features: Count the number of times feature X appears in class C
and divide it by the total number of instances in C.
○ For Continuous Features: Assume that the feature follows a Gaussian (normal)
distribution and calculate the likelihood using the following formula:

○ Apply Bayes' Theorem


Use Bayes' theorem to compute the posterior probability for each class C.

Since P(X)P(X)P(X) is common for all classes, it can be ignored in classification.

3. Class Prediction
Predict the class for a given feature vector X by selecting the class with the highest
posterior probability:
Types of Naïve Bayes Classifiers

1. Gaussian Naïve Bayes


○ Used when the features are continuous (like height, weight, or age).
○ Assumes the features follow a Gaussian (normal) distribution.
2. Multinomial Naïve Bayes
○ Used for discrete counts (like text data, bag-of-words representations).
○ Commonly used for text classification and spam detection.
3. Bernoulli Naïve Bayes
○ Used when features are binary (0 or 1, like True/False).
○ Also useful in text classification, where the presence or absence of a word in a
document is considered.

Example Walkthrough

Problem: Classify whether an email is "Spam" or "Not Spam" based on features like "Has
Offer?", "Has Clickable Link?", "Has Known Sender?", etc.

Has Offer Clickable Link Known Sender Spam?

Yes Yes No Yes

No Yes Yes No

Yes No No Yes

No No Yes No
Step 1: Calculate Prior Probabilities P(C)P(C)P(C)

Step 2: Calculate Likelihood P(X∣C)


We calculate probabilities for each feature under each class.

Step 3: Apply Bayes' Theorem


Suppose we have a new email:
Has Offer = Yes, Clickable Link = No, Known Sender = No

1. Compute P(Spam∣X):
2. Compute P(Not Spam∣X):

Result:
Since P(Spam∣X)>P(Not Spam∣X), we classify this email as Spam.

Advantages of Naïve Bayes

● Fast and simple: Works well for large datasets.


● Handles noisy data: Works even if the independence assumption is violated.
● Effective for text data: Used for spam detection, sentiment analysis, and text
classification.

Limitations of Naïve Bayes

● Feature independence assumption: Assumes that features are independent, which is rarely
true.
● Zero probability problem: If a feature value doesn’t appear in the training set, it has zero
likelihood, making P(X∣C)= 0. To handle this, Laplace smoothing is used.

Code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Step 1: Load the dataset
file_path = 'data.csv' # Replace with the actual path of your CSV file

data = pd.read_csv(file_path)
print("First 5 rows of the dataset:\n", data.head())

# Step 2: Preprocess the data (if needed)


# Handle missing values, convert categorical data to numerical, etc.
# Here, we assume the last column is the target (class) and the others are features
X = data.iloc[:, :-1] # Features (all columns except the last)
y = data.iloc[:, -1] # Target (the last column)

# If any categorical data exists, convert it using encoding


# Example: X = pd.get_dummies(X) # Uncomment this line if categorical variables exist

# Step 3: Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the Naive Bayes classifier


classifier = GaussianNB() # Assuming continuous features
classifier.fit(X_train, y_train)

# Step 5: Make predictions on the test set


y_pred = classifier.predict(X_test)

# Step 6: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("\nAccuracy of the Naive Bayes classifier: {:.2f}%".format(accuracy * 100))
print("\nConfusion Matrix:\n", confusion)
print("\nClassification Report:\n", report)

# Optional: Test on a new data point


new_data = np.array([[value1, value2, value3, ...]]) # Replace with actual feature values
new_prediction = classifier.predict(new_data)
print("\nPrediction for new data point:", new_prediction)

1️⃣ Importing Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,
classification_report, confusion_matrix

● pandas: Used to load and manipulate the dataset.


● numpy: Used for numerical computations, especially for dealing with arrays.
● train_test_split: Splits the dataset into training and testing subsets.
● GaussianNB: Implements the Naïve Bayes classifier for continuous features using a
Gaussian distribution.
● accuracy_score, classification_report, confusion_matrix: Used to evaluate the
classifier's performance.

2️⃣ Loading the Dataset


file_path = 'data.csv' # Replace with the actual path of your
CSV file
data = pd.read_csv(file_path)
print("First 5 rows of the dataset:\n", data.head())

● The dataset is loaded from a CSV file using pd.read_csv(file_path).


● The first 5 rows are printed to give you a preview of the dataset's structure.

3️⃣ Data Preprocessing

X = data.iloc[:, :-1] # Features (all columns except the last)


y = data.iloc[:, -1] # Target (the last column)

● X: All columns except the last one are used as features.


● y: The last column is considered the target variable (class label) we want to predict.

If the dataset has categorical columns (like 'Male', 'Female', 'Yes', 'No'), you may need to convert
them to numerical values. You can do this using one-hot encoding or label encoding as follows:

X = pd.get_dummies(X) # Converts categorical features into


dummy/indicator variables

4️⃣ Splitting the Data


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

● The dataset is split into training (80%) and testing (20%) subsets.
● The random_state=42 ensures reproducibility, meaning every time you run the code,
you get the same split.

5️⃣ Training the Naïve Bayes Classifier

classifier = GaussianNB()
classifier.fit(X_train, y_train)

● We create a Gaussian Naïve Bayes classifier using GaussianNB().


● The fit() method trains the model using the training data (X_train, y_train).

6️⃣ Making Predictions

y_pred = classifier.predict(X_test)

● The classifier predicts the labels for the test set (X_test).
● The predicted labels are stored in y_pred.

7️⃣ Evaluating the Model


accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("\nAccuracy of the Naive Bayes classifier:


{:.2f}%".format(accuracy * 100))
print("\nConfusion Matrix:\n", confusion)
print("\nClassification Report:\n", report)

● Accuracy: Measures the percentage of correct predictions.


● Confusion Matrix: Shows how many true positives, true negatives, false positives, and
false negatives the classifier has.
● Classification Report: Provides precision, recall, and F1-score for each class.

Example of output:

Accuracy of the Naive Bayes classifier: 85.00%

Confusion Matrix:
[[50 5]
[ 7 38]]

Classification Report:
precision recall f1-score support
0 0.88 0.91 0.89 55
1 0.88 0.84 0.86 45
accuracy 0.88 100
macro avg 0.88 0.88 0.88 100
weighted avg 0.88 0.88 0.88 100

Explanation of Metrics:

● Precision: Out of all predicted positive results, how many were actually positive.
● Recall: Out of all actual positive results, how many were predicted correctly.
● F1-Score: Harmonic mean of precision and recall.
● Support: Number of actual occurrences for each class.

8️⃣ Predicting for New Data

new_data = np.array([[value1, value2, value3, ...]]) # Replace


with actual feature values
new_prediction = classifier.predict(new_data)
print("\nPrediction for new data point:", new_prediction)

● Here, you can enter a new data point in the form of an array (example:
np.array([[2.1, 3.4, 1.2, 0.5]])).
● The classifier will predict the class label for this new data point.

You might also like