PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment No. 3
A.1 Aim:
To implement Support Vector Machine.
A.2 Prerequisite:
Python Basic Concepts
A.3 Outcome:
Students will be able To implement Support Vector Machine.
A.4 Theory:
Machine Learning, being a subset of Artificial Intelligence (AI), has been playing a
dominant role in our daily lives. Data science engineers and developers working in
various domains are widely using machine learning algorithms to make their tasks
simpler and life easier.
The objective of the support vector machine algorithm is to find a hyperplane in an N-
dimensional space(N — the number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyperplanes that could
be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum
distance between data points of both classes. Maximizing the margin distance provides
some reinforcement so that future data points can be classified with more confidence.
Hyperplanes are decision boundaries that help classify the data points. Data points falling
on either side of the hyperplane can be attributed to different classes. Also, the dimension
of the hyperplane depends upon the number of features. If the number of input features is
2, then the hyperplane is just a line. If the number of input features is 3, then the
hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the
number of features exceeds 3.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear
SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM
classifier.
SVM algorithm is implemented with kernel that transforms an input data space into the
required form. SVM uses a technique called the kernel trick in which kernel takes a low
dimensional input space and transforms it into a higher dimensional space. In simple
words, kernel converts non-separable problems into separable problems by adding more
dimensions to it. It makes SVM more powerful, flexible and accurate. The following are
some of the types of kernels used by SVM.
Linear Kernel
It can be used as a dot product between any two observations. The formula of linear
kernel is as below −
K(x,xi)=sum(x∗xi)K(x,xi)=sum(x∗xi)
From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is
the sum of the multiplication of each pair of input values.
Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or nonlinear input
space. Following is the formula for polynomial kernel −
k(X,Xi)=1+sum(X∗Xi)^dk(X,Xi)=1+sum(X∗Xi)^d
Here d is the degree of polynomial, which we need to specify manually in the learning
algorithm.
Pros and Cons of SVM Classifiers
Pros of SVM classifiers
SVM classifiers offers great accuracy and work well with high dimensional space. SVM
classifiers basically use a subset of training points hence in result uses very less memory.
Cons of SVM classifiers
They have high training time hence in practice not suitable for large datasets. Another
disadvantage is that SVM classifiers do not work well with overlapping classes.
PART B
(PART B : TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical. The
soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge faculties at
the end of the practical in case the there is no Black board access available)
Roll. No. B24 Name:Sakshi Bhaskar Tupsundar
Class: BE-Comps Batch:B2
Date of Experiment:10-10-2023 Date of Submission:12-10-2023
Grade:
B.1 Software Code written by student:
import numpy as np
from google.colab import drive
import csv
import pandas as pd
import seaborn as sns
df = pd.read_csv('/content/survey lung cancer.csv')
df.shape
df.isnull().sum()
df.head()
from sklearn import preprocessing
# label_encoder object knows
# how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'species'.
df['GENDER']= label_encoder.fit_transform(df['GENDER'])
df['GENDER'].unique()
df['LUNG_CANCER']= label_encoder.fit_transform(df['LUNG_CANCER'])
df['LUNG_CANCER'].unique()
df.head()
import matplotlib.pyplot as plt
plt.figure(figsize=(14, 8))
plt.suptitle("Lung Disease Prediction")
ax = plt.gca()
df.boxplot()
#removing outliers
import pandas as pd
columns_to_check = ['LUNG_CANCER']
# Step 1: Calculate the first quartile (Q1), third quartile (Q3),
# and IQR for each column
Q1 = data[columns_to_check].quantile(0.25)
Q3 = data[columns_to_check].quantile(0.75)
IQR = Q3 - Q1
# Step 2: Define the outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Step 3: Identify outliers for each column
outliers = {}
for column_name in columns_to_check:
outliers[column_name] = data[(data[column_name] <
lower_bound[column_name]) |
(data[column_name] >
upper_bound[column_name])]
# For example, if you choose to remove the outliers:
data_cleaned = data.copy()
for column_name in columns_to_check:
data_cleaned = data_cleaned[
(data_cleaned[column_name] >= lower_bound[column_name]) &
(data_cleaned[column_name] <= upper_bound[column_name])]
Applying SVM model before outlier removal
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X = data.drop('LUNG_CANCER', axis=1)
y = data[‘LUNG_CANCER']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
X_train= st_x.fit_transform(X_train)
X_test= st_x.transform(X_test)
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Applying SVM model after outlier removal
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Split the data into features and labels
X = data.drop('LUNG_CANCER', axis=1) # Adjust as needed
y = data['LUNG_CANCER']
# Initialize an empty list to store selected features
selected_features = []
best_accuracy = 0.0
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
while len(selected_features) < X.shape[1]: # Repeat until all features
are selected
# Find the feature that improves the model the most
best_feature = None
best_feature_accuracy = 0.0
for feature in X.columns:
if feature not in selected_features:
# Create a new feature set by adding the current feature
current_features = selected_features + [feature]
# Train an SVM classifier on the current feature set
svm = SVC()
svm.fit(X_train[current_features], y_train)
# Make predictions on the test set
y_pred = svm.predict(X_test[current_features])
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Check if this feature improves accuracy
if accuracy > best_feature_accuracy:
best_feature_accuracy = accuracy
best_feature = feature
# Add the best feature to the selected features list
selected_features.append(best_feature)
best_accuracy = best_feature_accuracy
#Print the selected feature and its accuracy
print(f"Selected Feature: {best_feature}, Accuracy:
{best_accuracy:.4f}")
print("Forward selection complete.")
print("Selected Features:", selected_features)
Applying SVM model sfter feature selection Process
#from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
# Create an SVM model with the 'rbf' kernel
clf = svm.SVC(kernel='rbf')
# Fit the SVM model to the training data
clf.fit(X_train_after, y_train_after)
# Make predictions on the test data
y_pred = clf.predict(X_test_after)
# Calculate accuracy on the test set
accuracy = accuracy_score(y_test_after, y_pred)
print("Testing Accuracy:", accuracy)
# Perform cross-validation and print the cross-validation scores
cv_scores = cross_val_score(clf, X_train_after, y_train_after, cv=5) #
You can change the number of folds (cv) as needed
print("Cross-Validation Scores:", cv_scores)
print("Mean CV Score:", cv_scores.mean())
print(classification_report(y_test_after, y_pred))
hyperparameter tuning for SVM
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import randint
# Assuming the columns are named 'feature1', 'feature2', and 'target'
X = data[['AGE']]
y = data['AGE']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Define the model
rf_model = RandomForestClassifier()
# Define the hyperparameter distributions to sample from
param_dist = {
'n_estimators': randint(50, 200),
'max_depth': [None] + list(randint(1, 20, 10).rvs(10)),
'min_samples_split': randint(2, 11),
'min_samples_leaf': randint(1, 5)
}
# Handle None for max_depth separately
param_dist['max_depth'].append(None)
# Perform randomized search with cross-validation
random_search = RandomizedSearchCV(estimator=rf_model,
param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy',
random_state=42)
random_search.fit(X_train, y_train)
# Get the best hyperparameters
best_params = random_search.best_params_
print("Best Hyperparameters:", best_params)
# Evaluate the model on the test set using the best hyperparameters
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on Test Set:", accuracy)
B.2 Input and Output:
SVM Model Scores
Training Accuracy score 0.89475
Testing Accuracy score 0.86
ROC_AUC score 0.951477
CV score 0.756
SVM Model(Feature Slection) Scores
Training Accuracy score 0.84963
Testing Accuracy score 0.9575
ROC_AUC score 0.9153
CV score 0.9425
Hyper Parameter tuning for Scores
SVM Model
Accuracy score 0.91935483870
ROC_AUC score 0.55
CV score 0.95967741
B.3 Observations and learning:
The SVM classifier with an RBF kernel demonstrated strong predictive capabilities, achieving a
high accuracy rate and effectively classifying data points into their respective classes.
Support Vector Machines are powerful classifiers that can be applied to a wide range of
classification problems.
Evaluating the performance of an SVM model through metrics like accuracy, precision, recall,
and the confusion matrix helps in understanding its strengths and weaknesses.
SVMs with RBF kernels are suitable for complex datasets with non-linear relationships, but
hyperparameter tuning and feature selection are crucial for optimizing their performance.
B.4 Conclusion:
In this experiment, we successfully implemented a Support Vector Machine (SVM) classifier with an
RBF kernel on a given dataset.
B.5 Question of Curiosity
Q1. What is a support vector machine (SVM)?
Ans: A support vector machine (SVM) is a type of supervised learning algorithm used in
machine learning to solve classification and regression tasks; SVMs are particularly good at
solving binary classification problems, which require classifying the elements of a data set into
two groups.
The aim of a support vector machine algorithm is to find the best possible line, or decision
boundary, that separates the data points of different data classes. This boundary is called a
hyperplane when working in high-dimensional feature spaces. The idea is to maximize the
margin, which is the distance between the hyperplane and the closest data points of each
category, thus making it easy to distinguish data classes.