0% found this document useful (0 votes)

54 views19 pages

Machine Learning Lab: Titanic PCA & ID3 Decision Tree

The document describes experiments on machine learning algorithms applied to the Titanic dataset. Experiment 1 performs data preprocessing, cleaning and dimensionality reduction using PCA. Experiment 2 implements a decision tree classifier using the ID3 algorithm.

Uploaded by

NANDINI AGGARWAL 211131

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views19 pages

Machine Learning Lab: Titanic PCA & ID3 Decision Tree

Uploaded by

NANDINI AGGARWAL 211131

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Jaypee University of Information Technology

Department of Computer Science and Engineering

Course Code: 18B1WCI674

Course Name: MACHINE LEARNING LAB

Submitted by: Submitted to:

Kshitiz Tayal Mr. Praveen Modi
211173
Batch: CS - 63
S.no Experiments Date Remarks
EXPERIMENT-1
AIM:

Data Preprocessing, Data cleaning and Dimensionality Reduction

using PCA on Titanic dataset.

CODE:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder
import matplotlib.pyplot as plt

# Load the Titanic dataset

url =
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/tit
anic.csv"
titanic_df = pd.read_csv(url)

print(titanic_df.head().to_string(index=False))

print(titanic_df.info())

# Check for missing values

print(titanic_df.isnull().sum().to_string())

# Data Preprocessing and Cleaning

titanic_df = titanic_df.drop(columns=['Cabin', 'Ticket', 'Name'])

titanic_df['Age'].fillna(titanic_df['Age'].mean(), inplace=True)

# Fill missing values in 'Embarked' with mode

titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0],
inplace=True)

# Encode categorical variables: Sex and Embarked

label_encoder = LabelEncoder()
titanic_df['Sex'] = label_encoder.fit_transform(titanic_df['Sex'])
titanic_df['Embarked'] =
label_encoder.fit_transform(titanic_df['Embarked'])
# Scale features: Age, Fare
scaler = StandardScaler()
titanic_df[['Age', 'Fare']] = scaler.fit_transform(titanic_df[['Age',
'Fare']])
print(titanic_df.var().to_string())

#PCA

# Separate features (X) and target (Survived)

X = titanic_df.drop(columns=['Survived'])
y = titanic_df['Survived']

# Apply PCA with 2 components

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Check explained variance ratio

print(pca.explained_variance_ratio_)

# Dimension before PCA

print("Dimension before PCA:", X.shape)

# Dimension after PCA

print("Dimension after PCA:", X_pca.shape)

# Scatter plot of the reduced dataset after PCA

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.5)
plt.title('Scatter plot of Titanic dataset after PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Survived')
plt.grid(True)
plt.show()
OUTPUT:
PassengerId Survived Pclass
Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen
Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs
Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss.
Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May
Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William
Henry male 35.0 0 0 373450 8.0500 NaN S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
PassengerId 66231.000000
Survived 0.236772
Pclass 0.699015
Sex 0.228475
Age 1.001124
SibSp 1.216043
Parch 0.649728
Fare 1.001124
Embarked 0.626477
[9.99918243e-01 2.43786002e-05]
Dimension before PCA: (891, 8)
Dimension after PCA: (891, 2)
EXPERIMENT-2
AIM:
Implementation Of Decision Tree using ID3 Algorithm.

CODE:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Titanic dataset

url =
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/tit
anic.csv"
titanic_df = pd.read_csv(url)

# Calculate the entropy of the whole dataset

def calc_total_entropy(train_data, label, class_list):
total_row = train_data.shape[0]
total_entr = 0
for c in class_list: # for each class in the label
total_class_count = train_data[train_data[label] ==
c].shape[0]
total_class_entr = - (total_class_count / total_row) *
np.log2(total_class_count / total_row)
total_entr += total_class_entr
return total_entr

# Calculate the entropy of a filtered dataset

def calc_entropy(feature_value_data, label, class_list):
class_count = feature_value_data.shape[0]
entropy = 0
for c in class_list:
label_class_count =
feature_value_data[feature_value_data[label] == c].shape[0]
entropy_class = 0
if label_class_count != 0:
probability_class = label_class_count / class_count
entropy_class = - probability_class *
np.log2(probability_class) # entropy
entropy += entropy_class
return entropy
# Calculate information gain for a feature
def calc_info_gain(feature_name, train_data, label, class_list):
feature_value_list = train_data[feature_name].unique()
total_row = train_data.shape[0]
feature_info = 0.0
for feature_value in feature_value_list:
feature_value_data = train_data[train_data[feature_name] ==
feature_value]
feature_value_count = feature_value_data.shape[0]
feature_value_entropy = calc_entropy(feature_value_data,
label, class_list)
feature_value_probability = feature_value_count / total_row
feature_info += feature_value_probability *
feature_value_entropy
return calc_total_entropy(train_data, label, class_list) -
feature_info

# Find the most informative feature

def find_most_informative_feature(train_data, label, class_list):
feature_list = train_data.columns.drop(label)
max_info_gain = -1
max_info_feature = None
for feature in feature_list:
feature_info_gain = calc_info_gain(feature, train_data, label,
class_list)
if max_info_gain < feature_info_gain:
max_info_gain = feature_info_gain
max_info_feature = feature
return max_info_feature

# Generate a subtree
def generate_sub_tree(feature_name, train_data, label, class_list):
feature_value_count_dict =
train_data[feature_name].value_counts(sort=False)
tree = {}
for feature_value, count in feature_value_count_dict.items():
feature_value_data = train_data[train_data[feature_name] ==
feature_value]
assigned_to_node = False
for c in class_list:
class_count = feature_value_data[feature_value_data[label]
== c].shape[0]
if class_count == count:
tree[feature_value] = c
train_data = train_data[train_data[feature_name] !=
feature_value]
assigned_to_node = True
if not assigned_to_node:
tree[feature_value] = "?"
return tree, train_data

# Recursively create the decision tree

def make_tree(root, prev_feature_value, train_data, label,
class_list):
if train_data.shape[0] != 0:
max_info_feature = find_most_informative_feature(train_data,
label, class_list)
tree, train_data = generate_sub_tree(max_info_feature,
train_data, label, class_list)
next_root = None
if prev_feature_value != None:
root[prev_feature_value] = dict()
root[prev_feature_value][max_info_feature] = tree
next_root = root[prev_feature_value][max_info_feature]
else:
root[max_info_feature] = tree
next_root = root[max_info_feature]
for node, branch in list(next_root.items()):
if branch == "?":
feature_value_data =
train_data[train_data[max_info_feature] == node]
make_tree(next_root, node, feature_value_data, label,
class_list)

# ID3 Algorithm
def id3(train_data_m, label):
train_data = train_data_m.copy()
tree = {}
class_list = train_data[label].unique()
make_tree(tree, None, train_data, label, class_list)
return tree

# Build the decision tree

# Impute missing values with the mean

imputer = SimpleImputer(strategy='mean')
X_imputed = pd.DataFrame(imputer.fit_transform(X_encoded),
columns=X_encoded.columns)

# Now fit the decision tree classifier

tree_clf = DecisionTreeClassifier(max_depth=3) # Adjust max_depth as
needed
tree_clf.fit(X_imputed, y)
# Export and visualize the decision tree
export_graphviz(
tree_clf,
out_file="titanic_tree.dot",
feature_names=X_imputed.columns,
class_names=['Not Survived', 'Survived'],
rounded=True,
filled=True
)

with open("titanic_tree.dot") as f:
dot_graph = f.read()

graphviz.Source(dot_graph)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y,
test_size=0.2, random_state=42)

# Fit the decision tree classifier on the training data

tree_clf = DecisionTreeClassifier(max_depth=3) # Adjust max_depth as
needed
tree_clf.fit(X_train, y_train)

# Make predictions on the testing data

y_pred = tree_clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)
OUTPUT:

Accuracy: 0.7988826815642458
EXPERIMENT-3
AIM:

Implementation of Decision tree using Random Forest Algorithm.

CODE:

import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz

# Load the Titanic dataset

url =
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/tit
anic.csv"
titanic_df = pd.read_csv(url)

# Data preprocessing
titanic_df.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1,
inplace=True)
titanic_df['Age'].fillna(titanic_df['Age'].median(), inplace=True)
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0],
inplace=True)
titanic_df['Sex'] = titanic_df['Sex'].map({'male': 0, 'female': 1})
titanic_df = pd.get_dummies(titanic_df, columns=['Embarked'])

# Split data into features and target variable

X = titanic_df.drop('Survived', axis=1).values
y = titanic_df['Survived'].values

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

class DecisionTree:
def __init__(self, max_depth=None):
self.max_depth = max_depth
def fit(self, X, y):
self.tree = self._grow_tree(X, y)

def _grow_tree(self, X, y, depth=0):

num_samples, num_features = X.shape
num_labels = len(np.unique(y))

# Stopping criteria
if depth == self.max_depth or num_labels == 1 or num_samples <
2:
return {'prediction': np.argmax(np.bincount(y))}

# Select random subset of features

feature_indices = random.sample(range(num_features),
int(np.sqrt(num_features)))
best_feature, best_threshold = self._best_criteria(X, y,
feature_indices)

# Handle case where no suitable split is found

if best_feature is None or best_threshold is None:
return {'prediction': np.argmax(np.bincount(y))}

# Split data
left_indices = np.where(X[:, best_feature] <=
best_threshold)[0]
right_indices = np.where(X[:, best_feature] >
best_threshold)[0]

# Create sub-trees
left_tree = self._grow_tree(X[left_indices], y[left_indices],
depth + 1)
right_tree = self._grow_tree(X[right_indices],
y[right_indices], depth + 1)

return {'feature': best_feature,

'threshold': best_threshold,
'left': left_tree,
'right': right_tree}

def _best_criteria(self, X, y, feature_indices):

best_gain = -1
best_feature = None
best_threshold = None

for feature_index in feature_indices:

thresholds = np.unique(X[:, feature_index])
for threshold in thresholds:
left_indices = np.where(X[:, feature_index] <=
threshold)[0]
right_indices = np.where(X[:, feature_index] >
threshold)[0]

if len(left_indices) == 0 or len(right_indices) == 0:
continue

gain = self._information_gain(y, y[left_indices],

y[right_indices])
if gain > best_gain:
best_gain = gain
best_feature = feature_index
best_threshold = threshold

return best_feature, best_threshold

def _information_gain(self, parent, left_child, right_child):

p = len(left_child) / len(parent)
entropy_parent = self._entropy(parent)
entropy_children = p * self._entropy(left_child) + (1 - p) *
self._entropy(right_child)
return entropy_parent - entropy_children

def _entropy(self, y):

_, counts = np.unique(y, return_counts=True)
probabilities = counts / len(y)
return -np.sum(probabilities * np.log2(probabilities + 1e-10))

def predict(self, X):

return np.array([self._predict_tree(x, self.tree) for x in X])

def _predict_tree(self, x, tree):

if 'prediction' in tree:
return tree['prediction']
else:
feature_value = x[tree['feature']]
if feature_value <= tree['threshold']:
return self._predict_tree(x, tree['left'])
else:
return self._predict_tree(x, tree['right'])

class RandomForest:
def __init__(self, n_estimators=100, max_depth=None):
self.n_estimators = n_estimators
self.max_depth = max_depth
self.trees = []
def fit(self, X, y):
for _ in range(self.n_estimators):
tree = DecisionTree(max_depth=self.max_depth)
indices = np.random.choice(len(X), len(X), replace=True)
tree.fit(X[indices], y[indices])
self.trees.append(tree)

def predict(self, X):

predictions = np.array([tree.predict(X) for tree in
self.trees])
return np.mean(predictions, axis=0).astype(int)

# Instantiate and train the Random Forest model

rf = RandomForest(n_estimators=100, max_depth=5)
rf.fit(X_train, y_train)

# Make predictions
predictions = rf.predict(X_test)

# Evaluate the model

accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

# Encode categorical variables

label_encoder = LabelEncoder()
X_encoded = pd.DataFrame(X, columns=titanic_df.drop('Survived',
axis=1).columns)
for col_idx in range(X_encoded.shape[1]):
if X_encoded.iloc[:, col_idx].dtype == 'object':
X_encoded.iloc[:, col_idx] =
label_encoder.fit_transform(X_encoded.iloc[:, col_idx])

# Now fit the decision tree classifier

tree_clf = DecisionTreeClassifier(max_depth=3) # Adjust max_depth as
needed
tree_clf.fit(X_encoded, y)

# Export and visualize the decision tree

export_graphviz(
tree_clf,
out_file="titanic_tree.dot",
feature_names=X_encoded.columns,
class_names=['Not Survived', 'Survived'],
rounded=True,
filled=True
)
# Read the DOT file and visualize the decision tree
with open("titanic_tree.dot") as f:
dot_graph = f.read()

graph = graphviz.Source(dot_graph)
graph.render("titanic_decision_tree", format='png', cleanup=True) #
Save tree as PNG
OUTPUT:

Accuracy: 0.5921787709497207
titanic_decision_tree.png

Titanic Data
No ratings yet
Titanic Data
5 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
5 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Assignment Data Science
No ratings yet
Assignment Data Science
2 pages
Experiment 1
No ratings yet
Experiment 1
2 pages
Loading The Dataset: ## The Matplotlib and Seaborn Library For Result Visualization and Analysis
No ratings yet
Loading The Dataset: ## The Matplotlib and Seaborn Library For Result Visualization and Analysis
13 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
Ai Tools and Applications-Lab
No ratings yet
Ai Tools and Applications-Lab
33 pages
Pyt Manual 1
No ratings yet
Pyt Manual 1
85 pages
Advanced Python for Data Scientists
No ratings yet
Advanced Python for Data Scientists
19 pages
Prac3 23bme053
No ratings yet
Prac3 23bme053
5 pages
Dspracticalexternak 23 Aug
No ratings yet
Dspracticalexternak 23 Aug
8 pages
ML - Lab 03.ipynb Colab
No ratings yet
ML - Lab 03.ipynb Colab
4 pages
Titanic ML for Data Scientists
No ratings yet
Titanic ML for Data Scientists
36 pages
Titanic Data Analysis
No ratings yet
Titanic Data Analysis
14 pages
The Titanic Dataset
No ratings yet
The Titanic Dataset
6 pages
Logistic Regression On Titanic Dataset
No ratings yet
Logistic Regression On Titanic Dataset
6 pages
Titanic Survival Prediction 1692609491
No ratings yet
Titanic Survival Prediction 1692609491
15 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
28 pages
Titanic Survival Prediction Guide
No ratings yet
Titanic Survival Prediction Guide
16 pages
Data Cleaning and Manipulation in Python
No ratings yet
Data Cleaning and Manipulation in Python
33 pages
Titanic Survival Analysis
100% (2)
Titanic Survival Analysis
13 pages
Titanic Classification
100% (1)
Titanic Classification
7 pages
9914 ML Lab3
No ratings yet
9914 ML Lab3
6 pages
# Load The Titanic Dataset: Import As Import As From Import From Import
No ratings yet
# Load The Titanic Dataset: Import As Import As From Import From Import
9 pages
Titanic Data Analysis & Modeling
No ratings yet
Titanic Data Analysis & Modeling
11 pages
Titanic Data Analysis & Modeling
No ratings yet
Titanic Data Analysis & Modeling
12 pages
ML Dataset Performance
No ratings yet
ML Dataset Performance
11 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
33 pages
Decision Tree
No ratings yet
Decision Tree
2 pages
Titanic Logistic Regression Project
No ratings yet
Titanic Logistic Regression Project
35 pages
Titanic
No ratings yet
Titanic
6 pages
A09Ass01 - Jupyter Notebook
No ratings yet
A09Ass01 - Jupyter Notebook
8 pages
LOGISTIC - REGRESSION - Jupyter Notebook
No ratings yet
LOGISTIC - REGRESSION - Jupyter Notebook
18 pages
7 8 - Missing Value Handling
No ratings yet
7 8 - Missing Value Handling
4 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Titanic PuneethRegonda
No ratings yet
Titanic PuneethRegonda
8 pages
Assignment 5
No ratings yet
Assignment 5
14 pages
Practical No 01
No ratings yet
Practical No 01
9 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
Seaborn Ploting in Titanic
No ratings yet
Seaborn Ploting in Titanic
18 pages
Unit 5 Analysis With Pandas in Python
No ratings yet
Unit 5 Analysis With Pandas in Python
26 pages
9924 ML Lab3
No ratings yet
9924 ML Lab3
9 pages
DL Assignment 1
No ratings yet
DL Assignment 1
7 pages
Titanic Data Analysis in Colab
No ratings yet
Titanic Data Analysis in Colab
4 pages
Homework 1
No ratings yet
Homework 1
17 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
Titanic
No ratings yet
Titanic
22 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
PANDAS Groupby Continues 2
No ratings yet
PANDAS Groupby Continues 2
5 pages
Assignment 1 DSB Da
No ratings yet
Assignment 1 DSB Da
14 pages
???? ???????????? ???? ??????
No ratings yet
???? ???????????? ???? ??????
63 pages
U19ADS2035-Python For Data Science Laboratory Page No:17
No ratings yet
U19ADS2035-Python For Data Science Laboratory Page No:17
5 pages
KNN Practical Debasmita Datta
No ratings yet
KNN Practical Debasmita Datta
6 pages
23L-2589 Lab 10
No ratings yet
23L-2589 Lab 10
17 pages
Project Report
No ratings yet
Project Report
7 pages
Titanic Eda
No ratings yet
Titanic Eda
17 pages
Dataset Visualization Basic Ml-1
No ratings yet
Dataset Visualization Basic Ml-1
12 pages
Variables
No ratings yet
Variables
39 pages
JK
No ratings yet
JK
2 pages
Phonetics Study Guide for Students
No ratings yet
Phonetics Study Guide for Students
30 pages
Intro To Hydraulics and Hydrology With Applications (001-030)
No ratings yet
Intro To Hydraulics and Hydrology With Applications (001-030)
30 pages
Energy Performance Certificate: Estimated Energy Costs of This Home
100% (1)
Energy Performance Certificate: Estimated Energy Costs of This Home
4 pages
Apqp and Control Plan To Iso - ts16949 - 2002
No ratings yet
Apqp and Control Plan To Iso - ts16949 - 2002
2 pages
Background of The Study0
No ratings yet
Background of The Study0
7 pages
Resume Preusse November 2016
No ratings yet
Resume Preusse November 2016
2 pages
Human Value by Tushar Tayal PDF
100% (1)
Human Value by Tushar Tayal PDF
314 pages
Project Plan Essentials Guide
No ratings yet
Project Plan Essentials Guide
20 pages
Solution Physics Light
No ratings yet
Solution Physics Light
16 pages
DK Essential Managers - Managing Budgets
88% (8)
DK Essential Managers - Managing Budgets
71 pages
Universal Serial Bus (USB) : Device Class Definition For Human Interface Devices (HID)
No ratings yet
Universal Serial Bus (USB) : Device Class Definition For Human Interface Devices (HID)
98 pages
Acta BSG Core Rules v8
100% (2)
Acta BSG Core Rules v8
13 pages
Dennett The Origins of Selves
0% (1)
Dennett The Origins of Selves
16 pages
Gallagher - Service Support Analyst - JD
No ratings yet
Gallagher - Service Support Analyst - JD
2 pages
Pranav Mistry
No ratings yet
Pranav Mistry
1 page
War Against The Weak 1
100% (2)
War Against The Weak 1
7 pages
Part II
No ratings yet
Part II
21 pages
Mathematics Paper 2 Year 5 Answer All Questions - Answer Are To Be Written in The Spaces Provided On The Question Paper
100% (1)
Mathematics Paper 2 Year 5 Answer All Questions - Answer Are To Be Written in The Spaces Provided On The Question Paper
4 pages
Peerlist PHP
No ratings yet
Peerlist PHP
3 pages
Value Stream Mapping For Office
50% (2)
Value Stream Mapping For Office
51 pages
Problems On Geometric Mean
No ratings yet
Problems On Geometric Mean
2 pages
Empathy Map Template Guide
No ratings yet
Empathy Map Template Guide
1 page
Digital Control Systems Lecture
No ratings yet
Digital Control Systems Lecture
10 pages
ISO/IEC 17025:2005 Lab Accreditation
No ratings yet
ISO/IEC 17025:2005 Lab Accreditation
11 pages
Manthan - 2025
No ratings yet
Manthan - 2025
21 pages
Network Layer Fundamentals
No ratings yet
Network Layer Fundamentals
45 pages
Global Classroom Module
No ratings yet
Global Classroom Module
40 pages