ML Report

The document describes building a machine learning model to predict which passengers were more likely to survive the Titanic shipwreck based on passenger data. It obtains a dataset from Kaggle containing information on 891 passengers and cleans the data by handling missing values and converting categorical variables to dummy variables. Exploratory data analysis reveals relationships between survival and variables like gender, fare, and age. Logistic regression and decision tree models are built on a train-test split of the data and evaluated using metrics like F1 score. While the models are not fully optimized, the approach demonstrates basic steps for building a predictive model from a dataset.

Uploaded by

Tumati Koushik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views3 pages

ML Report

Uploaded by

Tumati Koushik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

ML PROJECT REPORT

Koushik Tumati

PROBLEM:
Context: The sinking of the Titanic is one of the most infamous shipwrecks in history.
Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of
1502 out of 2224 passengers and crew.
Objective: To build a predictive model that answers the question: “what sorts of people were
more likely to survive?” using passenger data (i.e., name, age, gender, socio-economic class,
etc).
Tools and libraries used:
Language: Python
Libraries: Numpy, Pandas, Matplotlib, Seaborn, Sklearn
DATASETS:

Obtained data set from Kaggle. It contains 12 columns(features) and 891 rows.

Variable Definition Key

survival Survival 0 = No, 1 = Yes
Passenger ID Serial ID numbers
pclass Ticket class 1 = Upper , 2 = Middle , 3 = Lower
Name Name
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

APPROACH
1) DATA CLEANING:
1) Familiarising data:
Initially importing of above mentioned libraries is done and the dataset is loaded by read
function of pandas.
Now to get familiar with dataset, column types, basic descriptive statistics and info of these
columns is obtained through commands like .head(), .describe(), .info() and several methods
and functions. After performing these operations following conclusions are drawn.
 “Survival” is required dependent variable that predicts survival from other
independent variables.
 Passenger Id and ticket numbers are random numbers and cannot contribute to our
model.
 Sex and embarked are nominal data types which needs to be converted to dummy
variables.
 Age and Fare are continuous datatypes. (Note that Age is not a discrete number but
continuous datatype due to children below 1 year and accurate description or
assumption of some elders age like 25.6)
ML PROJECT REPORT
Koushik Tumati

 SIbsp and Parch are discrete numeric data.

 Cabin variable is nominal data and contains a lot of null values which makes it
insignificant and hence dropped from dataset.

2) Checking Outliers and Missing Data:

After examining no significant outliers are found but missing values are found in age,
cabin and embarked columns. Age column null values are filled by median and
embarked columns are filled with mode due to their respective quantitative and
qualitative types. Cabin is dropped due to huge number of null values. It is to be noted if
missing records can’t be reasonably filled and are few in number then we can drop those
rows for an accurate model.

3) Handling datatypes and feature engineering:

Feature engineering is creating new useful features from existing columns. In our
dataset we can extract title from Name column (Mr,Dr etc) and create a new column.
We can also create family_size variable from sibsp and parch variables.
Columns need to be checked for critical datatypes like datetime and currency. Luckily,
we don’t have such complex datatypes. However, We need to convert categorical
variables to numerical dummy variables for calculations of the model. There are several
ways like hot code encoding to create dummy variables from categorical variables. I
used some sklearn and pandas functions for this purpose. Splitting and altering data
frames a few times is done in this step to get model compatible dataset with dummy
variables.

4) Train and Test Data Split:

To prevent overfitting of our model, we divide dataset into 80:20 train dataset and test
dataset split and train our model using train dataset. We finally test our model on our
test dataset to check the accuracy of our model. If it is not satisfactory we will make
necessary changes in our model or choose a different algorithm to finally get an
optimised model.

2) EXPLORATORY DATA ANALYSIS (EDA) :

It is important to note that EDA is not always done after complete data cleaning. It can be
done alternatively in an iterative manner to extract relationships between different features
and continue with cleaning again. It is done iteratively until satisfactory insights and dataset
is obtained. However, considering the simplicity of this particular dataset, it is not needed. In
this step, we summarize variables and their relationship with target variable using
visualization packages like matplotlib and seaborn. In this dataset, we find some
observations like survival rate of women is much greater than men. Higher Fare ticket
holders had higher chance of survival. Children had higher chance of survival while people
over 65 didn’t have much chances. In this tiny dataset, most of the observations are intuitive
but in big datasets, we can get intriguing observations that are counterintuitive. Then we
have to validate dataset and find underlying reason for counterintuitive data.

Sample plots are attached below.

ML PROJECT REPORT
Koushik Tumati

3) BUILDING MODELS:
Now is the simple yet crucial step of selecting the algorithm for building predictive model.
Although the model is practically a few lines of code, the underlying statistics are vast and
complex. Due to my limited knowledge of these algorithms, I have just used a simple Logistic
Regression and decision trees models.
We build the models on our training dataset and get predictions for the test data. Then we
compare the predictions with test output and gauge our model using confusion matrix , F-1
score and several other metrics. I used only the two above said metrics.

4) Improving Model and Accuracy:

I have stopped my model till the above step due to my lack of knowledge of other algorithms
at the time. We can apply different algorithms and check the best among them. Also we can
improve model’s accuracy by tuning parameters and hyperparameters. After such
improvements, we get the finalised optimum model.

Other Key Points:

 At every point of time, we should be careful to avoid overfitting the model. For some
algorithms, hyperparameters iterate several times to get the best parameter. It may overfit
the model. So in that case we split dataset into three sets. Train, test and validation
datasets.
 Giving more data to test and validation set leads to insufficient train data to get an accurate
model and very less data to test data leads to inefficient accuracy check. Usually (70,15,15)
or (80,10,10) is the standard
 Sometimes More features (Columns) are needed and sometimes more rows are needed for
a better model. We need to understand the case by observing certain metrics.

ML Mini Project 2
No ratings yet
ML Mini Project 2
26 pages
Machine Learning
100% (1)
Machine Learning
62 pages
Report TSP
No ratings yet
Report TSP
13 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
Aim: Predicting The Survival of Titanic Passengers
No ratings yet
Aim: Predicting The Survival of Titanic Passengers
20 pages
ML Report-1
No ratings yet
ML Report-1
13 pages
PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
No ratings yet
PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
7 pages
Iml Project
No ratings yet
Iml Project
13 pages
Credit Risk Project
No ratings yet
Credit Risk Project
11 pages
Titanic Data Analysis Project
No ratings yet
Titanic Data Analysis Project
14 pages
EDA of Titanic Dataset Report
No ratings yet
EDA of Titanic Dataset Report
28 pages
Flight Price Prediction Report
No ratings yet
Flight Price Prediction Report
18 pages
30 Days ML Projects Challenge
No ratings yet
30 Days ML Projects Challenge
288 pages
Titanic Survival Prediction Using Machine Learning
No ratings yet
Titanic Survival Prediction Using Machine Learning
34 pages
MCA - Project Documentation Guidelines 2024-2025
No ratings yet
MCA - Project Documentation Guidelines 2024-2025
26 pages
Bank Marketing Targets 1724510938
No ratings yet
Bank Marketing Targets 1724510938
13 pages
Abinash Nag Project Report CART
No ratings yet
Abinash Nag Project Report CART
40 pages
Titanic Survival Prediction Using ML Miniproject
No ratings yet
Titanic Survival Prediction Using ML Miniproject
21 pages
Titanic Survival Prediction Using Machine Learning
No ratings yet
Titanic Survival Prediction Using Machine Learning
7 pages
LP3 - ML Mini-Project Report Format Shreeyas
No ratings yet
LP3 - ML Mini-Project Report Format Shreeyas
13 pages
Titanic
No ratings yet
Titanic
3 pages
Titanic
No ratings yet
Titanic
3 pages
DS For Business Home Assignments
No ratings yet
DS For Business Home Assignments
24 pages
Titanic Survival Prediction Project
No ratings yet
Titanic Survival Prediction Project
5 pages
Titanic Logistic Regression Project
No ratings yet
Titanic Logistic Regression Project
35 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Ahamed 123
100% (1)
Ahamed 123
7 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Machine Learning With Real Life Project: by - Rishabh Gaur
100% (2)
Machine Learning With Real Life Project: by - Rishabh Gaur
26 pages
AI Lab5
No ratings yet
AI Lab5
5 pages
Titanic Survival Prediction - Step-by-Step Guide
No ratings yet
Titanic Survival Prediction - Step-by-Step Guide
4 pages
CE802 Report
No ratings yet
CE802 Report
7 pages
Titanic
No ratings yet
Titanic
6 pages
Indraneel S (RA2211003010421)
No ratings yet
Indraneel S (RA2211003010421)
21 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
PPPL Final Practical Questions
No ratings yet
PPPL Final Practical Questions
5 pages
PROJECTS
No ratings yet
PROJECTS
6 pages
Project Report
100% (3)
Project Report
36 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
65 pages
Data Mining - Data Preparation Report
No ratings yet
Data Mining - Data Preparation Report
4 pages
Important Notes
No ratings yet
Important Notes
8 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
No ratings yet
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
9 pages
Cars Project PDF
No ratings yet
Cars Project PDF
9 pages
Flight Price Prediction
No ratings yet
Flight Price Prediction
34 pages
Titanic PuneethRegonda
No ratings yet
Titanic PuneethRegonda
8 pages
Optimizing Flight Booking Decisions Through Machine Learning Price Predictions
No ratings yet
Optimizing Flight Booking Decisions Through Machine Learning Price Predictions
50 pages
CEP Final
No ratings yet
CEP Final
11 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
Predictive Modeling Project
No ratings yet
Predictive Modeling Project
16 pages
Advance Python
No ratings yet
Advance Python
5 pages
Statistics For Data Science
100% (2)
Statistics For Data Science
39 pages
Project Report-Micro Credit Loan
No ratings yet
Project Report-Micro Credit Loan
8 pages
Notes On Intro To Data Science Udacity
No ratings yet
Notes On Intro To Data Science Udacity
8 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
ML Lab Experiment Shivansh
No ratings yet
ML Lab Experiment Shivansh
29 pages
AE II Simulation File PDF
No ratings yet
AE II Simulation File PDF
32 pages
ML Lab Report for ECE Students
No ratings yet
ML Lab Report for ECE Students
38 pages
i.MX Windows 10 IoT Release Notes
No ratings yet
i.MX Windows 10 IoT Release Notes
14 pages
Topic4 Implementing
No ratings yet
Topic4 Implementing
11 pages
Mcgarr Summit Hackfest 2023 Deck Final
No ratings yet
Mcgarr Summit Hackfest 2023 Deck Final
35 pages
Control LED Using MIT App Inventor and Arduino
100% (1)
Control LED Using MIT App Inventor and Arduino
8 pages
Diamond Color EQ-3 Manual
100% (1)
Diamond Color EQ-3 Manual
53 pages
Microsoft MB-500 Voct-2023
No ratings yet
Microsoft MB-500 Voct-2023
64 pages
Kashyap Shukla Principal Consultant, OCP
No ratings yet
Kashyap Shukla Principal Consultant, OCP
36 pages
Nse4 FGT-7.0
No ratings yet
Nse4 FGT-7.0
4 pages
Reporting Help Topics For Printing
No ratings yet
Reporting Help Topics For Printing
77 pages
Microsoft Word SHORT CUTS: Statewide Vision Resource Centre PO Box 201 Nunawading 3131 (03) 9841 0242
No ratings yet
Microsoft Word SHORT CUTS: Statewide Vision Resource Centre PO Box 201 Nunawading 3131 (03) 9841 0242
5 pages
IINS5211MM
No ratings yet
IINS5211MM
109 pages
2nd Quarter - Empowerment Technologies Exam
No ratings yet
2nd Quarter - Empowerment Technologies Exam
3 pages
Qas Actupg v1
No ratings yet
Qas Actupg v1
5 pages
ACP Solution Training Slides
No ratings yet
ACP Solution Training Slides
21 pages
Purposal of Research Method Project
No ratings yet
Purposal of Research Method Project
9 pages
Report ITF
No ratings yet
Report ITF
25 pages
A Step by Step Guide On The SAP Handling Unit Management Configurations
100% (6)
A Step by Step Guide On The SAP Handling Unit Management Configurations
55 pages
Haball Presentation
No ratings yet
Haball Presentation
12 pages
QR Code Books for Students
No ratings yet
QR Code Books for Students
46 pages
Inserting Image and Object
No ratings yet
Inserting Image and Object
12 pages
Windows Win32 API Nla
No ratings yet
Windows Win32 API Nla
204 pages
Aud in CIS Midterms
No ratings yet
Aud in CIS Midterms
20 pages
REX640 Instalare
No ratings yet
REX640 Instalare
108 pages
Course Detail Summary: Widyatama University Faculty of Engineering Information System - S1
No ratings yet
Course Detail Summary: Widyatama University Faculty of Engineering Information System - S1
8 pages
3G3MX2 AC Drives: High Programming Functionality
No ratings yet
3G3MX2 AC Drives: High Programming Functionality
21 pages
Introduction To C++ - Day 1
100% (1)
Introduction To C++ - Day 1
43 pages
WordPress Troubleshooting Guide
No ratings yet
WordPress Troubleshooting Guide
44 pages
Siemens Profinet CMMT PN 1 20EN
No ratings yet
Siemens Profinet CMMT PN 1 20EN
26 pages
Slide Set 4 - Requirements Engg
No ratings yet
Slide Set 4 - Requirements Engg
28 pages
VLIS Design Engineer - ELE - Q1201 - v3.0
No ratings yet
VLIS Design Engineer - ELE - Q1201 - v3.0
31 pages