A Project Report on
“Build a Machine Learning model that predicts the type of people
  who survived the Titanic Shipwreck using Passenger Data (i.e
         name, age, gender, socio-economic class, etc)”
                         Submitted
                             By
                Shivam Sandeep Khole [13B]
              Aditya Anil Rokade [17B]
                Sujeet Singh
                Tanvi Chaudhari
                Sakshi ahe
        In partial fulfilment of the requirements for
                The award of the degree of
                         Bachelor
                             in
             COMPUTER ENGINEERING
            For Academic Year 2022 – 2023
   DEPARTMENT OF COMPUTER ENGINEERING
 MET’s Institute of Engineering Bhujbal Knowledge City
                Adgaon, Nashik – 422003.
                              1
                               Certificate
                           This is to Certify That
                        Tanvi Devidas Ch
 Has completed the necessary Mini Project Work and Prepared the Report
                                     on
        “Build a Machine Learning model that predicts the type of people
          who survived the Titanic Shipwreck using Passenger Data (i.e
                 name, age, gender, socio-economic class, etc)”
 in Satisfactorily manner as a fulfilment of the requirement of the award of
   degree of the Bachelor in Computer Engineering in the Academic Year
                                2022 – 2023
 Project Guide
Prof. Vijay More
                                      2
Course Objectives:
• To understand the need for Machine learning
• To explore various data pre-processing methods.
• To study and understand classification methods
• To understand the need for multi-class classifiers.
• To learn the working of clustering algorithms
• To learn fundamental neural network algorithms.
                                               3
Course Outcomes:
On completion of the course, student will be able to–
CO1: Identify the needs and challenges of machine learning for real time applications.
CO2: Apply various data pre-processing techniques to simplify and speed up machine learning
algorithms.
CO3: Select and apply appropriately supervised machine learning algorithms for real time
applications.
CO4: Implement variants of multi-class classifier and measure its performance.
CO5: Compare and contrast different clustering algorithms. CO6: Design a neural network for
solving engineering problems.
                                              4
                             Acknowledgement
We take this opportunity to express our deepest sense of gratitude and sincere
thanks to those who have helped us in completing this task. We express our
sincere thanks to our guide Prof. Vijay More, who has given us valuable
suggestions, excellent guidance, continuous encouragement and taken interest in
the completion of this work. His kind help and constant inspiration will always
help us in our future also. We thank Dr. M. U. Kharat, Head of Computer
Engineering Department, for the co-operation and encouragement for collecting
the information and preparation of data. Credit goes to our colleague’s, staff
members of Computer Engineering Department and the Institute’s Library for
their help and timely assistance.
                                       5
                              Contents
Sr. No.   TITLE                          Page no
1.        Abstract                       7
2.        Objectives                     8
3.        Problem Statement              8
4.        Motivation                     8
5.        Introduction                   9
6.        Theory                         10
7.        Conclusion                     25
8.        References                     26
                                 6
                                          Abstract
This project is based on the Titanic dataset given on Kaggle. The sinking of the Titanic is one
of the most infamous shipwrecks in history. On April 15, 1912, the widely considered
“unsinkable” Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough
lifeboats for everyone on board, resulting in the death. In this project, we see how we can use
machine-learning techniques to predict survivors of the Titanic. With a dataset of 891
individuals containing features like sex, age, and class, we attempt to predict the survivors of
a small test group of 418. We are using Logistic Regression Model for the same.
                                               7
   Title of Mini-Project:
           Build a Machine Learning model that predicts the type of people who survived the
           Titanic Shipwreck using Passenger Data (i.e name, age, gender, socio-economic
           class, etc)
   Objective:
         Goal: Build a predictive model that answers the question: “what sorts of
   people were more likely to survive?” using passenger data like age, gender, class,
   etc.
Problem Statement
Build a machine learning model that predicts the type of people who survived
the Titanic shipwreck using passenger data (i.e., name, age, gender, socio-economic class, etc.).
Dataset Link: https://www.kaggle.com/competitions/titanic/data
Motivation
To predict what type of people survived the Titanic Shipwreck using passenger data and build
its prediction model is the main motive to study this mini project.
                                               8
Introduction:
Machine learning means the application of any computer-enabled algorithm that can be applied
against a data set to find a pattern in the data. This encompasses basically all types of data
science algorithms, supervised, unsupervised, segmentation, classification, or regression". few
important areas where machine learning can be applied are Handwriting Recognition,
Language Translation, Speech Recognition, Image Classification, Autonomous Driving. Some
features of machine learning algorithms can be observations that are used to form predictions
for image classification, the pixels are the features, for voice recognition, the pitch and volume
of the sound samples are the features and for autonomous cars, data from the cameras, range
sensors, and GPS.
Using data provided by www.kaggle.com, our goal is to apply machine-learning techniques to
successfully predict which passengers survived the sinking of the Titanic. Features like ticket
price, age, sex, and class will be used to make the predictions. Using Logistic Regression
methods, we try to predict the survival of passengers using different combinations of features.
The challenge boils down to a classification problem given a set of features.
                                                9
Theory
Data Set:
The data we used for our project was provided on the Kaggle website. We were given 891
passenger samples for our training set and their associated labels of whether the passenger
survived. For each passenger, we were given his/her passenger class, name, sex, age, number
of siblings/spouses aboard, number of parents/children aboard, ticket number, fare, cabin
embarked, and port of embarkation.
For the test data, we had 418 samples in the same format. The dataset is not complete, meaning
that for several samples, one or many of fields were not available and marked empty (especially
in the latter fields – age, fare, cabin, and port). However, all sample points contained at least
information about gender and passenger class.
To normalize the data, we replace missing values with the mean of the remaining data set or
other values.
                          Understanding the Titanic Dataset
 So first we will understand our titanic dataset. This is a dataset of Titanic ship passengers &
 here
• Each row represents the data of 1 passenger.
• Columns represent the features. We have 10 features/ variables in this dataset.
1. Survival: This variable shows whether the person survived or not. This is our target
    variable & we must predict its value. It’s a binary variable. 0 means not survived and 1
    means survived.
2. pclass: The ticket class of passengers. 1st (upper class), 2nd (middle), or 3rd (lower).
3. Sex: Gender of passenger
4. Age: Age (in years) of a passenger
5. sibsp: The no. of siblings/spouses of a particular passenger who were there on the ship.
6. parch: The no. of parents/children of a particular passenger who were there on the ship.
7. ticket: Ticket Number
8. fare: Passenger fare (like 1st class ticket fare must be greater than 2nd pr 3rd class ticket
    right)
9. cabin: Cabin Number
10. embarked: Port of Embarkation; From where that passenger took the ship. (C =
    Cherbourg, Q = Queenstown, S = Southampton)
                                               10
       Logistic Regression:
   11. A simple yet crisp description of Logistic Description would be, “it is a supervised
       learning classification algorithm used to predict the probability of a target variable. The
       nature of target or dependent variable is dichotomous, which means there would be
       only two possible classes.” as stated in the tutorial points article.
   12. The graph of logistic regression is as shown below:
What is Training Dataset?
The training data is the biggest (in -size) subset of the original dataset, which is used to train
or fit the machine learning model. Firstly, the training data is fed to the ML algorithms, which
lets them learn how to make predictions for the given task.
                                                11
       What is Test Dataset?
Once we train the model with the training dataset, it's time to test the model with the test dataset.
This dataset evaluates the performance of the model and ensures that the model can generalize
well with the new or unseen dataset. The test dataset is another subset of original data, which
is independent of the training dataset. However, it has some similar types of features and class
probability distribution and uses it as a benchmark for model evaluation once the model
training is completed. Test data is a well-organized dataset that contains data for each type of
scenario for a given problem that the model would be facing when used in the real world.
Usually, the test dataset is approximately 20-25% of the total original data for an ML project.
Accuracy:
To find the accuracy of model in confusion matrix the formula is:
Workflow
                                                 12
CODE & RESULT:
                 13
14
15
16
17
18
19
20
21
22
23
24
Conclusion:
The analysis revealed interesting patterns across individual-level features. Factors such as
socioeconomic status, social norms and family composition appeared to have an impact on
likelihood of survival. These conclusions, however, were derived from findings in the given
data set.
It has been observed that female survival rates are very high (approx 74%) while male survival
rates are very low. To make predictions in classification problem, the technique of logistic
regression is primarily used.
It would be interesting to play more with dataset and introducing more attributes which might
lead to better results. Various other machine learning techniques like Naive Bayes, K-NN
classification can be used to solve the problem.
                                             25
References:
[1] Kaggle, Titanic:       Machine     Learning    form     Disaster   [Online].   Available:
http://www.kaggle.com/
[2] Prediction of Survivors in Titanic Dataset: A Comparitive Study using Machine Learning
Algorithms, Tryambak Chatterlee, IJERMT-2017.
[3] Eric Lam, Chongxuan Tang, "Titanic Machine Learning from Disaster", LamTang-Titanic
Machine Learning From Disaster, 2012.
[4] Analyzing Titanic disaster using machine learning algorithms-Computing, Communication
and Automation (ICCCA), 2017 International Conference on 21 December 2017, IEEE.
[5]       https://towardsdatascience.com/predicting-thesurvival-of-titanic-passengers-Niklas
Donges
[6] https://www.analyticsvidhya.com/machine-learning
[7]         Wikipedia.         Logistic        Regression        [Online].         Available:
https://en.wikipedia.org/wiki/Logistic_regression
                                             26