ML PROJECT REPORT
Koushik Tumati
PROBLEM:
Context: The sinking of the Titanic is one of the most infamous shipwrecks in history.
Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of
1502 out of 2224 passengers and crew.
Objective: To build a predictive model that answers the question: “what sorts of people were
more likely to survive?” using passenger data (i.e., name, age, gender, socio-economic class,
etc).
Tools and libraries used:
Language: Python
Libraries: Numpy, Pandas, Matplotlib, Seaborn, Sklearn
DATASETS:
Obtained data set from Kaggle. It contains 12 columns(features) and 891 rows.
 Variable       Definition                                      Key
 survival       Survival                                        0 = No, 1 = Yes
 Passenger ID   Serial ID numbers
 pclass         Ticket class                                    1 = Upper , 2 = Middle , 3 = Lower
 Name           Name
 sex            Sex
 Age            Age in years
 sibsp          # of siblings / spouses aboard the Titanic
 parch          # of parents / children aboard the Titanic
 ticket         Ticket number
 fare           Passenger fare
 cabin          Cabin number
 embarked       Port of Embarkation                             C = Cherbourg, Q = Queenstown, S = Southampton
APPROACH
   1) DATA CLEANING:
       1) Familiarising data:
       Initially importing of above mentioned libraries is done and the dataset is loaded by read
       function of pandas.
       Now to get familiar with dataset, column types, basic descriptive statistics and info of these
       columns is obtained through commands like .head(), .describe(), .info() and several methods
       and functions. After performing these operations following conclusions are drawn.
             “Survival” is required dependent variable that predicts survival from other
                 independent variables.
             Passenger Id and ticket numbers are random numbers and cannot contribute to our
                 model.
             Sex and embarked are nominal data types which needs to be converted to dummy
                 variables.
             Age and Fare are continuous datatypes. (Note that Age is not a discrete number but
                 continuous datatype due to children below 1 year and accurate description or
                 assumption of some elders age like 25.6)
                                      ML PROJECT REPORT
                                                                                    Koushik Tumati
            SIbsp and Parch are discrete numeric data.
            Cabin variable is nominal data and contains a lot of null values which makes it
             insignificant and hence dropped from dataset.
     2) Checking Outliers and Missing Data:
        After examining no significant outliers are found but missing values are found in age,
        cabin and embarked columns. Age column null values are filled by median and
        embarked columns are filled with mode due to their respective quantitative and
        qualitative types. Cabin is dropped due to huge number of null values. It is to be noted if
        missing records can’t be reasonably filled and are few in number then we can drop those
        rows for an accurate model.
     3) Handling datatypes and feature engineering:
         Feature engineering is creating new useful features from existing columns. In our
         dataset we can extract title from Name column (Mr,Dr etc) and create a new column.
         We can also create family_size variable from sibsp and parch variables.
         Columns need to be checked for critical datatypes like datetime and currency. Luckily,
         we don’t have such complex datatypes. However, We need to convert categorical
         variables to numerical dummy variables for calculations of the model. There are several
         ways like hot code encoding to create dummy variables from categorical variables. I
         used some sklearn and pandas functions for this purpose. Splitting and altering data
         frames a few times is done in this step to get model compatible dataset with dummy
         variables.
     4) Train and Test Data Split:
         To prevent overfitting of our model, we divide dataset into 80:20 train dataset and test
         dataset split and train our model using train dataset. We finally test our model on our
         test dataset to check the accuracy of our model. If it is not satisfactory we will make
         necessary changes in our model or choose a different algorithm to finally get an
         optimised model.
2) EXPLORATORY DATA ANALYSIS (EDA) :
    It is important to note that EDA is not always done after complete data cleaning. It can be
    done alternatively in an iterative manner to extract relationships between different features
    and continue with cleaning again. It is done iteratively until satisfactory insights and dataset
    is obtained. However, considering the simplicity of this particular dataset, it is not needed. In
    this step, we summarize variables and their relationship with target variable using
    visualization packages like matplotlib and seaborn. In this dataset, we find some
    observations like survival rate of women is much greater than men. Higher Fare ticket
    holders had higher chance of survival. Children had higher chance of survival while people
    over 65 didn’t have much chances. In this tiny dataset, most of the observations are intuitive
    but in big datasets, we can get intriguing observations that are counterintuitive. Then we
    have to validate dataset and find underlying reason for counterintuitive data.
    Sample plots are attached below.
                                        ML PROJECT REPORT
                                                                                      Koushik Tumati
   3) BUILDING MODELS:
      Now is the simple yet crucial step of selecting the algorithm for building predictive model.
      Although the model is practically a few lines of code, the underlying statistics are vast and
      complex. Due to my limited knowledge of these algorithms, I have just used a simple Logistic
      Regression and decision trees models.
      We build the models on our training dataset and get predictions for the test data. Then we
      compare the predictions with test output and gauge our model using confusion matrix , F-1
      score and several other metrics. I used only the two above said metrics.
   4) Improving Model and Accuracy:
      I have stopped my model till the above step due to my lack of knowledge of other algorithms
      at the time. We can apply different algorithms and check the best among them. Also we can
      improve model’s accuracy by tuning parameters and hyperparameters. After such
      improvements, we get the finalised optimum model.
Other Key Points:
      At every point of time, we should be careful to avoid overfitting the model. For some
       algorithms, hyperparameters iterate several times to get the best parameter. It may overfit
       the model. So in that case we split dataset into three sets. Train, test and validation
       datasets.
      Giving more data to test and validation set leads to insufficient train data to get an accurate
       model and very less data to test data leads to inefficient accuracy check. Usually (70,15,15)
       or (80,10,10) is the standard
      Sometimes More features (Columns) are needed and sometimes more rows are needed for
       a better model. We need to understand the case by observing certain metrics.