Practical Session 1 : Exploratory Data Analysis
Exercise 1
We shall explore the Ames House Data1 . The dataset contained 113 variables describing 3970 property sales that
had occurred in Ames, Iowa between 2006 and 2010. The variables were a mix of nominal, ordinal, continuous, and
discrete variables used in calculation of assessed values and included physical property measurements in addition
to computation variables used in the city. The aim of this practical session is to explore this dataset.
Importation and preparation of the data
  1. Import the data and store it in a dataframe df. Are there some missing values?
  2. We can see that some features will not be relevant in our exploratory analysis as there are too much missing
     values (such as Alley and PoolQC). Remove Id and the features with 30% or more NaN values.
  3. Plot the histogram of the variable SalePrice. What do you deduce?
Numerical features analysis
  1. List all the types of data from the dataset and take only the numerical ones
  2. Plot all the histogram of these features
  3. Select the features for which the correlation with SalePrice is greater than 0.5. Give the correlation matrix
     between these numerical features. Comment!
Include categorical features in the pipeline
We want to visualise the impact of categorical features on sale prices
  1. Select the categorical features
  2. Display the box plot of the variable SalePrice in function of the different levels of the variable BsmtExposure.
     Comment
  3. Display the box plot of the variable SalePrice in function of the different levels of the variable SaleCondition.
     Comment
  1 http://jse.amstat.org/v19n3/decock.pdf
                                                         1
Exercise 2
The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning hours
of 15 April 1912, after it collided with an iceberg during its maiden voyage from Southampton to New York City.
There were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of
the deadliest commercial peacetime maritime disasters in modern history.
   Women and children first? The aim is to understand how survivors of Titanic were selected...
Importation of the data and description of the dataset
  1. In this first practical session, we shall work on the dataset titanic.csv on the survival of the passengers of
     Titanic. Download this dataset as a data frame
  2. Describe the dataset titanic : features, nature of the features, number of observations
  3. Basic statistics : mean of each variable, quartiles
  4. Percentage of missing values for each column. Sort by descending values
Basic graphic analysis
We want to understand what features could contribute to a high survival rate. It would make sense if everything
except ’PassengerId’, ’Ticket’ and ’Name’ would be correlated with a high survival rate.
  1. Get rid off the features ’PassengerId’, ’Ticket’ and ’Name’ which seem irrelevant to analyse the data
  2. We focus on the features ’Age’ and ’Sex’.
      (i) Separate the dataset into men and women
      (ii) Display the distribution of the age survivors and non survivors according to the sex. Comment
  3. At first glance is there some link between ’Embarked’ and ’Survival’.
  4. At first glance is there some link between ’Pclass’ and ’Survival’.