0% found this document useful (0 votes)
14 views2 pages

Practical Session 1: Exploratory Data Analysis: Exercise 1

The document outlines two practical sessions focused on exploratory data analysis using the Ames House Data and Titanic dataset. The first session involves data importation, cleaning, and analysis of property sales in Ames, Iowa, including visualizations of SalePrice and correlations with numerical and categorical features. The second session examines the Titanic dataset to analyze survival rates based on various features, including gender and class, while also emphasizing data cleaning and basic statistical analysis.

Uploaded by

Husam hr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views2 pages

Practical Session 1: Exploratory Data Analysis: Exercise 1

The document outlines two practical sessions focused on exploratory data analysis using the Ames House Data and Titanic dataset. The first session involves data importation, cleaning, and analysis of property sales in Ames, Iowa, including visualizations of SalePrice and correlations with numerical and categorical features. The second session examines the Titanic dataset to analyze survival rates based on various features, including gender and class, while also emphasizing data cleaning and basic statistical analysis.

Uploaded by

Husam hr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Practical Session 1 : Exploratory Data Analysis

Exercise 1
We shall explore the Ames House Data1 . The dataset contained 113 variables describing 3970 property sales that
had occurred in Ames, Iowa between 2006 and 2010. The variables were a mix of nominal, ordinal, continuous, and
discrete variables used in calculation of assessed values and included physical property measurements in addition
to computation variables used in the city. The aim of this practical session is to explore this dataset.

Importation and preparation of the data


1. Import the data and store it in a dataframe df. Are there some missing values?
2. We can see that some features will not be relevant in our exploratory analysis as there are too much missing
values (such as Alley and PoolQC). Remove Id and the features with 30% or more NaN values.

3. Plot the histogram of the variable SalePrice. What do you deduce?

Numerical features analysis


1. List all the types of data from the dataset and take only the numerical ones

2. Plot all the histogram of these features


3. Select the features for which the correlation with SalePrice is greater than 0.5. Give the correlation matrix
between these numerical features. Comment!

Include categorical features in the pipeline


We want to visualise the impact of categorical features on sale prices
1. Select the categorical features
2. Display the box plot of the variable SalePrice in function of the different levels of the variable BsmtExposure.
Comment
3. Display the box plot of the variable SalePrice in function of the different levels of the variable SaleCondition.
Comment

1 http://jse.amstat.org/v19n3/decock.pdf

1
Exercise 2
The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning hours
of 15 April 1912, after it collided with an iceberg during its maiden voyage from Southampton to New York City.
There were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of
the deadliest commercial peacetime maritime disasters in modern history.

Women and children first? The aim is to understand how survivors of Titanic were selected...

Importation of the data and description of the dataset


1. In this first practical session, we shall work on the dataset titanic.csv on the survival of the passengers of
Titanic. Download this dataset as a data frame

2. Describe the dataset titanic : features, nature of the features, number of observations
3. Basic statistics : mean of each variable, quartiles
4. Percentage of missing values for each column. Sort by descending values

Basic graphic analysis


We want to understand what features could contribute to a high survival rate. It would make sense if everything
except ’PassengerId’, ’Ticket’ and ’Name’ would be correlated with a high survival rate.
1. Get rid off the features ’PassengerId’, ’Ticket’ and ’Name’ which seem irrelevant to analyse the data

2. We focus on the features ’Age’ and ’Sex’.


(i) Separate the dataset into men and women
(ii) Display the distribution of the age survivors and non survivors according to the sex. Comment
3. At first glance is there some link between ’Embarked’ and ’Survival’.

4. At first glance is there some link between ’Pclass’ and ’Survival’.

You might also like