Week 6-7
Data Exploratory Analysis and Descriptive Statistics
(Graduate-Level Detail)
1. Introduction to Data Exploratory Analysis (EDA)
Exploratory Data Analysis (EDA) is the initial process of investigating datasets to
summarize their key characteristics, often through visualizations and summary
statistics. It helps data analysts and scientists to detect patterns, spot anomalies,
test hypotheses, and check assumptions. EDA is typically one of the first steps
taken in any data analysis project to better understand the data before applying
advanced models. The insights gained from EDA directly inform data cleaning,
feature engineering, and model selection.
Objectives of EDA:
Understand the structure and quality of data.
Identify trends, relationships, and anomalies.
Prepare data for modeling through cleaning, transformation, and feature
selection.
Discover important variables, relationships, and hidden patterns.
Week 6-7 1
2. Exploring Basic Statistical Analysis Tools in Python Pandas Library
The Pandas library in Python is one of the most popular tools for data analysis due
to its versatility and powerful functions. It provides a range of functions that allow
data exploration with ease and flexibility. Below are some essential tools and steps
used for EDA in Pandas:
Data Overview:
Loading the Data: Use read_csv() , read_excel() , etc., to load data into a
Pandas DataFrame.
Basic Inspection:
Head and Tail: The head() and tail() functions allow you to preview
the first or last few records of your dataset, helping in the initial
understanding of its structure.
Information Summary: info() provides data types, non-null counts,
and memory usage, which is useful for understanding the data
completeness and types of columns.
Summary Statistics: describe() gives statistical summaries of numeric
columns, including mean, standard deviation, min, max, and quartiles.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df.info())
print(df.describe())
Data Types: The function dtypes can be used to verify data types of
columns, helping identify if transformations are needed.
Handling Missing Values:
Missing values are common in real-world datasets and can introduce
biases or inaccuracies in modeling if not handled correctly.
Identify Missing Values: Use isnull() and sum() to get an overview of
where missing data exists.
Week 6-7 2
Strategies to Handle Missing Data:
Removal: If the missing values are minimal and randomly distributed,
rows or columns with missing values can be dropped using dropna() .
This approach is effective when the impact on data quality is
negligible.
Imputation: Fill missing values using central tendency metrics ( mean() ,
median() , mode() ) or predictive models.
Forward/Backward Filling: Methods like ffill() or bfill() can be
used in time series data to fill missing values based on neighboring
data points.
# Count missing values in each column
print(df.isnull().sum())
# Fill missing values with mean
df['column'] = df['column'].fillna(df['column'].mean())
# Drop rows with missing values
df.dropna(inplace=True)
Advanced Techniques: Use machine learning models for imputation (e.g.,
KNNImputer from Scikit-Learn) for more sophisticated handling of missing
data.
Summary Statistics:
Central Tendency: Calculations such as Mean, Median, and Mode give
insights into the typical values of a dataset.
Mean provides the average of all values.
Median is less affected by outliers and provides a central point of the
distribution.
Mode is used particularly for categorical data to determine the most
frequent category.
Week 6-7 3
Spread: Quantitative measures like Range, Variance, and Standard
Deviation help understand the dispersion of the data.
Range: Indicates the difference between maximum and minimum
values.
Variance and Standard Deviation: Variance represents the spread of
data points around the mean, while standard deviation is its square
root, making it more interpretable.
Skewness and Kurtosis:
Skewness tells us about the asymmetry in data distribution.
Kurtosis provides insight into the peakedness and the presence of
outliers.
print('Mean:', df['column'].mean())
print('Median:', df['column'].median())
print('Standard Deviation:', df['column'].std())
print('Skewness:', df['column'].skew())
print('Kurtosis:', df['column'].kurt())
Application: High skewness and kurtosis can affect the performance of
statistical models, and transformations like log or square root may be
applied to normalize the data.
3. Correlation and Methods to Calculate Causation
Correlation:
Definition: Correlation measures the strength and direction of a
relationship between two variables. The correlation coefficient (r) ranges
from -1 to 1.
Positive Correlation: As one variable increases, the other also
increases.
Negative Correlation: As one variable increases, the other decreases.
No Correlation: No apparent relationship between the variables.
Week 6-7 4
Types of Correlation:
Pearson Correlation: Measures linear relationships and is sensitive to
outliers.
Spearman Rank Correlation: Measures monotonic relationships using
rank, less sensitive to outliers.
Kendall Tau: Used for ordinal data and helps understand relationships
between ranked variables.
# Calculate Pearson correlation matrix
correlation_matrix = df.corr(method='pearson')
print(correlation_matrix)
# Calculate Spearman correlation
spearman_corr = df.corr(method='spearman')
print(spearman_corr)
Correlation Coefficient Formula:
Pearson Correlation Coefficient (r):
n
∑i=1 (xi − x
ˉ)(yi − yˉ)
![r = ]
n n
ˉ)2 ∑i=1 (yi − yˉ)2
∑i=1 (xi − x
Causation:
Definition: Causation implies that one event is the result of the occurrence
of the other event. Unlike correlation, causation requires evidence that
changing one variable will produce a change in another.
Proving Causation: Methods like Randomized Controlled Trials (RCTs) or
advanced statistical tests (e.g., Granger causality, A/B testing) are required
to demonstrate causation.
Week 6-7 5
Important Consideration: Correlation does not imply causation. A high
correlation between two variables may be coincidental or influenced by an
unseen third variable (confounder).
4. Descriptive Statistics: Types and Formulas
Descriptive statistics summarize and describe the features of a dataset through
numerical and graphical summaries. These measures are key to understanding
the nature and distribution of the data.
Measures of Central Tendency:
Mean (Average): The sum of all values divided by the number of
observations.
Formula:
∑ni=1 xi
Mean =
n
Median: The middle value when all observations are sorted in ascending
or descending order. Median is particularly useful in skewed distributions
where the mean can be misleading.
Formula:
If \(n\) is odd, Median = value at position \((n + 1) / 2\)
If \(n\) is even, Median = average of values at positions \((n / 2)\) and \
((n / 2) + 1\)
Mode: The value that appears most frequently in the data, particularly
helpful for categorical data analysis.
Formula: No specific formula; determined based on frequency count.
Measures of Dispersion:
Range: The difference between the maximum and minimum values,
showing the spread of the data.
Formula:
Range = Max(x) − Min(x)
Week 6-7 6
Variance: The average of the squared differences from the Mean.
Variance is a crucial measure of how spread out the data is.
Formula:
n
ˉ)2
∑i=1 (xi − x
2
Variance(σ ) =
n−1
Standard Deviation: The square root of variance, providing a measure of
the average distance of each data point from the mean.
Formula:
∑ni=1 (xi − x
ˉ)2
StandardDeviation(σ) =
n−1
Interquartile Range (IQR): The range between the first quartile (Q1) and
third quartile (Q3). It measures the spread of the middle 50% of the data
and helps identify outliers.
Formula:
IQR = Q3 − Q1
Measures of Distribution Shape:
Skewness: Measures the asymmetry of the data distribution.
Formula:
n
∑i=1 (xi − xˉ)3
Skewness =
(n − 1) ⋅ σ 3
A positive skewness value indicates a right-skewed distribution, while
a negative value indicates a left-skewed distribution. Highly skewed
data may require transformations for modeling.
Kurtosis: Measures the "tailedness" of the distribution, which helps
identify the presence of outliers.
Formula:
Week 6-7 7
n
∑i=1 (xi − xˉ)4
Kurtosis =
(n − 1) ⋅ σ 4
High kurtosis indicates a distribution with heavy tails, while low
kurtosis indicates light tails. Excess kurtosis (kurtosis - 3) helps
determine whether data has heavier or lighter tails compared to a
normal distribution.
5. Additional Tools in Python for Descriptive Analysis
Quantiles and Percentiles:
Use quantile() to find different quantiles and percentiles of a dataset,
which are useful for understanding data spread and identifying potential
outliers.
# Calculate 25th, 50th, and 75th percentiles
Q1 = df['column'].quantile(0.25)
Q2 = df['column'].quantile(0.50)
Q3 = df['column'].quantile(0.75)
print('Q1:', Q1)
print('Median (Q2):', Q2)
print('Q3:', Q3)
Application: Quantiles are particularly helpful for creating boxplots,
detecting outliers, and understanding the data's distribution.
Data Visualization:
Visualizing descriptive statistics is crucial in understanding data properties
intuitively:
Boxplots: Useful for visualizing the spread, median, and potential
outliers of a dataset. It uses quartiles and highlights the IQR, giving a
view of the data distribution.
Histograms: Help visualize the frequency distribution of data and give
a sense of skewness and kurtosis.
Week 6-7 8
Density Plots: Provide a smooth distribution of data values, helping to
understand data concentration and shape.
import seaborn as sns
import matplotlib.pyplot as plt
# Boxplot to visualize distribution
sns.boxplot(x=df['column'])
plt.title('Boxplot of Column')
plt.show()
# Histogram for frequency distribution
df['column'].hist(bins=30)
plt.title('Histogram of Column')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
# Density plot for distribution
sns.kdeplot(df['column'], shade=True)
plt.title('Density Plot of Column')
plt.show()
Week 6-7 9