0% found this document useful (0 votes)

26 views9 pages

Week - 6-7

This document provides a comprehensive overview of Exploratory Data Analysis (EDA) and Descriptive Statistics, emphasizing the importance of understanding data structure, trends, and relationships through visualizations and statistical tools. It details the use of Python's Pandas library for data exploration, including handling missing values, calculating summary statistics, and assessing correlation and causation. Additionally, it covers various descriptive statistics measures, their formulas, and visualization techniques to effectively analyze and interpret datasets.

Uploaded by

nghiemhoa4895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views9 pages

Week - 6-7

Uploaded by

nghiemhoa4895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Week 6-7

Data Exploratory Analysis and Descriptive Statistics

(Graduate-Level Detail)
1. Introduction to Data Exploratory Analysis (EDA)
Exploratory Data Analysis (EDA) is the initial process of investigating datasets to
summarize their key characteristics, often through visualizations and summary
statistics. It helps data analysts and scientists to detect patterns, spot anomalies,
test hypotheses, and check assumptions. EDA is typically one of the first steps
taken in any data analysis project to better understand the data before applying
advanced models. The insights gained from EDA directly inform data cleaning,
feature engineering, and model selection.
Objectives of EDA:

Understand the structure and quality of data.

Identify trends, relationships, and anomalies.

Prepare data for modeling through cleaning, transformation, and feature

selection.

Discover important variables, relationships, and hidden patterns.

Week 6-7 1
2. Exploring Basic Statistical Analysis Tools in Python Pandas Library
The Pandas library in Python is one of the most popular tools for data analysis due
to its versatility and powerful functions. It provides a range of functions that allow
data exploration with ease and flexibility. Below are some essential tools and steps
used for EDA in Pandas:

Data Overview:

Loading the Data: Use read_csv() , read_excel() , etc., to load data into a
Pandas DataFrame.

Basic Inspection:

Head and Tail: The head() and tail() functions allow you to preview
the first or last few records of your dataset, helping in the initial
understanding of its structure.

Information Summary: info() provides data types, non-null counts,

and memory usage, which is useful for understanding the data
completeness and types of columns.

Summary Statistics: describe() gives statistical summaries of numeric

columns, including mean, standard deviation, min, max, and quartiles.

import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df.info())
print(df.describe())

Data Types: The function dtypes can be used to verify data types of
columns, helping identify if transformations are needed.

Handling Missing Values:

Missing values are common in real-world datasets and can introduce

biases or inaccuracies in modeling if not handled correctly.

Identify Missing Values: Use isnull() and sum() to get an overview of

where missing data exists.

Week 6-7 2
Strategies to Handle Missing Data:

Removal: If the missing values are minimal and randomly distributed,

rows or columns with missing values can be dropped using dropna() .
This approach is effective when the impact on data quality is
negligible.

Imputation: Fill missing values using central tendency metrics ( mean() ,

median() , mode() ) or predictive models.

Forward/Backward Filling: Methods like ffill() or bfill() can be

used in time series data to fill missing values based on neighboring
data points.

# Count missing values in each column

print(df.isnull().sum())

# Fill missing values with mean

df['column'] = df['column'].fillna(df['column'].mean())

# Drop rows with missing values

df.dropna(inplace=True)

Advanced Techniques: Use machine learning models for imputation (e.g.,

KNNImputer from Scikit-Learn) for more sophisticated handling of missing

data.

Summary Statistics:

Central Tendency: Calculations such as Mean, Median, and Mode give

insights into the typical values of a dataset.

Mean provides the average of all values.

Median is less affected by outliers and provides a central point of the

distribution.

Mode is used particularly for categorical data to determine the most

frequent category.

Week 6-7 3
Spread: Quantitative measures like Range, Variance, and Standard
Deviation help understand the dispersion of the data.

Range: Indicates the difference between maximum and minimum

values.

Variance and Standard Deviation: Variance represents the spread of

data points around the mean, while standard deviation is its square
root, making it more interpretable.

Skewness and Kurtosis:

Skewness tells us about the asymmetry in data distribution.

Kurtosis provides insight into the peakedness and the presence of

outliers.

print('Mean:', df['column'].mean())
print('Median:', df['column'].median())
print('Standard Deviation:', df['column'].std())
print('Skewness:', df['column'].skew())
print('Kurtosis:', df['column'].kurt())

Application: High skewness and kurtosis can affect the performance of

statistical models, and transformations like log or square root may be
applied to normalize the data.

3. Correlation and Methods to Calculate Causation

Correlation:

Definition: Correlation measures the strength and direction of a

relationship between two variables. The correlation coefficient (r) ranges
from -1 to 1.

Positive Correlation: As one variable increases, the other also

increases.

Negative Correlation: As one variable increases, the other decreases.

No Correlation: No apparent relationship between the variables.

Week 6-7 4
Types of Correlation:

Pearson Correlation: Measures linear relationships and is sensitive to

outliers.

Spearman Rank Correlation: Measures monotonic relationships using

rank, less sensitive to outliers.

Kendall Tau: Used for ordinal data and helps understand relationships
between ranked variables.

# Calculate Pearson correlation matrix

correlation_matrix = df.corr(method='pearson')
print(correlation_matrix)

# Calculate Spearman correlation

spearman_corr = df.corr(method='spearman')
print(spearman_corr)

Correlation Coefficient Formula:

Pearson Correlation Coefficient (r):

n
∑i=1 (xi − x
ˉ)(yi − yˉ)
![r = ]

n n
ˉ)2 ∑i=1 (yi − yˉ)2
∑i=1 (xi − x

Causation:

Definition: Causation implies that one event is the result of the occurrence
of the other event. Unlike correlation, causation requires evidence that
changing one variable will produce a change in another.

Proving Causation: Methods like Randomized Controlled Trials (RCTs) or

advanced statistical tests (e.g., Granger causality, A/B testing) are required
to demonstrate causation.

Week 6-7 5
Important Consideration: Correlation does not imply causation. A high
correlation between two variables may be coincidental or influenced by an
unseen third variable (confounder).

4. Descriptive Statistics: Types and Formulas

Descriptive statistics summarize and describe the features of a dataset through

numerical and graphical summaries. These measures are key to understanding
the nature and distribution of the data.

Measures of Central Tendency:

Mean (Average): The sum of all values divided by the number of

observations.

Formula:

∑ni=1 xi
Mean =

n
Median: The middle value when all observations are sorted in ascending
or descending order. Median is particularly useful in skewed distributions
where the mean can be misleading.

Formula:

If \(n\) is odd, Median = value at position \((n + 1) / 2\)

If \(n\) is even, Median = average of values at positions \((n / 2)\) and \

((n / 2) + 1\)

Mode: The value that appears most frequently in the data, particularly
helpful for categorical data analysis.

Formula: No specific formula; determined based on frequency count.

Measures of Dispersion:

Range: The difference between the maximum and minimum values,

showing the spread of the data.
Formula:

Range = Max(x) − Min(x)

Week 6-7 6
Variance: The average of the squared differences from the Mean.
Variance is a crucial measure of how spread out the data is.

Formula:
n
ˉ)2
∑i=1 (xi − x
2
Variance(σ ) =

n−1

Standard Deviation: The square root of variance, providing a measure of

the average distance of each data point from the mean.
Formula:

∑ni=1 (xi − x
ˉ)2
StandardDeviation(σ) =

n−1

Interquartile Range (IQR): The range between the first quartile (Q1) and
third quartile (Q3). It measures the spread of the middle 50% of the data
and helps identify outliers.

Formula:

IQR = Q3 − Q1
Measures of Distribution Shape:

Skewness: Measures the asymmetry of the data distribution.

Formula:
n
∑i=1 (xi − xˉ)3
Skewness =

(n − 1) ⋅ σ 3

A positive skewness value indicates a right-skewed distribution, while

a negative value indicates a left-skewed distribution. Highly skewed
data may require transformations for modeling.

Kurtosis: Measures the "tailedness" of the distribution, which helps

identify the presence of outliers.

Formula:

Week 6-7 7
n
∑i=1 (xi − xˉ)4
Kurtosis =

(n − 1) ⋅ σ 4

High kurtosis indicates a distribution with heavy tails, while low

kurtosis indicates light tails. Excess kurtosis (kurtosis - 3) helps
determine whether data has heavier or lighter tails compared to a
normal distribution.

5. Additional Tools in Python for Descriptive Analysis

Quantiles and Percentiles:

Use quantile() to find different quantiles and percentiles of a dataset,

which are useful for understanding data spread and identifying potential
outliers.

# Calculate 25th, 50th, and 75th percentiles

Q1 = df['column'].quantile(0.25)
Q2 = df['column'].quantile(0.50)
Q3 = df['column'].quantile(0.75)
print('Q1:', Q1)
print('Median (Q2):', Q2)
print('Q3:', Q3)

Application: Quantiles are particularly helpful for creating boxplots,

detecting outliers, and understanding the data's distribution.

Data Visualization:

Visualizing descriptive statistics is crucial in understanding data properties

intuitively:

Boxplots: Useful for visualizing the spread, median, and potential

outliers of a dataset. It uses quartiles and highlights the IQR, giving a
view of the data distribution.

Histograms: Help visualize the frequency distribution of data and give

a sense of skewness and kurtosis.

Week 6-7 8
Density Plots: Provide a smooth distribution of data values, helping to
understand data concentration and shape.

import seaborn as sns

import matplotlib.pyplot as plt

# Boxplot to visualize distribution

sns.boxplot(x=df['column'])
plt.title('Boxplot of Column')
plt.show()

# Histogram for frequency distribution

df['column'].hist(bins=30)
plt.title('Histogram of Column')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

# Density plot for distribution

sns.kdeplot(df['column'], shade=True)
plt.title('Density Plot of Column')
plt.show()

Week 6-7 9

Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Exploratory Data Analysis (EDA) and Descriptive Analytic
No ratings yet
Exploratory Data Analysis (EDA) and Descriptive Analytic
47 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
Unit 3
No ratings yet
Unit 3
20 pages
S1 - Descriptive Statistics
No ratings yet
S1 - Descriptive Statistics
133 pages
Unit .......
No ratings yet
Unit .......
45 pages
Program-1
No ratings yet
Program-1
15 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
STQS2223 CH 4
No ratings yet
STQS2223 CH 4
30 pages
3 Data Visualization
No ratings yet
3 Data Visualization
75 pages
DAAN436277 Buoi09 EDA
No ratings yet
DAAN436277 Buoi09 EDA
132 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
L1.2 Exploratory Data Analysis 2023
No ratings yet
L1.2 Exploratory Data Analysis 2023
49 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
13 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Business Statstics Complete
No ratings yet
Business Statstics Complete
13 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
21 pages
Edaunit IV
No ratings yet
Edaunit IV
15 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
Stastics For Data Science1 (Quiz1 Notes)
No ratings yet
Stastics For Data Science1 (Quiz1 Notes)
2 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Tutoring Session 2023 - Statistics For Business
No ratings yet
Tutoring Session 2023 - Statistics For Business
65 pages
DA Practical Lab 02 Statistical Functions
No ratings yet
DA Practical Lab 02 Statistical Functions
6 pages
2.DescriptiveAnalytics v2
No ratings yet
2.DescriptiveAnalytics v2
10 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
4 DataUnderstanding
No ratings yet
4 DataUnderstanding
51 pages
Programming Python Statistics
No ratings yet
Programming Python Statistics
7 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
Section 1 Slide
No ratings yet
Section 1 Slide
132 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
Unit 3 DS
No ratings yet
Unit 3 DS
30 pages
Intro to Statistics & Sampling
100% (1)
Intro to Statistics & Sampling
30 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Unit 1,2
No ratings yet
Unit 1,2
17 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Practical No.-01
No ratings yet
Practical No.-01
25 pages
Descriptive Analytics - Uni and Bi
No ratings yet
Descriptive Analytics - Uni and Bi
36 pages
Six Sigma: Statistics: By: - Hakeem-Ur-Rehman
100% (1)
Six Sigma: Statistics: By: - Hakeem-Ur-Rehman
44 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
BRM Chapter 6
No ratings yet
BRM Chapter 6
8 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Descriptive & Inferential Stats Guide
No ratings yet
Descriptive & Inferential Stats Guide
13 pages
Qunt Data Coding & Analysis
No ratings yet
Qunt Data Coding & Analysis
104 pages
Lec 2
No ratings yet
Lec 2
26 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel
48 pages
Python Statisc
No ratings yet
Python Statisc
7 pages
Statistical Analysis For Data Science
No ratings yet
Statistical Analysis For Data Science
2 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
DATA 240 - 23 - Lec3 - FA 2024 - Dist
No ratings yet
DATA 240 - 23 - Lec3 - FA 2024 - Dist
50 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Iba Unit - Ii
No ratings yet
Iba Unit - Ii
31 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
4 pages
Reaction Paper in Forgery and Forged Document
No ratings yet
Reaction Paper in Forgery and Forged Document
3 pages
Stress Concentration Solution For A 2D Dent in An Internally Pressurized Cylinder
No ratings yet
Stress Concentration Solution For A 2D Dent in An Internally Pressurized Cylinder
10 pages
26 05 29 00 (16070)
No ratings yet
26 05 29 00 (16070)
6 pages
Cab Tilt
100% (1)
Cab Tilt
39 pages
Course Logistics and Introduction: CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Course Logistics and Introduction: CS771: Introduction To Machine Learning Piyush Rai
24 pages
Module 4
No ratings yet
Module 4
20 pages
Self-Awareness and Regulation: Session 2
No ratings yet
Self-Awareness and Regulation: Session 2
73 pages
Generalized Die Hard Number Theory
No ratings yet
Generalized Die Hard Number Theory
3 pages
Intermediate DVD Worksheets Unit 7
0% (1)
Intermediate DVD Worksheets Unit 7
5 pages
Steam Engineering & Heat Transfer Guide
100% (2)
Steam Engineering & Heat Transfer Guide
169 pages
Vernier Caliper Worksheet With Example Solution PDF
100% (7)
Vernier Caliper Worksheet With Example Solution PDF
3 pages
Tips and Tricks ANSYS
No ratings yet
Tips and Tricks ANSYS
62 pages
PhD Mechanical Engineering Syllabus
No ratings yet
PhD Mechanical Engineering Syllabus
3 pages
Factor Affecting Financial Performance of Commercial Bak
No ratings yet
Factor Affecting Financial Performance of Commercial Bak
11 pages
Advanced Quantum Mechanics Guide
No ratings yet
Advanced Quantum Mechanics Guide
405 pages
Ai Use Cases Msme
No ratings yet
Ai Use Cases Msme
13 pages
Soil Testing for Engineers
No ratings yet
Soil Testing for Engineers
6 pages
Lawyer Success Guide: Beyond the Gavel
No ratings yet
Lawyer Success Guide: Beyond the Gavel
58 pages
Abib
No ratings yet
Abib
3 pages
Early Flood Monitoring System Using Iot Applications: A Mini Project Report
No ratings yet
Early Flood Monitoring System Using Iot Applications: A Mini Project Report
19 pages
BASIC MATH - Drawing A Picture
No ratings yet
BASIC MATH - Drawing A Picture
26 pages
Mean Free Path and Collision Time
No ratings yet
Mean Free Path and Collision Time
2 pages
Grade 10.00 Out of 10.00 (100%) : Question Text
No ratings yet
Grade 10.00 Out of 10.00 (100%) : Question Text
69 pages
Compiled Scope OKAT 2022
No ratings yet
Compiled Scope OKAT 2022
14 pages
English Critical Reading Test
No ratings yet
English Critical Reading Test
9 pages
Mercedes Axor 1828 2022 09 29
No ratings yet
Mercedes Axor 1828 2022 09 29
7 pages
Social Media Content Calendar Template
No ratings yet
Social Media Content Calendar Template
1 page
High-Speed SSDs for Gamers
No ratings yet
High-Speed SSDs for Gamers
1 page
Upload - Sample INASGOC
No ratings yet
Upload - Sample INASGOC
9 pages
Week-4 Assignment-4
No ratings yet
Week-4 Assignment-4
3 pages

Week - 6-7

Uploaded by

Week - 6-7

Uploaded by

Week 6-7

Data Exploratory Analysis and Descriptive Statistics

Understand the structure and quality of data.

Identify trends, relationships, and anomalies.

Prepare data for modeling through cleaning, transformation, and feature

Discover important variables, relationships, and hidden patterns.

Information Summary: info() provides data types, non-null counts,

Summary Statistics: describe() gives statistical summaries of numeric

Handling Missing Values:

Missing values are common in real-world datasets and can introduce

Identify Missing Values: Use isnull() and sum() to get an overview of

Removal: If the missing values are minimal and randomly distributed,

Imputation: Fill missing values using central tendency metrics ( mean() ,

Forward/Backward Filling: Methods like ffill() or bfill() can be

# Count missing values in each column

# Fill missing values with mean

# Drop rows with missing values

Advanced Techniques: Use machine learning models for imputation (e.g.,

Central Tendency: Calculations such as Mean, Median, and Mode give

Mean provides the average of all values.

Median is less affected by outliers and provides a central point of the

Mode is used particularly for categorical data to determine the most

Range: Indicates the difference between maximum and minimum

Variance and Standard Deviation: Variance represents the spread of

Skewness and Kurtosis:

Skewness tells us about the asymmetry in data distribution.

Kurtosis provides insight into the peakedness and the presence of

Application: High skewness and kurtosis can affect the performance of

3. Correlation and Methods to Calculate Causation

Definition: Correlation measures the strength and direction of a

Positive Correlation: As one variable increases, the other also

Negative Correlation: As one variable increases, the other decreases.

No Correlation: No apparent relationship between the variables.

Pearson Correlation: Measures linear relationships and is sensitive to

Spearman Rank Correlation: Measures monotonic relationships using

# Calculate Pearson correlation matrix

# Calculate Spearman correlation

Correlation Coefficient Formula:

Pearson Correlation Coefficient (r):

Proving Causation: Methods like Randomized Controlled Trials (RCTs) or

4. Descriptive Statistics: Types and Formulas

Descriptive statistics summarize and describe the features of a dataset through

Measures of Central Tendency:

Mean (Average): The sum of all values divided by the number of

If \(n\) is odd, Median = value at position \((n + 1) / 2\)

If \(n\) is even, Median = average of values at positions \((n / 2)\) and \

Formula: No specific formula; determined based on frequency count.

Range: The difference between the maximum and minimum values,

Range = Max(x) − Min(x)

Standard Deviation: The square root of variance, providing a measure of

Skewness: Measures the asymmetry of the data distribution.

A positive skewness value indicates a right-skewed distribution, while

Kurtosis: Measures the "tailedness" of the distribution, which helps

High kurtosis indicates a distribution with heavy tails, while low

5. Additional Tools in Python for Descriptive Analysis

Quantiles and Percentiles:

Use quantile() to find different quantiles and percentiles of a dataset,

# Calculate 25th, 50th, and 75th percentiles

Application: Quantiles are particularly helpful for creating boxplots,

Visualizing descriptive statistics is crucial in understanding data properties

Boxplots: Useful for visualizing the spread, median, and potential

Histograms: Help visualize the frequency distribution of data and give

import seaborn as sns

# Boxplot to visualize distribution

# Histogram for frequency distribution

# Density plot for distribution

You might also like