EDA Unit V
EDA Unit V
Third Variable - Partial Correlation and Mediation Analysis - Causal Explanations – Causal Inference and
Structural Equation Modeling - Three-Variable Contingency Tables and Beyond - Multi-way Tables -Time
Series Analysis (TSA) - Characteristics of Time Series Data - Stationarity, Trend, Seasonality Time Series
Aggregation - Rolling Mean, Moving Averages - Time Series Forecasting - ARIMA
# Sample data
data = pd.DataFrame({
'Study Hours': [10, 15, 20, 25, 30, 35, 40],
'Exam Scores': [50, 55, 60, 65, 70, 75, 80],
'Sleep Hours': [6, 7, 6, 8, 7, 6, 7]
})
# Scatterplot for Study Hours vs. Exam Scores sns.scatterplot(x='Study Hours', y='Exam Scores', data=data)
plt.title('Study Hours vs. Exam Scores')
plt.show()
2. Introducing the Third Variable (Sleep Hours)
o Use color-coding to incorporate Sleep Hours into the scatterplot. python
# Scatterplot with Sleep Hours as hue
sns.scatterplot(x='Study Hours', y='Exam Scores', hue='Sleep Hours', palette='viridis', data=data)
plt.title('Study Hours vs. Exam Scores (Color-coded by Sleep Hours)') plt.show()
3. Stratification by Sleep Hours
o Create separate scatterplots for different levels of Sleep Hours to see if the relationship
between Study Hours and Exam Scores changes.
python
# Facet grid for stratified analysis
g = sns.FacetGrid(data, col='Sleep Hours', col_wrap=3, height=4) g.map(sns.scatterplot, 'Study Hours',
'Exam Scores') g.add_legend()
plt.show()
4. Statistical Analysis
o Use multiple regression to assess the impact of both Study Hours and Sleep Hours on Exam
Scores.
import statsmodels.api as sm
# Prepare the data for regression
X = data[['Study Hours', 'Sleep Hours']]
X = sm.add_constant(X) # Adds a constant term to the predictor y = data['Exam Scores']
# Fit the regression model model = sm.OLS(y, X).fit() # Print the regression results print(model.summary())
Interpretation
Scatterplots: The color-coded scatterplot and facet grid help visualize how Sleep Hours might influence
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Previous Academic Performance: A measure of students' performance in prior exams. Step-by-Step Analysis
1. Initial Correlation Analysis
o Examine the correlation between Study Hours and Exam Scores. python
import pandas as pd # Sample data
data = pd.DataFrame({
'Study Hours': [10, 15, 20, 25, 30, 35, 40],
'Exam Scores': [50, 55, 60, 65, 70, 75, 80],
'Previous Academic Performance': [45, 50, 55, 60, 65, 70, 75]
})
# Calculate correlation
correlation = data['Study Hours'].corr(data['Exam Scores']) print(f"Correlation between Study Hours and
Exam Scores: {correlation}")
2. Controlling for Confounding Variables
o Use multiple regression to control for Previous Academic Performance.
import statsmodels.api as sm
Interpretation
Correlation Analysis: The initial correlation provides a preliminary indication of a relationship
between Study Hours and Exam Scores.
Multiple Regression: Controlling for Previous Academic Performance helps isolate the effect of Study
Hours on Exam Scores.
Instrumental Variables: Using an IV like Parental Encouragement helps address endogeneity
issues, providing a more robust causal inference.
Causal explanations are a crucial component of EDA, enabling researchers and practitioners to move beyond mere
associations and uncover the true cause-and-effect relationships in their data. By employing a combination of
experimental designs, statistical controls, and advanced inference techniques, we can gain deeper insights and make
more informed decisions based on data.
Three-Variable Contingency Tables and Beyond
Three-variable contingency tables, also known as three-way tables, extend the concept of two- way tables to analyze
the relationship between three categorical variables simultaneously. These tables allow researchers to explore more
complex interactions and dependencies among variables, providing a deeper understanding of the data.
Importance of Three-Variable Contingency Tables
1. Exploring Interactions: Understanding how the relationship between two variables changes when
considering a third variable.
2. Identifying Conditional Dependencies: Determining if the association between two variables
depends on the level of a third variable.
3. Controlling for Confounders: Adjusting for a third variable to see if the association between the
primary variables is genuine or confounded.
Constructing a Three-Variable Contingency Table
To construct a three-variable contingency table, we categorize data based on three variables. Each cell in the table
represents a combination of these variables and contains the count or frequency of observations.
Example: Analyzing the Relationship Between Gender, Smoking Status, and Exercise Frequency Consider a dataset
with the following variables:
Gender: Male, Female
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Logistic Regression: Offers a more nuanced understanding of the relationships, especially when dealing
with more than three variables. The model coefficients and p- values indicate the significance and strength
of these relationships.
Three-variable contingency tables are a powerful tool in EDA for exploring complex interactions and dependencies
among categorical variables. While they provide valuable insights, higher-dimensional data often require advanced
multivariate analysis and visualization techniques to fully understand the relationships. Utilizing these methods
allows for a deeper and more comprehensive exploration of the data, leading to more accurate and actionable
insights.
Objective:
● To understand concept of time series analysis
● To know the different characteristic of time series analysis
● To identify different methods of time series
● To have an overview of the univariate and multivariate time series What Is Time Series Analysis?
Time-series analysis is a method of analyzing a collection of data points over a period of time. Instead of recording
data points intermittently or randomly, time series analysts record data points at consistent intervals over a set period
of time.
While time-series data is information gathered over time, various types of information describe how and when that
information was gathered. For example:
Time series data: It is a collection of observations on the values that a variable takes at various points in
time.
Cross-sectional data: Data from one or more variables that were collected simultaneously.
Pooled data: It is a combination of cross-sectional and time-series data.
The variable varies according to the probability distribution, showing which value Y can take and with which
probability those values are taken.
Yt = μt + εt
Each instance of Yt is the result of the signal μt εt is the noise term here.
Why Do We eed Time-Series Analysis?
Time series analysis has a range of applications in statistics, sales, economics, and many more areas. The common
point is the technique used to model the data over a given period of time.
The reasons for doing time series analysis are as follows:
Features: Time series analysis can be used to track features like trend, seasonality, and variability.
Forecasting: Time series analysis can aid in the prediction of stock prices. It is used if you would like to
know if the price will rise or fall and how much it will rise or fall.
Inferences: You can predict the value and draw inferences from data using Time series analysis.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
What is forecasting?
Forecasting is a special technique of making predictions for the future by using historical data as inputs and analyzing
trends. However, forecasting doesn't tell the future definitively, it only shows probabilities. So, you should always
double check the results before making a decision.
What is a time series analysis and what are the benefits?
A time series analysis focuses on a series of data points ordered in time. This is one of the most widely used data
science analyses and is applied in a variety of industries.
ARIMA
ARIMA is an acronym for Autoregressive Integrated Moving Average. The Box-Jenkins method is another name for
this method.
Now you will explore the ARIMA parameters in detail:
Autoregressive Component: AR stands for autoregressive, and is denoted by p. When the value of p is 0, it means
there is no correlation in the series. When the value of p is 1, it means that the auto-correlation is up to one lag.
Moving Average: Moving average is denoted by q. When q=1, it means that there is an error term.
Integration: Integration is denoted by d. When the value of d is 0, the series is stationary. When the value of d is
1, the series is not stationary, and you can make it stationary by taking the difference.
Time series analysis is a statistical technique that deals with time-ordered data. The primary goal is to understand the
underlying structure and function that produced the observations and to forecast future values based on historical
data. Time series data are ubiquitous in various fields such as finance, economics, environmental science, and many
others.
Characteristics of Time Series Data
1. Trend: Long-term movement in the data. It represents the general direction in which the data is moving
over time.
2. Seasonality: Regular, repeating patterns or cycles in data at fixed intervals.
3. Cyclic Patterns: Irregular fluctuations that are not of fixed period but occur over long time frames.
4. Noise: Random variations that are unexplained by the model.
Steps in Time Series Analysis
1. Data Collection: Gather the data in a time-ordered sequence.
2. Data Cleaning: Handle missing values, outliers, and anomalies.
3. Data Visualization: Plot the data to identify patterns such as trends and seasonality.
4. Decomposition: Separate the time series into trend, seasonal, and residual components.
5. Modeling: Fit statistical models to the data for understanding and forecasting.
6. Validation: Evaluate the model's performance using validation techniques.
7. Forecasting: Predict future values using the developed model.
Decomposition of Time Series
Decomposition involves breaking down a time series into its constituent components:
Trend Component (T): Represents the long-term progression of the series.
Seasonal Component (S): Captures the repeating short-term cycle.
Residual Component (R): The remaining part after removing trend and seasonality.
Example: Decomposition using Python python
import pandas as pd import numpy as np
import matplotlib.pyplot as plt
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
from statsmodels.tsa.seasonal import seasonal_decompose # Generate sample data
np.random.seed(0)
date_rng = pd.date_range(start='1/1/2020', end='1/1/2022', freq='D') df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.sin(np.linspace(-2 * np.pi, 2 * np.pi, len(df))) + np.random.normal(0, 0.1, len(df)) df.set_index('date',
inplace=True)
# Decompose the time series
result = seasonal_decompose(df['data'], model='additive') # Plot the decomposition
result.plot() plt.show()
print(df)
Example:
import pandas as pd
import matplotlib.pyplot as plt # Sample data
date_range = pd.date_range(start='2020-01-01', periods=100, freq='D') data = pd.Series(range(100),
index=date_range)
# Plotting data.plot()
plt.title('Line Plot of Time Series Data') plt.xlabel('Date')
plt.ylabel('Value') plt.show()
2. Seasonal Plots
Description: Seasonal plots help to visualize seasonal patterns by plotting data for each period (e.g.,
month, year) on the same graph.
Usage: Useful for identifying and comparing seasonal patterns across different periods.
Example:
import seaborn as sns
# Assuming `data` is a pandas DataFrame with a datetime index and a 'value' column data =
pd.DataFrame({'date': date_range, 'value': range(100)})
data['month'] = data['date'].dt.month # Seasonal plot
sns.lineplot(data=data, x='month', y='value') plt.title('Seasonal Plot of Time Series Data') plt.xlabel('Month')
plt.ylabel('Value') plt.show()
3. Autocorrelation Plots (ACF)
Description: Autocorrelation plots display the correlation of the time series with its own past values
(lags).
Usage: Useful for identifying the degree of correlation between time steps and for checking the
randomness of data.
Example:
python
from statsmodels.graphics.tsaplots import plot_acf # Plotting ACF
plot_acf(data['value']) plt.title('Autocorrelation Plot') plt.xlabel('Lags') plt.ylabel('Autocorrelation') plt.show()
4. Heatmaps
Description: Heatmaps represent data in a matrix form, where values are represented by different colors.
Time series heatmaps can show patterns across two dimensions, such as time of day and day of week.
Usage: Useful for visualizing seasonal and cyclical patterns across multiple dimensions.
Example:
python
data['day_of_week'] = data['date'].dt.dayofweek data['hour'] = data['date'].dt.hour
# Creating a pivot table
heatmap_data = data.pivot_table(index='day_of_week', columns='hour', values='value') # Plotting heatmap
sns.heatmap(heatmap_data, cmap='coolwarm') plt.title('Heatmap of Time Series Data') plt.xlabel('Hour')
plt.ylabel('Day of Week') plt.show()
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
5. Box Plots
Description: Box plots (or whisker plots) display the distribution of data based on a five- number summary:
minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
Usage: Useful for comparing distributions across different time periods.
Example:
sns.boxplot(x='month', y='value', data=data) plt.title('Box Plot of Time Series Data') plt.xlabel('Month')
plt.ylabel('Value') plt.show()
6. Decomposition Plots
Description: Decomposition plots break down a time series into its component parts: trend, seasonal,
and residual components.
Usage: Useful for understanding the underlying structure of the time series data.
Example:
python
from statsmodels.tsa.seasonal import seasonal_decompose # Decompose the time series
decomposition = seasonal_decompose(data['value'], model='additive', period=12) # Plotting decomposition
decomposition.plot() plt.show()
Importance of Visualization in Time Series Analysis
Trend Identification: Visualizations help identify long-term trends in the data.
Seasonal Patterns: They reveal recurring patterns and seasonality.
Anomaly Detection: Visualizing data can highlight outliers or anomalies.
Data Cleaning: Helps in identifying missing values and incorrect data points.
Model Selection: Assists in choosing appropriate models for forecasting and analysis based on observed
patterns.
Effective visualization is essential for gaining insights and making informed decisions based on time series data. By
utilizing these visualization techniques, analysts can better understand and interpret their data.
Grouping
Grouping in Time Series Analysis (TSA)
Grouping in Time Series Analysis is a powerful technique used to aggregate data based on specific time intervals or
other criteria. This method helps in summarizing the data, identifying patterns, and gaining insights over different
periods.
Key Concepts of Grouping in TSA
1. Time-based Grouping:
o Grouping data based on time intervals such as hourly, daily, weekly, monthly, or yearly.
o Useful for identifying trends and seasonal patterns over specific periods.
2. Custom Grouping:
o Grouping based on custom criteria such as weekdays vs. weekends, business hours vs. non-business
hours, etc.
o Helps in understanding the impact of external factors on the time series data.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
print(df_upsampled).
This will fill the NaN values by propagating the last valid observation forward.
Example
Here is the complete example:
python
import pandas as pd import numpy as np # Create a date range
date_rng = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D') # Create a DataFrame with this date range
df = pd.DataFrame(date_rng, columns=['date']) # Add a random value column
df['data'] = np.random.randint(0, 100, size=(len(date_rng))) # Set the date column as the index
df.set_index('date', inplace=True) print("Original DataFrame:") print(df)
# Downsampling to 3-day frequency, calculating the mean df_downsampled = df.resample('3D').mean()
print("\nDownsampled DataFrame (3-day frequency, mean):") print(df_downsampled)
# Upsampling to hourly frequency, filling NaN values using forward fill method df_upsampled =
df.resample('H').ffill()
print("\nUpsampled DataFrame (hourly frequency, forward fill):") print(df_upsampled)
Output
Original DataFrame:
2020-01-01 81
2020-01-02 43
2020-01-03 23
2020-01-04 76
2020-01-05 21
2020-01-06 34
2020-01-07 84
2020-01-08 45
2020-01-09 15
2020-01-10 57
Downsampled DataFrame (3-day frequency, mean):
2020-01-01 49.000000
2020-01-04 43.666667
2020-01-07 48.000000
2020-01-10 57.000000
Upsampled DataFrame (hourly frequency, forward fill):
2020-01-01 00:00:00 81
2020-01-01 01:00:00 81
2020-01-01 02:00:00 81
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
2020-01-01 03:00:00 81
2020-01-01 04:00:00 81
2020-01-09 20:00:00 15
2020-01-09 21:00:00 15
2020-01-09 22:00:00 15
2020-01-09 23:00:00 15
2020-01-10 00:00:00 57
[217 rows x 1 columns]
These examples show how you can resample your time series data to different frequencies using pandas, providing
flexibility in your data analysis.
Unit V Completed