0% found this document useful (0 votes)
6 views18 pages

EDA Unit V

EDA Unit V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views18 pages

EDA Unit V

EDA Unit V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Rajalakshmi Institute of Technology

(An Autonomous Institution), Affiliated to Anna University, Chennai


Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

UNIT V MULTIVARIATE AND TIME SERIES ANALYSIS

Third Variable - Partial Correlation and Mediation Analysis - Causal Explanations – Causal Inference and
Structural Equation Modeling - Three-Variable Contingency Tables and Beyond - Multi-way Tables -Time
Series Analysis (TSA) - Characteristics of Time Series Data - Stationarity, Trend, Seasonality Time Series
Aggregation - Rolling Mean, Moving Averages - Time Series Forecasting - ARIMA

Introducing a Third Variable in Exploratory Data Analysis (EDA) Introduction


In Exploratory Data Analysis (EDA), examining relationships between two variables can provide valuable insights.
However, introducing a third variable can often reveal deeper, more nuanced patterns and relationships that are not
immediately apparent. This can help in identifying potential interactions, confounding factors, or mediators in the
data.
Importance of Introducing a Third Variable
1. Understanding Interactions: A third variable can help identify interactions between the primary
variables being analyzed. For example, the relationship between exercise and weight loss might be
influenced by diet.
2. Controlling for Confounders: Introducing a third variable can help control for confounding factors
that may distort the observed relationship between two variables.
3. Identifying Mediators: A third variable can act as a mediator, explaining the mechanism through which
one variable affects another.
Methods for Introducing a Third Variable
1. Stratification: Dividing the data into subgroups based on the third variable and analyzing the primary
relationship within each subgroup.
2. Multivariate Plots: Using multivariate visualizations such as 3D plots, color-coded scatterplots, or
facet grids to incorporate the third variable into the analysis.
3. Statistical Models: Employing statistical models like multiple regression, ANOVA, or logistic
regression to include the third variable and assess its impact on the primary relationship.
Example: Analyzing the Relationship Between Study Hours, Exam Scores, and Sleep
Data
Consider a dataset with the following variables:
 Study Hours: Number of hours a student studies per week.
 Exam Scores: Scores obtained by students in an exam.
 Sleep Hours: Average number of hours a student sleeps per night. Step-by-Step Analysis
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

1. Initial Bivariate Analysis


o Examine the relationship between Study Hours and Exam Scores using a scatterplot.
python
import pandas as pd import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = pd.DataFrame({
'Study Hours': [10, 15, 20, 25, 30, 35, 40],
'Exam Scores': [50, 55, 60, 65, 70, 75, 80],
'Sleep Hours': [6, 7, 6, 8, 7, 6, 7]
})

# Scatterplot for Study Hours vs. Exam Scores sns.scatterplot(x='Study Hours', y='Exam Scores', data=data)
plt.title('Study Hours vs. Exam Scores')
plt.show()
2. Introducing the Third Variable (Sleep Hours)
o Use color-coding to incorporate Sleep Hours into the scatterplot. python
# Scatterplot with Sleep Hours as hue
sns.scatterplot(x='Study Hours', y='Exam Scores', hue='Sleep Hours', palette='viridis', data=data)
plt.title('Study Hours vs. Exam Scores (Color-coded by Sleep Hours)') plt.show()
3. Stratification by Sleep Hours
o Create separate scatterplots for different levels of Sleep Hours to see if the relationship
between Study Hours and Exam Scores changes.
python
# Facet grid for stratified analysis
g = sns.FacetGrid(data, col='Sleep Hours', col_wrap=3, height=4) g.map(sns.scatterplot, 'Study Hours',
'Exam Scores') g.add_legend()
plt.show()
4. Statistical Analysis
o Use multiple regression to assess the impact of both Study Hours and Sleep Hours on Exam
Scores.
import statsmodels.api as sm
# Prepare the data for regression
X = data[['Study Hours', 'Sleep Hours']]
X = sm.add_constant(X) # Adds a constant term to the predictor y = data['Exam Scores']
# Fit the regression model model = sm.OLS(y, X).fit() # Print the regression results print(model.summary())
Interpretation
 Scatterplots: The color-coded scatterplot and facet grid help visualize how Sleep Hours might influence
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

the relationship between Study Hours and Exam Scores.


 Regression Analysis: The multiple regression model provides quantitative insights into the effect of both
Study Hours and Sleep Hours on Exam Scores. The coefficients and p- values indicate the significance and
strength of these relationships.
Conclusion
Introducing a third variable in EDA enriches the analysis by revealing hidden patterns, controlling for confounders,
and understanding interactions. Utilizing multivariate visualizations and statistical models allows for a
comprehensive examination of complex relationships, leading to more informed and accurate conclusions.
Causal Explanations
Causal explanations in EDA aim to identify and understand the cause-and-effect relationships between variables.
Unlike simple associations or correlations, causal explanations provide insights into how and why changes in one
variable lead to changes in another. Establishing causality is crucial for making informed decisions and
implementing effective interventions in various fields, including healthcare, economics, social sciences, and more.
Importance of Causal Explanations
1. Understanding Mechanisms: Identifying the underlying mechanisms through which one variable
affects another.
2. Effective Interventions: Designing and implementing strategies or policies that can effectively
address the root causes of issues.
3. Predictive Accuracy: Improving the accuracy of predictive models by incorporating causal
relationships.
4. Policy Making: Providing evidence-based support for policy decisions and interventions.

Methods for Establishing Causality


1. Randomized Controlled Trials (RCTs): The gold standard for establishing causality, involving
random assignment of subjects to treatment and control groups.
2. Natural Experiments: Using naturally occurring events or circumstances to study causal effects.
3. Quasi-Experimental Designs: Approaches like difference-in-differences, instrumental variables, and
regression discontinuity that attempt to infer causality when randomization is not possible.
4. Longitudinal Studies: Tracking the same subjects over time to observe how changes in one variable
lead to changes in another.
5. Causal Inference Methods: Statistical techniques like propensity score matching, Granger causality
tests, and structural equation modeling.
Example: The Impact of Study Hours on Exam Scores
Data
Consider a dataset with the following variables:
 Study Hours: Number of hours a student studies per week.
 Exam Scores: Scores obtained by students in an exam.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Previous Academic Performance: A measure of students' performance in prior exams. Step-by-Step Analysis
1. Initial Correlation Analysis
o Examine the correlation between Study Hours and Exam Scores. python
import pandas as pd # Sample data
data = pd.DataFrame({
'Study Hours': [10, 15, 20, 25, 30, 35, 40],
'Exam Scores': [50, 55, 60, 65, 70, 75, 80],
'Previous Academic Performance': [45, 50, 55, 60, 65, 70, 75]
})
# Calculate correlation
correlation = data['Study Hours'].corr(data['Exam Scores']) print(f"Correlation between Study Hours and
Exam Scores: {correlation}")
2. Controlling for Confounding Variables
o Use multiple regression to control for Previous Academic Performance.

import statsmodels.api as sm

# Prepare the data for regression


X = data[['Study Hours', 'Previous Academic Performance']]
X = sm.add_constant(X) # Adds a constant term to the predictor y = data['Exam Scores']
# Fit the regression model model = sm.OLS(y, X).fit() # Print the regression results print(model.summary())
3. Assessing Causality
o If Study Hours remain a significant predictor of Exam Scores after controlling for Previous
Academic Performance, this strengthens the causal argument.
python
# Interpretation of regression results
if model.pvalues['Study Hours'] < 0.05:
print("Study Hours have a significant impact on Exam Scores, controlling for Previous Academic
Performance.")
else:
print("Study Hours do not have a significant impact on Exam Scores after controlling for Previous
Academic Performance.")
4. Using Instrumental Variables
o Identify an instrumental variable (IV) that affects Study Hours but not directly Exam Scores,
to further validate causality.
python
# Example: Assuming 'Parental Encouragement' as an instrumental variable data['Parental Encouragement']
= [3, 4, 2, 5, 3, 4, 5]
# First stage regression: Study Hours on Parental Encouragement
first_stage = sm.OLS(data['Study Hours'], sm.add_constant(data['Parental
Encouragement'])).fit()
# Predicted Study Hours from first stage
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

data['Predicted Study Hours'] = first_stage.predict(sm.add_constant(data['Parental


Encouragement']))
# Second stage regression: Exam Scores on Predicted Study Hours
iv_model = sm.OLS(data['Exam Scores'], sm.add_constant(data['Predicted Study
Hours'])).fit()
# Print the IV regression results print(iv_model.summary())

Interpretation
 Correlation Analysis: The initial correlation provides a preliminary indication of a relationship
between Study Hours and Exam Scores.
 Multiple Regression: Controlling for Previous Academic Performance helps isolate the effect of Study
Hours on Exam Scores.
 Instrumental Variables: Using an IV like Parental Encouragement helps address endogeneity
issues, providing a more robust causal inference.
Causal explanations are a crucial component of EDA, enabling researchers and practitioners to move beyond mere
associations and uncover the true cause-and-effect relationships in their data. By employing a combination of
experimental designs, statistical controls, and advanced inference techniques, we can gain deeper insights and make
more informed decisions based on data.
Three-Variable Contingency Tables and Beyond

Three-variable contingency tables, also known as three-way tables, extend the concept of two- way tables to analyze
the relationship between three categorical variables simultaneously. These tables allow researchers to explore more
complex interactions and dependencies among variables, providing a deeper understanding of the data.
Importance of Three-Variable Contingency Tables
1. Exploring Interactions: Understanding how the relationship between two variables changes when
considering a third variable.
2. Identifying Conditional Dependencies: Determining if the association between two variables
depends on the level of a third variable.
3. Controlling for Confounders: Adjusting for a third variable to see if the association between the
primary variables is genuine or confounded.
Constructing a Three-Variable Contingency Table
To construct a three-variable contingency table, we categorize data based on three variables. Each cell in the table
represents a combination of these variables and contains the count or frequency of observations.
Example: Analyzing the Relationship Between Gender, Smoking Status, and Exercise Frequency Consider a dataset
with the following variables:
 Gender: Male, Female
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Smoking Status: Smoker, Non-Smoker


 Exercise Frequency: Regular, Occasional, Rarely
 Step-by-Step Construction
1. Create the Data
o Prepare the dataset with the three variables. python
import pandas as pd # Sample data
data = pd.DataFrame({
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'], 'Smoking Status':
['Smoker', 'Non-Smoker', 'Smoker', 'Non-Smoker', 'Non-Smoker',
'Smoker', 'Non-Smoker', 'Smoker'],
'Exercise Frequency': ['Regular', 'Occasional', 'Rarely', 'Regular', 'Occasional', 'Rarely', 'Regular',
'Occasional']
})
2. Create the Contingency Table
o Use the pd.crosstab function to create the three-variable contingency table. python
# Create three-variable contingency table
contingency_table = pd.crosstab(index=[data['Gender'], data['Smoking Status']],
columns=data['Exercise Frequency'])
print(contingency_table)
3. Interpreting the Table
o Each cell represents the count of observations for a specific combination of Gender, Smoking
Status, and Exercise Frequency.
Beyond Three-Variable Contingency Tables
While three-variable contingency tables provide valuable insights, they can become complex and difficult to interpret
as the number of variables increases. Therefore, additional techniques and tools are often used to analyze higher-
dimensional data.
1. Higher-Dimensional Tables: Extending to four or more variables, though interpretation becomes
increasingly challenging.
2. Multivariate Analysis: Using statistical techniques like logistic regression, factor analysis, or
cluster analysis to handle multiple variables simultaneously.
3. Data Visualization: Employing advanced visualization techniques such as heatmaps, mosaic
plots, or parallel coordinates to represent high-dimensional data.
Example: Higher-Dimensional Analysis
Suppose we add a fourth variable, Age Group (Youth, Adult, Senior), to the previous example. We can create a
four-variable contingency table or use logistic regression to analyze the data.

Step-by-Step Higher-Dimensional Analysis


Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

1. Create the Data


o Add the fourth variable to the dataset. python
data['Age Group'] = ['Youth', 'Adult', 'Senior', 'Youth', 'Adult', 'Senior', 'Youth', 'Adult']
2. Higher-Dimensional Contingency Table
o Create a four-variable contingency table using pd.crosstab. python
# Create four-variable contingency table
contingency_table_4d = pd.crosstab(index=[data['Gender'], data['Smoking Status'],
data['Age Group']], columns=data['Exercise Frequency'])
print(contingency_table_4d)
3. Multivariate Analysis with Logistic Regression
o Use logistic regression to analyze the relationship between Exercise Frequency and the other
variables.
import statsmodels.api as sm # Encode categorical variables
data_encoded = pd.get_dummies(data, drop_first=True) # Define the target and predictors
y = data_encoded['Exercise Frequency_Rarely']
X = data_encoded.drop(columns=['Exercise Frequency_Rarely', 'Exercise
Frequency_Regular', 'Exercise Frequency_Occasional'])
# Add constant term for the intercept X = sm.add_constant(X)
# Fit the logistic regression model logit_model = sm.Logit(y, X).fit() # Print the model summary
print(logit_model.summary())
Interpretation
 Contingency Tables: Provide a clear count of observations for each combination of variables, useful
for preliminary analysis.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Logistic Regression: Offers a more nuanced understanding of the relationships, especially when dealing
with more than three variables. The model coefficients and p- values indicate the significance and strength
of these relationships.
Three-variable contingency tables are a powerful tool in EDA for exploring complex interactions and dependencies
among categorical variables. While they provide valuable insights, higher-dimensional data often require advanced
multivariate analysis and visualization techniques to fully understand the relationships. Utilizing these methods
allows for a deeper and more comprehensive exploration of the data, leading to more accurate and actionable
insights.

Fundamentals of Time series analysis:

Objective:
● To understand concept of time series analysis
● To know the different characteristic of time series analysis
● To identify different methods of time series
● To have an overview of the univariate and multivariate time series What Is Time Series Analysis?
Time-series analysis is a method of analyzing a collection of data points over a period of time. Instead of recording
data points intermittently or randomly, time series analysts record data points at consistent intervals over a set period
of time.
While time-series data is information gathered over time, various types of information describe how and when that
information was gathered. For example:
 Time series data: It is a collection of observations on the values that a variable takes at various points in
time.
 Cross-sectional data: Data from one or more variables that were collected simultaneously.
 Pooled data: It is a combination of cross-sectional and time-series data.
The variable varies according to the probability distribution, showing which value Y can take and with which
probability those values are taken.
Yt = μt + εt

Each instance of Yt is the result of the signal μt εt is the noise term here.
Why Do We eed Time-Series Analysis?
Time series analysis has a range of applications in statistics, sales, economics, and many more areas. The common
point is the technique used to model the data over a given period of time.
The reasons for doing time series analysis are as follows:
 Features: Time series analysis can be used to track features like trend, seasonality, and variability.
 Forecasting: Time series analysis can aid in the prediction of stock prices. It is used if you would like to
know if the price will rise or fall and how much it will rise or fall.
 Inferences: You can predict the value and draw inferences from data using Time series analysis.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
What is forecasting?
Forecasting is a special technique of making predictions for the future by using historical data as inputs and analyzing
trends. However, forecasting doesn't tell the future definitively, it only shows probabilities. So, you should always
double check the results before making a decision.
What is a time series analysis and what are the benefits?
A time series analysis focuses on a series of data points ordered in time. This is one of the most widely used data
science analyses and is applied in a variety of industries.

Time Series Analysis Example


Non-stationary data—that is, data that is constantly fluctuating over time or is affected by time— is analyzed using
time series analysis. Because currency and sales are always changing, industries like finance, retail, and e-commerce
frequently use time series analysis. Stock market analysis, especially when combined with automated trading
algorithms, is an excellent example of time series analysis in action.
Time series analysis can be used in -
 Rainfall measurements
 Automated stock trading
 Industry forecast
 Temperature readings
 Sales forecasting
Consider an example of railway passenger data over a period of time.
On the X-axis, we have years, and on the Y-axis, you have the number of passengers.

The following observations can be derived from the given data.


1. Trend: Over time, an increasing or decreasing pattern has been observed. The total number of passengers has
risen over time.
2. Seasonality: Cyclic patterns are the ones that repeat after a certain interval of time. In the case of the railway
passenger, you can see a cyclic pattern with a high and low point that is visible throughout the interval.
Time Series Analysis Types
Some of the models of time series analysis include -
 Classification: It identifies and assigns categories to the data.
 Curve Fitting: It plots data on a curve to investigate the relationships between variables in the data.
 Descriptive Analysis: Patterns in time-series data, such as trends, cycles, and seasonal variation, are
identified.
 Explanative analysis: It attempts to comprehend the data and the relationships between it and cause and effect.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
 Segmentation: It splits the data into segments to reveal the source data's underlying properties.

ARIMA
ARIMA is an acronym for Autoregressive Integrated Moving Average. The Box-Jenkins method is another name for
this method.
Now you will explore the ARIMA parameters in detail:
 Autoregressive Component: AR stands for autoregressive, and is denoted by p. When the value of p is 0, it means
there is no correlation in the series. When the value of p is 1, it means that the auto-correlation is up to one lag.
 Moving Average: Moving average is denoted by q. When q=1, it means that there is an error term.
 Integration: Integration is denoted by d. When the value of d is 0, the series is stationary. When the value of d is
1, the series is not stationary, and you can make it stationary by taking the difference.
Time series analysis is a statistical technique that deals with time-ordered data. The primary goal is to understand the
underlying structure and function that produced the observations and to forecast future values based on historical
data. Time series data are ubiquitous in various fields such as finance, economics, environmental science, and many
others.
Characteristics of Time Series Data
1. Trend: Long-term movement in the data. It represents the general direction in which the data is moving
over time.
2. Seasonality: Regular, repeating patterns or cycles in data at fixed intervals.
3. Cyclic Patterns: Irregular fluctuations that are not of fixed period but occur over long time frames.
4. Noise: Random variations that are unexplained by the model.
Steps in Time Series Analysis
1. Data Collection: Gather the data in a time-ordered sequence.
2. Data Cleaning: Handle missing values, outliers, and anomalies.
3. Data Visualization: Plot the data to identify patterns such as trends and seasonality.
4. Decomposition: Separate the time series into trend, seasonal, and residual components.
5. Modeling: Fit statistical models to the data for understanding and forecasting.
6. Validation: Evaluate the model's performance using validation techniques.
7. Forecasting: Predict future values using the developed model.
Decomposition of Time Series
Decomposition involves breaking down a time series into its constituent components:
 Trend Component (T): Represents the long-term progression of the series.
 Seasonal Component (S): Captures the repeating short-term cycle.
 Residual Component (R): The remaining part after removing trend and seasonality.
Example: Decomposition using Python python
import pandas as pd import numpy as np
import matplotlib.pyplot as plt
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
from statsmodels.tsa.seasonal import seasonal_decompose # Generate sample data
np.random.seed(0)
date_rng = pd.date_range(start='1/1/2020', end='1/1/2022', freq='D') df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.sin(np.linspace(-2 * np.pi, 2 * np.pi, len(df))) + np.random.normal(0, 0.1, len(df)) df.set_index('date',
inplace=True)
# Decompose the time series
result = seasonal_decompose(df['data'], model='additive') # Plot the decomposition
result.plot() plt.show()

Common Time Series Models


1. Autoregressive (AR) Model: Predicts future values based on past values.
2. Moving Average (MA) Model: Predicts future values based on past errors.
3. ARIMA Model: Combines AR and MA models with differencing to make the series stationary.
4. Seasonal ARIMA (SARIMA) Model: Extends ARIMA to capture seasonality.
5. Exponential Smoothing: Applies weighted averages of past observations, giving more weight to
recent observations.
6. Prophet: Developed by Facebook, suitable for forecasting time series with strong seasonal
effects and multiple seasonality.
Example: ARIMA Model using Python python
from statsmodels.tsa.arima.model import ARIMA # Fit an ARIMA model
model = ARIMA(df['data'], order=(1, 1, 1)) model_fit = model.fit()
# Summary of the model print(model_fit.summary()) # Forecast future values
forecast = model_fit.forecast(steps=10) print(forecast)
Evaluating Time Series Models
1. Mean Absolute Error (MAE): Average of absolute errors.
2. Mean Squared Error (MSE): Average of squared errors.
3. Root Mean Squared Error (RMSE): Square root of MSE.
4. Mean Absolute Percentage Error (MAPE): Average of absolute percentage errors.
Example: Evaluating Model using Python python
from sklearn.metrics import mean_absolute_error, mean_squared_error # Actual vs. Predicted values
actual = df['data'].iloc[-10:]
predicted = model_fit.forecast(steps=10) # Calculate MAE and RMSE
mae = mean_absolute_error(actual, predicted)
rmse = np.sqrt(mean_squared_error(actual, predicted)) print(f"MAE: {mae}, RMSE: {rmse}")
Time series analysis is a powerful tool for understanding and forecasting data that varies over time. By decomposing
the data, fitting appropriate models, and evaluating their performance, we can make accurate predictions and uncover
insights that inform decision- making in various domains.
Data Cleaning
Data cleaning is a crucial step in time series analysis to ensure the accuracy and reliability of the results. This involves:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
o Handling Missing Values: Imputing or interpolating missing data points to maintain
continuity.Imputation: Filling missing values with a specific value like mean, median, or mode.
o Interpolation: Estimating missing values based on surrounding data points.
2. Outlier Detection and Removal: Identifying and treating anomalies or unusual data points that
can skew the analysis.
3. Smoothing: Reducing noise to highlight underlying patterns using methods like moving averages.
Example: Data Cleaning using Python python
import pandas as pd import numpy as np
# Sample time series data with missing values and outliers
date_rng = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D') data = pd.Series([1, np.nan, 3, 4, 5,
np.nan, 7, 100, 9, 10], index=date_rng) # Handling missing values
data_filled = data.interpolate() # Removing outliers
data_filled[data_filled > 20] = np.nan # Assume values greater than 20 are outliers data_cleaned =
data_filled.interpolate()
print(data_cleaned)
Time-Based Indexing
Time-based indexing involves setting the index of a DataFrame or Series to a datetime object, allowing for more
convenient and efficient time series operations.
Example: Time-Based Indexing using Python python
# Create DataFrame with time-based indexing df = pd.DataFrame({
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}, index=pd.date_range(start='2020-01-01', end='2020-01-10'))

print(df)

Visualizing in Time Series Analysis


Visualization is a crucial aspect of time series analysis as it helps to understand the data's structure, trends, patterns,
and anomalies over time. Effective visualizations can reveal insights that might not be apparent through statistical
analysis alone.
Here are some common visualization techniques used in time series analysis:
1. Line Plots
 Description: Line plots are the most basic and commonly used method for visualizing time series data.
They plot data points sequentially over time with lines connecting the points.
 Usage: Ideal for displaying trends and patterns over time.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

 Example:
import pandas as pd
import matplotlib.pyplot as plt # Sample data
date_range = pd.date_range(start='2020-01-01', periods=100, freq='D') data = pd.Series(range(100),
index=date_range)
# Plotting data.plot()
plt.title('Line Plot of Time Series Data') plt.xlabel('Date')
plt.ylabel('Value') plt.show()
2. Seasonal Plots
 Description: Seasonal plots help to visualize seasonal patterns by plotting data for each period (e.g.,
month, year) on the same graph.
 Usage: Useful for identifying and comparing seasonal patterns across different periods.
 Example:
import seaborn as sns
# Assuming `data` is a pandas DataFrame with a datetime index and a 'value' column data =
pd.DataFrame({'date': date_range, 'value': range(100)})
data['month'] = data['date'].dt.month # Seasonal plot
sns.lineplot(data=data, x='month', y='value') plt.title('Seasonal Plot of Time Series Data') plt.xlabel('Month')
plt.ylabel('Value') plt.show()
3. Autocorrelation Plots (ACF)
 Description: Autocorrelation plots display the correlation of the time series with its own past values
(lags).
 Usage: Useful for identifying the degree of correlation between time steps and for checking the
randomness of data.
 Example:
python
from statsmodels.graphics.tsaplots import plot_acf # Plotting ACF
plot_acf(data['value']) plt.title('Autocorrelation Plot') plt.xlabel('Lags') plt.ylabel('Autocorrelation') plt.show()
4. Heatmaps
 Description: Heatmaps represent data in a matrix form, where values are represented by different colors.
Time series heatmaps can show patterns across two dimensions, such as time of day and day of week.
 Usage: Useful for visualizing seasonal and cyclical patterns across multiple dimensions.
 Example:
python
data['day_of_week'] = data['date'].dt.dayofweek data['hour'] = data['date'].dt.hour
# Creating a pivot table
heatmap_data = data.pivot_table(index='day_of_week', columns='hour', values='value') # Plotting heatmap
sns.heatmap(heatmap_data, cmap='coolwarm') plt.title('Heatmap of Time Series Data') plt.xlabel('Hour')
plt.ylabel('Day of Week') plt.show()
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
5. Box Plots
 Description: Box plots (or whisker plots) display the distribution of data based on a five- number summary:
minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
 Usage: Useful for comparing distributions across different time periods.

Example:
sns.boxplot(x='month', y='value', data=data) plt.title('Box Plot of Time Series Data') plt.xlabel('Month')
plt.ylabel('Value') plt.show()
6. Decomposition Plots
 Description: Decomposition plots break down a time series into its component parts: trend, seasonal,
and residual components.
 Usage: Useful for understanding the underlying structure of the time series data.

 Example:
python
from statsmodels.tsa.seasonal import seasonal_decompose # Decompose the time series
decomposition = seasonal_decompose(data['value'], model='additive', period=12) # Plotting decomposition
decomposition.plot() plt.show()
Importance of Visualization in Time Series Analysis
 Trend Identification: Visualizations help identify long-term trends in the data.
 Seasonal Patterns: They reveal recurring patterns and seasonality.
 Anomaly Detection: Visualizing data can highlight outliers or anomalies.
 Data Cleaning: Helps in identifying missing values and incorrect data points.
 Model Selection: Assists in choosing appropriate models for forecasting and analysis based on observed
patterns.
Effective visualization is essential for gaining insights and making informed decisions based on time series data. By
utilizing these visualization techniques, analysts can better understand and interpret their data.
Grouping
Grouping in Time Series Analysis (TSA)
Grouping in Time Series Analysis is a powerful technique used to aggregate data based on specific time intervals or
other criteria. This method helps in summarizing the data, identifying patterns, and gaining insights over different
periods.
Key Concepts of Grouping in TSA
1. Time-based Grouping:
o Grouping data based on time intervals such as hourly, daily, weekly, monthly, or yearly.
o Useful for identifying trends and seasonal patterns over specific periods.
2. Custom Grouping:
o Grouping based on custom criteria such as weekdays vs. weekends, business hours vs. non-business
hours, etc.
o Helps in understanding the impact of external factors on the time series data.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Grouping with Pandas


Pandas is a powerful library for handling and analyzing time series data. The groupby method in Pandas is commonly
used for grouping data. Below are examples illustrating various grouping techniques.
Example 1: Time-based Grouping Hourly Grouping
python
import pandas as pd import numpy as np
# Creating a sample time series data
date_range = pd.date_range(start='2023-01-01', periods=1000, freq='H') data = pd.Series(np.random.randn(1000),
index=date_range)
# Grouping by hour
hourly_group = data.groupby(data.index.hour).mean() # Plotting the grouped data hourly_group.plot(kind='bar')
plt.title('Average Value by Hour') plt.xlabel('Hour of the Day') plt.ylabel('Average Value') plt.show()
Monthly Grouping
# Grouping by month
monthly_group = data.groupby(data.index.month).sum() # Plotting the grouped data monthly_group.plot(kind='bar')

plt.title('Total Value by Month') plt.xlabel('Month') plt.ylabel('Total Value') plt.show()


Example 2: Custom Grouping Weekday vs. Weekend
# Adding a column to indicate if the day is a weekday or weekend data_frame = data.to_frame(name='value')
data_frame['is_weekend'] = data_frame.index.weekday >= 5
# Grouping by weekday vs. weekend
weekend_group = data_frame.groupby('is_weekend')['value'].mean() #
Plotting the grouped data
weekend_group.plot(kind='bar') plt.title('Average Value: Weekday vs. Weekend') plt.xlabel('Is Weekend')
plt.ylabel('Average Value')
plt.xticks([0, 1], ['Weekday', 'Weekend']) plt.show()
Business Hours vs. Non-Business Hours
python
# Adding a column to indicate if the time is within business hours (9am-5pm) data_frame['is_business_hours'] =
data_frame.index.hour.isin(range(9, 17))
# Grouping by business hours vs. non-business hours
business_hours_group = data_frame.groupby('is_business_hours')['value'].mean() # Plotting the grouped data
business_hours_group.plot(kind='bar')
plt.title('Average Value: Business Hours vs. Non-Business Hours') plt.xlabel('Is Business Hours')
plt.ylabel('Average Value')
plt.xticks([0, 1], ['Non-Business Hours', 'Business Hours']) plt.show()
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

Benefits of Grouping in TSA


1. Data Summarization:
o Grouping helps in summarizing large datasets into meaningful statistics such as mean, sum,
count, etc.
o It simplifies the data, making it easier to analyze and interpret.
2. Pattern Identification:
o By grouping data over different time intervals, one can identify trends and seasonal
patterns.
o It helps in understanding the behavior of the time series data.
3. Anomaly Detection:
o Grouping data can reveal anomalies or outliers that deviate from the expected pattern.
o This is crucial for identifying unusual events or behaviors in the data.
Grouping is an essential technique in time series analysis that enables better data management and analysis. Whether
it’s time-based grouping or custom grouping, using tools like Pandas makes it straightforward to implement and
visualize the results. This approach provides valuable insights into the data, aiding in more informed decision-
making.
Resampling
Resampling is a powerful technique in time series analysis that involves changing the frequency of your time series
data. This can include:
 Downsampling: Reducing the frequency of the data (e.g., converting daily data to monthly data).
 Upsampling: Increasing the frequency of the data (e.g., converting monthly data to daily data).
Here's an example of how to perform resampling in pandas:
Example Dataset

Let's create a sample time series dataset:


import pandas as pd import numpy as np # Create a date range
date_rng = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D') # Create a DataFrame with this date range
df = pd.DataFrame(date_rng, columns=['date']) # Add a random value column
df['data'] = np.random.randint(0, 100, size=(len(date_rng))) # Set the date column as the index
df.set_index('date', inplace=True) print(df)
Downsampling
To downsample the data from daily to a different frequency (e.g., monthly), use the resample method:
# Downsampling to 3-day frequency, calculating the mean df_downsampled = df.resample('3D').mean()
print(df_downsampled)
This will output the mean value of the data for every 3 days.
Upsampling
To upsample the data from daily to a higher frequency (e.g., hourly), use the resample method and then fill the
resulting NaN values:
# Upsampling to hourly frequency, filling NaN values using forward fill method df_upsampled =
df.resample('H').ffill()
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis

print(df_upsampled).
This will fill the NaN values by propagating the last valid observation forward.
Example
Here is the complete example:
python
import pandas as pd import numpy as np # Create a date range
date_rng = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D') # Create a DataFrame with this date range
df = pd.DataFrame(date_rng, columns=['date']) # Add a random value column
df['data'] = np.random.randint(0, 100, size=(len(date_rng))) # Set the date column as the index
df.set_index('date', inplace=True) print("Original DataFrame:") print(df)
# Downsampling to 3-day frequency, calculating the mean df_downsampled = df.resample('3D').mean()
print("\nDownsampled DataFrame (3-day frequency, mean):") print(df_downsampled)
# Upsampling to hourly frequency, filling NaN values using forward fill method df_upsampled =
df.resample('H').ffill()
print("\nUpsampled DataFrame (hourly frequency, forward fill):") print(df_upsampled)
Output
Original DataFrame:

2020-01-01 81
2020-01-02 43
2020-01-03 23
2020-01-04 76
2020-01-05 21
2020-01-06 34
2020-01-07 84
2020-01-08 45
2020-01-09 15
2020-01-10 57
Downsampled DataFrame (3-day frequency, mean):

2020-01-01 49.000000
2020-01-04 43.666667
2020-01-07 48.000000
2020-01-10 57.000000
Upsampled DataFrame (hourly frequency, forward fill):

2020-01-01 00:00:00 81
2020-01-01 01:00:00 81
2020-01-01 02:00:00 81
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
2020-01-01 03:00:00 81
2020-01-01 04:00:00 81
2020-01-09 20:00:00 15
2020-01-09 21:00:00 15
2020-01-09 22:00:00 15
2020-01-09 23:00:00 15
2020-01-10 00:00:00 57
[217 rows x 1 columns]
These examples show how you can resample your time series data to different frequencies using pandas, providing
flexibility in your data analysis.

Unit V Completed

You might also like