AD3301 DATA EXPLORATION AND VISUALIZATION
UNIT -5
MULTIVARIATE AND TIME
SERIES ANALYSIS
Presented By
Dr R Murugadoss
Professor
Artificial Intelligence & Data Science
Introducing a Third Variable
- Causal Explanations
- Three-Variable Contingency Tables and Beyond
- Longitudinal Data
– Fundamentals of TSA
– Characteristics of time series data
– Data Cleaning
– Time-based indexing
– Visualizing
– Grouping
– Resampling
What is causal explanation in data exploration?
Causal inference techniques used with experimental
data require additional assumptions to produce reasonable
inferences with observation data. The difficulty of causal
inference under such circumstances is often summed up as
"correlation does not imply causation".
What is an example of a causal explanation?
What is a causal explanation? Population increase causes technological
innovation. A free press causes a low incidence of famine. The fiscal system
of the ancient regime caused the collapse of the French monarchy.
What is three variable contingency table?
Three-way contingency tables involve three binary or categorical
variables. I will stick mostly to the binary case to keep things
simple, but we can have three-way tables with any number of
categories with each variable.
Smoking × Breathing × Age.
Group × Response × Z (hypothetical).
Boys Scouts × Delinquent × SES (hypothetical).
Cal graduate admissions × gender × Department.
Supervisor Job satisfaction × Worker Job satisfaction × Management
quality.
Race × Questions regarding media × Year.
Employment status × Residence × Months after hurricane Katrina.
Cure (C), Gender (G) and Therapy (T).
A dataset is longitudinal if it tracks the same type of information
on the same subjects at multiple points in time. For example,
part of a longitudinal dataset could contain specific students and
their standardized test scores in six successive years
Size 2003 2004 2005 2006 2007
(Gigabytes)
80 $155 $115 $85 $80 $70
120 $245 $150 $115 $115 $80
160 $500 $200 $145 $120 $85
200 $749 $260 $159 $140 $100
Time series analysis is a specific way of analyzing a sequence of data
points collected over an interval of time. In time series analysis,
analysts record data points at consistent intervals over a set period of
time rather than just recording the data points intermittently or
randomly.
Characteristics of time series
Studying the past behavior of a series will help you identify patterns and make better forecasts. When
plotted, many time series exhibit one or more of the following features:
•Trends
A trend is a gradual upward or downward shift in the level of the series or the tendency of the series values to
increase or decrease over time.
•Seasonal cycles
A seasonal cycle is a repetitive, predictable pattern in the series values.
•Nonseasonal cycles
A nonseasonal cycle is a repetitive, possibly unpredictable, pattern in the series values.
•Pulses and steps
Many series experience abrupt changes in level. They generally come in two types:
•A sudden, temporary shift, or pulse, in the series level
•A sudden, permanent shift, or step, in the series level
•Outliers
Shifts in the level of a time series that cannot be explained are referred to as outliers.
Data cleaning is the process of fixing or removing incorrect,
corrupted, incorrectly formatted, duplicate, or incomplete data
within a dataset. When combining multiple data sources, there are
many opportunities for data to be duplicated or mislabeled.
Introduction. A pandas DataFrame or Series with a time-based
index is defined as a time series. The parameters in the time series
can be anything that can fit inside the containers. Date or time
values are merely used to retrieve them.
Visualizing
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and visualization
techniques in order to bring important aspects of that data into focus for further analysis.
Data visualization is an important component of Exploratory Data
Analysis (EDA) because it allows a data analyst to “look at” their data
and get to know the variables and relationships between them.
Grouping Datasets
During data analysis, it is often essential to cluster or group data together
based on certain criteria. For example, an e-commerce store might want to
group all the sales that were done during the Christmas period or the
orders that were received on Black Friday. These grouping concepts occur in
several parts of data analysis. In this chapter, we will cover the
fundamentals of grouping techniques and how doing this can improve data
analysis. We will discuss different groupby() mechanics that will accumulate
our dataset into various classes that we can perform aggregation on. We
will also figure out how to dissect this categorical data with visualization by
utilizing pivot tables and cross-tabulations.
Resampling is a method that involves repeatedly drawing
samples from the training dataset. These samples are then
used to refit a specific model to retrieve more information
about the fitted model. The aim is to gather more
information about a sample and improve the accuracy and
estimate the uncertainty
Two resampling methods are frequently used in data science:
1.The Bootstrap Method
2.Cross-Validation