Eda Unit 5
Eda Unit 5
Inferential Statistics
Inferential statistics is a branch of statistics that makes the use of various analytical tools to
draw inferences about the population data from sample data. The purpose of descriptive
and inferential statistics is to analyze different types of data using different tools.
Descriptive statistics helps to describe and organize known data using charts, bar graphs,
etc., while inferential statistics aims at making inferences and generalizations about the
population data.
Descriptive statistics allow you to describe a data set, while inferential statistics allow you
to make inferences based on a data set. The samples chosen in inferential statistics need to
be representative of the entire population.
There are two main types of inferential statistics - hypothesis testing and regression
analysis.
   • Hypothesis Testing - This technique involves the use of hypothesis tests such as the
       z test, f test, t test, etc. to make inferences about the population data. It requires
       setting up the null hypothesis, alternative hypothesis, and testing the decision
       criteria.
   • Regression Analysis - Such a technique is used to check the relationship between
       dependent and independent variables. The most commonly used type of regression
       is linear regression.
T-test
    • Sample size is less than 30 and the data set follows a t-distribution.
    • The population variance is not known to the researcher.
         o 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠: 𝐻0: 𝜇 = 𝜇0
         o 𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠: 𝐻1: 𝜇 > 𝜇0
F-test
    • Checks whether a difference between the variances of two samples or populations
       exists or not.
Multivariate Analysis:
Multivariate analysis refers to statistical techniques used to analyze and understand
relationships between multiple variables simultaneously. It involves exploring patterns,
dependencies, and associations among variables in a dataset. Some commonly used
multivariate analysis techniques include:
    • Multivariate Regression Analysis: Extends simple linear regression to analyze the
       relationship between multiple independent variables and a dependent variable.
    • Principal Component Analysis (PCA): Reduces the dimensionality of a dataset by
       transforming variables into a smaller set of uncorrelated variables called principal
       components.
    • Factor Analysis: Examines the underlying factors or latent variables that explain the
       correlations among a set of observed variables.
    • Cluster Analysis: Identifies groups or clusters of similar observations based on the
       similarity of their attributes.
    • Discriminant Analysis: Differentiates between two or more predefined groups based
       on a set of predictor variables.
    • Canonical Correlation Analysis: Analyzes the relationship between two sets of
       variables to identify the underlying dimensions that are shared between them.
Approach:
Descriptive Analysis:
   • Calculate descriptive statistics for each variable, including measures like mean,
       median, standard deviation, and frequency distributions.
   • Examine the distributions of age and income using histograms or boxplots to
       identify any outliers or unusual patterns.
   • Create cross-tabulations or contingency tables to explore the distribution of
       education level across different age groups or income brackets.
Correlation Analysis:
   • Calculate the correlation coefficients (e.g., Pearson correlation, Spearman
       correlation) between age, income, and education level.
   • Interpret the correlation coefficients to determine the strength and direction of the
       relationships between variables.
   • Visualize the relationships using a correlation matrix or a heatmap to identify any
       significant associations.
Regression Analysis:
   • Perform multivariate regression analysis to assess the impact of age and education
       level on income.
   • Set income as the dependent variable and age and education level as independent
       variables.
   • Interpret the regression coefficients to understand how each independent variable
       influences the dependent variable.
   • Assess the overall model fit and statistical significance of the regression model.
Multivariate Visualization:
   • Create scatter plots or bubble plots to visualize the relationship between age,
       income, and education level.
   • Use different colors or symbols to represent different education levels and examine
       if there are distinct patterns or trends.
Further Analysis:
   • Consider additional multivariate techniques such as factor analysis or cluster
       analysis to explore underlying dimensions or groups within the data.
   • Conduct subgroup analyses or interaction analyses to investigate if the relationships
       differ across different demographic groups or educational backgrounds.
Causal explanations
       Causal explanations aim to understand the cause-and-effect relationships between
variables and explain why certain outcomes occur. They involve identifying the factors or
                                                                                    APEC
conditions that influence a particular outcome and determining the mechanisms through
which they operate.
       Causal explanations are important in various fields, including social sciences,
economics, psychology, and epidemiology, among others. They help researchers
understand the fundamental drivers of phenomena and develop interventions or policies to
bring about desired outcomes.
Some key aspects and approaches to consider when seeking causal explanations:
Association vs. Causation:
It's crucial to differentiate between mere associations or correlations between variables
and actual causal relationships. Correlation does not imply causation, and establishing
causality requires rigorous evidence, such as experimental designs or well-designed
observational studies that account for potential confounding factors.
                                                                                        APEC
Establishing Causality:
Several criteria need to be considered when establishing causality, such as temporal
precedence (the cause precedes the effect in time), covariation (the cause and effect vary
together), and ruling out alternative explanations.
Simple Scenario:
We want to investigate whether exercise has a causal effect on weight loss. We hypothesize
that regular exercise leads to a reduction in weight.
Explanation:
To establish a causal explanation, we would need to conduct a study that meets the criteria
for establishing causality, such as a randomized controlled trial (RCT). In this hypothetical
RCT, we randomly assign participants to two groups:
    • Experimental Group: Participants in this group are instructed to engage in a
       structured exercise program, such as 30 minutes of moderate-intensity aerobic
       exercise five times a week.
    • Control Group: Participants in this group do not receive any specific exercise
       instructions and maintain their usual daily activities.
The study is conducted over a period of three months, during which the weight of each
participant is measured at the beginning and end of the study. The data collected are as
follows:
Experimental Group:
                                                                                           APEC
Analysis:
        We compare the average weight loss between the experimental and control groups.
The results show that the experimental group had an average weight loss of 4 kg, while the
control group had an average weight loss of only 1 kg. The difference in average weight loss
between the groups suggests that regular exercise has a causal effect on weight loss.
        Additionally, we can use statistical tests, such as t-tests or analysis of variance
(ANOVA), to determine if the observed difference in weight loss between the groups is
statistically significant. If the p-value is below a predetermined significance level (e.g., p <
0.05), we can conclude that the difference is unlikely due to chance alone and provides
further evidence for a causal relationship.
Example
Let's consider the variables "Gender" (Male/Female), "Education Level" (High
school/College/Graduate), and "Income Level" (Low/Medium/High). We want to explore if
there is an association between gender, education level, and income level.
A three-variable contingency table for this example might look like:
                                                   Income Level
                Education Level
                                          Low         Medium          High
                High School                20           40             30
                                                                                        APEC
               College                    30            50           40
               Graduate                   10            20           30
From this contingency table, we can analyze the relationship between these variables. For
example:
   • Conditional Relationships: We can examine the relationship between gender and
      income level, conditional on education level. This can be done by comparing the
      income level distribution for males and females within each education level
      category.
   • Marginal Relationships: We can examine the relationship between gender and
      education level, and between education level and income level separately by looking
      at the marginal distributions of the variables.
   • Assessing Dependency: We can perform statistical tests, such as the chi-square test,
      to determine if there is a statistically significant association between the variables.
      This helps assess the dependency and provides insights into potential causal
      explanations.
Crosstabs is just another name for contingency tables, which summarize the relationship
between different categorical variables. Crosstabs in SPSS can help you visualize the
proportion of cases in subgroups.
    • To describe a single categorical variable, we use frequency tables.
    • To describe the relationship between two categorical variables, we use a special
       type of table called a cross-tabulation (or "crosstab")
           o Categories of one variable determine the rows of the table
           o Categories of the other variable determine the columns
           o The cells of the table contain the number of times that a particular
              combination of categories occurred.
A "square" crosstab is one in which the row and column variables have the same number of
categories. Tables of dimensions 2x2, 3x3, 4x4, etc. are all square crosstabs.
Example 1
                                                                                     APEC
Example 2
Example 3
   •   A → Row(s): One or more variables to use in the rows of the crosstab(s). You must
       enter at least one Row variable.
   •   B →Column(s): One or more variables to use in the columns of the crosstab(s). You
       must enter at least one Column variable.
   •   C → Layer: An optional "stratification" variable. When a layer variable is specified,
       the crosstab between the Row and Column variable(s) will be created at each level
       of the layer variable. You can have multiple layers of variables by specifying the first
       layer variable and then clicking Next to specify the second layer variable.
   •   D → Statistics: Opens the Crosstabs: Statistics window, which contains fifteen
       different inferential statistics for comparing categorical variables.
                                                                                  APEC
•   E → Cells: Opens the Crosstabs: Cell Display window, which controls which output is
    displayed in each cell of the crosstab.
•   F → Format: Opens the Crosstabs: Table Format window, which specifies how the
    rows of the table are sorted.
                                                                                    APEC
Syntax
CROSSTABS
 /TABLES=RankUpperUnder BY LiveOnCampus BY State_Residency
 /FORMAT=AVALUE TABLES
 /CELLS=COUNT
 /COUNT ROUND CELL.
Output
Again, the Crosstabs output includes the boxes Case Processing Summary and the
crosstabulation itself.
                                                                                         APEC
Notice that after including the layer variable State Residency, the number of valid cases we
have to work with has dropped from 388 to 367. This is because the crosstab requires
nonmissing values for all three variables: row, column, and layer.
                                            l
The layered crosstab shows the individual Rank by Campus tables within each level of State
Residency. Some observations we can draw from this table include:
   • A slightly higher proportion of out-of-state underclassmen live on campus (30/43)
       than do in-state underclassmen (110/168).
   • There were about equal numbers of out-of-state upper and underclassmen; for in-
       state students, the underclassmen outnumbered the upperclassmen.
   • Of the nine upperclassmen living on-campus, only two were from out of state.
Temporal Dependencies:
   • Time series data often exhibits temporal dependencies, where each observation is
     influenced by previous observations.
   • Understanding these dependencies is crucial for analyzing and forecasting time
     series data accurately.
   •   Seasonality: The recurring patterns or cycles that occur at fixed time intervals.
          o Example: Monthly Sales of Ice Cream
          o Sales of ice cream are higher during the summer months compared to the
             rest of the year, showing a seasonal pattern.
Data cleaning
Data cleaning is the process of identifying and correcting inaccurate records from a dataset
along with recognizing unreliable or irrelevant parts of the data.
Handling Missing Values:
   • Identify missing values in the time series data.
   • Decide on an appropriate method to handle missing values, such as interpolation,
       forward filling, or backward filling.
   • Use pandas or other libraries to fill or interpolate missing values.
Outlier Detection and Treatment:
   • Identify outliers in the time series data that may be caused by measurement errors
       or anomalies.
   • Use statistical techniques, such as z-score or modified z-score, to detect outliers.
   • Decide on the treatment of outliers, such as removing them, imputing them with a
       reasonable value, or replacing them using smoothing techniques.
Handling Duplicates:
   • Check for duplicate entries in the time series data.
   • Remove or handle duplicate values appropriately based on the specific
       requirements of the analysis.
                                                                                      APEC
Output
Original Data:
             Value
Date
2021-01-01   10.0
2021-01-02     NaN
2021-01-03   12.0
2021-01-04   18.0
2021-01-05     NaN
Interpolated Data:
             Value
Date
2021-01-01   10.0
2021-01-02   11.0
2021-01-03   12.0
2021-01-04   18.0
2021-01-05   18.0
# Remove outliers
df_cleaned = df[df['Value'] <= threshold]
# Plot the original and cleaned time series data with outliers highlighted
plt.figure(figsize=(8, 4))
plt.plot(df.index, df['Value'], label='Original', color='blue')
plt.scatter(outliers.index, outliers['Value'], color='red', label='Outliers')
plt.plot(df_cleaned.index, df_cleaned['Value'], label='Cleaned',
color='green')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Outlier Detection and Treatment')
plt.legend()
plt.show()
Output
Original Data:
             Value
Date
                                                                                       APEC
2021-01-01        10
2021-01-02        15
2021-01-03        12
2021-01-04       100
2021-01-05        20
Cleaned Data:
             Value
Date
2021-01-01     10
2021-01-02     15
2021-01-03     12
2021-01-05     20
Time-based indexing
       Time-based indexing refers to the process of organizing and accessing data based on
timestamps or time intervals. It involves assigning timestamps to data records or events
and utilizing these timestamps to efficiently retrieve and manipulate the data.
       In time-based indexing, each data record or event is associated with a timestamp
indicating when it occurred. The timestamps can be precise points in time or time intervals,
depending on the granularity required for the application. The data is then organized and
indexed based on these timestamps, enabling quick and efficient access to specific time
ranges or individual timestamps.
       Time-based indexing is commonly used in various domains that involve time-series
data or events, such as financial markets, scientific research, IoT (Internet of Things)
applications, system monitoring, and social media analysis.
In the context of TSA (Time Series Analysis), time-based indexing refers to the practice of
organizing and accessing time-series data based on the timestamps associated with each
observation. TSA involves analyzing and modeling data that is collected over time, and
time-based indexing plays a crucial role in effectively working with such data.
Time-based indexing allows for efficient retrieval and manipulation of time-series data,
enabling various operations such as subsetting, filtering, and aggregation based on specific
time periods or intervals.
   •   Pandas: Pandas provides the DateTimeIndex object, which allows for indexing and
       manipulation of time-series data. It offers a wide range of time-based operations,
       such as slicing by specific time periods, resampling at different frequencies, and
       handling missing or irregular timestamps.
Time-based indexing in TSA is essential for conducting exploratory data analysis, fitting
time-series models, forecasting future values, and evaluating model performance.
Slicing: Slicing involves retrieving a subset of data within a specific time range. With time-
based indexing, you can easily slice the time-series data based on specific dates, times, or
time intervals.
Example:
           # Retrieve data between two specific dates
           subset = df['2023-01-01':'2023-03-31']
df = pd.DataFrame(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])
Output
                value
timestamp
2023-01-02          15
2023-01-03          12
Resampling: Resampling involves changing the frequency of the time-series data. You can
upsample (increase frequency) or downsample (decrease frequency) the data to different
time intervals, such as aggregating hourly data to daily data or converting daily data to
monthly data.
Example:
        # Resample data to monthly frequency
        monthly_data = df.resample('M').mean()
df = pd.DataFrame(data)
Output
                value
timestamp
2023-01-31      13.75
Shifting: Shifting involves moving the timestamps of the data forwards or backwards by a
specified number of time units. This operation is useful for calculating time differences or
creating lagged variables.
Example:
     # Shift the data one day forward
     shifted_data = df.shift(1, freq='D')
df = pd.DataFrame(data)
Output
                value
timestamp
2023-01-02           10
2023-01-03           15
2023-01-04           12
2023-01-05           18
                                                                                        APEC
Rolling Windows: Rolling windows involve calculating statistics over a moving window of
data. It allows for analyzing trends or patterns in a time-series by considering a fixed-size
window of observations.
Example:
        # Calculate the rolling average over a 7-day window
        rolling_avg = df['value'].rolling(window=7).mean()
import pandas as pd
df = pd.DataFrame(data)
Output
timestamp
2023-01-01     NaN
2023-01-02    12.5
2023-01-03    13.5
2023-01-04    15.0
Name: value, dtype: float64
Grouping and Aggregation: Grouping and aggregation operations involve grouping the
time-series data based on specific time periods (e.g., days, weeks, months) and performing
calculations on each group, such as calculating the sum, mean, or maximum value.
                                                                                    APEC
Example:
     # Calculate the sum of values for each month
     monthly_sum = df.groupby(pd.Grouper(freq='M')).sum()
df = pd.DataFrame(data)
Output
                value
timestamp
2023-01-31          55
df = pd.DataFrame(data)
Output
                 value
timestamp
2023-02-01          15
2023-03-01          12
                 value
timestamp
2023-01-31        10.0
2023-02-28        15.0
2023-03-31        12.0
2023-04-30        18.0
Bar Charts
Bar charts, also known as bar graphs or column charts, are a type of graph that uses
rectangular bars to represent data. They are widely used for visualizing categorical or
discrete data, where each category is represented by a separate bar. Bar charts are effective
in displaying comparisons between different categories or showing the distribution of a
single variable across different groups. The length of each bar is proportional to the value
of the variable at that point in time.
Gantt chart
A Gantt chart is a type of bar chart that is commonly used in project management to
visually represent project schedules and tasks over time. It provides a graphical
representation of the project timeline, showing the start and end dates of tasks, as well as
their duration and dependencies.
The key features of a Gantt chart are as follows:
    • Task Bars: Each task is represented by a horizontal bar on the chart. The length of
       the bar indicates the duration of the task, and its position on the chart indicates the
       start and end dates.
    • Timeline: The horizontal axis of the chart represents the project timeline, typically
       displayed in increments of days, weeks, or months. It allows for easy visualization of
       the project duration and scheduling.
    • Dependencies: Gantt charts often include arrows or lines between tasks to represent
       dependencies or relationships between them. This helps to visualize the order in
       which tasks need to be completed and identify any critical paths or potential
                                                                                     APEC
      bottlenecks.
      Milestones: Milestones are significant events or achievements within a project. They
      are typically represented by diamond-shaped markers on the chart to indicate
      important deadlines or deliverables.
Output
Stream graph
A stream graph is a variation of a stacked area chart that displays changes in data over time
of different categories through the use of flowing, organic shapes that create an aesthetic
river/stream appearance. Unlike the stacked area chart, which plots data over a fixed,
straight axis, the stream plot has values displaced around a varying central baseline.
Each individual stream shape in the stream graph is proportional to the values of it’s
categories. Color can be used to either distinguish each category or to visualize each
category’s additional quantitative values through varying the color shade.
Making a Stream Graph with Python
For this example we will use Altair, which is a graphing library in python. Altair is a
declarative statistical visualization library, based on Vega and Vega-Lite. The source code is
available on GitHub.
To begin creating our stream graph, we will need to first install Altair and vega_datasets.
                                                                                       APEC
Now, let’s use altair and the vega datasets to create an interactive stream graph looking at
unemployment data across a series of 10 years of time across multiple industries.
source = data.unemployment_across_industries.url
alt.Chart(source).mark_area().encode(
    alt.X('yearmonth(date):T',
        axis=alt.Axis(format='%Y', domain=False, tickSize=0)
    ),
    alt.Y('sum(count):Q', stack='center', axis=None),
    alt.Color('series:N',
        scale=alt.Scale(scheme='category20b')
    )
).interactive()
Output
                                                                                            APEC
Heat map
       A heat map is a graphical representation of data where individual values are
represented as colors. It is typically used to visualize the density or intensity of a particular
phenomenon over a geographic area or a grid of cells.
       In a heat map, each data point is assigned a color based on its value or frequency.
Typically, a gradient of colors is used, ranging from cooler colors (such as blue or green) to
warmer colors (such as yellow or red). The colors indicate the magnitude of the data, with
darker or more intense colors representing higher values and lighter or less intense colors
representing lower values.
       Heat maps are commonly used in various fields, including data analysis, statistics,
finance, marketing, and geographic information systems (GIS). They can provide insights
into patterns, trends, or anomalies in the data by visually highlighting areas of higher or
lower concentration.
import numpy as np
import matplotlib.pyplot as plt
# Generate random data
data = np.random.rand(10, 10)
# Create heatmap
plt.imshow(data, cmap='hot', interpolation='nearest')
# Add color bar
plt.colorbar()
# Show the plot
plt.show()
output
                                                                                         APEC
Grouping
Grouping time series data involves dividing it into distinct groups based on certain criteria.
This grouping can be useful for performing calculations, aggregations, or analyses on
specific subsets of the data. In Python, you can use the pandas library to perform grouping
operations on time series data. Here's an example of how to group time series data using
pandas:
Simple Python program to explain Grouping
import pandas as pd
Output
Grouped DataFrame (by category):
          value
category
A          2450
B          2500
In this example, we first create a sample DataFrame df with a datetime index. The
DataFrame contains a 'value' column ranging from 0 to 99 and a 'category' column with
two distinct categories 'A' and 'B'.
Resampling
      Resampling time series data involves grouping the data into different time intervals
and aggregating or summarizing the values within each interval. This process is useful
when you want to change the frequency or granularity of the data or when you need to
perform calculations over specific time intervals. There are two common methods for
resampling time series data: upsampling and downsampling.
In this example, the data is being downsampled to daily frequency, and the mean value
within each day is calculated.
                                                                                    APEC
Upsampling: Upsampling involves increasing the frequency of the data by grouping it into
smaller time intervals. This may require filling in missing values or interpolating to
estimate values within the new intervals. Some common upsampling methods include:
    • Forward/Backward Filling: Propagate the last known value forward or backward to
       fill missing values within each interval.
    • Interpolation: Use interpolation methods like linear, polynomial, or spline
       interpolation to estimate values within each interval.
    • Resample Method: Utilize specialized resampling methods to estimate values within
       each interval.
Here's an example of upsampling time series data using the resample() function in pandas:
import pandas as pd
In this example, the data is being upsampled to hourly frequency, and missing values are
interpolated using the interpolate() function.
print("\nUpsampled DataFrame:")
print(upsampled_df.head())
Output
Downsampled DataFrame:
            value
                                                                                   APEC
date
2022-01-31       15.0
2022-02-28       44.5
2022-03-31       74.0
2022-04-30       94.5
Upsampled DataFrame:
                                value
date
2022-01-01    00:00:00      0.000000
2022-01-01    01:00:00      0.041667
2022-01-01    02:00:00      0.083333
2022-01-01    03:00:00      0.125000
2022-01-01    04:00:00      0.166667
In this example, we first create a sample DataFrame df with a datetime index. The
DataFrame contains a 'value' column ranging from 0 to 99, with a daily frequency for the
'date' index.