Introduction To Stasmodels
Introduction To Stasmodels
NumPy and SciPy: These libraries provide basic statistical functions and
tools for numerical computing. However, Statsmodels offers more
Introduction to Statsmodels 1
advanced tools for estimating and interpreting statistical models.
Patsy: Patsy is primarily used for describing statistical models and building
design matrices, but it lacks the modeling and testing capabilities of
Statsmodels.
• Example: Predicting house prices based on features like size and location.
Introduction to Statsmodels 2
Time Series Models: Designed for data collected over time, such as stock
prices or weather data. These models account for temporal dependencies.
Key Concepts
Parameters: These are the coefficients in a statistical model that define the
relationship between the variables. For example, in a linear regression
model y = \beta_0 + \beta_1 x + \epsilon , \beta_0 (intercept) and \beta_1
(slope) are parameters.
Estimates: These are the values of the parameters calculated from the
data. They are used to make predictions or inferences about the population
from which the data was drawn.
Residuals: Residuals are the differences between the observed values and
the values predicted by the model. They are crucial for assessing the fit of
the model. A good model will have residuals that are randomly distributed
with no clear pattern.
Introduction to Statsmodels 3
Example: Linear Regression with Statsmodels
Code Snippet
import statsmodels.api as sm
import numpy as np
Explanation
In this example:
Introduction to Statsmodels 4
This example demonstrates how Statsmodels can be used to estimate a model,
obtain parameter estimates, and perform hypothesis tests—all essential steps
in statistical analysis.
• Hint: Think about the true value versus the value calculated from data.
3. Using the linear regression example provided, interpret the p-value for the
slope coefficient.
Introduction
Data preparation and exploration are foundational steps in statistical analysis
and modeling. Preparing data involves loading it from various sources, cleaning
it by addressing missing values and inconsistencies, and transforming it to suit
analytical needs. Exploratory Data Analysis (EDA) allows us to summarize and
visualize the data, revealing its patterns, distributions, and potential issues like
Introduction to Statsmodels 5
outliers. These steps ensure the data is reliable and well-understood before
applying statistical models, such as those in Statsmodels.
In this module, we’ll explore techniques for loading and manipulating data,
followed by methods for conducting EDA, using Python libraries like Pandas,
Statsmodels, Matplotlib, and Seaborn. We’ll use the Iris dataset from
Statsmodels for consistent examples.
import pandas as pd
data = pd.read_csv('path/to/your/file.csv')
data = pd.read_excel('path/to/your/file.xlsx')
For our examples, we’ll use the iris dataset from Statsmodels:
import statsmodels.api as sm
iris = sm.datasets.get_rdataset('iris').data
The Iris dataset includes sepal and petal measurements for three iris flower
species, providing a rich dataset for demonstration.
Introduction to Statsmodels 6
Identifying Missing Values: Check for missing data with isnull().
missing_values = iris.isnull().sum()
print(missing_values)
• Dropping Missing Values: Remove rows with missing data if they’re minimal.
iris_clean = iris.dropna()
• Imputing Missing Values: Replace missing numerical values with the mean,
median, or mode.
iris = iris.drop_duplicates()
iris['species'] = iris['species'].astype('category')
import numpy as np
iris['log_sepal_length'] = np.log(iris['sepal length'])
Introduction to Statsmodels 7
• Generating Polynomial Features: Add polynomial terms for non-linear
relationships.
These techniques prepare the data for more accurate statistical modeling.
Summary Statistics
Summary statistics provide insights into data’s central tendencies and spread:
summary = iris.describe()
print(summary)
This outputs count, mean, standard deviation, min, max, and quartiles.
Data Visualization
Introduction to Statsmodels 8
plt.title('Distribution of Sepal Length')
plt.show()
correlation_matrix = iris.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Introduction to Statsmodels 9
Visual Methods: Use plots to spot unusual values.
sns.boxplot(x=iris['sepal width'])
plt.title('Box Plot of Sepal Width')
plt.show()
Q1 = iris['sepal width'].quantile(0.25)
Q3 = iris['sepal width'].quantile(0.75)
IQR = Q3 - Q1
outliers = iris[(iris['sepal width'] < (Q1 - 1.5 * IQR)) | (iris['sepal width'] > (Q
3 + 1.5 * IQR))]
print(outliers)
Decide whether to remove or adjust outliers based on their impact and context.
Summary
This module covered critical steps in data preparation and exploration. In
Lecture 2.1, we learned to load data from CSV, Excel, and DataFrames, clean it
by handling missing values and duplicates, and transform it through feature
engineering. In Lecture 2.2, we explored EDA with summary statistics,
visualizations, distribution analysis, and outlier detection. Using the Iris dataset,
we demonstrated these concepts with practical Python code, leveraging
Statsmodels, Pandas, Matplotlib, and Seaborn.
Introduction
Linear regression is a foundational statistical technique used to model the
relationship between a dependent variable and one or more independent
variables. It is widely applied in prediction, hypothesis testing, and
understanding variable relationships. This module explores two key types of
linear regression—Simple Linear Regression and Multiple Linear Regression—
using the statsmodels library in Python.
Introduction to Statsmodels 10
We will use the Boston Housing dataset, which includes variables such as
median home value (medv), crime rate (crim), and average number of rooms
(rm), to illustrate the concepts.
( y ): Dependent variable
( x ): Independent variable
import statsmodels.api as sm
Introduction to Statsmodels 11
# Specify and fit the model
# View results
print(model.summary())
Residual Plots
Introduction to Statsmodels 12
# Predicted values and residuals
predictions = model.predict(boston['crim'])
residuals = boston['medv'] - predictions
# Plot
plt.scatter(predictions, residuals)
plt.show()
β1 x1 + β2 x2 + ⋯ + βk xk + ϵ]
where ( x_1, x_2, \dots, x_k ) are independent
variables.
Handling Categorical Variables and Interaction Terms
Categorical Variables
Categorical variables are included as dummy variables. In statsmodels, use C()
in the formula. For example, chas (1 if tract bounds the Charles River, 0
otherwise) is included as C(chas).
Interaction Terms
Interaction terms model how the effect of one variable depends on another. For
example, crim:chas tests if the effect of crim on medv varies by chas.
Example Model
Let’s model medv with crim, rm, and chas:
Introduction to Statsmodels 13
# Specify and fit the model
# View results
print(multi_model.summary())
# Prepare data
X = boston[['crim', 'rm', 'chas']]
X = sm.add_constant(X)
# Calculate VIF
vif = pd.DataFrame()
vif['Variable'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
# Data
Introduction to Statsmodels 14
X = boston[['crim', 'rm', 'chas']]
y = boston['medv']
Summary
This module covered:
Introduction
Statistical inference enables us to draw conclusions about a population from
sample data. In this module, we explore hypothesis testing and confidence
intervals to evaluate relationships in linear regression, and model comparison
techniques to select the best model. We’ll use the Boston Housing dataset
(available in statsmodels), which includes variables like median home value
(medv), crime rate (crim), and average number of rooms (rm), to illustrate these
concepts.
Introduction to Statsmodels 15
Understanding Hypothesis Testing and Confidence Intervals in the Context
of Linear Regression
In linear regression, hypothesis testing assesses whether an independent
variable significantly affects the dependent variable. For each coefficient ((βi ))
The p-value indicates the probability of observing the data if (H_0) is true. A p-
value < 0.05 typically leads to rejecting (H_0), suggesting a significant
relationship.
Confidence intervals (CIs) provide a range within which the true coefficient
likely lies, with a specified confidence level (e.g., 95%). If the CI excludes zero,
the coefficient is significant.
Using `statsmodels` to Perform Hypothesis Tests and Construct Confidence
Intervals
Let’s fit a simple linear regression model using medv as the dependent variable
and crim as the predictor:
import statsmodels.api as sm
import statsmodels.formula.api as smf
In the output:
Introduction to Statsmodels 16
The 95% CI for crim (under [0.025 0.975]) shows the range of plausible
values for (βcrim ).
conf_int = model.conf_int(alpha=0.05)
print(conf_int)
Hypothesis Testing: If the p-value for crim is 0.001, we reject (H0 )and
To choose the best regression model, we compare them using key metrics:
Akaike Information Criterion (AIC): Balances fit and complexity. Lower AIC
suggests a better model.
Introduction to Statsmodels 17
# Fit Model 1
lr = LinearRegression()
mse = cross_val_score(lr, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Mean MSE: {-mse.mean():.2f}")
A lower mean squared error (MSE) indicates better predictive accuracy.
Introduction to Statsmodels 18
Avoiding Overfitting and Underfitting
Overfitting: The model is too complex, fitting noise instead of the true
pattern. It performs well on training data but poorly on new data.
To balance these:
Summary
This module covered:
Lecture 4.2: Comparing models with metrics (R-squared, AIC, BIC) and
selecting the best one using techniques like stepwise regression and cross-
validation, while avoiding overfitting and underfitting.
With these skills, you can confidently analyze regression models using
statsmodels and make data-driven decisions!
Introduction
Time series analysis is essential for understanding and forecasting data
collected over time, such as stock prices, weather patterns, or sales figures.
This module introduces the core concepts of time series data and
demonstrates how to model and forecast it using statsmodels. We’ll use the
AirPassengers dataset, which records monthly airline passenger numbers
from 1949 to 1960, to illustrate key techniques.
Introduction to Statsmodels 19
Lecture 5.1: Introduction to Time Series
Analysis
Understanding Time Series Data and Its Characteristics
A time series is a sequence of data points recorded at regular time intervals.
Time series data often exhibits:
The AirPassengers dataset, for example, shows both an upward trend and
seasonal fluctuations.
Basic Concepts: Stationarity, Autocorrelation, and Partial Autocorrelation
stationary.
Time Series Plot: Displays the data over time to reveal trends and
seasonality.
import statsmodels.api as sm
Introduction to Statsmodels 20
import matplotlib.pyplot as plt
plot_pacf(air_passengers['value'], lags=40)
plt.title('Partial Autocorrelation Function (PACF)')
plt.show()
These plots help identify the appropriate model and its parameters.
Introduction to Statsmodels 21
ARIMA (AutoRegressive Integrated Moving Average): Suitable for non-
seasonal time series. It has three components:
Use the ADF test to check if the series is stationary. If not, apply differencing.
# ADF test
result = adfuller(air_passengers['value'])
print(f'p-value: {result[1]}')
air_passengers['diff_value'] = air_passengers['value'].diff().dropna()
For seasonal data, look for patterns at seasonal lags (e.g., every 12
months).
Introduction to Statsmodels 22
For ARIMA, specify the orders (p, d, q). For SARIMAX, include seasonal orders
(P, D, Q, s).
# Example: ARIMA(1,1,1)
print(arima_model.summary())
For seasonal data like AirPassengers, use SARIMAX with seasonal parameters:
# Example: SARIMAX(1,1,1)(1,1,1,12)
print(sarimax_model.summary())
# Residuals plot
residuals = sarimax_model.resid
plot_acf(residuals, lags=40)
plt.title('ACF of Residuals')
plt.show()
Introduction to Statsmodels 23
Mean Absolute Error (MAE): Average absolute difference between
forecasts and actual values.
# Forecast
forecast = model.forecast(steps=12)
Summary
This module covered:
Lecture 5.2: How to specify, estimate, and diagnose ARIMA and SARIMAX
models in statsmodels, and evaluate their performance using MAE and
MSE.
With these skills, you can analyze and forecast time series data effectively
using statsmodels!
Introduction to Statsmodels 24
Module 6: Advanced Topics and Case
Studies
Introduction
This module dives into advanced linear regression techniques and practical
applications of statsmodels. In Lecture 6.1, we cover methods to handle
complex data scenarios like unequal variances, correlated errors, and outliers.
In Lecture 6.2, we explore case studies from finance, economics, and social
sciences to illustrate how statsmodels solves real-world problems. The Boston
Housing dataset is used for the advanced techniques, while diverse datasets
highlight the case studies.
Example: Using the Boston Housing dataset, suppose variance increases with
crime rate (crim). We weight observations inversely to crim.
import statsmodels.api as sm
import statsmodels.formula.api as smf
Introduction to Statsmodels 25
boston = sm.datasets.get_rdataset('Boston', 'MASS').data
print(wls_model.summary())
Example: For the Boston dataset, assume correlated errors among nearby
towns (simplified here with an identity covariance matrix).
y = boston['medv']
cov_matrix = np.eye(len(boston))
print(gls_model.summary())
Introduction to Statsmodels 26
Note: Real applications require estimating the covariance matrix based on
data structure (e.g., autocorrelation).
X = sm.add_constant(boston[['crim', 'rm']])
y = boston['medv']
print(robust_model.summary())
Outlier Detection:
Outlier detection identifies anomalous points that may distort models.
Techniques include residual analysis and influence measures like Cook’s
distance.
Example: Detect outliers in the Boston dataset using Cook’s distance.
Introduction to Statsmodels 27
influence = ols_model.get_influence()
cooks_d = influence.cooks_distance[0]
Dataset: Hypothetical stock data (e.g., replace with Yahoo Finance data).
Example:
print(arima_model.summary())
Introduction to Statsmodels 28
Dataset: Macroeconomic data (e.g., from public sources).
Example:
print(econ_model.summary())
Example:
print(survey_model.summary())
Introduction to Statsmodels 29
Model Selection: Validate assumptions (e.g., normality, homoscedasticity).
Common Pitfalls
Summary
Lecture 6.1: Covered WLS, GLS, robust regression, and outlier detection
using statsmodels, with examples from the Boston Housing dataset.
This module equips you with advanced tools and practical knowledge to apply
statsmodels effectively in diverse, real-world scenarios.
Introduction to Statsmodels 30
The guided project is your chance to apply what you’ve learned to a dataset
and problem of your choosing. Using statsmodels, you’ll perform statistical
analysis and modeling, following these steps:
1. Choose a Dataset
Select a dataset that interests you—either from previous modules (e.g.,
Boston Housing) or a public source like Kaggle. Ensure it aligns with the
problem you want to solve.
Example: Imagine using the Boston Housing dataset to predict median home
values. You’d explore variables like crime rate and room count, build a linear
regression model, evaluate its fit, and interpret how each factor affects prices.
Applying Concepts and Techniques Learned Throughout the Course
Introduction to Statsmodels 31
Data Exploration: Techniques from Module 2 for cleaning and visualizing
data.
Modeling: Linear regression (Module 3), inference (Module 4), and time
series (Module 5).
File Structure: Organize files into folders (e.g., data/, scripts/, docs/).
Results: Share findings with visuals (e.g., plots) and stats (e.g., model
summaries).
Receiving Feedback:
Introduction to Statsmodels 32
Use feedback to refine your project.
Data Prep & Exploration: Cleaning and analyzing data sets the stage for
modeling.
These skills equip you to solve real-world problems with statistical rigor.
Resources for Further Learning and Professional Development
Continue your growth with these resources:
Books:
Online Courses:
Communities:
Conclusion
Module 7 ties together your learning journey. In Lecture 7.1, you’ll create a
project that demonstrates your skills with statsmodels, organized and
documented to a high standard. In Lecture 7.2, you’ll present your work, reflect
Introduction to Statsmodels 33
on the course, and gain resources to keep advancing. Congratulations on
reaching this point—you’re ready to apply statistical modeling to new
challenges!
Introduction to Statsmodels 34