INTRODUCTION TO PREDICTIVE ANALYTICS
UNIT-1
Introduction: Overview of Predictive Analytics, Setting Up the Problem- Predictive Analytics
Processing Steps: CRISP-DM, Defining Data for Predictive Modeling, Defining the Target
Variable, Defining Measures of Success for Predictive Models, Doing Predictive Modeling Out of
Order
1.1 Introduction to Predictive Analytics
Predictive Analytics is a branch of advanced analytics that uses historical data, statistical
algorithms, and machine learning techniques to identify the likelihood of future outcomes
based on past data.
Key Objectives
Forecast future trends or behaviors
Inform decision-making
Optimize operations
Identify risks and opportunities
Examples
Domain Application
Finance Credit scoring, fraud detection
Marketing Customer churn prediction, targeted advertising
Healthcare Disease outbreak prediction, patient readmission risk
Supply Chain Demand forecasting, inventory optimization
Core Components
1. Data Collection – Acquiring structured or unstructured historical data
2. Data Preparation – Cleaning, transforming, and feature engineering
3. Model Selection – Choosing appropriate algorithms (e.g., regression, decision trees,
neural networks)
4. Model Training – Fitting the model to historical data
5. Evaluation – Measuring performance using metrics like RMSE, accuracy, AUC
6. Deployment – Integrating the model into decision-making systems
7. Monitoring – Tracking model performance over time
1.2 Setting Up the Predictive Analytics Problem
1. Define the Objective
Clarify the business or research goal:
"What are we trying to predict?"
Example: Predict customer churn within 3 months.
2. Identify the Target Variable (Label)
This is the output the model will predict.
Continuous → Regression (e.g., sales forecast)
Categorical → Classification (e.g., churn yes/no)
3. Gather and Explore the Data
Collect relevant historical data
Use exploratory data analysis (EDA) to detect:
o Trends and patterns
o Outliers or anomalies
o Missing values
4. Feature Selection and Engineering
Select or create input variables (features) that best explain or relate to the target.
Techniques include:
o One-hot encoding
o Binning
o Normalization
o Interaction terms
5. Choose a Modeling Approach
Match the problem type with an appropriate model:
Problem Type Example Models
Regression Linear Regression, Random Forest Regressor
Classification Logistic Regression, SVM, Decision Trees
Time Series ARIMA, Prophet, LSTM
6. Define Success Criteria
Establish evaluation metrics to assess performance:
Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC
Regression: RMSE, MAE, R²
Time Series: MAPE, RMSE, SMAPE
7. Validation Strategy
Split data to test generalizability:
Train/Test Split
K-Fold Cross-Validation
Time-based Validation (for time series)
Summary
Topic Key Point
Predictive Analytics Uses data to predict future outcomes
Main Steps Data → Features → Model → Evaluate → Deploy
Problem Setup Define goal → Identify target → Choose features/model → Evaluate
Metrics Vary by task (accuracy for classification, RMSE for regression)
1. 3 Predictive Analytics Processing Steps
Flowchart
[Start]
↓
[Define the Problem]
↓
[Collect Data]
↓
[Prepare Data (Cleaning & Preprocessing)]
↓
[Exploratory Data Analysis (EDA)]
↓
[Feature Selection & Engineering]
↓
[Split the Data (Train/Test/Validation)]
↓
[Select Model]
↓
[Train the Model]
↓
[Evaluate the Model]
↓
If performance is poor → Return to Step Feature Selection & Engineering
↓
[Deploy the Model]
↓
[Monitor & Maintain the Model]
↓
[End / Repeat if Needed]
1. Define the Problem
Goal: Translate a business need into a predictive task
What do we want to predict?
How will the prediction be used?
Examples:
Predict loan defaults (classification)
Forecast monthly sales (regression)
Next: You need data that relates to the outcome you're predicting.
2. Collect Data
Goal: Gather historical data containing relevant features and the target variable
Sources may include:
Databases (SQL, NoSQL)
APIs
Spreadsheets
Web scraping
Important: Ensure data quality, coverage, and relevance.
Next: Raw data must be cleaned and prepared for analysis.
3. Prepare Data (Preprocessing)
Goal: Clean and format the data for modeling
Tasks include:
Handle missing values
Remove duplicates
Correct inconsistent formats
Convert dates to standard formats
Outcome: Structured, clean dataset ready for exploration.
Next: Explore patterns, trends, and relationships in the data.
4. Exploratory Data Analysis (EDA)
Goal: Understand the data’s structure, trends, and relationships
Techniques:
Summary statistics (mean, std, median)
Correlation analysis
Visualizations: histograms, boxplots, scatterplots
Use EDA to:
Spot outliers
Detect multicollinearity
Guide feature engineering
Next: Choose/create variables to train your model.
5. Feature Selection & Engineering
Goal: Build powerful input variables for the model
Techniques include:
Feature selection (filter, wrapper, embedded)
Encoding (One-Hot, Label Encoding)
Normalization/standardization
Creating interaction or aggregated features
Outcome: Optimized feature set that maximizes model accuracy.
Next: Split the dataset to train and test the model properly.
6. Split the Data
Goal: Divide data to avoid overfitting and estimate generalization
Common splits:
Training set: to train the model
Validation set: to tune hyperparameters
Test set: to evaluate final model
Alternative: K-Fold Cross-Validation for smaller datasets
Next: Choose a model suited to your problem and data.
7. Select Model
Goal: Choose the best algorithm for the task
Examples:
Classification: Logistic Regression, Decision Tree, SVM
Regression: Linear Regression, Random Forest Regressor
Time Series: ARIMA, Prophet, LSTM
Next: Train the model on the training data.
8. Train the Model
Goal: Fit the selected model to the training data
May involve:
Parameter tuning
Feature scaling
Handling imbalance (e.g., SMOTE)
Result: A trained model ready for evaluation.
Next: Evaluate performance using test/validation data.
9. Evaluate the Model
Goal: Measure how well the model performs
Metrics vary by task:
Type Metrics
Classification Accuracy, F1-score, ROC-AUC
Regression RMSE, MAE, R²
Time Series MAPE, SMAPE
Outcome: Decide if the model meets the required performance.
If good: Move to deployment
If bad: Reiterate steps 5–8 (feature engineering, model tuning)
10. Deploy the Model
Goal: Use the model in real-world applications
Options include:
API integration
Dashboard interfaces
Embedded in software systems
Result: The model starts making live or batch predictions.
Next: Monitor performance in production.
11. Monitor & Maintain the Model
Goal: Ensure long-term model reliability
Tasks:
Track performance metrics over time
Detect data drift or concept drift
Retrain with new data as needed
Reason: Models degrade over time due to changing patterns.
1.4 CRISP-DM (Cross-Industry Standard Process for Data Mining)
CRISP-DM is a structured, six-phase framework for organizing and executing data science and
predictive analytics projects. It’s industry-neutral, iterative, and designed to support the full
lifecycle of a data mining project.
PROCESS
[1. Business Understanding]
↓
[2. Data Understanding]
↓
[3. Data Preparation]
↓
[4. Modeling]
↓
[5. Evaluation]
↓
[6. Deployment]
↓
[→ Iterate if necessary or update over time]
1. Business Understanding
Understand the project objectives and requirements from a business perspective.
Define business goals (e.g., reduce churn, increase sales)
Assess the current situation
Translate business needs into a data science problem
2. Data Understanding
Collect initial data and begin to get familiar with it.
Identify data sources
Describe and explore data
Detect data quality issues
Perform initial EDA (exploratory data analysis)
3. Data Preparation
Build the dataset to feed into modeling tools.
Clean data (handle missing values, outliers)
Format data (types, encodings)
Create features (feature engineering)
Merge data from different sources
4. Modeling
Apply machine learning or statistical modeling techniques.
Select modeling techniques (e.g., classification, regression)
Train models on prepared data
Tune model parameters
Evaluate training performance
5. Evaluation
Assess the model to ensure it meets business objectives.
Evaluate model performance using metrics
Review whether the model satisfies the original business goals
Decide on next steps (refine model, go to deployment, or revisit earlier steps)
6. Deployment
Implement the model into a real-world environment.
Deploy to production systems (e.g., via API, dashboard)
Create documentation
Set up monitoring and maintenance plan
Deliver final report to stakeholders
Iterative Loops: You may go back and forth between steps. For example:
Poor evaluation results → return to Modeling or Data Prep
New business goals → return to Business Understanding
Summary
Phase Description Output
Business Understand the business
Project goals, data mining goals
Understanding problem
Data Understanding Collect & explore the data Initial insights, data quality report
Clean, select, and engineer
Data Preparation Modeling-ready dataset
features
Modeling Build predictive models Trained models, tuned parameters
Assess models against business
Evaluation Model performance report
goals
Live model, documentation, feedback
Deployment Implement the solution
loop