0% found this document useful (0 votes)
27 views41 pages

Machine Learning Problem-Solving Steps: 1. Look at The Big Picture

The document outlines a structured approach to solving machine learning problems, detailing steps from understanding the problem to deploying and maintaining models. It emphasizes the importance of data collection, exploration, preparation, model selection, evaluation, and visualization. Additionally, it discusses specific techniques for handling datasets, such as stratified sampling and error measurement, using examples like a house price prediction model for Gurgaon.

Uploaded by

dekuled5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views41 pages

Machine Learning Problem-Solving Steps: 1. Look at The Big Picture

The document outlines a structured approach to solving machine learning problems, detailing steps from understanding the problem to deploying and maintaining models. It emphasizes the importance of data collection, exploration, preparation, model selection, evaluation, and visualization. Additionally, it discusses specific techniques for handling datasets, such as stratified sampling and error measurement, using examples like a house price prediction model for Gurgaon.

Uploaded by

dekuled5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Machine Learning Problem-Solving

Steps
In order to solve machine learning problems, we follow a structured approach that
helps ensure accuracy, clarity, and effectiveness. Here are the main steps involved:

1. Look at the Big Picture

Understand the overall problem you’re solving. Define your objective clearly —
what does success look like?

2. Get the Data

Collect relevant and quality data from reliable sources. Without data, there’s no
machine learning.

3. Explore and Visualize the Data

Analyze and visualize data to uncover patterns, trends, and anomalies. This step
helps you understand what you’re working with.

4. Prepare the Data

Clean, transform, and format the data. Handle missing values, normalize features,
and split the data into training and testing sets.

5. Select a Model and Train It

Choose a suitable machine learning algorithm and train it using your data. This is
where your model learns from patterns.
6. Fine-Tune Your Model

Optimize hyperparameters, try different techniques, and improve performance


through iteration.

7. Present Your Solution

Explain your model’s results using visuals, metrics, and clear language so
stakeholders can understand and make decisions.

8. Launch, Monitor, and Maintain

Deploy the model in the real world, monitor its performance, and update it
regularly as new data arrives.
Datasets for Machine Learning
Machine Learning requires quality datasets for training and testing models. Some
popular sources include:

• OpenML.org – A collaborative platform offering a wide range of datasets with


metadata and tools for benchmarking.
• UCI Machine Learning Repository – One of the oldest and most widely used
sources for machine learning datasets.
• Kaggle – A data science community offering large-scale, real-world datasets
and competitions.
Quick Training with Scikit-learn (CSV
Data)
This guide walks you through training a model on your own CSV dataset using
scikit-learn.

1. Import Libraries

import pandas as pd
from sklearn.ensemble import RandomForestClassifier

2. Load Your CSV File

data = pd.read_csv('data.csv') # Replace with your actual CSV file path

3. Separate Features and Label

X = data.iloc[:, :-1] # All columns except the last as features


y = data.iloc[:, -1] # Last column as label

4. Train the Model

model = RandomForestClassifier()
model.fit(X, y)

The model is now trained on your complete dataset.


5. Inference

predictions = model.predict(X) # Predict on the same/new data (for demonstration)


print(predictions) # Display predictions
Evaluating Performance of ML Models
When we build a machine learning model, especially a classification model (which
predicts categories like “spam” or “not spam”, “dog” or “cat”), it’s important to
measure how well the model is performing.

One of the most basic ways to evaluate a classification model is accuracy.

What is Accuracy?

In simple terms, accuracy tells us how often the model was right.

If you gave your model 100 questions to answer, and it got 90 of them correct, its
accuracy would be:

Accuracy = Correct Predictions / Total Predictions = 90 / 100 = 90%

So, accuracy is just the fraction of predictions the model got right.

Evaluating Accuracy
Accuracy = (Correct Predictions) / (Total Predictions)

This is the simplest and most intuitive way to check if the model is doing a good
job.
Gurgaon House Price Prediction Model

Why Are We Building This?

Gurgaon, a rapidly growing city in India, has seen a sharp rise in real estate
development over the past decade. With its proximity to Delhi, booming IT hubs,
and modern infrastructure, Gurgaon has become a major attraction for both
homebuyers and investors. However, the real estate market here is highly dynamic
and often difficult to assess without proper data-driven tools.

We’re building a Gurgaon house price prediction model to help:

• Understand how factors like location, size, number of rooms, and amenities
affect property prices in Gurgaon.
• Assist buyers in identifying fair prices based on historical trends.
• Help sellers estimate an appropriate asking price.
• Empower real estate agents and platforms to improve recommendations and
negotiations.

How Will We Build It?

While we don’t have access to a large, clean dataset of house prices in Gurgaon
right now, we will use a well-known and cleaned dataset—the California housing
dataset—as a proxy. This will allow us to build, test, and evaluate a working model
with real-world variables like:

• Median income of the area


• Proximity to the city center
• Number of rooms
• Latitude and longitude
• Population density
We’ll treat this as a simulation: suppose the California data is Gurgaon data, and
suppose we are building this model for a neighborhood where both you and I live
or work nearby.

Once the model is developed and understood, we can later adapt the same
approach to real Gurgaon data when available, using the same techniques and
logic.
Revisiting Steps to solve this problem

Understanding the Problem

Before we build the model, we need to understand what kind of machine learning
problem we are solving.

First, we’ll check if this is a supervised or unsupervised learning task. Since we


have historical data with house prices (our target), and we want to predict the price
of a house based on input features, this is a supervised learning problem.

Next, we observe that the model is predicting one continuous label — the price of
a house — based on several input features. This makes it a univariate regression
problem.
Measuring Errors (RMSE & MAE)
After training our regression model, we need to evaluate how good its predictions
are. Two common metrics used for this are MAE and RMSE.

1. Mean Absolute Error (MAE)


MAE stands for Mean Absolute Error. It calculates the average of the absolute
differences between the predicted and actual values.

Formula: MAE = (1/n) × sum of |actual - predicted|

• It treats all errors equally, no matter their size.


• MAE is based on the Manhattan norm (also called L1 norm), which measures
distance by summing absolute values.

2. Root Mean Squared Error (RMSE)


RMSE stands for Root Mean Squared Error. It calculates the square root of the
average of squared differences between predicted and actual values.

Formula: RMSE = sqrt((1/n) × sum of (actual - predicted)²)

• RMSE gives more weight to larger errors because it squares them.


• RMSE is based on the Euclidean norm (also called L2 norm), which measures
straight-line distance.

Summary
• Use MAE when all errors should be treated equally.
• Use RMSE when larger errors should be penalized more.
Analyzing the Data (EDA)
Exploratory Data Analysis (EDA) is the process of examining a dataset to
summarize its main characteristics, often using visual methods or quick commands.
The goal is to understand the structure of the data, detect patterns, spot
anomalies, and get a feel for what kind of preprocessing or modeling might be
needed.

For our project, we’ll perform EDA on the California housing dataset (which we are
treating as if it represents Gurgaon data). Here are some key commands we’ll use:

1. df.head()

• Displays the first 5 rows of the dataset.


• Useful for getting a quick overview of what the data looks like — column
names, data types, and sample values.

2. df.info()

• Gives a summary of the dataset.


• Shows the number of entries, column names, data types, and how many non-
null values each column has.
• Helps us identify missing values or incorrect data types.

3. df.describe()

• Provides statistical summaries for numeric columns.

• Shows:

• Count: Total number of non-null entries


• Mean: Average value
• Std: Standard deviation
• Min: The smallest value (0th percentile or 1st quartile in some contexts)
• 25%: The 1st quartile (Q1) — 25% of the data is below this value
• 50%: The median or 2nd quartile (Q2) — half of the data is below this
value
• 75%: The 3rd quartile (Q3) — 75% of the data is below this value
• Max: The largest value (often considered the 4th quartile or 100th
percentile)

Percentiles divide the data into 100 equal parts. Quartiles divide the data into
4 equal parts (Q1 = 25th percentile, Q2 = 50th, Q3 = 75th). So:

• Min is the 0th percentile


• Max is the 100th percentile

This helps us understand how the values are spread out and if there are outliers.

4. df['column_name'].value_counts()

• Shows the count of each unique value in a specific column.


• Useful for categorical columns to see how values are distributed.
Creating a Test Set
When building machine learning models, one of the most important steps is
splitting your dataset into training and test sets. This ensures your model is
evaluated on data it has never seen before, which is critical for assessing its ability
to generalize.

The Problem of Data Snooping Bias

Data snooping bias occurs when information from the test set leaks into the
training process. This can lead to overly optimistic performance metrics and
models that don’t perform well in real-world scenarios.

To avoid this, the test set must be isolated before any data exploration, feature
selection, or model training begins.

Random Sampling: A Basic Approach

A simple method to split the data is to randomly shuffle it and then divide it:

import numpy as np

def shuffle_and_split_data(data, test_ratio):


np.random.seed(42) # Set the seed for reproducibility
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]

Setting the random seed (e.g., with np.random.seed(42) ) ensures consistency


across runs — this is crucial for debugging and comparing models fairly.
However, pure random sampling might not always be reliable, especially if the
dataset contains important patterns that are not evenly distributed.

Stratified Sampling

To ensure that important characteristics of the population are well represented in


both the training and test sets, we use stratified sampling.

What is a Strata?
A strata is a subgroup of the data defined by a specific attribute. Stratified
sampling ensures that each of these subgroups is proportionally represented.

For example, in the California housing dataset, median income is a strong


predictor of house prices. Instead of randomly sampling, we can create strata
based on income levels (e.g., binning median income into categories) and ensure
the test set maintains the same distribution of income levels as the full dataset.

Creating Income Categories

import pandas as pd
# Load the dataset
data = pd.read_csv("housing.csv")
# Create income categories

data["income_cat"] = pd.cut(data["median_income"],
bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf],
labels=[1, 2, 3, 4, 5])

This code creates a new column income_cat that categorizes the median_income
into five bins. Each bin represents a range of income levels, allowing us to stratify
our sampling based on these categories.

We can plot these income categories to visualize the distribution:

import matplotlib.pyplot as plt


data["income_cat"].value_counts().sort_index().plot.bar(rot=0, grid=True)
plt.title("Income Categories Distribution")
plt.xlabel("Income Category")
plt.ylabel("Number of Instances")
plt.show()

Stratified Shuffle Split in Scikit-Learn

Scikit-learn provides a built-in way to perform stratified sampling using


StratifiedShuffleSplit .

Here’s how you can use it:

from sklearn.model_selection import StratifiedShuffleSplit

# Assume income_cat is a column in the dataset created from median_income


split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(data, data["income_cat"]):

strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]

This ensures that the income distribution in both sets is similar to that of the full
dataset, reducing sampling bias and making your model evaluation more reliable.
Data Visualization
Before handling missing values or training models, it’s important to visualize the
data to uncover patterns, relationships, and potential issues.

Geographical Scatter Plot

Visualize the geographical distribution of the data:

df.plot(kind="scatter", x="longitude", y="latitude", grid=True, alpha=0.2)


plt.show()

• alpha=0.2 makes overlapping points more visible.


• This helps reveal data clusters and high-density areas like coastal regions.

Correlation Matrix

To understand relationships between numerical features, compute the correlation


matrix:

corr_matrix = df.corr()

Check how strongly each attribute correlates with the target:

corr_matrix["median_house_value"].sort_values(ascending=False)

This helps identify useful predictors. For example, median_income usually shows a
strong positive correlation with house prices.
Scatter Matrix

Plot selected features to see pairwise relationships:

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median


_age"]
scatter_matrix(df[attributes], figsize=(12, 8))
plt.show()

This gives an overview of which features are linearly related and may be good
predictors.

Focused Income vs Price Plot

Plot median_income vs median_house_value directly:

df.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1,


grid=True)
Further Preprocessing & Handling
Missing Data
Before feeding your data into a machine learning algorithm, you need to clean and
prepare it.

Prepare Data for Training

It’s best to write transformation functions instead of applying them manually. This
ensures:

• Reproducibility on any dataset


• Reusability across projects
• Compatibility with live systems
• Easier experimentation

Start by creating a clean copy and separating the predictors and labels:

housing = strat_train_set.drop("median_house_value", axis=1)

housing_labels = strat_train_set["median_house_value"].copy()

Handling Missing Data

Some features, like total_bedrooms , contain missing values. You can:

1. Drop rows with missing values


2. Drop the entire column
3. Impute missing values (recommended)

We’ll use option 3 using SimpleImputer from Scikit-Learn, which allows consistent
handling across all datasets (train, test, new data):
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)

This computes the median for each numerical column and stores it in
imputer.statistics_ :

>>> imputer.statistics_
array([-118.51 , 34.26 , 29. , 2125. , 434. , 1167. , 408. , 3.5385])

Now apply the learned medians to transform the data:

X = imputer.transform(housing_num)

Other available strategies:

• "mean" – replaces with mean value


• "most_frequent" – for the most common value (can handle categorical)
• "constant" – fill with a fixed value using fill_value=...
Scikit-Learn Design Principles
Scikit-Learn has a simple and consistent API that makes it easy to use and
understand. Below are the key design principles behind it:

1. Consistency

All objects follow a standard interface, which makes learning and using different
tools in Scikit-Learn easier.

2. Estimators

Any object that learns from data is called an estimator.

• Use the .fit() method to train an estimator.


• In supervised learning, pass both X (features) and y (labels) to .fit(X, y) .
• Hyperparameters (like strategy='mean' in SimpleImputer ) are set when
creating the object.

Example:

imputer = SimpleImputer(strategy="median")
imputer.fit(data)
3. Transformers

Some estimators can also transform data. These are called transformers.

• Use .transform() to apply the transformation after fitting.


• Use .fit_transform() to do both in one step.

Example:

X_transformed = imputer.fit_transform(data)

4. Predictors

Models that can make predictions are predictors.

• Use .predict() to make predictions on new data.


• Use .score() to evaluate performance (e.g., accuracy or R²).

Example:

model = LinearRegression()

model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = model.score(X_test, y_test)

5. Inspection

• Hyperparameters can be accessed directly: model.param_name


• Learned parameters are stored with an underscore: model.coef_ ,
imputer.statistics_
6. No Extra Classes

• Inputs and outputs are basic structures like NumPy arrays or Pandas
DataFrames.
• No need to learn custom data types.

7. Composition

You can combine steps into a Pipeline, chaining transformers and a final predictor.

Example:

pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("model", LinearRegression())

])
pipeline.fit(X, y)

8. Sensible Defaults

Most tools in Scikit-Learn work well with default settings, so you can get started
quickly.

Note on DataFrames

Even if you input a Pandas DataFrame, the output of transformers like


transform() will be a NumPy array. You can convert it back like this:

X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)
Handling Categorical and Text
Attributes in Scikit-Learn
Most machine learning algorithms work best with numerical data. But real-world
datasets often contain categorical or text attributes. Let’s understand how to
handle these in Scikit-Learn using the ocean_proximity column from the
California housing dataset as an example.

1. Categorical Attributes

Text columns like "ocean_proximity" are not free-form text but limited to a fixed
set of values (e.g., "NEAR BAY" , "INLAND" ). These are known as categorical
attributes.

Example:

housing_cat = housing[["ocean_proximity"]]
housing_cat.head()

2. Ordinal Encoding

Scikit-Learn’s OrdinalEncoder can convert categories to numbers:

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

This will output a 2D NumPy array with numerical category codes.


To see the mapping:

ordinal_encoder.categories_
# Output: array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'])

⚠️ Caution: Ordinal encoding implies an order between categories, which may not
be true here. For example, it treats INLAND (1) as closer to <1H OCEAN (0) than
NEAR OCEAN (4) , which might not make sense.

3. One-Hot Encoding

For unordered categories, one-hot encoding is a better choice. It creates one


binary column per category.

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

This gives a sparse matrix (efficient storage for mostly zeros).

To convert it to a regular NumPy array:

housing_cat_1hot.toarray()

Or directly get a dense array:

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

To check category order:

cat_encoder.categories_
4. Summary

Method Use When Output Type

OrdinalEncoder Categories have an order 2D NumPy array

OneHotEncoder Categories are unordered Sparse or dense

Using the right encoding ensures your model learns correctly from categorical
features.
Feature Scaling and Transformation
Feature scaling is a crucial preprocessing step. Most machine learning algorithms
perform poorly when input features have vastly different scales.

In the California housing dataset, for example:

• total_rooms ranges from 6 to over 39,000


• median_income ranges from 0 to 15

If you don’t scale these features, models will give more importance to
total_rooms simply because it has larger values.

Why Scaling Is Needed

• Many models (like Linear Regression, KNN, SVMs, Gradient Descent-based


algorithms) assume features are on a similar scale.
• Without scaling, features with larger ranges can dominate model behavior.
• Scaling makes training more stable and faster.

Min-Max Scaling (Normalization)

This method rescales the data to a specific range, usually [0, 1] or [-1, 1] .

Formula:

scaled_value = (x - min) / (max - min)

Use Scikit-Learn’s MinMaxScaler :

from sklearn.preprocessing import MinMaxScaler


min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)

• Use feature_range=(-1, 1) for models like neural networks.


• Sensitive to outliers — extreme values can distort the scale.

Standardization (Z-score Scaling)

This method centers the data around 0 and scales it based on standard deviation.

Formula:

standardized_value = (x - mean) / std

Use Scikit-Learn’s StandardScaler :

from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

• Resulting features have zero mean and unit variance


• Robust to outliers compared to min-max scaling
• Recommended for most ML algorithms, especially when using gradient
descent
Transformation Pipelines
As datasets grow more complex, data preprocessing often involves multiple steps
such as imputing missing values, scaling features, encoding categorical variables,
etc. These steps must be applied in the correct order and consistently across
training, validation, test, and future production data.

To streamline this process, Scikit-Learn provides the Pipeline class — a powerful


utility for chaining data transformations.

Building a Numerical Pipeline

A typical pipeline for numerical attributes might include:

1. Imputation of missing values (e.g., with median).


2. Feature scaling (e.g., with standardization).

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer


from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
("impute", SimpleImputer(strategy="median")),
("standardize", StandardScaler()),
])

How It Works
• The pipeline takes a list of steps as (name, transformer) pairs.
• Names must be unique and should not contain double underscores __ .
• All intermediate steps must be transformers (i.e., must implement
fit_transform() ).
• The final step can be either a transformer or a predictor.

Using make_pipeline

If you don’t want to name the steps manually, you can use make_pipeline() :

from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

• This automatically names the steps using the class names in lowercase.
• If the same class appears multiple times, a number is appended (e.g.,
standardscaler-1 ).

Applying the Pipeline

Call fit_transform() to apply all transformations in sequence:

housing_num_prepared = num_pipeline.fit_transform(housing_num)
print(housing_num_prepared[:2].round(2))

Example output:

array([[-1.42, 1.01, 1.86, 0.31, 1.37, 0.14, 1.39, -0.94],

[ 0.60, -0.70, 0.91, -0.31, -0.44, -0.69, -0.37, 1.17]])

• Each row corresponds to a transformed sample.


• Each column corresponds to a scaled feature.

Retrieving Feature Names


To turn the result back into a DataFrame with feature names:
df_housing_num_prepared = pd.DataFrame(

housing_num_prepared,
columns=num_pipeline.get_feature_names_out(),
index=housing_num.index
)

Pipeline as a Transformer or Predictor

• If the last step is a transformer, the pipeline behaves like a transformer


( fit_transform() , transform() ).
• If the last step is a predictor (e.g., a model), the pipeline behaves like an
estimator ( fit() , predict() ).

This flexibility makes Pipeline the standard way to handle data preprocessing
and modeling in Scikit-Learn projects.
Data Preprocessing - Final Pipeline
In this section, we will consolidate everything we’ve done so far into one final
script using Scikit-Learn pipelines. This includes:

1. Creating a stratified test set


2. Handling missing values
3. Encoding categorical variables
4. Scaling numerical features
5. Combining everything using Pipeline and ColumnTransformer

This will ensure clean, modular, and reproducible code — perfect for production
and education.

Final Preprocessing Code using Scikit-Learn Pipelines

import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedShuffleSplit


from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# from sklearn.preprocessing import OrdinalEncoder # Uncomment if you prefer
ordinal

# 1. Load the data


housing = pd.read_csv("housing.csv")

# 2. Create a stratified test set based on income category


housing["income_cat"] = pd.cut(
housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5]
)

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)


for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index].drop("income_cat", axis=1)
strat_test_set = housing.loc[test_index].drop("income_cat", axis=1)

# Work on a copy of training data


housing = strat_train_set.copy()

# 3. Separate predictors and labels


housing_labels = housing["median_house_value"].copy()
housing = housing.drop("median_house_value", axis=1)

# 4. Separate numerical and categorical columns


num_attribs = housing.drop("ocean_proximity", axis=1).columns.tolist()
cat_attribs = ["ocean_proximity"]

# 5. Pipelines
# Numerical pipeline
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
])

# Categorical pipeline
cat_pipeline = Pipeline([
# ("ordinal", OrdinalEncoder()) # Use this if you prefer ordinal encoding

("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Full pipeline
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", cat_pipeline, cat_attribs),
])

# 6. Transform the data


housing_prepared = full_pipeline.fit_transform(housing)
# housing_prepared is now a NumPy array ready for training

print(housing_prepared.shape)
Training and Evaluating ML Models
Now that our data is preprocessed, let’s move on to training machine learning
models and evaluating their performance. We’ll start with:

• Linear Regression
• Decision Tree Regressor
• Random Forest Regressor

We’ll first test them on the training data and then use cross-validation to get a
better estimate of their true performance.

1. Train and Test Models on the Training Set

from sklearn.linear_model import LinearRegression


from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

# Decision Tree
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

# Random Forest
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

# Predict using training data


lin_preds = lin_reg.predict(housing_prepared)
tree_preds = tree_reg.predict(housing_prepared)
forest_preds = forest_reg.predict(housing_prepared)

# Calculate RMSE
lin_rmse = mean_squared_error(housing_labels, lin_preds, squared=False)
tree_rmse = mean_squared_error(housing_labels, tree_preds, squared=False)
forest_rmse = mean_squared_error(housing_labels, forest_preds, squared=False)

print("Linear Regression RMSE:", lin_rmse)


print("Decision Tree RMSE:", tree_rmse)
print("Random Forest RMSE:", forest_rmse)

A Warning About Training RMSE

Training RMSE only shows how well the model fits the training data. It does not
tell us how well it will perform on unseen data. In fact, the Decision Tree and
Random Forest may overfit, leading to very low training error but poor
generalization.

2. Cross-Validation: A Better Evaluation Strategy

Cross-validation helps us evaluate how a model generalizes to new data without


needing to touch the test set.

What is Cross-Validation?
Instead of training the model once and evaluating on a holdout set, k-fold cross-
validation splits the training data into k folds (typically 10), trains the model on k-1
folds, and validates it on the remaining fold. This process repeats k times.

We’ll use cross_val_score from sklearn.model_selection .


Cross-Validation on Decision Tree

from sklearn.model_selection import cross_val_score


import pandas as pd

# Evaluate Decision Tree with cross-validation


tree_rmses = -cross_val_score(
tree_reg,
housing_prepared,
housing_labels,
scoring="neg_root_mean_squared_error",
cv=10
)

# WARNING: Scikit-Learn’s scoring uses utility functions (higher is better), so


RMSE is returned as negative.
# We use minus (-) to convert it back to positive RMSE.
print("Decision Tree CV RMSEs:", tree_rmses)
print("\nCross-Validation Performance (Decision Tree):")
print(pd.Series(tree_rmses).describe())
Model Persistence and Inference with
Joblib in a Random Forest Pipeline
Lets now summarize how to train a Random Forest model on California housing
data, save the model and preprocessing pipeline using joblib , and reuse the
model later for inference on new data ( input.csv ). This approach helps avoid
retraining the model every time, improving performance and enabling
reproducibility.

Why These Steps?

1. Why Train Once and Save?


• Training models repeatedly is time-consuming and computationally
expensive.
• Saving the model ( model.pkl ) and preprocessing pipeline ( pipeline.pkl )
ensures you can quickly load and run inference anytime in the future.

2. Why Use a Preprocessing Pipeline?


• Raw data needs to be cleaned, scaled, and encoded before model training.
• A Pipeline automates this transformation and ensures identical
preprocessing during inference.

3. Why Use Joblib?


• joblib efficiently serializes large NumPy arrays (like in sklearn models).
• Faster and more suitable than pickle for scikit-learn objects.
4. Why the If-Else Logic?
• The program checks if a saved model exists.

• If not, it trains and saves the model.


• If it does, it skips training and only runs inference, saving time.

Full Code

import os
import pandas as pd
import numpy as np
import joblib

from sklearn.model_selection import StratifiedShuffleSplit


from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

MODEL_FILE = "model.pkl"
PIPELINE_FILE = "pipeline.pkl"

def build_pipeline(num_attribs, cat_attribs):


num_pipeline = Pipeline([

("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
cat_pipeline = Pipeline([
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", cat_pipeline, cat_attribs)
])
return full_pipeline
if not os.path.exists(MODEL_FILE):
# TRAINING PHASE

housing = pd.read_csv("housing.csv")
housing['income_cat'] = pd.cut(housing["median_income"],
bins=[0.0, 1.5, 3.0, 4.5, 6.0, np.inf],
labels=[1, 2, 3, 4, 5])
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, _ in split.split(housing, housing['income_cat']):
housing = housing.loc[train_index].drop("income_cat", axis=1)

housing_labels = housing["median_house_value"].copy()
housing_features = housing.drop("median_house_value", axis=1)

num_attribs = housing_features.drop("ocean_proximity", axis=1).columns.tolist()


cat_attribs = ["ocean_proximity"]

pipeline = build_pipeline(num_attribs, cat_attribs)


housing_prepared = pipeline.fit_transform(housing_features)

model = RandomForestRegressor(random_state=42)
model.fit(housing_prepared, housing_labels)

# Save model and pipeline


joblib.dump(model, MODEL_FILE)
joblib.dump(pipeline, PIPELINE_FILE)

print("Model trained and saved.")

else:

# INFERENCE PHASE
model = joblib.load(MODEL_FILE)
pipeline = joblib.load(PIPELINE_FILE)

input_data = pd.read_csv("input.csv")
transformed_input = pipeline.transform(input_data)
predictions = model.predict(transformed_input)
input_data["median_house_value"] = predictions
input_data.to_csv("output.csv", index=False)
print("Inference complete. Results saved to output.csv")

Summary

With this setup, our ML pipeline is:

• Efficient – No retraining needed if the model exists.


• Reproducible – Same preprocessing logic every time.
• Production-ready – Can be deployed or reused across multiple systems.
Conclusion: California Housing Price
Prediction Project
In this project, we built a complete machine learning pipeline to predict California
housing prices using various regression algorithms. We started by:

• Loading and preprocessing the dataset ( housing.csv ) with careful treatment


of missing values, scaling, and encoding using a custom pipeline.

• Stratified splitting was used to maintain income category distribution


between train and test sets.

• We trained and evaluated multiple algorithms including:

• Linear Regression
• Decision Tree Regressor
• Random Forest Regressor

• Through cross-validation, we found that Random Forest performed the best,


offering the lowest RMSE and most stable results.

Finally, we built a script that:

• Trains the Random Forest model and saves it using joblib .


• Uses an if-else logic to skip retraining if the model exists.
• Applies the trained model to new data ( input.csv ) to predict
median_house_value , storing results in output.csv .

This pipeline ensures that predictions are accurate, efficient, and ready for
production deployment.

You might also like