0% found this document useful (0 votes)

27 views41 pages

Machine Learning Problem-Solving Steps: 1. Look at The Big Picture

The document outlines a structured approach to solving machine learning problems, detailing steps from understanding the problem to deploying and maintaining models. It emphasizes the importance of data collection, exploration, preparation, model selection, evaluation, and visualization. Additionally, it discusses specific techniques for handling datasets, such as stratified sampling and error measurement, using examples like a house price prediction model for Gurgaon.

Uploaded by

dekuled5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views41 pages

Machine Learning Problem-Solving Steps: 1. Look at The Big Picture

Uploaded by

dekuled5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Machine Learning Problem-Solving

Steps
In order to solve machine learning problems, we follow a structured approach that
helps ensure accuracy, clarity, and effectiveness. Here are the main steps involved:

1. Look at the Big Picture

Understand the overall problem you’re solving. Define your objective clearly —
what does success look like?

2. Get the Data

Collect relevant and quality data from reliable sources. Without data, there’s no
machine learning.

3. Explore and Visualize the Data

Analyze and visualize data to uncover patterns, trends, and anomalies. This step
helps you understand what you’re working with.

4. Prepare the Data

Clean, transform, and format the data. Handle missing values, normalize features,
and split the data into training and testing sets.

5. Select a Model and Train It

Choose a suitable machine learning algorithm and train it using your data. This is
where your model learns from patterns.
6. Fine-Tune Your Model

Optimize hyperparameters, try different techniques, and improve performance

through iteration.

7. Present Your Solution

Explain your model’s results using visuals, metrics, and clear language so
stakeholders can understand and make decisions.

8. Launch, Monitor, and Maintain

Deploy the model in the real world, monitor its performance, and update it
regularly as new data arrives.
Datasets for Machine Learning
Machine Learning requires quality datasets for training and testing models. Some
popular sources include:

• OpenML.org – A collaborative platform offering a wide range of datasets with

metadata and tools for benchmarking.
• UCI Machine Learning Repository – One of the oldest and most widely used
sources for machine learning datasets.
• Kaggle – A data science community offering large-scale, real-world datasets
and competitions.
Quick Training with Scikit-learn (CSV
Data)
This guide walks you through training a model on your own CSV dataset using
scikit-learn.

1. Import Libraries

import pandas as pd
from sklearn.ensemble import RandomForestClassifier

2. Load Your CSV File

data = pd.read_csv('data.csv') # Replace with your actual CSV file path

3. Separate Features and Label

X = data.iloc[:, :-1] # All columns except the last as features

y = data.iloc[:, -1] # Last column as label

4. Train the Model

model = RandomForestClassifier()
model.fit(X, y)

The model is now trained on your complete dataset.

5. Inference

predictions = model.predict(X) # Predict on the same/new data (for demonstration)

print(predictions) # Display predictions
Evaluating Performance of ML Models
When we build a machine learning model, especially a classification model (which
predicts categories like “spam” or “not spam”, “dog” or “cat”), it’s important to
measure how well the model is performing.

One of the most basic ways to evaluate a classification model is accuracy.

What is Accuracy?

In simple terms, accuracy tells us how often the model was right.

If you gave your model 100 questions to answer, and it got 90 of them correct, its
accuracy would be:

Accuracy = Correct Predictions / Total Predictions = 90 / 100 = 90%

So, accuracy is just the fraction of predictions the model got right.

Evaluating Accuracy
Accuracy = (Correct Predictions) / (Total Predictions)

This is the simplest and most intuitive way to check if the model is doing a good
job.
Gurgaon House Price Prediction Model

Why Are We Building This?

Gurgaon, a rapidly growing city in India, has seen a sharp rise in real estate
development over the past decade. With its proximity to Delhi, booming IT hubs,
and modern infrastructure, Gurgaon has become a major attraction for both
homebuyers and investors. However, the real estate market here is highly dynamic
and often difficult to assess without proper data-driven tools.

We’re building a Gurgaon house price prediction model to help:

• Understand how factors like location, size, number of rooms, and amenities
affect property prices in Gurgaon.
• Assist buyers in identifying fair prices based on historical trends.
• Help sellers estimate an appropriate asking price.
• Empower real estate agents and platforms to improve recommendations and
negotiations.

How Will We Build It?

While we don’t have access to a large, clean dataset of house prices in Gurgaon
right now, we will use a well-known and cleaned dataset—the California housing
dataset—as a proxy. This will allow us to build, test, and evaluate a working model
with real-world variables like:

• Median income of the area

• Proximity to the city center
• Number of rooms
• Latitude and longitude
• Population density
We’ll treat this as a simulation: suppose the California data is Gurgaon data, and
suppose we are building this model for a neighborhood where both you and I live
or work nearby.

Once the model is developed and understood, we can later adapt the same
approach to real Gurgaon data when available, using the same techniques and
logic.
Revisiting Steps to solve this problem

Understanding the Problem

Before we build the model, we need to understand what kind of machine learning
problem we are solving.

First, we’ll check if this is a supervised or unsupervised learning task. Since we

have historical data with house prices (our target), and we want to predict the price
of a house based on input features, this is a supervised learning problem.

Next, we observe that the model is predicting one continuous label — the price of
a house — based on several input features. This makes it a univariate regression
problem.
Measuring Errors (RMSE & MAE)
After training our regression model, we need to evaluate how good its predictions
are. Two common metrics used for this are MAE and RMSE.

1. Mean Absolute Error (MAE)

MAE stands for Mean Absolute Error. It calculates the average of the absolute
differences between the predicted and actual values.

Formula: MAE = (1/n) × sum of |actual - predicted|

• It treats all errors equally, no matter their size.

• MAE is based on the Manhattan norm (also called L1 norm), which measures
distance by summing absolute values.

2. Root Mean Squared Error (RMSE)

RMSE stands for Root Mean Squared Error. It calculates the square root of the
average of squared differences between predicted and actual values.

Formula: RMSE = sqrt((1/n) × sum of (actual - predicted)²)

• RMSE gives more weight to larger errors because it squares them.

• RMSE is based on the Euclidean norm (also called L2 norm), which measures
straight-line distance.

Summary
• Use MAE when all errors should be treated equally.
• Use RMSE when larger errors should be penalized more.
Analyzing the Data (EDA)
Exploratory Data Analysis (EDA) is the process of examining a dataset to
summarize its main characteristics, often using visual methods or quick commands.
The goal is to understand the structure of the data, detect patterns, spot
anomalies, and get a feel for what kind of preprocessing or modeling might be
needed.

For our project, we’ll perform EDA on the California housing dataset (which we are
treating as if it represents Gurgaon data). Here are some key commands we’ll use:

1. df.head()

• Displays the first 5 rows of the dataset.

• Useful for getting a quick overview of what the data looks like — column
names, data types, and sample values.

2. df.info()

• Gives a summary of the dataset.

• Shows the number of entries, column names, data types, and how many non-
null values each column has.
• Helps us identify missing values or incorrect data types.

3. df.describe()

• Provides statistical summaries for numeric columns.

• Shows:

• Count: Total number of non-null entries

• Mean: Average value
• Std: Standard deviation
• Min: The smallest value (0th percentile or 1st quartile in some contexts)
• 25%: The 1st quartile (Q1) — 25% of the data is below this value
• 50%: The median or 2nd quartile (Q2) — half of the data is below this
value
• 75%: The 3rd quartile (Q3) — 75% of the data is below this value
• Max: The largest value (often considered the 4th quartile or 100th
percentile)

Percentiles divide the data into 100 equal parts. Quartiles divide the data into
4 equal parts (Q1 = 25th percentile, Q2 = 50th, Q3 = 75th). So:

• Min is the 0th percentile

• Max is the 100th percentile

This helps us understand how the values are spread out and if there are outliers.

4. df['column_name'].value_counts()

• Shows the count of each unique value in a specific column.

• Useful for categorical columns to see how values are distributed.
Creating a Test Set
When building machine learning models, one of the most important steps is
splitting your dataset into training and test sets. This ensures your model is
evaluated on data it has never seen before, which is critical for assessing its ability
to generalize.

The Problem of Data Snooping Bias

Data snooping bias occurs when information from the test set leaks into the
training process. This can lead to overly optimistic performance metrics and
models that don’t perform well in real-world scenarios.

To avoid this, the test set must be isolated before any data exploration, feature
selection, or model training begins.

Random Sampling: A Basic Approach

A simple method to split the data is to randomly shuffle it and then divide it:

import numpy as np

def shuffle_and_split_data(data, test_ratio):

np.random.seed(42) # Set the seed for reproducibility
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]

Setting the random seed (e.g., with np.random.seed(42) ) ensures consistency

across runs — this is crucial for debugging and comparing models fairly.
However, pure random sampling might not always be reliable, especially if the
dataset contains important patterns that are not evenly distributed.

Stratified Sampling

To ensure that important characteristics of the population are well represented in

both the training and test sets, we use stratified sampling.

What is a Strata?
A strata is a subgroup of the data defined by a specific attribute. Stratified
sampling ensures that each of these subgroups is proportionally represented.

For example, in the California housing dataset, median income is a strong

predictor of house prices. Instead of randomly sampling, we can create strata
based on income levels (e.g., binning median income into categories) and ensure
the test set maintains the same distribution of income levels as the full dataset.

Creating Income Categories

import pandas as pd
# Load the dataset
data = pd.read_csv("housing.csv")
# Create income categories

data["income_cat"] = pd.cut(data["median_income"],
bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf],
labels=[1, 2, 3, 4, 5])

This code creates a new column income_cat that categorizes the median_income
into five bins. Each bin represents a range of income levels, allowing us to stratify
our sampling based on these categories.

We can plot these income categories to visualize the distribution:

import matplotlib.pyplot as plt

data["income_cat"].value_counts().sort_index().plot.bar(rot=0, grid=True)
plt.title("Income Categories Distribution")
plt.xlabel("Income Category")
plt.ylabel("Number of Instances")
plt.show()

Stratified Shuffle Split in Scikit-Learn

Scikit-learn provides a built-in way to perform stratified sampling using

StratifiedShuffleSplit .

Here’s how you can use it:

from sklearn.model_selection import StratifiedShuffleSplit

# Assume income_cat is a column in the dataset created from median_income

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(data, data["income_cat"]):

strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]

This ensures that the income distribution in both sets is similar to that of the full
dataset, reducing sampling bias and making your model evaluation more reliable.
Data Visualization
Before handling missing values or training models, it’s important to visualize the
data to uncover patterns, relationships, and potential issues.

Geographical Scatter Plot

Visualize the geographical distribution of the data:

df.plot(kind="scatter", x="longitude", y="latitude", grid=True, alpha=0.2)

plt.show()

• alpha=0.2 makes overlapping points more visible.

• This helps reveal data clusters and high-density areas like coastal regions.

Correlation Matrix

To understand relationships between numerical features, compute the correlation

matrix:

corr_matrix = df.corr()

Check how strongly each attribute correlates with the target:

corr_matrix["median_house_value"].sort_values(ascending=False)

This helps identify useful predictors. For example, median_income usually shows a
strong positive correlation with house prices.
Scatter Matrix

Plot selected features to see pairwise relationships:

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median

_age"]
scatter_matrix(df[attributes], figsize=(12, 8))
plt.show()

This gives an overview of which features are linearly related and may be good
predictors.

Focused Income vs Price Plot

Plot median_income vs median_house_value directly:

df.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1,

grid=True)
Further Preprocessing & Handling
Missing Data
Before feeding your data into a machine learning algorithm, you need to clean and
prepare it.

Prepare Data for Training

It’s best to write transformation functions instead of applying them manually. This
ensures:

• Reproducibility on any dataset

• Reusability across projects
• Compatibility with live systems
• Easier experimentation

Start by creating a clean copy and separating the predictors and labels:

housing = strat_train_set.drop("median_house_value", axis=1)

housing_labels = strat_train_set["median_house_value"].copy()

Handling Missing Data

Some features, like total_bedrooms , contain missing values. You can:

1. Drop rows with missing values

2. Drop the entire column
3. Impute missing values (recommended)

We’ll use option 3 using SimpleImputer from Scikit-Learn, which allows consistent
handling across all datasets (train, test, new data):
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)

This computes the median for each numerical column and stores it in
imputer.statistics_ :

>>> imputer.statistics_
array([-118.51 , 34.26 , 29. , 2125. , 434. , 1167. , 408. , 3.5385])

Now apply the learned medians to transform the data:

X = imputer.transform(housing_num)

Other available strategies:

• "mean" – replaces with mean value

• "most_frequent" – for the most common value (can handle categorical)
• "constant" – fill with a fixed value using fill_value=...
Scikit-Learn Design Principles
Scikit-Learn has a simple and consistent API that makes it easy to use and
understand. Below are the key design principles behind it:

1. Consistency

All objects follow a standard interface, which makes learning and using different
tools in Scikit-Learn easier.

2. Estimators

Any object that learns from data is called an estimator.

• Use the .fit() method to train an estimator.

• In supervised learning, pass both X (features) and y (labels) to .fit(X, y) .
• Hyperparameters (like strategy='mean' in SimpleImputer ) are set when
creating the object.

Example:

imputer = SimpleImputer(strategy="median")
imputer.fit(data)
3. Transformers

Some estimators can also transform data. These are called transformers.

• Use .transform() to apply the transformation after fitting.

• Use .fit_transform() to do both in one step.

Example:

X_transformed = imputer.fit_transform(data)

4. Predictors

Models that can make predictions are predictors.

• Use .predict() to make predictions on new data.

• Use .score() to evaluate performance (e.g., accuracy or R²).

Example:

model = LinearRegression()

model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = model.score(X_test, y_test)

5. Inspection

• Hyperparameters can be accessed directly: model.param_name

• Learned parameters are stored with an underscore: model.coef_ ,
imputer.statistics_
6. No Extra Classes

• Inputs and outputs are basic structures like NumPy arrays or Pandas
DataFrames.
• No need to learn custom data types.

7. Composition

You can combine steps into a Pipeline, chaining transformers and a final predictor.

Example:

pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("model", LinearRegression())

])
pipeline.fit(X, y)

8. Sensible Defaults

Most tools in Scikit-Learn work well with default settings, so you can get started
quickly.

Note on DataFrames

Even if you input a Pandas DataFrame, the output of transformers like

transform() will be a NumPy array. You can convert it back like this:

X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)
Handling Categorical and Text
Attributes in Scikit-Learn
Most machine learning algorithms work best with numerical data. But real-world
datasets often contain categorical or text attributes. Let’s understand how to
handle these in Scikit-Learn using the ocean_proximity column from the
California housing dataset as an example.

1. Categorical Attributes

Text columns like "ocean_proximity" are not free-form text but limited to a fixed
set of values (e.g., "NEAR BAY" , "INLAND" ). These are known as categorical
attributes.

Example:

housing_cat = housing[["ocean_proximity"]]
housing_cat.head()

2. Ordinal Encoding

Scikit-Learn’s OrdinalEncoder can convert categories to numbers:

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

This will output a 2D NumPy array with numerical category codes.

To see the mapping:

ordinal_encoder.categories_
# Output: array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'])

⚠️ Caution: Ordinal encoding implies an order between categories, which may not
be true here. For example, it treats INLAND (1) as closer to <1H OCEAN (0) than
NEAR OCEAN (4) , which might not make sense.

3. One-Hot Encoding

For unordered categories, one-hot encoding is a better choice. It creates one

binary column per category.

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

This gives a sparse matrix (efficient storage for mostly zeros).

To convert it to a regular NumPy array:

housing_cat_1hot.toarray()

Or directly get a dense array:

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

To check category order:

cat_encoder.categories_
4. Summary

Method Use When Output Type

OrdinalEncoder Categories have an order 2D NumPy array

OneHotEncoder Categories are unordered Sparse or dense

Using the right encoding ensures your model learns correctly from categorical
features.
Feature Scaling and Transformation
Feature scaling is a crucial preprocessing step. Most machine learning algorithms
perform poorly when input features have vastly different scales.

In the California housing dataset, for example:

• total_rooms ranges from 6 to over 39,000

• median_income ranges from 0 to 15

If you don’t scale these features, models will give more importance to
total_rooms simply because it has larger values.

Why Scaling Is Needed

• Many models (like Linear Regression, KNN, SVMs, Gradient Descent-based

algorithms) assume features are on a similar scale.
• Without scaling, features with larger ranges can dominate model behavior.
• Scaling makes training more stable and faster.

Min-Max Scaling (Normalization)

This method rescales the data to a specific range, usually [0, 1] or [-1, 1] .

Formula:

scaled_value = (x - min) / (max - min)

Use Scikit-Learn’s MinMaxScaler :

from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)

• Use feature_range=(-1, 1) for models like neural networks.

• Sensitive to outliers — extreme values can distort the scale.

Standardization (Z-score Scaling)

This method centers the data around 0 and scales it based on standard deviation.

Formula:

standardized_value = (x - mean) / std

Use Scikit-Learn’s StandardScaler :

from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

• Resulting features have zero mean and unit variance

• Robust to outliers compared to min-max scaling
• Recommended for most ML algorithms, especially when using gradient
descent
Transformation Pipelines
As datasets grow more complex, data preprocessing often involves multiple steps
such as imputing missing values, scaling features, encoding categorical variables,
etc. These steps must be applied in the correct order and consistently across
training, validation, test, and future production data.

To streamline this process, Scikit-Learn provides the Pipeline class — a powerful

utility for chaining data transformations.

Building a Numerical Pipeline

A typical pipeline for numerical attributes might include:

1. Imputation of missing values (e.g., with median).

2. Feature scaling (e.g., with standardization).

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
("impute", SimpleImputer(strategy="median")),
("standardize", StandardScaler()),
])

How It Works
• The pipeline takes a list of steps as (name, transformer) pairs.
• Names must be unique and should not contain double underscores __ .
• All intermediate steps must be transformers (i.e., must implement
fit_transform() ).
• The final step can be either a transformer or a predictor.

Using make_pipeline

If you don’t want to name the steps manually, you can use make_pipeline() :

from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

• This automatically names the steps using the class names in lowercase.
• If the same class appears multiple times, a number is appended (e.g.,
standardscaler-1 ).

Applying the Pipeline

Call fit_transform() to apply all transformations in sequence:

housing_num_prepared = num_pipeline.fit_transform(housing_num)
print(housing_num_prepared[:2].round(2))

Example output:

array([[-1.42, 1.01, 1.86, 0.31, 1.37, 0.14, 1.39, -0.94],

[ 0.60, -0.70, 0.91, -0.31, -0.44, -0.69, -0.37, 1.17]])

• Each row corresponds to a transformed sample.

• Each column corresponds to a scaled feature.

Retrieving Feature Names

To turn the result back into a DataFrame with feature names:
df_housing_num_prepared = pd.DataFrame(

housing_num_prepared,
columns=num_pipeline.get_feature_names_out(),
index=housing_num.index
)

Pipeline as a Transformer or Predictor

• If the last step is a transformer, the pipeline behaves like a transformer

( fit_transform() , transform() ).
• If the last step is a predictor (e.g., a model), the pipeline behaves like an
estimator ( fit() , predict() ).

This flexibility makes Pipeline the standard way to handle data preprocessing
and modeling in Scikit-Learn projects.
Data Preprocessing - Final Pipeline
In this section, we will consolidate everything we’ve done so far into one final
script using Scikit-Learn pipelines. This includes:

1. Creating a stratified test set

2. Handling missing values
3. Encoding categorical variables
4. Scaling numerical features
5. Combining everything using Pipeline and ColumnTransformer

This will ensure clean, modular, and reproducible code — perfect for production
and education.

Final Preprocessing Code using Scikit-Learn Pipelines

import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedShuffleSplit

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# from sklearn.preprocessing import OrdinalEncoder # Uncomment if you prefer
ordinal

# 1. Load the data

housing = pd.read_csv("housing.csv")

# 2. Create a stratified test set based on income category

housing["income_cat"] = pd.cut(
housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5]
)

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index].drop("income_cat", axis=1)
strat_test_set = housing.loc[test_index].drop("income_cat", axis=1)

# Work on a copy of training data

housing = strat_train_set.copy()

# 3. Separate predictors and labels

housing_labels = housing["median_house_value"].copy()
housing = housing.drop("median_house_value", axis=1)

# 4. Separate numerical and categorical columns

num_attribs = housing.drop("ocean_proximity", axis=1).columns.tolist()
cat_attribs = ["ocean_proximity"]

# 5. Pipelines
# Numerical pipeline
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
])

# Categorical pipeline
cat_pipeline = Pipeline([
# ("ordinal", OrdinalEncoder()) # Use this if you prefer ordinal encoding

("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Full pipeline
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", cat_pipeline, cat_attribs),
])

# 6. Transform the data

housing_prepared = full_pipeline.fit_transform(housing)
# housing_prepared is now a NumPy array ready for training

print(housing_prepared.shape)
Training and Evaluating ML Models
Now that our data is preprocessed, let’s move on to training machine learning
models and evaluating their performance. We’ll start with:

• Linear Regression
• Decision Tree Regressor
• Random Forest Regressor

We’ll first test them on the training data and then use cross-validation to get a
better estimate of their true performance.

1. Train and Test Models on the Training Set

from sklearn.linear_model import LinearRegression

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

# Decision Tree
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

# Random Forest
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

# Predict using training data

lin_preds = lin_reg.predict(housing_prepared)
tree_preds = tree_reg.predict(housing_prepared)
forest_preds = forest_reg.predict(housing_prepared)

# Calculate RMSE
lin_rmse = mean_squared_error(housing_labels, lin_preds, squared=False)
tree_rmse = mean_squared_error(housing_labels, tree_preds, squared=False)
forest_rmse = mean_squared_error(housing_labels, forest_preds, squared=False)

print("Linear Regression RMSE:", lin_rmse)

print("Decision Tree RMSE:", tree_rmse)
print("Random Forest RMSE:", forest_rmse)

A Warning About Training RMSE

Training RMSE only shows how well the model fits the training data. It does not
tell us how well it will perform on unseen data. In fact, the Decision Tree and
Random Forest may overfit, leading to very low training error but poor
generalization.

2. Cross-Validation: A Better Evaluation Strategy

Cross-validation helps us evaluate how a model generalizes to new data without

needing to touch the test set.

What is Cross-Validation?
Instead of training the model once and evaluating on a holdout set, k-fold cross-
validation splits the training data into k folds (typically 10), trains the model on k-1
folds, and validates it on the remaining fold. This process repeats k times.

We’ll use cross_val_score from sklearn.model_selection .

Cross-Validation on Decision Tree

from sklearn.model_selection import cross_val_score

import pandas as pd

# Evaluate Decision Tree with cross-validation

tree_rmses = -cross_val_score(
tree_reg,
housing_prepared,
housing_labels,
scoring="neg_root_mean_squared_error",
cv=10
)

# WARNING: Scikit-Learn’s scoring uses utility functions (higher is better), so

RMSE is returned as negative.
# We use minus (-) to convert it back to positive RMSE.
print("Decision Tree CV RMSEs:", tree_rmses)
print("\nCross-Validation Performance (Decision Tree):")
print(pd.Series(tree_rmses).describe())
Model Persistence and Inference with
Joblib in a Random Forest Pipeline
Lets now summarize how to train a Random Forest model on California housing
data, save the model and preprocessing pipeline using joblib , and reuse the
model later for inference on new data ( input.csv ). This approach helps avoid
retraining the model every time, improving performance and enabling
reproducibility.

Why These Steps?

1. Why Train Once and Save?

• Training models repeatedly is time-consuming and computationally
expensive.
• Saving the model ( model.pkl ) and preprocessing pipeline ( pipeline.pkl )
ensures you can quickly load and run inference anytime in the future.

2. Why Use a Preprocessing Pipeline?

• Raw data needs to be cleaned, scaled, and encoded before model training.
• A Pipeline automates this transformation and ensures identical
preprocessing during inference.

3. Why Use Joblib?

• joblib efficiently serializes large NumPy arrays (like in sklearn models).
• Faster and more suitable than pickle for scikit-learn objects.
4. Why the If-Else Logic?
• The program checks if a saved model exists.

• If not, it trains and saves the model.

• If it does, it skips training and only runs inference, saving time.

Full Code

import os
import pandas as pd
import numpy as np
import joblib

from sklearn.model_selection import StratifiedShuffleSplit

MODEL_FILE = "model.pkl"
PIPELINE_FILE = "pipeline.pkl"

def build_pipeline(num_attribs, cat_attribs):

num_pipeline = Pipeline([

("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
cat_pipeline = Pipeline([
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", cat_pipeline, cat_attribs)
])
return full_pipeline
if not os.path.exists(MODEL_FILE):
# TRAINING PHASE

housing = pd.read_csv("housing.csv")
housing['income_cat'] = pd.cut(housing["median_income"],
bins=[0.0, 1.5, 3.0, 4.5, 6.0, np.inf],
labels=[1, 2, 3, 4, 5])
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, _ in split.split(housing, housing['income_cat']):
housing = housing.loc[train_index].drop("income_cat", axis=1)

housing_labels = housing["median_house_value"].copy()
housing_features = housing.drop("median_house_value", axis=1)

num_attribs = housing_features.drop("ocean_proximity", axis=1).columns.tolist()

cat_attribs = ["ocean_proximity"]

pipeline = build_pipeline(num_attribs, cat_attribs)

housing_prepared = pipeline.fit_transform(housing_features)

model = RandomForestRegressor(random_state=42)
model.fit(housing_prepared, housing_labels)

# Save model and pipeline

joblib.dump(model, MODEL_FILE)
joblib.dump(pipeline, PIPELINE_FILE)

print("Model trained and saved.")

else:

# INFERENCE PHASE
model = joblib.load(MODEL_FILE)
pipeline = joblib.load(PIPELINE_FILE)

input_data = pd.read_csv("input.csv")
transformed_input = pipeline.transform(input_data)
predictions = model.predict(transformed_input)
input_data["median_house_value"] = predictions
input_data.to_csv("output.csv", index=False)
print("Inference complete. Results saved to output.csv")

Summary

With this setup, our ML pipeline is:

• Efficient – No retraining needed if the model exists.

• Reproducible – Same preprocessing logic every time.
• Production-ready – Can be deployed or reused across multiple systems.
Conclusion: California Housing Price
Prediction Project
In this project, we built a complete machine learning pipeline to predict California
housing prices using various regression algorithms. We started by:

• Loading and preprocessing the dataset ( housing.csv ) with careful treatment

of missing values, scaling, and encoding using a custom pipeline.

• Stratified splitting was used to maintain income category distribution

between train and test sets.

• We trained and evaluated multiple algorithms including:

• Linear Regression
• Decision Tree Regressor
• Random Forest Regressor

• Through cross-validation, we found that Random Forest performed the best,

offering the lowest RMSE and most stable results.

Finally, we built a script that:

• Trains the Random Forest model and saves it using joblib .

• Uses an if-else logic to skip retraining if the model exists.
• Applies the trained model to new data ( input.csv ) to predict
median_house_value , storing results in output.csv .

This pipeline ensures that predictions are accurate, efficient, and ready for
production deployment.

CWH Sklearn Merged
No ratings yet
CWH Sklearn Merged
74 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
BA Project - Team17
No ratings yet
BA Project - Team17
13 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Module 2
No ratings yet
Module 2
35 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
Dawit House
No ratings yet
Dawit House
49 pages
Hands On Machine Learning, End-to-End Machine Learning Project Notes
No ratings yet
Hands On Machine Learning, End-to-End Machine Learning Project Notes
10 pages
CSE 445 - Lecture 2 - Data Exploration - Regression
No ratings yet
CSE 445 - Lecture 2 - Data Exploration - Regression
31 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Regression Pipeline in AI Techniques
No ratings yet
Regression Pipeline in AI Techniques
94 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
Unit 2
No ratings yet
Unit 2
78 pages
Module 5
No ratings yet
Module 5
46 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
ML Practical 04
No ratings yet
ML Practical 04
19 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Intro to ML with Sklearn & Python
No ratings yet
Intro to ML with Sklearn & Python
10 pages
Report
No ratings yet
Report
40 pages
ML Lab Manual
No ratings yet
ML Lab Manual
110 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Project Report
No ratings yet
Project Report
37 pages
End-to-End Machine Learning Project (Bootcamp)
No ratings yet
End-to-End Machine Learning Project (Bootcamp)
415 pages
House Price Prediction Using Machine Learning in Python
No ratings yet
House Price Prediction Using Machine Learning in Python
13 pages
EXAMPLE ML in Real Life
No ratings yet
EXAMPLE ML in Real Life
6 pages
Hands On ML Workshop End To End ML
No ratings yet
Hands On ML Workshop End To End ML
20 pages
Lec3 4 ML Project
No ratings yet
Lec3 4 ML Project
26 pages
House Report
No ratings yet
House Report
26 pages
Top 90+ Data Science Interview Questions and Answers (2024)
No ratings yet
Top 90+ Data Science Interview Questions and Answers (2024)
38 pages
Aastha Mahajan Python File
No ratings yet
Aastha Mahajan Python File
17 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Data Science and Machine Learning - Interview Questions
No ratings yet
Data Science and Machine Learning - Interview Questions
185 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
End To End Machine Learning Project-2
No ratings yet
End To End Machine Learning Project-2
10 pages
Neural Network Housing Price Prediction
No ratings yet
Neural Network Housing Price Prediction
30 pages
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
100% (1)
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
32 pages
Explain Me Every Code Written in It With Deep Know
No ratings yet
Explain Me Every Code Written in It With Deep Know
7 pages
QB 1
No ratings yet
QB 1
11 pages
Important Notes
No ratings yet
Important Notes
8 pages
2023 MScIT Patel Mirza
No ratings yet
2023 MScIT Patel Mirza
54 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
LAB MANUAL For Machine Learning
No ratings yet
LAB MANUAL For Machine Learning
15 pages
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
BigML WhizzML Tutorials
No ratings yet
BigML WhizzML Tutorials
45 pages
Session 4 Machine Learning Process
No ratings yet
Session 4 Machine Learning Process
28 pages
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
No ratings yet
000+ +curriculum+ +Complete+Data+Science+and+Machine+Learning+Using+Python
10 pages
Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
Predicting House Prices
No ratings yet
Predicting House Prices
9 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Machine Learnine Experiment by Priyanka
No ratings yet
Machine Learnine Experiment by Priyanka
6 pages
Ads Lab8
No ratings yet
Ads Lab8
5 pages
Chapter 8 - Mario F. Triola
100% (1)
Chapter 8 - Mario F. Triola
24 pages
Test 1-Past Paper
No ratings yet
Test 1-Past Paper
8 pages
6.lecture-6 Moments, Skewness and Kurtosis
No ratings yet
6.lecture-6 Moments, Skewness and Kurtosis
4 pages
Sample-Tables Hands-On Activity
No ratings yet
Sample-Tables Hands-On Activity
14 pages
Correlation
No ratings yet
Correlation
19 pages
Cluster Sampling
No ratings yet
Cluster Sampling
9 pages
Cammd Aims Lecture4
No ratings yet
Cammd Aims Lecture4
14 pages
Cronbach - S Coefficient Alpha Well Known But Poorly Understood
No ratings yet
Cronbach - S Coefficient Alpha Well Known But Poorly Understood
59 pages
INTRODUCTION TO STATISTICS Notes
No ratings yet
INTRODUCTION TO STATISTICS Notes
16 pages
Exercise 2 Week5
No ratings yet
Exercise 2 Week5
6 pages
AP Statistics Worksheet Residuals and Least Squares
No ratings yet
AP Statistics Worksheet Residuals and Least Squares
3 pages
Chapter 22 Solution To Problems 1
No ratings yet
Chapter 22 Solution To Problems 1
3 pages
Econometrics Lecture Note Chapter 4 and 5
No ratings yet
Econometrics Lecture Note Chapter 4 and 5
39 pages
SPSS Test Output
No ratings yet
SPSS Test Output
16 pages
Data Science Interview Guide
No ratings yet
Data Science Interview Guide
93 pages
Chapter 2 AP Statistics Practice Test
100% (1)
Chapter 2 AP Statistics Practice Test
5 pages
Standard Deviation Is Independent of Change in Origin'. Explain The Property of Standard Deviatio - Google Search
No ratings yet
Standard Deviation Is Independent of Change in Origin'. Explain The Property of Standard Deviatio - Google Search
1 page
COMPX310-19A Machine Learning Chapter 3: Classification
No ratings yet
COMPX310-19A Machine Learning Chapter 3: Classification
39 pages
One Way Anova: SUM 175 195 153 Mean
No ratings yet
One Way Anova: SUM 175 195 153 Mean
4 pages
Statistik Deskriptif: Descriptive Statistics
No ratings yet
Statistik Deskriptif: Descriptive Statistics
3 pages
26-03-2020-5. Case Study
No ratings yet
26-03-2020-5. Case Study
2 pages
CH 9
No ratings yet
CH 9
35 pages
PGM 1
No ratings yet
PGM 1
5 pages
Business Students and Ethics: A Meta-Analysis: Susan C. Borkowski Yusuf J. Ugras
No ratings yet
Business Students and Ethics: A Meta-Analysis: Susan C. Borkowski Yusuf J. Ugras
11 pages
Sample Size Determination PDF
100% (1)
Sample Size Determination PDF
28 pages
wk03 - Hypothesis Testing - Hand Written Notes 170822
No ratings yet
wk03 - Hypothesis Testing - Hand Written Notes 170822
33 pages
Practice Set - VI: Sub: (MA 231)
No ratings yet
Practice Set - VI: Sub: (MA 231)
2 pages
Business Stats & Probability Guide
No ratings yet
Business Stats & Probability Guide
4 pages
Biostatistics Regression Course
No ratings yet
Biostatistics Regression Course
4 pages
Yr 10 5.2 NCM CH 5 Investigating Data
No ratings yet
Yr 10 5.2 NCM CH 5 Investigating Data
50 pages