1.
Objectives and Introduction
Objective:
The objective of this project is to develop a sales prediction system
that uses historical sales data to forecast future sales trends. By
applying machine learning models like linear regression and random
forest regression, the goal is to generate predictions that can aid in
business decision-making.
Introduction:
This project involves analyzing daily sales data to predict future
sales for a specific year. Various machine learning techniques are
utilized, including data preprocessing, feature scaling, model
training, and evaluation. The project aims to implement an
automated sales forecasting system using Python libraries, with an
easy-to-use interface for the end user.
2. Analytical Solution
Step-by-Step Solution:
o Data Preprocessing: The first step involves loading the
sales data, handling missing values, and grouping the data by
date to get the total sales per day.
o Model Training: We use linear regression and random forest
regression models for sales prediction. The models are trained
on a time series dataset that includes the sales and date
information.
o Feature Scaling: StandardScaler is used to scale the
features for better model performance.
o Evaluation Metrics: Mean Squared Error (MSE), Mean
Absolute Error (MAE), and R-squared are used to evaluate the
performance of the models.
o Prediction: The models are used to predict sales for future
years by scaling the date-related features and generating
forecasts.
Assumptions and Values:
o The dataset consists of daily sales data over a specific period.
o The regression models assume that the relationship between
the date (or time) and sales follows a linear or non-linear trend
that can be learned from the historical data.
o The random forest model utilizes 100 trees for better
generalization.
3. Explanation of Commands, Functions, and Toolboxes Used (15
Marks)
Libraries and Functions:
o pandas: Used for data manipulation, loading CSV files, and
grouping sales data.
o numpy: Used for numerical operations and generating
features.
o matplotlib: Used for data visualization, including time series
plotting and sales distribution.
o sklearn.model_selection.train_test_split: Used to split the data
into training and testing sets.
o sklearn.linear_model.LinearRegression: Used to train the linear
regression model.
o sklearn.preprocessing.StandardScaler: Used to standardize the
features before training the models.
o sklearn.metrics: Used to compute performance metrics such
as MSE, MAE, and R-squared.
o sklearn.ensemble.RandomForestRegressor: Used for training a
random forest regression model.
Commands:
o The train_model() function prepares the data, splits it into
training and testing sets, scales it, and trains the model.
o The visualize_predictions() function plots the predicted sales
for a specified year.
o The load_data() function reads the sales data from a CSV file
and prepares it for analysis.
4. Results and Discussion (20 Marks)
Results:
The models generated predictions with varying accuracy. The linear
regression model provided a basic understanding of the sales trend,
but the random forest regressor offered better performance in terms
of lower MSE and higher R-squared scores.
o Linear Regression Results:
o Random Forest Regression Results:
Discussion:
The random forest model outperformed the linear regression model
in terms of accuracy. This demonstrates the power of more complex
models like random forests, especially when working with time-
series data. However, the simplicity of the linear regression model
could still be useful for quicker predictions with less computational
overhead.
5. Flowchart
6. Conclusions
Conclusion:
This project successfully developed a sales prediction system using
machine learning models. The random forest regressor provided the
best results for predicting future sales based on historical data. The
system can be further enhanced by incorporating more features and
exploring different model architectures.
o Key Points:
Linear regression and random forest were tested for
sales prediction.
Random forest showed better performance.
The system can predict sales for any future year based
on past trends.
7. Contribution (5 Marks)
Team Contributions:
o M Abdullah (24I-3050): Responsible for data preprocessing,
feature engineering, and model evaluation.
o M Jibran (24I-3134): Worked on training the machine
learning models and performance analysis.
o M Umer (24I-3132): Implemented the prediction
functionality and visualized the results.
Difficulties and Solutions:
o Difficulty: Handling missing values and ensuring that the
data was in a clean format.
o Solution: Used dropna() and ensured proper date conversion
and grouping of data.
8. Python Code (Ensure that the code is well-commented as
requested in the report requirements)
Trained Random Forest Regressor Model on OrderDate and Sales
dataset
https://www.kaggle.com/datasets/kyanyoga/sample-sales-data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error,
r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
def load_data(file_path):
try:
# Read with robust encoding
data = pd.read_csv(file_path, encoding='latin-1')
# Convert ORDERDATE to datetime
data['ORDERDATE'] = pd.to_datetime(data['ORDERDATE'])
# Group by date and sum SALES
daily_sales = data.groupby('ORDERDATE')
['SALES'].sum().reset_index()
daily_sales.set_index('ORDERDATE', inplace=True)
return daily_sales
except Exception as e:
print(f"Error loading data: {e}")
return None
def train_model(data):
# Prepare features and target
X = data.index.map(lambda date: [date.year, date.month, date.day,
date.toordinal()]).tolist()
y = data['SALES'].values
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Create pipeline with scaling and Random Forest
pipeline = make_pipeline(
StandardScaler(),
RandomForestRegressor(n_estimators=100, random_state=42)
# Train model
pipeline.fit(X_train, y_train)
# Predictions and evaluation
predictions = pipeline.predict(X_test)
# Detailed model performance metrics
mse = mean_squared_error(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")
print(f"R-squared Score: {r2}")
return pipeline
def visualize_predictions(model, data, year):
# Create date range for prediction
start_date = pd.to_datetime(f"{year}-01-01")
end_date = pd.to_datetime(f"{year}-12-31")
date_range = pd.date_range(start_date, end_date)
# Prepare features for prediction
X_future = date_range.map(lambda date: [date.year, date.month,
date.day, date.toordinal()]).tolist()
# Predict
predictions = model.predict(X_future)
# Plotting
plt.figure(figsize=(12, 6))
plt.plot(date_range, predictions, label=f"Predicted Sales ({year})",
color='red')
plt.title(f"Predicted Sales for the Year {year}")
plt.xlabel('Date')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()
def main():
# Load dataset
data = load_data('sales_data_sample.csv')
if data is None:
return
# Train model
model = train_model(data)
# Predict future sales
while True:
try:
year = int(input("Enter a year for prediction (e.g., 2025): "))
visualize_predictions(model, data, year)
cont = input("Do you want to predict sales for another year?
(yes/no): ")
if cont.lower() != 'yes':
break
except ValueError:
print("Invalid year. Please enter a valid year.")
if __name__ == "__main__":
main()
Linear Regression on Date and Temperature dataset sourced from github :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score,
mean_absolute_error
from sklearn.preprocessing import StandardScaler
import os
def display_team_info():
print("Welcome to the Sales Prediction System\n")
print("Software House: Predictify Solutions")
print("Team Members: [ M Abdullah, 24I-3050]")
print("Team Members: [ M Jibran, 24I-3134]")
print("Team Members: [ M Umer, 24I-3132]")
print("--------------------------------------------\n")
def load_data(file_url):
try:
data = pd.read_csv(file_url)
data.columns = ['Date', 'Sales'] # Rename columns for context
data['Date'] = pd.to_datetime(data['Date'])
data.set_index('Date', inplace=True)
return data
except Exception as e:
print(f"Unexpected error loading data: {e}")
return None
def visualize_data(data):
print("Visualizing sales data...\n")
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['Sales'], label='Sales', color='blue')
plt.title('Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()
def train_model(data):
print("Training the linear regression model...\n")
data['Time'] = np.arange(len(data)) # Create a time index
X = data[['Time']]
y = data['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"Mean Squared Error: {mse:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"R-squared Value: {r2:.2f}\n")
return model, scaler
def derivative_analysis(model):
print("Performing derivative analysis...\n")
rate_of_change = model.coef_[0]
print(f"Rate of Change (Derivative): {rate_of_change:.2f} sales per
scaled day\n")
def predict_future_sales(model, scaler, data, year):
print(f"Predicting sales for the year {year}...\n")
start_date = pd.to_datetime(f"{year}-01-01")
end_date = pd.to_datetime(f"{year}-12-31")
date_range = pd.date_range(start_date, end_date)
date_ordinals = np.arange(len(data), len(data) +
len(date_range)).reshape(-1, 1)
date_ordinals_scaled = scaler.transform(date_ordinals)
predictions = model.predict(date_ordinals_scaled)
plt.figure(figsize=(12, 6))
plt.plot(date_range, predictions, label=f"Predicted Sales ({year})",
color='red')
plt.title(f"Predicted Sales for {year}")
plt.xlabel('Date')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()
def main():
display_team_info()
file_url =
"https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-
temperatures.csv"
data = load_data(file_url)
if data is None:
return
visualize_data(data)
model, scaler = train_model(data)
derivative_analysis(model)
while True:
try:
year = int(input("Enter a year for prediction (e.g., 2025): "))
predict_future_sales(model, scaler, data, year)
cont = input("Do you want to predict sales for another year?
(yes/no): ")
if cont.lower() != 'yes':
print("Thank you for using the Sales Prediction System!")
break
except ValueError:
print("Invalid year. Please enter a valid year.")
if __name__ == "__main__":
main()