0% found this document useful (0 votes)
18 views14 pages

Report

The document outlines a project aimed at developing a sales prediction system using historical sales data and machine learning models, specifically linear regression and random forest regression. The project includes data preprocessing, model training, evaluation, and visualization of predictions, with the random forest model demonstrating superior performance. Contributions from team members and challenges faced during the project are also discussed.

Uploaded by

faizanpervaz74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views14 pages

Report

The document outlines a project aimed at developing a sales prediction system using historical sales data and machine learning models, specifically linear regression and random forest regression. The project includes data preprocessing, model training, evaluation, and visualization of predictions, with the random forest model demonstrating superior performance. Contributions from team members and challenges faced during the project are also discussed.

Uploaded by

faizanpervaz74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

1.

Objectives and Introduction

 Objective:
The objective of this project is to develop a sales prediction system
that uses historical sales data to forecast future sales trends. By
applying machine learning models like linear regression and random
forest regression, the goal is to generate predictions that can aid in
business decision-making.

 Introduction:
This project involves analyzing daily sales data to predict future
sales for a specific year. Various machine learning techniques are
utilized, including data preprocessing, feature scaling, model
training, and evaluation. The project aims to implement an
automated sales forecasting system using Python libraries, with an
easy-to-use interface for the end user.

2. Analytical Solution

 Step-by-Step Solution:

o Data Preprocessing: The first step involves loading the


sales data, handling missing values, and grouping the data by
date to get the total sales per day.

o Model Training: We use linear regression and random forest


regression models for sales prediction. The models are trained
on a time series dataset that includes the sales and date
information.

o Feature Scaling: StandardScaler is used to scale the


features for better model performance.

o Evaluation Metrics: Mean Squared Error (MSE), Mean


Absolute Error (MAE), and R-squared are used to evaluate the
performance of the models.

o Prediction: The models are used to predict sales for future


years by scaling the date-related features and generating
forecasts.

 Assumptions and Values:

o The dataset consists of daily sales data over a specific period.


o The regression models assume that the relationship between
the date (or time) and sales follows a linear or non-linear trend
that can be learned from the historical data.

o The random forest model utilizes 100 trees for better


generalization.

3. Explanation of Commands, Functions, and Toolboxes Used (15


Marks)

 Libraries and Functions:

o pandas: Used for data manipulation, loading CSV files, and


grouping sales data.

o numpy: Used for numerical operations and generating


features.

o matplotlib: Used for data visualization, including time series


plotting and sales distribution.

o sklearn.model_selection.train_test_split: Used to split the data


into training and testing sets.

o sklearn.linear_model.LinearRegression: Used to train the linear


regression model.

o sklearn.preprocessing.StandardScaler: Used to standardize the


features before training the models.

o sklearn.metrics: Used to compute performance metrics such


as MSE, MAE, and R-squared.

o sklearn.ensemble.RandomForestRegressor: Used for training a


random forest regression model.

 Commands:

o The train_model() function prepares the data, splits it into


training and testing sets, scales it, and trains the model.

o The visualize_predictions() function plots the predicted sales


for a specified year.

o The load_data() function reads the sales data from a CSV file
and prepares it for analysis.

4. Results and Discussion (20 Marks)


 Results:
The models generated predictions with varying accuracy. The linear
regression model provided a basic understanding of the sales trend,
but the random forest regressor offered better performance in terms
of lower MSE and higher R-squared scores.

o Linear Regression Results:


o Random Forest Regression Results:

 Discussion:
The random forest model outperformed the linear regression model
in terms of accuracy. This demonstrates the power of more complex
models like random forests, especially when working with time-
series data. However, the simplicity of the linear regression model
could still be useful for quicker predictions with less computational
overhead.
5. Flowchart

6. Conclusions

 Conclusion:
This project successfully developed a sales prediction system using
machine learning models. The random forest regressor provided the
best results for predicting future sales based on historical data. The
system can be further enhanced by incorporating more features and
exploring different model architectures.

o Key Points:

 Linear regression and random forest were tested for


sales prediction.

 Random forest showed better performance.

 The system can predict sales for any future year based
on past trends.
7. Contribution (5 Marks)

 Team Contributions:

o M Abdullah (24I-3050): Responsible for data preprocessing,


feature engineering, and model evaluation.

o M Jibran (24I-3134): Worked on training the machine


learning models and performance analysis.

o M Umer (24I-3132): Implemented the prediction


functionality and visualized the results.

 Difficulties and Solutions:

o Difficulty: Handling missing values and ensuring that the


data was in a clean format.

o Solution: Used dropna() and ensured proper date conversion


and grouping of data.

8. Python Code (Ensure that the code is well-commented as


requested in the report requirements)

Trained Random Forest Regressor Model on OrderDate and Sales


dataset

https://www.kaggle.com/datasets/kyanyoga/sample-sales-data

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error, mean_absolute_error,


r2_score

from sklearn.ensemble import RandomForestRegressor

from sklearn.pipeline import make_pipeline


def load_data(file_path):

try:

# Read with robust encoding

data = pd.read_csv(file_path, encoding='latin-1')

# Convert ORDERDATE to datetime

data['ORDERDATE'] = pd.to_datetime(data['ORDERDATE'])

# Group by date and sum SALES

daily_sales = data.groupby('ORDERDATE')
['SALES'].sum().reset_index()

daily_sales.set_index('ORDERDATE', inplace=True)

return daily_sales

except Exception as e:

print(f"Error loading data: {e}")

return None

def train_model(data):

# Prepare features and target

X = data.index.map(lambda date: [date.year, date.month, date.day,


date.toordinal()]).tolist()

y = data['SALES'].values

# Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Create pipeline with scaling and Random Forest

pipeline = make_pipeline(
StandardScaler(),

RandomForestRegressor(n_estimators=100, random_state=42)

# Train model

pipeline.fit(X_train, y_train)

# Predictions and evaluation

predictions = pipeline.predict(X_test)

# Detailed model performance metrics

mse = mean_squared_error(y_test, predictions)

mae = mean_absolute_error(y_test, predictions)

r2 = r2_score(y_test, predictions)

print(f"Mean Squared Error: {mse}")

print(f"Mean Absolute Error: {mae}")

print(f"R-squared Score: {r2}")

return pipeline

def visualize_predictions(model, data, year):

# Create date range for prediction

start_date = pd.to_datetime(f"{year}-01-01")

end_date = pd.to_datetime(f"{year}-12-31")

date_range = pd.date_range(start_date, end_date)

# Prepare features for prediction


X_future = date_range.map(lambda date: [date.year, date.month,
date.day, date.toordinal()]).tolist()

# Predict

predictions = model.predict(X_future)

# Plotting

plt.figure(figsize=(12, 6))

plt.plot(date_range, predictions, label=f"Predicted Sales ({year})",


color='red')

plt.title(f"Predicted Sales for the Year {year}")

plt.xlabel('Date')

plt.ylabel('Sales')

plt.xticks(rotation=45)

plt.legend()

plt.tight_layout()

plt.show()

def main():

# Load dataset

data = load_data('sales_data_sample.csv')

if data is None:

return

# Train model

model = train_model(data)

# Predict future sales

while True:

try:
year = int(input("Enter a year for prediction (e.g., 2025): "))

visualize_predictions(model, data, year)

cont = input("Do you want to predict sales for another year?


(yes/no): ")

if cont.lower() != 'yes':

break

except ValueError:

print("Invalid year. Please enter a valid year.")

if __name__ == "__main__":

main()

Linear Regression on Date and Temperature dataset sourced from github :

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score,


mean_absolute_error

from sklearn.preprocessing import StandardScaler

import os

def display_team_info():

print("Welcome to the Sales Prediction System\n")

print("Software House: Predictify Solutions")

print("Team Members: [ M Abdullah, 24I-3050]")

print("Team Members: [ M Jibran, 24I-3134]")


print("Team Members: [ M Umer, 24I-3132]")

print("--------------------------------------------\n")

def load_data(file_url):

try:

data = pd.read_csv(file_url)

data.columns = ['Date', 'Sales'] # Rename columns for context

data['Date'] = pd.to_datetime(data['Date'])

data.set_index('Date', inplace=True)

return data

except Exception as e:

print(f"Unexpected error loading data: {e}")

return None

def visualize_data(data):

print("Visualizing sales data...\n")

plt.figure(figsize=(12, 6))

plt.plot(data.index, data['Sales'], label='Sales', color='blue')

plt.title('Sales Over Time')

plt.xlabel('Date')

plt.ylabel('Sales')

plt.xticks(rotation=45)

plt.legend()

plt.tight_layout()

plt.show()

def train_model(data):

print("Training the linear regression model...\n")

data['Time'] = np.arange(len(data)) # Create a time index


X = data[['Time']]

y = data['Sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

model = LinearRegression()

model.fit(X_train_scaled, y_train)

predictions = model.predict(X_test_scaled)

mse = mean_squared_error(y_test, predictions)

mae = mean_absolute_error(y_test, predictions)

r2 = r2_score(y_test, predictions)

print(f"Mean Squared Error: {mse:.2f}")

print(f"Mean Absolute Error: {mae:.2f}")

print(f"R-squared Value: {r2:.2f}\n")

return model, scaler

def derivative_analysis(model):

print("Performing derivative analysis...\n")

rate_of_change = model.coef_[0]

print(f"Rate of Change (Derivative): {rate_of_change:.2f} sales per


scaled day\n")

def predict_future_sales(model, scaler, data, year):

print(f"Predicting sales for the year {year}...\n")


start_date = pd.to_datetime(f"{year}-01-01")

end_date = pd.to_datetime(f"{year}-12-31")

date_range = pd.date_range(start_date, end_date)

date_ordinals = np.arange(len(data), len(data) +


len(date_range)).reshape(-1, 1)

date_ordinals_scaled = scaler.transform(date_ordinals)

predictions = model.predict(date_ordinals_scaled)

plt.figure(figsize=(12, 6))

plt.plot(date_range, predictions, label=f"Predicted Sales ({year})",


color='red')

plt.title(f"Predicted Sales for {year}")

plt.xlabel('Date')

plt.ylabel('Sales')

plt.xticks(rotation=45)

plt.legend()

plt.tight_layout()

plt.show()

def main():

display_team_info()

file_url =
"https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-
temperatures.csv"

data = load_data(file_url)

if data is None:

return
visualize_data(data)

model, scaler = train_model(data)

derivative_analysis(model)

while True:

try:

year = int(input("Enter a year for prediction (e.g., 2025): "))

predict_future_sales(model, scaler, data, year)

cont = input("Do you want to predict sales for another year?


(yes/no): ")

if cont.lower() != 'yes':

print("Thank you for using the Sales Prediction System!")

break

except ValueError:

print("Invalid year. Please enter a valid year.")

if __name__ == "__main__":

main()

You might also like