0% found this document useful (0 votes)

577 views37 pages

Sales Prediction

This document contains the contents page for a research paper on sales forecasting for grocery stores. It lists the various chapters that will be included such as the introduction, literature review, system design, and conclusion. The abstract provides an overview of the research problem of accurately forecasting product sales and determining inventory levels. It also gives a brief description of the data and machine learning methods that will be used to develop a sales forecasting model, including various regression techniques. The introduction further discusses the importance of sales forecasting for grocery stores and describes the dataset and supervised learning approach that will be used.

Uploaded by

GLOBAL INFO-TECH KUMBAKONAM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

577 views37 pages

Sales Prediction

Uploaded by

GLOBAL INFO-TECH KUMBAKONAM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 37

CHAPTER NO. TITLE PAGE NO

BONAFIDE CERTIFICATE I

ACKNOWLEDGEMENT II

ABSTRACT III

CONTENTS IV

1 INTRODUCTION

2 LITERATURE SURVEY

3 SYSTEM DESIGN

SOFTWARE AND HARDWARE

4
REQUIREMENTS

5 SYSTEM ANALYSIS

6 MODULES

7 SYSTEM TESTING

8 ALGORITHM
9 SAMPLE CODE

10 SAMPLE OUTPUT

11 CONCLUSION & FUTURE WORK

REFERENCE
ABSTRACT

Product sales forecasting is a major aspect of purchasing management. Forecasts

are crucial in determining inventory stock levels, and accurately estimating future

demand for goods has been an ongoing challenge, especially in the Supermarkets and

Grocery Stores industry. If goods are not readily available or goods availability is more

than demand overall profit can be compromised. As a result, sales forecasting for goods

can be significant to ensure loss is minimized. Additionally, the problem becomes more

complex as retailers add new locations with unique needs, new products, ever

transitioning seasonal tastes, and unpredictable product marketing. In this analysis, a

forecasting model is developed using machine learning algorithms to improve the

accurately forecasts product sales. The proposed model is especially targeted to support

the future purchase and more accurate forecasts product sales and is not intended to

change current subjective forecasting methods. A model based on a real grocery store's

data is developed in order to validate the use of the various machine learning algorithms.

In the case study, multiple regression methods are compared. The methods impact on

forecast product availability in store to ensure they have just enough products at right

time.
CHAPTER 1

INTRODUCTION

In this project, we are trying to forecasts product sales based on the items, stores,

transaction and other dependent variables like holidays and oil prices.

This is a Kaggle Competition called "Corporación Favorita Grocery Sales Forecasting"

where the task is to predict stocking of products to better ensure grocery stores please

customers by having just enough of the right products at the right time.

For this particular problem, we have analyzed the data as a supervised learning problem.

In order to forecasts the sales we have compared different regression models like Linear

Regression, Decision Tree, ExtraTreeRegressor, Gradient Boosting, Random Forest and

XgBoost. Further to optimize the results we have used multilayer perception (MLP: a

class of feed forward artificial neural network) and LightGBM ( gradient boosting

framework that uses tree based learning algorithms).

Sales Forecasting is the process of using the company’s sales records of the past years to

predict the short-term or long-term performance in the future. This is one of the pillars of

proper financial planning. As with any prediction-related process, risk and uncertainty are

unavoidable in Sales Forecasting too. Hence, it’s considered good practice for forecasting

teams to mention the degree of uncertainties in their forecast.

Accurately forecasting sales and building a sales plan can help to avoid unforeseen cash

flow problems and manage production, staff and financing needs more effectively.
Brick-and-mortar grocery stores are always closely tied to purchasing and sales forecasts.

An incorrect prediction will cause over-purchasing, which will lead to overstock and

spoilage. On the other hand, an insufficient purchase will cause a shortage of

merchandise available to customers. Therefore, it is very important for grocery stores to

accurately predict the purchase volume of goods. Sales prediction is an important part of

modern business intelligence [1]. Accurate forecasts can bring huge benefits to a

businessman or a business. In the last decade, machine learning has been used for various

business predictions, such as in the financial industry, stock forecasting [2], etc.

To accurately predict sales, it is critical to take into account a wide range of factors.

Corporación Favorita gives important data to make relevant predictions for this model

training. The purpose of this study was to address the inventory problems that exist in

most grocery stores, such as overstocking and not having enough items for customers to

purchase. In this paper, based on the above problem, a related prediction model is

studied. The model was able to predict the sales of different items in each store. Stores

can purchase goods at different times according to the forecast, which can reduce

unreasonable purchases in the store and increase the turnover of the grocery store.
CHAPTER 2

LITERATURE SURVEY

Akshay Krishna et al., proposed the Normal regression technique, the Boosting

technique, and the Root Mean Square Error value (RMSE) for evaluating accuracy. The

boosting algorithm has better results than the regular regression algorithm. Learning

objective- RMSE is calculated using a variance, a fundamental concept. So it cannot

scale it up to the larger level. This is one of the significant factors that affect calculating

accuracy at a high rate. Without proper hyper parameter tuning, the AdaBoost algorithm

won't perform as expected, and the performance deteriorates.

Gopalakrishnan T et al., proposed a linear regression algorithm. Accuracy is

evaluated based on precision value. Here precision value specifies the number of correct

recommendations, i.e. proportion of the relevant revivals to the total number of

populations. Besides, there are plenty of machine learning algorithms. This system only

uses a linear regression algorithm. Learning objective- Linear regression algorithm is

used when someone wants to predict a variable's value based on the value of another

variable. Since this algorithm is dependent on the other variables for prediction, this will

not be so efficient. So, without comparison with another algorithm, one cannot assure that

this is the best algorithm.

SunithaCheriyan et al., proposed a generalized linear model, decision tree,

gradient boosted tree. Accuracy is calculated using empirical evaluation in which results

are derived by observation or experiment instead of theory. The results are summarized in

terms of the efficient technique's reliability and accuracy. During the analysis phase,
some of the documents were discarded. Data used in this analysis were insufficient for

further analysis. Learning objective- Using a generalized linear model, decision tree, and

gradient boosted tree; the execution time will be huge. Managing a large set of records

will be complicated. So it isn't easy to make predictions for massive datasets.

MohitGurnani et al., proposed various machine learning models, namely,

Autoregressive integrated moving average(ARIMA), Auto-Regressive Neural

Network(ARNN), XG Boost, SVM, Hybrid Models like Hybrid ARIMAARNN, Hybrid

ARIMA- XG Boost(extreme Gradient Boosting), Hybrid ARIMA-SVM and STL

Decomposition (using ARIMA, Snaive, XG Boost). These models' accuracy is measured

by metrics such as Mean Absolute Error (MAE) and Root Mean Square Error(RMSE).

Standard Template Library (STL) gave better results than individual and hybrid models.

Learning objective- STL is one of the decomposition techniques in which each

component is analyzed separately and are forecasted using various machine learning

algorithms. Linear models such as ARIMA cannot capture nonlinear patterns precisely. It

can fail when the performance of nonlinear models outperforms the hybrid model.

F.M. Thiesing et al., proposed feed-forward multilayer perceptron networks. One

batch and two online training algorithms are implemented on parallel systems (PARIX,

Python Virtual Machine (PVM)). By increasing the number of input neurons, will

increase training time. The prediction error rate is high. Learning objective- By this feed-

forward multilayer perceptron networks, as the count of input neurons increases, there

will be an increase in the time also so that there will not be any time constraint

predictions. It will vary according to the selected inputs, and it is hard to predict within

the expected period. F.M. Thiesing et al., proposed the prediction techniques like neural,
conventional (naive, statistical). The error is measured by the Root mean squared error

(RMSE). Accuracy is measured by RMSE and Theil's U. Neural Network outperforms

the Naïve and statistical approaches. The program runs as a prototype and handles only a

small subset of the supermarket's inventory. Learning objective- Neural and Conventional

techniques are suitable only for a small subset of the supermarket's stock. Hence it cannot

scale it up to the larger level.

CHAPTER – 7

SYSTEM DESIGN

Sales forecasting is remained the one of the important requirement of any grocery

stores. There are many statistical models used for task of predicting sales such as

ARIMA. Thos models mainly based on time so are univarient timeseries preditions. CF

sales are affected by many other factors which promotes application of more complex

model.

Here, predicting the sales for each product is considered as main problem to be

solved. Generally, grocery stores use traditional models for prediction with little data

taken in consideration. It is important to evaluate which would be better for forecasting

sales of CF among two different kinds of models. One is time series based model and

another is regression based on causality. Here, comparison is using LSTM for time-

series forecasting and Random forest for causal forecasting. This will allow as

understanding which is better for forecasting of sales

Dataset Size reduction: As training data is quite large consisting of 125497040

observations which when loaded in memory takes more than 50 GB. So, first step would

be to reduce dataset using data transformation or sampling.

Data analysis: This includes study the effect of each factor on sale such as,

Kind of change in sale due to promotion

Effect of oil price changes

Sale volume on each store

Highest consumer’s city

Treatment of null, negative or NaN values in each column.

unit_sales vs time

MLPNN is one of the most significant models in artificial neural network. The

MLPNN consists of one input layer, one or more hidden layers and one output layer.3 In

MLPNN, the input nodes pass values to the first hidden layer, and then nodes of first

hidden layer pass values to the second and so on till producing outputs as shown in Figu
CHAPTER -4

SOFTWARE AND HARDWARE REQUIREMENTS

Hardware Requirements:

• System : Dual Core.

• Hard Disk : 500 GB.

• Monitor : Led Monitor.

• Mouse : Optical Mouse

• RAM : 4 GB

Software Requirements:

• Operating system : Windows 10.

• Coding Language : Python 3.7

• Compiler : Pycharm

• Data Base : Access

CHAPTER – 5

SYSTEM ANALYSIS

LGBM aims to make gradient boosting on decision trees faster. The idea is that

instead of checking all of the splits when creating new leaves, only some of them are

checked: the model first sorts all of the attributes and buckets the observation by creating

discrete bins. When there needs to be a split of a leaf in the tree, instead of iterating over

all of the leaves, it simply iterates over all of the buckets. This implementation is called

histogram implementation by its authors.

Data Flow Diagram

The BP algorithm has served as a useful methodology to train multilayer

perceptron for a wide range of applications.4 The BP network calculates the difference

between real and predicted values, which is circulated from output nodes backwards to

nodes in previous layer. The BP learning algorithm can be divided into two phases,

propagation and weight update.4

CHAPTER -6

MODULES

Random forest

Random Forest is one of the most complex models employed here with its default

parameter configuration. As, here we are comparing it with LSTM parameter tuning is

not applied. Following is a detail of fitted model.

RandomForestRegressor

(bootstrap=True, criterion='mse', max_depth=None,

max_features='auto', max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,

oob_score=False, random_state=None, verbose=0, warm_start=False)

LSTM

It is a variant of RNN. Here we have used single layer of LSTM with on dropout layer

and one fully connected dense layer. Experiments have been conducted to stacking

multiple layers of LSTM, but it was not improving so to keep it simple and efficient only

one layer is kept. Activation layer added is linear and LSTM has internal gates based on

tanh and sigmoid. Optimizer is RMSPROP with its learning rate 0.001 and MSE is used
as loss function. LSTM is experimented with different batch sizes amount all them 10 is

selected for memory efficiency. Total epoch is 15 as after that no improvement is found.

Following is network architecture.

model.add(LSTM(50,

input_shape=(X_train_values.shape[1], X_train_values.shape[2])))

model.add(Dropout(0.3))

Dense(model.add(Dense(1)))

model.add(Activation('linear'))

model.compile(loss='mse',optimizer='rmsprop')
CHAPTER -7

SYSTEM TESTING

It is a sort of supervised learning algorithmic program that's largely used for

classification issues. Surprisingly, it works for each categorical and continuous dependent

variable. In this algorithmic program, we tend to split the population into 2 or a lot of

homogenized sets. This is done supported most vital attributes/ freelance variables to

form as distinct teams as attainable. A tree has several analogies in real world, and seems

that it's influenced a large space of machine learning, covering each classification and

regression. In call analysis, a choice tree is wont to visually and expressly represent

selections and higher cognitive process. As the name goes, it uses a tree-like model of
decisions. Though a commonly used tool in data mining for deriving a strategy to reach a

particular goal, it’s also widely used in machine learning. Once we completed modelling

the Decision Tree classifier, we will use the trained model to predict whether the balance

scale tip to the right or tip to the left or be balanced.

Random Forest is a great algorithm to train early in the model development

process, to see how it performs and it’s hard to build a “bad” Random Forest, because of

its simplicity. This rule is additionally an excellent alternative, if you would like to

develop a model during a short amount of your time. On prime of that, it provides a fairly

sensible indicator of the importance it assigns to your options. Random Forests are

terribly onerous to ram down terms of performance. And on prime of that, they'll handle

tons of various feature varieties, like binary, categorical and numerical. Overall, Random

Forest may be a (mostly) quick, easy and versatile tool, though it's its limitations.

Random forests are an ensemble learning method for classification, regression and other

tasks, that operate by constructing a multitude of decision trees at training time and

outputting the class that is the mode of the categories (classification) or mean prediction

(regression) of the individual trees Random call forests correct for call trees' habit of over

fitting to their training set.

CHAPTER – 8

ALGORITHM

Here, two different algorithms going to be used are RandomForest and LSTM(Long and

Short Memory). Over here we will use most basic versions of both of them so that they

could be compared.

RandomForestRegressor

(bootstrap=True, criterion='mse', max_depth=None,

max_features='auto', max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,

oob_score=False, random_state=None, verbose=0, warm_start=False)

LSTM

It is a variant of RNN. Here we have used single layer of LSTM with on dropout layer

and one fully connected dense layer. Experiments have been conducted to stacking

multiple layers of LSTM, but it was not improving so to keep it simple and efficient only

one layer is kept. Activation layer added is linear and LSTM has internal gates based on

tanh and sigmoid. Optimizer is RMSPROP with its learning rate 0.001 and MSE is used
as loss function. LSTM is experimented with different batch sizes amount all them 10 is

selected for memory efficiency. Total epoch is 15 as after that no improvement is found.

Following is network architecture.

model.add(LSTM(50,

input_shape=(X_train_values.shape[1], X_train_values.shape[2])))

model.add(Dropout(0.3))

Dense(model.add(Dense(1)))

model.add(Activation('linear'))

model.compile(loss='mse',optimizer='rmsprop')
CHAPTER – 9

SAMPLE CODE

# This Python 3 environment comes with many helpful analytics libraries installed

# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-

python

# For example, here's several helpful packages to load in

import numpy as np # linear algebra

import pandas as pd # data prAny results you write to the current directory are saved as

output.ocessing, CSV file I/O (e.g. pd.read_csv)

import gc; gc.enable()

from sklearn import preprocessing, linear_model, metrics

import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.

# For example, running this (by clicking run or pressing Shift+Enter) will list the files in

the input directory

from subprocess import check_output

print(check_output(["ls", "../input"]).decode("utf8"))

dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}

data = {

'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),

'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),

'ite': pd.read_csv('../input/items.csv'),

'sto': pd.read_csv('../input/stores.csv'),

'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),

'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},

parse_dates=['date']),

'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),

train = data['tra']#[(data['tra']['date'].dt.month == 8) & (data['tra']['date'].dt.day > 15)]

test = data['tes']#[(data['tes']['date'].dt.month == 8) & (data['test']['date'].dt.day > 15)]

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor

from sklearn import cross_validationrf = RandomForestRegressor(max_features = "auto",

min_samples_leaf = 50,n_estimators = 100,random_state =50,oob_score =True )

rf.fit(X_train, y_train)

print ('RF accuracy: TRAINING', rf.score(X_train,y_train,W_train))

print ('RF accuracy: TESTING', rf.score(X_test,y_test,W_test))

print("feature Importance",rf.feature_importances_)

yhat1 = rf.predict(X_test)

print('NWRMSLE RF',NWRMSLE((y_test),(yhat1),W_test.values ))

Input In [2], in <cell line: 3>()

1 from sklearn.tree import DecisionTreeRegressor

2 from sklearn.ensemble import RandomForestRegressor

----> 3 from sklearn import cross_validation

5 rf = RandomForestRegressor(max_features = "auto", min_samples_leaf =

50,n_estimators = 100,random_state =50,oob_score =True )

7 rf.fit(X_train, y_train)

"kernelspec": {

"name": "python3",
"display_name": "Python 3 (ipykernel)",

"language": "python"

"language_info": {

"name": "python",

"version": "3.9.12",

"mimetype": "text/x-python",

"codemirror_mode": {

"name": "ipython",

"version": 3

"pygments_lexer": "ipython3",

"nbconvert_exporter": "python",

"file_extension": ".py"

# Input data files are available in the "../input/" directory.

# For example, running this (by clicking run or pressing Shift+Enter) will list the files in

the input directory

from subprocess import check_output

print(check_output(["ls", "../input"]).decode("utf8"))

dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}

data = {

'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),

'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),

'ite': pd.read_csv('../input/items.csv'),

'sto': pd.read_csv('../input/stores.csv'),

'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),

'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},

parse_dates=['date']),

'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),

train = data['tra']#[(data['tra']['date'].dt.month == 8) & (data['tra']['date'].dt.day > 15)]

test = data['tes']#[(data['tes']['date'].dt.month == 8) & (data['test']['date'].dt.day > 15)]

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor

from sklearn import cross_validation

rf = RandomForestRegressor(max_features = "auto", min_samples_leaf =

50,n_estimators = 100,random_state =50,oob_score =True )

LSTM

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import gc; gc.enable()

from sklearn import preprocessing, linear_model, metrics

import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.

# For example, running this (by clicking run or pressing Shift+Enter) will list the files in

the input directory

from subprocess import check_output

print(check_output(["ls", "../input"]).decode("utf8"))

dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}

data = {
'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),

#'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),

'ite': pd.read_csv('../input/items.csv'),

#'sto': pd.read_csv('../input/stores.csv'),

#'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),

#'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},

parse_dates=['date']),

#'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),

Input In [1], in <cell line: 15>()

11 # Input data files are available in the "../input/" directory.

12 # For example, running this (by clicking run or pressing Shift+Enter) will list the

files in the input directory

14 from subprocess import check_output

---> 15 print(check_output(["ls", "../input"]).decode("utf8"))

File ~\anaconda3\lib\subprocess.py:424, in check_output(timeout, *popenargs,

**kwargs)

421 empty = b''

422 kwargs['input'] = empty

--> 424 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,

425 **kwargs).stdout

File ~\anaconda3\lib\subprocess.py:505, in run(input, capture_output, timeout, check,

*popenargs, **kwargs)

502 kwargs['stdout'] = PIPE

503 kwargs['stderr'] = PIPE

--> 505 with Popen(*popenargs, **kwargs) as process:

506 try:

507 stdout, stderr = process.communicate(input, timeout=timeout)

File ~\anaconda3\lib\subprocess.py:951, in Popen.init(self, args, bufsize, executable,

stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines,

startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group,

extra_groups, encoding, errors, text, umask)

947 if self.text_mode:

948 self.stderr = io.TextIOWrapper(self.stderr,

949 encoding=encoding, errors=errors)

--> 951 self._execute_child(args, executable, preexec_fn, close_fds,

952 pass_fds, cwd, env,

953 startupinfo, creationflags, shell,

954 p2cread, p2cwrite,

955 c2pread, c2pwrite,

956 errread, errwrite,

957 restore_signals,

958 gid, gids, uid, umask,

959 start_new_session)

960 except:

961 # Cleanup if the child failed starting.

962 for f in filter(None, (self.stdin, self.stdout, self.stderr)):

File ~\anaconda3\lib\subprocess.py:1420, in Popen._execute_child(self, args, executable,

preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread,

p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_gid,

unused_gids, unused_uid, unused_umask, unused_start_new_session)

1418 # Start the process

1419 try:

-> 1420 hp, ht, pid, tid = _winapi.CreateProcess(executable, args,

1421 # no special security

1422 None, None,

1423 int(not close_fds),

1424 creationflags,

1425 env,

1426 cwd,

1427 startupinfo)

1428 finally:

1429 # Child is launched. Close the parent's copy of those pipe

1430 # handles that only the child should have open. You need

(...)

1433 # pipe will not close when the child process exits and the

1434 # ReadFile will hang.

1435 self._close_pipe_fds(p2cread, p2cwrite,

1436 c2pread, c2pwrite,

1437 errread, errwrite)

FileNotFoundError: [WinError 2] The system cannot find the file specified

dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}

data = {

'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),

#'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),

'ite': pd.read_csv('../input/items.csv'),

#'sto': pd.read_csv('../input/stores.csv'),

#'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),

#'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},

parse_dates=['date']),

#'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),

}
CHAPTER – 10

SAMPLE OUTPUT

Sales on timeframe: It will be useful to understand how sales are distributed over year,

month, and day for understanding effect of time on sales.

Sale on each year

Above chart on each year clarifies sale is increases each year. On the other hand

maximum sale is happening in the month of July.

This above chart of sale by each day of month indicates that there are less sale happing in

at last and first day of each month.

Above chart indicates that there is no much effect of weekend on sale.

CHAPTER – 11
CONCLUSION

Sales forecasting plays a vital role in the business sector in every field. With the help of

the sales forecasts, sales revenue analysis will help to get the details needed to estimate

both the revenue and the income. Different types of Machine Learning techniques such as

Support Vector Regression, Gradient Boosting Regression, Simple Linear Regression,

and Random Forest Regression have been evaluated on food sales data to find the critical

factors that influence sales to provide a solution for forecasting sales. After performing

metrics such as accuracy, mean absolute error, and max error, the Random Forest

Regression is found to be the appropriate algorithm according to the collected data and

thus fulfilling the aim of this project.Following visualizations show that randomforest fits

better to testing data than that of LSTM. Following is a actual vs prediction plot.

Following is a visualization of actual vs predictions.

This project was taken from Kaggle competition when I was novice to such competitions

but due to my interest and only objective to build kaggle projects There were different

phases I had gone through this project as below but overall I find it is very difficult for

novice to compete.

Based on analysis I come to know that this data has two aspects one is causality and

another is time dependency and based on that I had decided to make proposal which

would be better forecasting model and I have selected RandomForest for causality based

forecasting and LSTM for time-series forcasting.

I have decided to reduce data and use only portion of data and experimented lot with

many combination of data that looks meaningful. Finally I have selected on most selling

product to predict over time and based on features.

Apart from experiments with data the hardest part I found in this project was to train

LSTM. I had tried lot many ways to reduce over fitting and under fitting which had taken

most portion of total experiments in this project.

Finally, I ended up with reasonably good model with LSTM and also with RandomForest

and found that this data has lot of causality which is why time series furcating will

underperforms.

FUTURE WORK

This project is about understanding application of two different methods for

predicting sales. One used is random forest which is state of the art boosting algorithm

based on Decision tress. Random-Forest is used to check causality based prediction. In

these experiments, I have used it for one single product and it is performing quite well

and better than LSTM. Random forest with k-fold crosses validation for better parameter

tuning.

LSTM is used with lag of 7 observations which again need improvement. With

availability of better of high end resources w can create multivariate time series in

LSTM. During experiments I have observed LSTM will improve with more number of

data and quite sensitive to batch size but tuning both of these parameters need huge

amount of time and also processing power. I have tried many layers and nodes of LSTM

but I could not see improvement but experiments combination of CNN over here might

be helpful in learning causality of such sales data.

In both of the experiment on can also improve in number of data as total of 5 GB

data is available to make model learn better. I haven’t tried fb prophet here, but it is also

considered one of the good candidates for time series forecasting.

12. REFERENCE

1. https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average.

2. https://www.kaggle.com/c/favorita-grocery-sales forecasting#evaluation.

3. https://en.wikipedia.org/wiki/Random_forest6.

4. https://github.com/llSourcell/LSTM_Networks/blob/master/LSTM

%20Demo.ipynb.

5. H. Yu, O. G. Garrod, and P. G. Schyns, “Perception-driven facialexpression

synthesis,” Computers & Graphics, vol. 36, no. 3, pp.152–162, 2012.

6. S. Oh, J. Bailenson, N. Kr¨amer, and B. Li, “Let the avatar brightenyour smile:

Effects of enhancing facial expressions in virtualenvironments,” PLoS ONE, vol.

11, no. 9, p. e0161794, 2016.

7. James R. Williamson. MIT Lincoln, “Detecting Depression using Vocal, Facial

and SemanticCommunication Cues,” AVEC'16, October 16 2016, Amsterdam,

NetherlandsACM. ISBN 978-1-4503-4516-3/16/10

8. V. Surakka and J. K. Hietanen, “Facial and emotional reactionsto duchenne and

non-duchenne smiles,” International Journal ofPsychophysiology, vol. 29, no. 1,

pp. 23–33, 1998.

9. Christine L. Lisetti, Diane J. Schiano “Automatic Facial Expression

Interpretation: Where Human-ComputerInteraction,

10. Artificial Intelligence and Cognitive Science Intersect.” Pragmatics and

Cognition (Special Issue on Facial Information Processing: A

MultidisciplinaryPerspective), Vol. 8(1): 185-235, 2000.

11. , “Thesimulation of smiles (sims) model: Embodied simulation and themeaning of

facial expression,” Behavioral and brain sciences, vol. 33,no. 06, pp. 417–433,

2010

12. Zhang, G. P. Business forecasting with artificial neural networks: An overview.

Neural networks in business forecasting, 2004, 1-22.

13. https://www.kaggle.com/c/favorita-grocery-sales forecasting#evaluation.

14. Shen, S., Jiang, H., & Zhang, T. Stock market forecasting using machine learning

algorithms. Department of Electrical Engineering, Stanford University, Stanford,

CA, 2012, 1-5.

15. https://en.wikipedia.org/wiki/Random_forest6.

Predictive Analysis For Big Mart Sales Using Machine
100% (1)
Predictive Analysis For Big Mart Sales Using Machine
11 pages
Amit Kumar: Bigmart Sales Prediction A Project Report
No ratings yet
Amit Kumar: Bigmart Sales Prediction A Project Report
47 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
MCA11: Mathematical Foundation For Computer Science 1: Example 2.13
100% (1)
MCA11: Mathematical Foundation For Computer Science 1: Example 2.13
3 pages
Big Mart Sales Prediction Report
67% (3)
Big Mart Sales Prediction Report
23 pages
Applied Machine Learning Question Paper
100% (1)
Applied Machine Learning Question Paper
2 pages
Mefa Problems
100% (1)
Mefa Problems
39 pages
TCS Codevita Previous Papers - Pdf-Edited
No ratings yet
TCS Codevita Previous Papers - Pdf-Edited
8 pages
Probability and Queueing Theory Syllabus
No ratings yet
Probability and Queueing Theory Syllabus
2 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Bda Unit 5
No ratings yet
Bda Unit 5
30 pages
Financial Accounting and Management - Unit 4 Notes
No ratings yet
Financial Accounting and Management - Unit 4 Notes
26 pages
Compact Representation of Frequent Item Set
No ratings yet
Compact Representation of Frequent Item Set
59 pages
Relational Set Operators Lecture 5
No ratings yet
Relational Set Operators Lecture 5
25 pages
Numericals - Forecasting
No ratings yet
Numericals - Forecasting
6 pages
MLT Unit-1
No ratings yet
MLT Unit-1
19 pages
KNN and Decision Tree Algorithms
No ratings yet
KNN and Decision Tree Algorithms
50 pages
Pincer Search Algo
No ratings yet
Pincer Search Algo
8 pages
Mca Alagappa University
No ratings yet
Mca Alagappa University
27 pages
Retail Analytics & Advertising Insights
No ratings yet
Retail Analytics & Advertising Insights
22 pages
Data Mining TOC
No ratings yet
Data Mining TOC
3 pages
1.5 Triangular Factors and Row Exchanges
No ratings yet
1.5 Triangular Factors and Row Exchanges
29 pages
Mining Class Comparisions and Mining Descriptive Statistical Measures
No ratings yet
Mining Class Comparisions and Mining Descriptive Statistical Measures
24 pages
AI & ML Lab Manual 2022-2023
No ratings yet
AI & ML Lab Manual 2022-2023
44 pages
DBMS BCA Syllabus 25-SEP III SEM
No ratings yet
DBMS BCA Syllabus 25-SEP III SEM
2 pages
Data Structures Using C 2nd Edition A. K. Sharma Download
100% (8)
Data Structures Using C 2nd Edition A. K. Sharma Download
81 pages
Discrete Math Exam for CS Majors
No ratings yet
Discrete Math Exam for CS Majors
3 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages
1) Statement: Descriptive Analytics, Is The Conventional Form of Business Intelligence and Data Analysis. B. False
100% (1)
1) Statement: Descriptive Analytics, Is The Conventional Form of Business Intelligence and Data Analysis. B. False
21 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
47 pages
Major Project Documentation Final 2
No ratings yet
Major Project Documentation Final 2
62 pages
09 Vector NIMCET Study Material Free - Zo3lsq8sc9qrf6laml2s
No ratings yet
09 Vector NIMCET Study Material Free - Zo3lsq8sc9qrf6laml2s
2 pages
Big Data Analytics Mod1
No ratings yet
Big Data Analytics Mod1
36 pages
Sentiment Analysis Report
No ratings yet
Sentiment Analysis Report
31 pages
Customer Segmentation Using K-Means Custering Report - ML3
No ratings yet
Customer Segmentation Using K-Means Custering Report - ML3
26 pages
Sowndharya.e Internship Report Final
100% (1)
Sowndharya.e Internship Report Final
35 pages
21matcs41 m4 Vtu Notes
No ratings yet
21matcs41 m4 Vtu Notes
310 pages
AIML Feb, March Scheme 2023
No ratings yet
AIML Feb, March Scheme 2023
25 pages
12TH Computer Science pt1
No ratings yet
12TH Computer Science pt1
3 pages
Mysql One Mark Questions
No ratings yet
Mysql One Mark Questions
12 pages
Unit 4 - Association Analysis
No ratings yet
Unit 4 - Association Analysis
12 pages
EE - L9 To L10 - Economic Evaluation of Alternatives - Present Worth Method
No ratings yet
EE - L9 To L10 - Economic Evaluation of Alternatives - Present Worth Method
39 pages
Soft Computing
No ratings yet
Soft Computing
13 pages
Q&A Univ 3unit
No ratings yet
Q&A Univ 3unit
18 pages
Mca Project Guidelines Complete
No ratings yet
Mca Project Guidelines Complete
13 pages
SC&RP - Unit 5
No ratings yet
SC&RP - Unit 5
36 pages
AIML 4th and 5th Module Notes
No ratings yet
AIML 4th and 5th Module Notes
77 pages
MC5032 - DMDW
No ratings yet
MC5032 - DMDW
3 pages
Lab Manual B.Sc. (CA) : Department of Computer Science Ccb-2P2: Laboratory Course - Ii
No ratings yet
Lab Manual B.Sc. (CA) : Department of Computer Science Ccb-2P2: Laboratory Course - Ii
31 pages
Ecommerce Model Question Paper
No ratings yet
Ecommerce Model Question Paper
4 pages
IMDB Movie Review Analysis
No ratings yet
IMDB Movie Review Analysis
9 pages
Operation Research
No ratings yet
Operation Research
8 pages
STQA MiniProject
No ratings yet
STQA MiniProject
13 pages
Dbms Project Report
No ratings yet
Dbms Project Report
18 pages
Competitive Advantage of Business Analytics
No ratings yet
Competitive Advantage of Business Analytics
7 pages
Cluster Analysis and Applications
No ratings yet
Cluster Analysis and Applications
37 pages
Sequencing & Queuing Theory Guide
No ratings yet
Sequencing & Queuing Theory Guide
7 pages
Decision Science Material
No ratings yet
Decision Science Material
136 pages
1142pm - 1.EPRA JOURNALS 14814
No ratings yet
1142pm - 1.EPRA JOURNALS 14814
6 pages
RP 3
No ratings yet
RP 3
12 pages
Digital Marketing Questionnare
No ratings yet
Digital Marketing Questionnare
5 pages
Varun Final Project1
No ratings yet
Varun Final Project1
49 pages
Project Swami0202
No ratings yet
Project Swami0202
55 pages
R.K. Narayan: Life and Works
100% (1)
R.K. Narayan: Life and Works
38 pages
HRW
No ratings yet
HRW
28 pages
1 9
No ratings yet
1 9
9 pages
Conclusion
No ratings yet
Conclusion
11 pages
Pandyas Architecture
No ratings yet
Pandyas Architecture
55 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
Role of Rail Transport in Tourism Industry
No ratings yet
Role of Rail Transport in Tourism Industry
45 pages
Study On The Helminth Infections in The Cattle (Bos Taurus) at Thirukandeeswaram Village Nannilam Taluk of Tiruvarur District
No ratings yet
Study On The Helminth Infections in The Cattle (Bos Taurus) at Thirukandeeswaram Village Nannilam Taluk of Tiruvarur District
5 pages
Studies On The Helminth Infections in The Cattle (Bos Taurus) at K0Dali Village Udaiyarpalam Taluk of Ariyalur District
No ratings yet
Studies On The Helminth Infections in The Cattle (Bos Taurus) at K0Dali Village Udaiyarpalam Taluk of Ariyalur District
1 page
Cashless Kowsalya
No ratings yet
Cashless Kowsalya
58 pages
SENTHAMIL
No ratings yet
SENTHAMIL
1 page
E. Hemavathi Summer Internship Project Final
No ratings yet
E. Hemavathi Summer Internship Project Final
22 pages
A Study On The Effects of Cashless Transactions On People'S Spending Behaviour in Sirkazhi Town, Mayiladuthurai District
No ratings yet
A Study On The Effects of Cashless Transactions On People'S Spending Behaviour in Sirkazhi Town, Mayiladuthurai District
14 pages
English for Engineering Students
No ratings yet
English for Engineering Students
125 pages
Birla Power - MBA Project
No ratings yet
Birla Power - MBA Project
23 pages
Marketing Practices in Handicraft: Marketing Touches Everyone's Life. Marketing Involves A
No ratings yet
Marketing Practices in Handicraft: Marketing Touches Everyone's Life. Marketing Involves A
4 pages
Dabur's Rural India Focus
No ratings yet
Dabur's Rural India Focus
19 pages
Consumer
No ratings yet
Consumer
6 pages
Organization Theory and Design On Daraz Bangladesh 1
No ratings yet
Organization Theory and Design On Daraz Bangladesh 1
15 pages
Sample L1 Visa Plan: Jane Doe
No ratings yet
Sample L1 Visa Plan: Jane Doe
27 pages
The Art and Science of Pricing, ET Retail
No ratings yet
The Art and Science of Pricing, ET Retail
11 pages
Retail Locations and Retail Site Locations
No ratings yet
Retail Locations and Retail Site Locations
2 pages
Course Outline 362
No ratings yet
Course Outline 362
37 pages
SOP
No ratings yet
SOP
13 pages
Feasibility Study - Market Strategies
100% (1)
Feasibility Study - Market Strategies
25 pages
Project Report ON Market Segmentation: Designed by Pooja Devija
No ratings yet
Project Report ON Market Segmentation: Designed by Pooja Devija
41 pages
Distribution Management CHAPTER 4
No ratings yet
Distribution Management CHAPTER 4
10 pages
UCLA Anderson Casebook Consulting Case Interview Book 2016 - 2017加州大学洛杉矶安德森商学院管理学院咨询案例面试
100% (2)
UCLA Anderson Casebook Consulting Case Interview Book 2016 - 2017加州大学洛杉矶安德森商学院管理学院咨询案例面试
94 pages
Goldenmeat Company: Location Hargeisa Somaliland Email: Goldenmeet@Email. Comcontect Us: 566668
No ratings yet
Goldenmeat Company: Location Hargeisa Somaliland Email: Goldenmeet@Email. Comcontect Us: 566668
17 pages
Distribution Channels
100% (1)
Distribution Channels
129 pages
India Retail Real Estate REvived, REshaped, & REinforced
No ratings yet
India Retail Real Estate REvived, REshaped, & REinforced
44 pages
Every Question in Business wbs11
No ratings yet
Every Question in Business wbs11
12 pages
Chapter 6 - Electronic Data Processing Systems
No ratings yet
Chapter 6 - Electronic Data Processing Systems
91 pages
Croma Shifts Focus to Larger Stores
No ratings yet
Croma Shifts Focus to Larger Stores
3 pages
2021 Global Food Drink Trends
No ratings yet
2021 Global Food Drink Trends
33 pages
Anshul - Singh - RM908 - MALL Manage.
No ratings yet
Anshul - Singh - RM908 - MALL Manage.
3 pages
Evolution of Entrepreneurship
No ratings yet
Evolution of Entrepreneurship
7 pages
The Digital Transformation of Kroger Remaking The Grocery Business
No ratings yet
The Digital Transformation of Kroger Remaking The Grocery Business
12 pages
TATA Chemicals LTD
No ratings yet
TATA Chemicals LTD
40 pages
Zara Scm.
100% (1)
Zara Scm.
24 pages
Group 1 Puregold
No ratings yet
Group 1 Puregold
60 pages
Decoupling in Swedish Apparel CBMs
No ratings yet
Decoupling in Swedish Apparel CBMs
14 pages
27 SMF Crowdsourced Manufacturing en
No ratings yet
27 SMF Crowdsourced Manufacturing en
18 pages