0% found this document useful (0 votes)
577 views37 pages

Sales Prediction

This document contains the contents page for a research paper on sales forecasting for grocery stores. It lists the various chapters that will be included such as the introduction, literature review, system design, and conclusion. The abstract provides an overview of the research problem of accurately forecasting product sales and determining inventory levels. It also gives a brief description of the data and machine learning methods that will be used to develop a sales forecasting model, including various regression techniques. The introduction further discusses the importance of sales forecasting for grocery stores and describes the dataset and supervised learning approach that will be used.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
577 views37 pages

Sales Prediction

This document contains the contents page for a research paper on sales forecasting for grocery stores. It lists the various chapters that will be included such as the introduction, literature review, system design, and conclusion. The abstract provides an overview of the research problem of accurately forecasting product sales and determining inventory levels. It also gives a brief description of the data and machine learning methods that will be used to develop a sales forecasting model, including various regression techniques. The introduction further discusses the importance of sales forecasting for grocery stores and describes the dataset and supervised learning approach that will be used.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

CONTENTS

CHAPTER NO. TITLE PAGE NO

BONAFIDE CERTIFICATE I

ACKNOWLEDGEMENT II

ABSTRACT III

CONTENTS IV

1 INTRODUCTION

2 LITERATURE SURVEY

3 SYSTEM DESIGN

SOFTWARE AND HARDWARE


4
REQUIREMENTS

5 SYSTEM ANALYSIS

6 MODULES

7 SYSTEM TESTING

8 ALGORITHM
9 SAMPLE CODE

10 SAMPLE OUTPUT

11 CONCLUSION & FUTURE WORK

REFERENCE
ABSTRACT

Product sales forecasting is a major aspect of purchasing management. Forecasts

are crucial in determining inventory stock levels, and accurately estimating future

demand for goods has been an ongoing challenge, especially in the Supermarkets and

Grocery Stores industry. If goods are not readily available or goods availability is more

than demand overall profit can be compromised. As a result, sales forecasting for goods

can be significant to ensure loss is minimized. Additionally, the problem becomes more

complex as retailers add new locations with unique needs, new products, ever

transitioning seasonal tastes, and unpredictable product marketing. In this analysis, a

forecasting model is developed using machine learning algorithms to improve the

accurately forecasts product sales. The proposed model is especially targeted to support

the future purchase and more accurate forecasts product sales and is not intended to

change current subjective forecasting methods. A model based on a real grocery store's

data is developed in order to validate the use of the various machine learning algorithms.

In the case study, multiple regression methods are compared. The methods impact on

forecast product availability in store to ensure they have just enough products at right

time.
CHAPTER 1

INTRODUCTION

In this project, we are trying to forecasts product sales based on the items, stores,

transaction and other dependent variables like holidays and oil prices.

This is a Kaggle Competition called "Corporación Favorita Grocery Sales Forecasting"

where the task is to predict stocking of products to better ensure grocery stores please

customers by having just enough of the right products at the right time.

For this particular problem, we have analyzed the data as a supervised learning problem.

In order to forecasts the sales we have compared different regression models like Linear

Regression, Decision Tree, ExtraTreeRegressor, Gradient Boosting, Random Forest and

XgBoost. Further to optimize the results we have used multilayer perception (MLP: a

class of feed forward artificial neural network) and LightGBM ( gradient boosting

framework that uses tree based learning algorithms).

Sales Forecasting is the process of using the company’s sales records of the past years to

predict the short-term or long-term performance in the future. This is one of the pillars of

proper financial planning. As with any prediction-related process, risk and uncertainty are

unavoidable in Sales Forecasting too. Hence, it’s considered good practice for forecasting

teams to mention the degree of uncertainties in their forecast.

Accurately forecasting sales and building a sales plan can help to avoid unforeseen cash

flow problems and manage production, staff and financing needs more effectively.
Brick-and-mortar grocery stores are always closely tied to purchasing and sales forecasts.

An incorrect prediction will cause over-purchasing, which will lead to overstock and

spoilage. On the other hand, an insufficient purchase will cause a shortage of

merchandise available to customers. Therefore, it is very important for grocery stores to

accurately predict the purchase volume of goods. Sales prediction is an important part of

modern business intelligence [1]. Accurate forecasts can bring huge benefits to a

businessman or a business. In the last decade, machine learning has been used for various

business predictions, such as in the financial industry, stock forecasting [2], etc.

To accurately predict sales, it is critical to take into account a wide range of factors.

Corporación Favorita gives important data to make relevant predictions for this model

training. The purpose of this study was to address the inventory problems that exist in

most grocery stores, such as overstocking and not having enough items for customers to

purchase. In this paper, based on the above problem, a related prediction model is

studied. The model was able to predict the sales of different items in each store. Stores

can purchase goods at different times according to the forecast, which can reduce

unreasonable purchases in the store and increase the turnover of the grocery store.
CHAPTER 2

LITERATURE SURVEY

Akshay Krishna et al., proposed the Normal regression technique, the Boosting

technique, and the Root Mean Square Error value (RMSE) for evaluating accuracy. The

boosting algorithm has better results than the regular regression algorithm. Learning

objective- RMSE is calculated using a variance, a fundamental concept. So it cannot

scale it up to the larger level. This is one of the significant factors that affect calculating

accuracy at a high rate. Without proper hyper parameter tuning, the AdaBoost algorithm

won't perform as expected, and the performance deteriorates.

Gopalakrishnan T et al., proposed a linear regression algorithm. Accuracy is

evaluated based on precision value. Here precision value specifies the number of correct

recommendations, i.e. proportion of the relevant revivals to the total number of

populations. Besides, there are plenty of machine learning algorithms. This system only

uses a linear regression algorithm. Learning objective- Linear regression algorithm is

used when someone wants to predict a variable's value based on the value of another

variable. Since this algorithm is dependent on the other variables for prediction, this will

not be so efficient. So, without comparison with another algorithm, one cannot assure that

this is the best algorithm.

SunithaCheriyan et al., proposed a generalized linear model, decision tree,

gradient boosted tree. Accuracy is calculated using empirical evaluation in which results

are derived by observation or experiment instead of theory. The results are summarized in

terms of the efficient technique's reliability and accuracy. During the analysis phase,
some of the documents were discarded. Data used in this analysis were insufficient for

further analysis. Learning objective- Using a generalized linear model, decision tree, and

gradient boosted tree; the execution time will be huge. Managing a large set of records

will be complicated. So it isn't easy to make predictions for massive datasets.

MohitGurnani et al., proposed various machine learning models, namely,

Autoregressive integrated moving average(ARIMA), Auto-Regressive Neural

Network(ARNN), XG Boost, SVM, Hybrid Models like Hybrid ARIMAARNN, Hybrid

ARIMA- XG Boost(extreme Gradient Boosting), Hybrid ARIMA-SVM and STL

Decomposition (using ARIMA, Snaive, XG Boost). These models' accuracy is measured

by metrics such as Mean Absolute Error (MAE) and Root Mean Square Error(RMSE).

Standard Template Library (STL) gave better results than individual and hybrid models.

Learning objective- STL is one of the decomposition techniques in which each

component is analyzed separately and are forecasted using various machine learning

algorithms. Linear models such as ARIMA cannot capture nonlinear patterns precisely. It

can fail when the performance of nonlinear models outperforms the hybrid model.

F.M. Thiesing et al., proposed feed-forward multilayer perceptron networks. One

batch and two online training algorithms are implemented on parallel systems (PARIX,

Python Virtual Machine (PVM)). By increasing the number of input neurons, will

increase training time. The prediction error rate is high. Learning objective- By this feed-

forward multilayer perceptron networks, as the count of input neurons increases, there

will be an increase in the time also so that there will not be any time constraint

predictions. It will vary according to the selected inputs, and it is hard to predict within

the expected period. F.M. Thiesing et al., proposed the prediction techniques like neural,
conventional (naive, statistical). The error is measured by the Root mean squared error

(RMSE). Accuracy is measured by RMSE and Theil's U. Neural Network outperforms

the Naïve and statistical approaches. The program runs as a prototype and handles only a

small subset of the supermarket's inventory. Learning objective- Neural and Conventional

techniques are suitable only for a small subset of the supermarket's stock. Hence it cannot

scale it up to the larger level.


CHAPTER – 7

SYSTEM DESIGN

Sales forecasting is remained the one of the important requirement of any grocery

stores. There are many statistical models used for task of predicting sales such as

ARIMA. Thos models mainly based on time so are univarient timeseries preditions. CF

sales are affected by many other factors which promotes application of more complex

model.

Here, predicting the sales for each product is considered as main problem to be

solved. Generally, grocery stores use traditional models for prediction with little data

taken in consideration. It is important to evaluate which would be better for forecasting

sales of CF among two different kinds of models. One is time series based model and

another is regression based on causality. Here, comparison is using LSTM for time-

series forecasting and Random forest for causal forecasting. This will allow as

understanding which is better for forecasting of sales

Dataset Size reduction: As training data is quite large consisting of 125497040

observations which when loaded in memory takes more than 50 GB. So, first step would

be to reduce dataset using data transformation or sampling.

Data analysis: This includes study the effect of each factor on sale such as,

Kind of change in sale due to promotion

Effect of oil price changes

Sale volume on each store


Highest consumer’s city

Treatment of null, negative or NaN values in each column.

unit_sales vs time

MLPNN is one of the most significant models in artificial neural network. The

MLPNN consists of one input layer, one or more hidden layers and one output layer.3 In

MLPNN, the input nodes pass values to the first hidden layer, and then nodes of first

hidden layer pass values to the second and so on till producing outputs as shown in Figu
CHAPTER -4

SOFTWARE AND HARDWARE REQUIREMENTS

Hardware Requirements:

• System : Dual Core.

• Hard Disk : 500 GB.

• Monitor : Led Monitor.

• Mouse : Optical Mouse

• RAM : 4 GB

Software Requirements:

• Operating system : Windows 10.

• Coding Language : Python 3.7

• Compiler : Pycharm

• Data Base : Access


CHAPTER – 5

SYSTEM ANALYSIS

LGBM aims to make gradient boosting on decision trees faster. The idea is that

instead of checking all of the splits when creating new leaves, only some of them are

checked: the model first sorts all of the attributes and buckets the observation by creating

discrete bins. When there needs to be a split of a leaf in the tree, instead of iterating over

all of the leaves, it simply iterates over all of the buckets. This implementation is called

histogram implementation by its authors.


Data Flow Diagram

The BP algorithm has served as a useful methodology to train multilayer

perceptron for a wide range of applications.4 The BP network calculates the difference

between real and predicted values, which is circulated from output nodes backwards to

nodes in previous layer. The BP learning algorithm can be divided into two phases,

propagation and weight update.4


CHAPTER -6

MODULES

Random forest

Random Forest is one of the most complex models employed here with its default

parameter configuration. As, here we are comparing it with LSTM parameter tuning is

not applied. Following is a detail of fitted model.

RandomForestRegressor

(bootstrap=True, criterion='mse', max_depth=None,

max_features='auto', max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,

oob_score=False, random_state=None, verbose=0, warm_start=False)

LSTM

It is a variant of RNN. Here we have used single layer of LSTM with on dropout layer

and one fully connected dense layer. Experiments have been conducted to stacking

multiple layers of LSTM, but it was not improving so to keep it simple and efficient only

one layer is kept. Activation layer added is linear and LSTM has internal gates based on

tanh and sigmoid. Optimizer is RMSPROP with its learning rate 0.001 and MSE is used
as loss function. LSTM is experimented with different batch sizes amount all them 10 is

selected for memory efficiency. Total epoch is 15 as after that no improvement is found.

Following is network architecture.

model.add(LSTM(50,

input_shape=(X_train_values.shape[1], X_train_values.shape[2])))

model.add(Dropout(0.3))

Dense(model.add(Dense(1)))

model.add(Activation('linear'))

model.compile(loss='mse',optimizer='rmsprop')
CHAPTER -7

SYSTEM TESTING

It is a sort of supervised learning algorithmic program that's largely used for

classification issues. Surprisingly, it works for each categorical and continuous dependent

variable. In this algorithmic program, we tend to split the population into 2 or a lot of

homogenized sets. This is done supported most vital attributes/ freelance variables to

form as distinct teams as attainable. A tree has several analogies in real world, and seems

that it's influenced a large space of machine learning, covering each classification and

regression. In call analysis, a choice tree is wont to visually and expressly represent

selections and higher cognitive process. As the name goes, it uses a tree-like model of
decisions. Though a commonly used tool in data mining for deriving a strategy to reach a

particular goal, it’s also widely used in machine learning. Once we completed modelling

the Decision Tree classifier, we will use the trained model to predict whether the balance

scale tip to the right or tip to the left or be balanced.

Random Forest is a great algorithm to train early in the model development

process, to see how it performs and it’s hard to build a “bad” Random Forest, because of

its simplicity. This rule is additionally an excellent alternative, if you would like to

develop a model during a short amount of your time. On prime of that, it provides a fairly

sensible indicator of the importance it assigns to your options. Random Forests are

terribly onerous to ram down terms of performance. And on prime of that, they'll handle

tons of various feature varieties, like binary, categorical and numerical. Overall, Random

Forest may be a (mostly) quick, easy and versatile tool, though it's its limitations.

Random forests are an ensemble learning method for classification, regression and other

tasks, that operate by constructing a multitude of decision trees at training time and

outputting the class that is the mode of the categories (classification) or mean prediction

(regression) of the individual trees Random call forests correct for call trees' habit of over

fitting to their training set.


CHAPTER – 8

ALGORITHM

Here, two different algorithms going to be used are RandomForest and LSTM(Long and

Short Memory). Over here we will use most basic versions of both of them so that they

could be compared.

RandomForestRegressor

(bootstrap=True, criterion='mse', max_depth=None,

max_features='auto', max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,

oob_score=False, random_state=None, verbose=0, warm_start=False)

LSTM

It is a variant of RNN. Here we have used single layer of LSTM with on dropout layer

and one fully connected dense layer. Experiments have been conducted to stacking

multiple layers of LSTM, but it was not improving so to keep it simple and efficient only

one layer is kept. Activation layer added is linear and LSTM has internal gates based on

tanh and sigmoid. Optimizer is RMSPROP with its learning rate 0.001 and MSE is used
as loss function. LSTM is experimented with different batch sizes amount all them 10 is

selected for memory efficiency. Total epoch is 15 as after that no improvement is found.

Following is network architecture.

model.add(LSTM(50,

input_shape=(X_train_values.shape[1], X_train_values.shape[2])))

model.add(Dropout(0.3))

Dense(model.add(Dense(1)))

model.add(Activation('linear'))

model.compile(loss='mse',optimizer='rmsprop')
CHAPTER – 9

SAMPLE CODE

# This Python 3 environment comes with many helpful analytics libraries installed

# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-

python

# For example, here's several helpful packages to load in

import numpy as np # linear algebra

import pandas as pd # data prAny results you write to the current directory are saved as

output.ocessing, CSV file I/O (e.g. pd.read_csv)

import gc; gc.enable()

from sklearn import preprocessing, linear_model, metrics

import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.

# For example, running this (by clicking run or pressing Shift+Enter) will list the files in

the input directory


from subprocess import check_output

print(check_output(["ls", "../input"]).decode("utf8"))

dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}

data = {

'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),

'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),

'ite': pd.read_csv('../input/items.csv'),

'sto': pd.read_csv('../input/stores.csv'),

'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),

'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},

parse_dates=['date']),

'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),

train = data['tra']#[(data['tra']['date'].dt.month == 8) & (data['tra']['date'].dt.day > 15)]

test = data['tes']#[(data['tes']['date'].dt.month == 8) & (data['test']['date'].dt.day > 15)]

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor


from sklearn import cross_validationrf = RandomForestRegressor(max_features = "auto",

min_samples_leaf = 50,n_estimators = 100,random_state =50,oob_score =True )

rf.fit(X_train, y_train)

print ('RF accuracy: TRAINING', rf.score(X_train,y_train,W_train))

print ('RF accuracy: TESTING', rf.score(X_test,y_test,W_test))

print("feature Importance",rf.feature_importances_)

yhat1 = rf.predict(X_test)

print('NWRMSLE RF',NWRMSLE((y_test),(yhat1),W_test.values ))

Input In [2], in <cell line: 3>()

1 from sklearn.tree import DecisionTreeRegressor

2 from sklearn.ensemble import RandomForestRegressor

----> 3 from sklearn import cross_validation

5 rf = RandomForestRegressor(max_features = "auto", min_samples_leaf =

50,n_estimators = 100,random_state =50,oob_score =True )

7 rf.fit(X_train, y_train)

"kernelspec": {

"name": "python3",
"display_name": "Python 3 (ipykernel)",

"language": "python"

},

"language_info": {

"name": "python",

"version": "3.9.12",

"mimetype": "text/x-python",

"codemirror_mode": {

"name": "ipython",

"version": 3

},

"pygments_lexer": "ipython3",

"nbconvert_exporter": "python",

"file_extension": ".py"

# Input data files are available in the "../input/" directory.


# For example, running this (by clicking run or pressing Shift+Enter) will list the files in

the input directory

from subprocess import check_output

print(check_output(["ls", "../input"]).decode("utf8"))

dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}

data = {

'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),

'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),

'ite': pd.read_csv('../input/items.csv'),

'sto': pd.read_csv('../input/stores.csv'),

'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),

'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},

parse_dates=['date']),

'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),

train = data['tra']#[(data['tra']['date'].dt.month == 8) & (data['tra']['date'].dt.day > 15)]

test = data['tes']#[(data['tes']['date'].dt.month == 8) & (data['test']['date'].dt.day > 15)]


from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor

from sklearn import cross_validation

rf = RandomForestRegressor(max_features = "auto", min_samples_leaf =

50,n_estimators = 100,random_state =50,oob_score =True )

LSTM

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import gc; gc.enable()

from sklearn import preprocessing, linear_model, metrics

import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.

# For example, running this (by clicking run or pressing Shift+Enter) will list the files in

the input directory

from subprocess import check_output

print(check_output(["ls", "../input"]).decode("utf8"))

dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}

data = {
'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),

#'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),

'ite': pd.read_csv('../input/items.csv'),

#'sto': pd.read_csv('../input/stores.csv'),

#'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),

#'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},

parse_dates=['date']),

#'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),

Input In [1], in <cell line: 15>()

11 # Input data files are available in the "../input/" directory.

12 # For example, running this (by clicking run or pressing Shift+Enter) will list the

files in the input directory

14 from subprocess import check_output

---> 15 print(check_output(["ls", "../input"]).decode("utf8"))

File ~\anaconda3\lib\subprocess.py:424, in check_output(timeout, *popenargs,

**kwargs)

421 empty = b''


422 kwargs['input'] = empty

--> 424 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,

425 **kwargs).stdout

File ~\anaconda3\lib\subprocess.py:505, in run(input, capture_output, timeout, check,

*popenargs, **kwargs)

502 kwargs['stdout'] = PIPE

503 kwargs['stderr'] = PIPE

--> 505 with Popen(*popenargs, **kwargs) as process:

506 try:

507 stdout, stderr = process.communicate(input, timeout=timeout)

File ~\anaconda3\lib\subprocess.py:951, in Popen.__init__(self, args, bufsize, executable,

stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines,

startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group,

extra_groups, encoding, errors, text, umask)

947 if self.text_mode:

948 self.stderr = io.TextIOWrapper(self.stderr,

949 encoding=encoding, errors=errors)

--> 951 self._execute_child(args, executable, preexec_fn, close_fds,


952 pass_fds, cwd, env,

953 startupinfo, creationflags, shell,

954 p2cread, p2cwrite,

955 c2pread, c2pwrite,

956 errread, errwrite,

957 restore_signals,

958 gid, gids, uid, umask,

959 start_new_session)

960 except:

961 # Cleanup if the child failed starting.

962 for f in filter(None, (self.stdin, self.stdout, self.stderr)):

File ~\anaconda3\lib\subprocess.py:1420, in Popen._execute_child(self, args, executable,

preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread,

p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_gid,

unused_gids, unused_uid, unused_umask, unused_start_new_session)

1418 # Start the process

1419 try:

-> 1420 hp, ht, pid, tid = _winapi.CreateProcess(executable, args,


1421 # no special security

1422 None, None,

1423 int(not close_fds),

1424 creationflags,

1425 env,

1426 cwd,

1427 startupinfo)

1428 finally:

1429 # Child is launched. Close the parent's copy of those pipe

1430 # handles that only the child should have open. You need

(...)

1433 # pipe will not close when the child process exits and the

1434 # ReadFile will hang.

1435 self._close_pipe_fds(p2cread, p2cwrite,

1436 c2pread, c2pwrite,

1437 errread, errwrite)


FileNotFoundError: [WinError 2] The system cannot find the file specified

dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}

data = {

'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),

#'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),

'ite': pd.read_csv('../input/items.csv'),

#'sto': pd.read_csv('../input/stores.csv'),

#'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),

#'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},

parse_dates=['date']),

#'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),

}
CHAPTER – 10

SAMPLE OUTPUT

Sales on timeframe: It will be useful to understand how sales are distributed over year,

month, and day for understanding effect of time on sales.

Sale on each year


Above chart on each year clarifies sale is increases each year. On the other hand

maximum sale is happening in the month of July.

This above chart of sale by each day of month indicates that there are less sale happing in

at last and first day of each month.

Above chart indicates that there is no much effect of weekend on sale.

CHAPTER – 11
CONCLUSION

Sales forecasting plays a vital role in the business sector in every field. With the help of

the sales forecasts, sales revenue analysis will help to get the details needed to estimate

both the revenue and the income. Different types of Machine Learning techniques such as

Support Vector Regression, Gradient Boosting Regression, Simple Linear Regression,

and Random Forest Regression have been evaluated on food sales data to find the critical

factors that influence sales to provide a solution for forecasting sales. After performing

metrics such as accuracy, mean absolute error, and max error, the Random Forest

Regression is found to be the appropriate algorithm according to the collected data and

thus fulfilling the aim of this project.Following visualizations show that randomforest fits

better to testing data than that of LSTM. Following is a actual vs prediction plot.

Following is a visualization of actual vs predictions.


This project was taken from Kaggle competition when I was novice to such competitions

but due to my interest and only objective to build kaggle projects There were different

phases I had gone through this project as below but overall I find it is very difficult for

novice to compete.

Based on analysis I come to know that this data has two aspects one is causality and

another is time dependency and based on that I had decided to make proposal which

would be better forecasting model and I have selected RandomForest for causality based

forecasting and LSTM for time-series forcasting.

I have decided to reduce data and use only portion of data and experimented lot with

many combination of data that looks meaningful. Finally I have selected on most selling

product to predict over time and based on features.

Apart from experiments with data the hardest part I found in this project was to train

LSTM. I had tried lot many ways to reduce over fitting and under fitting which had taken

most portion of total experiments in this project.


Finally, I ended up with reasonably good model with LSTM and also with RandomForest

and found that this data has lot of causality which is why time series furcating will

underperforms.

FUTURE WORK

This project is about understanding application of two different methods for

predicting sales. One used is random forest which is state of the art boosting algorithm

based on Decision tress. Random-Forest is used to check causality based prediction. In

these experiments, I have used it for one single product and it is performing quite well

and better than LSTM. Random forest with k-fold crosses validation for better parameter

tuning.

LSTM is used with lag of 7 observations which again need improvement. With

availability of better of high end resources w can create multivariate time series in

LSTM. During experiments I have observed LSTM will improve with more number of

data and quite sensitive to batch size but tuning both of these parameters need huge

amount of time and also processing power. I have tried many layers and nodes of LSTM

but I could not see improvement but experiments combination of CNN over here might

be helpful in learning causality of such sales data.

In both of the experiment on can also improve in number of data as total of 5 GB

data is available to make model learn better. I haven’t tried fb prophet here, but it is also

considered one of the good candidates for time series forecasting.


12. REFERENCE

1. https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average.

2. https://www.kaggle.com/c/favorita-grocery-sales forecasting#evaluation.

3. https://en.wikipedia.org/wiki/Random_forest6.

4. https://github.com/llSourcell/LSTM_Networks/blob/master/LSTM

%20Demo.ipynb.

5. H. Yu, O. G. Garrod, and P. G. Schyns, “Perception-driven facialexpression

synthesis,” Computers & Graphics, vol. 36, no. 3, pp.152–162, 2012.

6. S. Oh, J. Bailenson, N. Kr¨amer, and B. Li, “Let the avatar brightenyour smile:

Effects of enhancing facial expressions in virtualenvironments,” PLoS ONE, vol.

11, no. 9, p. e0161794, 2016.

7. James R. Williamson. MIT Lincoln, “Detecting Depression using Vocal, Facial

and SemanticCommunication Cues,” AVEC'16, October 16 2016, Amsterdam,

NetherlandsACM. ISBN 978-1-4503-4516-3/16/10

8. V. Surakka and J. K. Hietanen, “Facial and emotional reactionsto duchenne and

non-duchenne smiles,” International Journal ofPsychophysiology, vol. 29, no. 1,

pp. 23–33, 1998.

9. Christine L. Lisetti, Diane J. Schiano “Automatic Facial Expression

Interpretation: Where Human-ComputerInteraction,

10. Artificial Intelligence and Cognitive Science Intersect.” Pragmatics and

Cognition (Special Issue on Facial Information Processing: A

MultidisciplinaryPerspective), Vol. 8(1): 185-235, 2000.


11. , “Thesimulation of smiles (sims) model: Embodied simulation and themeaning of

facial expression,” Behavioral and brain sciences, vol. 33,no. 06, pp. 417–433,

2010

12. Zhang, G. P. Business forecasting with artificial neural networks: An overview.

Neural networks in business forecasting, 2004, 1-22.

13. https://www.kaggle.com/c/favorita-grocery-sales forecasting#evaluation.

14. Shen, S., Jiang, H., & Zhang, T. Stock market forecasting using machine learning

algorithms. Department of Electrical Engineering, Stanford University, Stanford,

CA, 2012, 1-5.

15. https://en.wikipedia.org/wiki/Random_forest6.

You might also like