CONTENTS
CHAPTER NO. TITLE PAGE NO
BONAFIDE CERTIFICATE I
ACKNOWLEDGEMENT II
ABSTRACT III
CONTENTS IV
1 INTRODUCTION
2 LITERATURE SURVEY
3 SYSTEM DESIGN
SOFTWARE AND HARDWARE
4
REQUIREMENTS
5 SYSTEM ANALYSIS
6 MODULES
7 SYSTEM TESTING
8 ALGORITHM
9 SAMPLE CODE
10 SAMPLE OUTPUT
11 CONCLUSION & FUTURE WORK
REFERENCE
ABSTRACT
Product sales forecasting is a major aspect of purchasing management. Forecasts
are crucial in determining inventory stock levels, and accurately estimating future
demand for goods has been an ongoing challenge, especially in the Supermarkets and
Grocery Stores industry. If goods are not readily available or goods availability is more
than demand overall profit can be compromised. As a result, sales forecasting for goods
can be significant to ensure loss is minimized. Additionally, the problem becomes more
complex as retailers add new locations with unique needs, new products, ever
transitioning seasonal tastes, and unpredictable product marketing. In this analysis, a
forecasting model is developed using machine learning algorithms to improve the
accurately forecasts product sales. The proposed model is especially targeted to support
the future purchase and more accurate forecasts product sales and is not intended to
change current subjective forecasting methods. A model based on a real grocery store's
data is developed in order to validate the use of the various machine learning algorithms.
In the case study, multiple regression methods are compared. The methods impact on
forecast product availability in store to ensure they have just enough products at right
time.
CHAPTER 1
INTRODUCTION
In this project, we are trying to forecasts product sales based on the items, stores,
transaction and other dependent variables like holidays and oil prices.
This is a Kaggle Competition called "Corporación Favorita Grocery Sales Forecasting"
where the task is to predict stocking of products to better ensure grocery stores please
customers by having just enough of the right products at the right time.
For this particular problem, we have analyzed the data as a supervised learning problem.
In order to forecasts the sales we have compared different regression models like Linear
Regression, Decision Tree, ExtraTreeRegressor, Gradient Boosting, Random Forest and
XgBoost. Further to optimize the results we have used multilayer perception (MLP: a
class of feed forward artificial neural network) and LightGBM ( gradient boosting
framework that uses tree based learning algorithms).
Sales Forecasting is the process of using the company’s sales records of the past years to
predict the short-term or long-term performance in the future. This is one of the pillars of
proper financial planning. As with any prediction-related process, risk and uncertainty are
unavoidable in Sales Forecasting too. Hence, it’s considered good practice for forecasting
teams to mention the degree of uncertainties in their forecast.
Accurately forecasting sales and building a sales plan can help to avoid unforeseen cash
flow problems and manage production, staff and financing needs more effectively.
Brick-and-mortar grocery stores are always closely tied to purchasing and sales forecasts.
An incorrect prediction will cause over-purchasing, which will lead to overstock and
spoilage. On the other hand, an insufficient purchase will cause a shortage of
merchandise available to customers. Therefore, it is very important for grocery stores to
accurately predict the purchase volume of goods. Sales prediction is an important part of
modern business intelligence [1]. Accurate forecasts can bring huge benefits to a
businessman or a business. In the last decade, machine learning has been used for various
business predictions, such as in the financial industry, stock forecasting [2], etc.
To accurately predict sales, it is critical to take into account a wide range of factors.
Corporación Favorita gives important data to make relevant predictions for this model
training. The purpose of this study was to address the inventory problems that exist in
most grocery stores, such as overstocking and not having enough items for customers to
purchase. In this paper, based on the above problem, a related prediction model is
studied. The model was able to predict the sales of different items in each store. Stores
can purchase goods at different times according to the forecast, which can reduce
unreasonable purchases in the store and increase the turnover of the grocery store.
CHAPTER 2
LITERATURE SURVEY
Akshay Krishna et al., proposed the Normal regression technique, the Boosting
technique, and the Root Mean Square Error value (RMSE) for evaluating accuracy. The
boosting algorithm has better results than the regular regression algorithm. Learning
objective- RMSE is calculated using a variance, a fundamental concept. So it cannot
scale it up to the larger level. This is one of the significant factors that affect calculating
accuracy at a high rate. Without proper hyper parameter tuning, the AdaBoost algorithm
won't perform as expected, and the performance deteriorates.
Gopalakrishnan T et al., proposed a linear regression algorithm. Accuracy is
evaluated based on precision value. Here precision value specifies the number of correct
recommendations, i.e. proportion of the relevant revivals to the total number of
populations. Besides, there are plenty of machine learning algorithms. This system only
uses a linear regression algorithm. Learning objective- Linear regression algorithm is
used when someone wants to predict a variable's value based on the value of another
variable. Since this algorithm is dependent on the other variables for prediction, this will
not be so efficient. So, without comparison with another algorithm, one cannot assure that
this is the best algorithm.
SunithaCheriyan et al., proposed a generalized linear model, decision tree,
gradient boosted tree. Accuracy is calculated using empirical evaluation in which results
are derived by observation or experiment instead of theory. The results are summarized in
terms of the efficient technique's reliability and accuracy. During the analysis phase,
some of the documents were discarded. Data used in this analysis were insufficient for
further analysis. Learning objective- Using a generalized linear model, decision tree, and
gradient boosted tree; the execution time will be huge. Managing a large set of records
will be complicated. So it isn't easy to make predictions for massive datasets.
MohitGurnani et al., proposed various machine learning models, namely,
Autoregressive integrated moving average(ARIMA), Auto-Regressive Neural
Network(ARNN), XG Boost, SVM, Hybrid Models like Hybrid ARIMAARNN, Hybrid
ARIMA- XG Boost(extreme Gradient Boosting), Hybrid ARIMA-SVM and STL
Decomposition (using ARIMA, Snaive, XG Boost). These models' accuracy is measured
by metrics such as Mean Absolute Error (MAE) and Root Mean Square Error(RMSE).
Standard Template Library (STL) gave better results than individual and hybrid models.
Learning objective- STL is one of the decomposition techniques in which each
component is analyzed separately and are forecasted using various machine learning
algorithms. Linear models such as ARIMA cannot capture nonlinear patterns precisely. It
can fail when the performance of nonlinear models outperforms the hybrid model.
F.M. Thiesing et al., proposed feed-forward multilayer perceptron networks. One
batch and two online training algorithms are implemented on parallel systems (PARIX,
Python Virtual Machine (PVM)). By increasing the number of input neurons, will
increase training time. The prediction error rate is high. Learning objective- By this feed-
forward multilayer perceptron networks, as the count of input neurons increases, there
will be an increase in the time also so that there will not be any time constraint
predictions. It will vary according to the selected inputs, and it is hard to predict within
the expected period. F.M. Thiesing et al., proposed the prediction techniques like neural,
conventional (naive, statistical). The error is measured by the Root mean squared error
(RMSE). Accuracy is measured by RMSE and Theil's U. Neural Network outperforms
the Naïve and statistical approaches. The program runs as a prototype and handles only a
small subset of the supermarket's inventory. Learning objective- Neural and Conventional
techniques are suitable only for a small subset of the supermarket's stock. Hence it cannot
scale it up to the larger level.
CHAPTER – 7
SYSTEM DESIGN
Sales forecasting is remained the one of the important requirement of any grocery
stores. There are many statistical models used for task of predicting sales such as
ARIMA. Thos models mainly based on time so are univarient timeseries preditions. CF
sales are affected by many other factors which promotes application of more complex
model.
Here, predicting the sales for each product is considered as main problem to be
solved. Generally, grocery stores use traditional models for prediction with little data
taken in consideration. It is important to evaluate which would be better for forecasting
sales of CF among two different kinds of models. One is time series based model and
another is regression based on causality. Here, comparison is using LSTM for time-
series forecasting and Random forest for causal forecasting. This will allow as
understanding which is better for forecasting of sales
Dataset Size reduction: As training data is quite large consisting of 125497040
observations which when loaded in memory takes more than 50 GB. So, first step would
be to reduce dataset using data transformation or sampling.
Data analysis: This includes study the effect of each factor on sale such as,
Kind of change in sale due to promotion
Effect of oil price changes
Sale volume on each store
Highest consumer’s city
Treatment of null, negative or NaN values in each column.
unit_sales vs time
MLPNN is one of the most significant models in artificial neural network. The
MLPNN consists of one input layer, one or more hidden layers and one output layer.3 In
MLPNN, the input nodes pass values to the first hidden layer, and then nodes of first
hidden layer pass values to the second and so on till producing outputs as shown in Figu
CHAPTER -4
SOFTWARE AND HARDWARE REQUIREMENTS
Hardware Requirements:
• System : Dual Core.
• Hard Disk : 500 GB.
• Monitor : Led Monitor.
• Mouse : Optical Mouse
• RAM : 4 GB
Software Requirements:
• Operating system : Windows 10.
• Coding Language : Python 3.7
• Compiler : Pycharm
• Data Base : Access
CHAPTER – 5
SYSTEM ANALYSIS
LGBM aims to make gradient boosting on decision trees faster. The idea is that
instead of checking all of the splits when creating new leaves, only some of them are
checked: the model first sorts all of the attributes and buckets the observation by creating
discrete bins. When there needs to be a split of a leaf in the tree, instead of iterating over
all of the leaves, it simply iterates over all of the buckets. This implementation is called
histogram implementation by its authors.
Data Flow Diagram
The BP algorithm has served as a useful methodology to train multilayer
perceptron for a wide range of applications.4 The BP network calculates the difference
between real and predicted values, which is circulated from output nodes backwards to
nodes in previous layer. The BP learning algorithm can be divided into two phases,
propagation and weight update.4
CHAPTER -6
MODULES
Random forest
Random Forest is one of the most complex models employed here with its default
parameter configuration. As, here we are comparing it with LSTM parameter tuning is
not applied. Following is a detail of fitted model.
RandomForestRegressor
(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
LSTM
It is a variant of RNN. Here we have used single layer of LSTM with on dropout layer
and one fully connected dense layer. Experiments have been conducted to stacking
multiple layers of LSTM, but it was not improving so to keep it simple and efficient only
one layer is kept. Activation layer added is linear and LSTM has internal gates based on
tanh and sigmoid. Optimizer is RMSPROP with its learning rate 0.001 and MSE is used
as loss function. LSTM is experimented with different batch sizes amount all them 10 is
selected for memory efficiency. Total epoch is 15 as after that no improvement is found.
Following is network architecture.
model.add(LSTM(50,
input_shape=(X_train_values.shape[1], X_train_values.shape[2])))
model.add(Dropout(0.3))
Dense(model.add(Dense(1)))
model.add(Activation('linear'))
model.compile(loss='mse',optimizer='rmsprop')
CHAPTER -7
SYSTEM TESTING
It is a sort of supervised learning algorithmic program that's largely used for
classification issues. Surprisingly, it works for each categorical and continuous dependent
variable. In this algorithmic program, we tend to split the population into 2 or a lot of
homogenized sets. This is done supported most vital attributes/ freelance variables to
form as distinct teams as attainable. A tree has several analogies in real world, and seems
that it's influenced a large space of machine learning, covering each classification and
regression. In call analysis, a choice tree is wont to visually and expressly represent
selections and higher cognitive process. As the name goes, it uses a tree-like model of
decisions. Though a commonly used tool in data mining for deriving a strategy to reach a
particular goal, it’s also widely used in machine learning. Once we completed modelling
the Decision Tree classifier, we will use the trained model to predict whether the balance
scale tip to the right or tip to the left or be balanced.
Random Forest is a great algorithm to train early in the model development
process, to see how it performs and it’s hard to build a “bad” Random Forest, because of
its simplicity. This rule is additionally an excellent alternative, if you would like to
develop a model during a short amount of your time. On prime of that, it provides a fairly
sensible indicator of the importance it assigns to your options. Random Forests are
terribly onerous to ram down terms of performance. And on prime of that, they'll handle
tons of various feature varieties, like binary, categorical and numerical. Overall, Random
Forest may be a (mostly) quick, easy and versatile tool, though it's its limitations.
Random forests are an ensemble learning method for classification, regression and other
tasks, that operate by constructing a multitude of decision trees at training time and
outputting the class that is the mode of the categories (classification) or mean prediction
(regression) of the individual trees Random call forests correct for call trees' habit of over
fitting to their training set.
CHAPTER – 8
ALGORITHM
Here, two different algorithms going to be used are RandomForest and LSTM(Long and
Short Memory). Over here we will use most basic versions of both of them so that they
could be compared.
RandomForestRegressor
(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
LSTM
It is a variant of RNN. Here we have used single layer of LSTM with on dropout layer
and one fully connected dense layer. Experiments have been conducted to stacking
multiple layers of LSTM, but it was not improving so to keep it simple and efficient only
one layer is kept. Activation layer added is linear and LSTM has internal gates based on
tanh and sigmoid. Optimizer is RMSPROP with its learning rate 0.001 and MSE is used
as loss function. LSTM is experimented with different batch sizes amount all them 10 is
selected for memory efficiency. Total epoch is 15 as after that no improvement is found.
Following is network architecture.
model.add(LSTM(50,
input_shape=(X_train_values.shape[1], X_train_values.shape[2])))
model.add(Dropout(0.3))
Dense(model.add(Dense(1)))
model.add(Activation('linear'))
model.compile(loss='mse',optimizer='rmsprop')
CHAPTER – 9
SAMPLE CODE
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-
python
# For example, here's several helpful packages to load in
import numpy as np # linear algebra
import pandas as pd # data prAny results you write to the current directory are saved as
output.ocessing, CSV file I/O (e.g. pd.read_csv)
import gc; gc.enable()
from sklearn import preprocessing, linear_model, metrics
import matplotlib.pyplot as plt
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in
the input directory
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}
data = {
'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),
'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),
'ite': pd.read_csv('../input/items.csv'),
'sto': pd.read_csv('../input/stores.csv'),
'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),
'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},
parse_dates=['date']),
'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),
train = data['tra']#[(data['tra']['date'].dt.month == 8) & (data['tra']['date'].dt.day > 15)]
test = data['tes']#[(data['tes']['date'].dt.month == 8) & (data['test']['date'].dt.day > 15)]
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import cross_validationrf = RandomForestRegressor(max_features = "auto",
min_samples_leaf = 50,n_estimators = 100,random_state =50,oob_score =True )
rf.fit(X_train, y_train)
print ('RF accuracy: TRAINING', rf.score(X_train,y_train,W_train))
print ('RF accuracy: TESTING', rf.score(X_test,y_test,W_test))
print("feature Importance",rf.feature_importances_)
yhat1 = rf.predict(X_test)
print('NWRMSLE RF',NWRMSLE((y_test),(yhat1),W_test.values ))
Input In [2], in <cell line: 3>()
1 from sklearn.tree import DecisionTreeRegressor
2 from sklearn.ensemble import RandomForestRegressor
----> 3 from sklearn import cross_validation
5 rf = RandomForestRegressor(max_features = "auto", min_samples_leaf =
50,n_estimators = 100,random_state =50,oob_score =True )
7 rf.fit(X_train, y_train)
"kernelspec": {
"name": "python3",
"display_name": "Python 3 (ipykernel)",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.9.12",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in
the input directory
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}
data = {
'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),
'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),
'ite': pd.read_csv('../input/items.csv'),
'sto': pd.read_csv('../input/stores.csv'),
'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),
'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},
parse_dates=['date']),
'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),
train = data['tra']#[(data['tra']['date'].dt.month == 8) & (data['tra']['date'].dt.day > 15)]
test = data['tes']#[(data['tes']['date'].dt.month == 8) & (data['test']['date'].dt.day > 15)]
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import cross_validation
rf = RandomForestRegressor(max_features = "auto", min_samples_leaf =
50,n_estimators = 100,random_state =50,oob_score =True )
LSTM
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import gc; gc.enable()
from sklearn import preprocessing, linear_model, metrics
import matplotlib.pyplot as plt
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in
the input directory
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}
data = {
'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),
#'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),
'ite': pd.read_csv('../input/items.csv'),
#'sto': pd.read_csv('../input/stores.csv'),
#'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),
#'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},
parse_dates=['date']),
#'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),
Input In [1], in <cell line: 15>()
11 # Input data files are available in the "../input/" directory.
12 # For example, running this (by clicking run or pressing Shift+Enter) will list the
files in the input directory
14 from subprocess import check_output
---> 15 print(check_output(["ls", "../input"]).decode("utf8"))
File ~\anaconda3\lib\subprocess.py:424, in check_output(timeout, *popenargs,
**kwargs)
421 empty = b''
422 kwargs['input'] = empty
--> 424 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
425 **kwargs).stdout
File ~\anaconda3\lib\subprocess.py:505, in run(input, capture_output, timeout, check,
*popenargs, **kwargs)
502 kwargs['stdout'] = PIPE
503 kwargs['stderr'] = PIPE
--> 505 with Popen(*popenargs, **kwargs) as process:
506 try:
507 stdout, stderr = process.communicate(input, timeout=timeout)
File ~\anaconda3\lib\subprocess.py:951, in Popen.__init__(self, args, bufsize, executable,
stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines,
startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group,
extra_groups, encoding, errors, text, umask)
947 if self.text_mode:
948 self.stderr = io.TextIOWrapper(self.stderr,
949 encoding=encoding, errors=errors)
--> 951 self._execute_child(args, executable, preexec_fn, close_fds,
952 pass_fds, cwd, env,
953 startupinfo, creationflags, shell,
954 p2cread, p2cwrite,
955 c2pread, c2pwrite,
956 errread, errwrite,
957 restore_signals,
958 gid, gids, uid, umask,
959 start_new_session)
960 except:
961 # Cleanup if the child failed starting.
962 for f in filter(None, (self.stdin, self.stdout, self.stderr)):
File ~\anaconda3\lib\subprocess.py:1420, in Popen._execute_child(self, args, executable,
preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread,
p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_gid,
unused_gids, unused_uid, unused_umask, unused_start_new_session)
1418 # Start the process
1419 try:
-> 1420 hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
1421 # no special security
1422 None, None,
1423 int(not close_fds),
1424 creationflags,
1425 env,
1426 cwd,
1427 startupinfo)
1428 finally:
1429 # Child is launched. Close the parent's copy of those pipe
1430 # handles that only the child should have open. You need
(...)
1433 # pipe will not close when the child process exits and the
1434 # ReadFile will hang.
1435 self._close_pipe_fds(p2cread, p2cwrite,
1436 c2pread, c2pwrite,
1437 errread, errwrite)
FileNotFoundError: [WinError 2] The system cannot find the file specified
dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8', 'onpromotion':str}
data = {
'tra': pd.read_csv('../input/train.csv', dtype=dtypes, parse_dates=['date']),
#'tes': pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']),
'ite': pd.read_csv('../input/items.csv'),
#'sto': pd.read_csv('../input/stores.csv'),
#'trn': pd.read_csv('../input/transactions.csv', parse_dates=['date']),
#'hol': pd.read_csv('../input/holidays_events.csv', dtype={'transferred':str},
parse_dates=['date']),
#'oil': pd.read_csv('../input/oil.csv', parse_dates=['date']),
}
CHAPTER – 10
SAMPLE OUTPUT
Sales on timeframe: It will be useful to understand how sales are distributed over year,
month, and day for understanding effect of time on sales.
Sale on each year
Above chart on each year clarifies sale is increases each year. On the other hand
maximum sale is happening in the month of July.
This above chart of sale by each day of month indicates that there are less sale happing in
at last and first day of each month.
Above chart indicates that there is no much effect of weekend on sale.
CHAPTER – 11
CONCLUSION
Sales forecasting plays a vital role in the business sector in every field. With the help of
the sales forecasts, sales revenue analysis will help to get the details needed to estimate
both the revenue and the income. Different types of Machine Learning techniques such as
Support Vector Regression, Gradient Boosting Regression, Simple Linear Regression,
and Random Forest Regression have been evaluated on food sales data to find the critical
factors that influence sales to provide a solution for forecasting sales. After performing
metrics such as accuracy, mean absolute error, and max error, the Random Forest
Regression is found to be the appropriate algorithm according to the collected data and
thus fulfilling the aim of this project.Following visualizations show that randomforest fits
better to testing data than that of LSTM. Following is a actual vs prediction plot.
Following is a visualization of actual vs predictions.
This project was taken from Kaggle competition when I was novice to such competitions
but due to my interest and only objective to build kaggle projects There were different
phases I had gone through this project as below but overall I find it is very difficult for
novice to compete.
Based on analysis I come to know that this data has two aspects one is causality and
another is time dependency and based on that I had decided to make proposal which
would be better forecasting model and I have selected RandomForest for causality based
forecasting and LSTM for time-series forcasting.
I have decided to reduce data and use only portion of data and experimented lot with
many combination of data that looks meaningful. Finally I have selected on most selling
product to predict over time and based on features.
Apart from experiments with data the hardest part I found in this project was to train
LSTM. I had tried lot many ways to reduce over fitting and under fitting which had taken
most portion of total experiments in this project.
Finally, I ended up with reasonably good model with LSTM and also with RandomForest
and found that this data has lot of causality which is why time series furcating will
underperforms.
FUTURE WORK
This project is about understanding application of two different methods for
predicting sales. One used is random forest which is state of the art boosting algorithm
based on Decision tress. Random-Forest is used to check causality based prediction. In
these experiments, I have used it for one single product and it is performing quite well
and better than LSTM. Random forest with k-fold crosses validation for better parameter
tuning.
LSTM is used with lag of 7 observations which again need improvement. With
availability of better of high end resources w can create multivariate time series in
LSTM. During experiments I have observed LSTM will improve with more number of
data and quite sensitive to batch size but tuning both of these parameters need huge
amount of time and also processing power. I have tried many layers and nodes of LSTM
but I could not see improvement but experiments combination of CNN over here might
be helpful in learning causality of such sales data.
In both of the experiment on can also improve in number of data as total of 5 GB
data is available to make model learn better. I haven’t tried fb prophet here, but it is also
considered one of the good candidates for time series forecasting.
12. REFERENCE
1. https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average.
2. https://www.kaggle.com/c/favorita-grocery-sales forecasting#evaluation.
3. https://en.wikipedia.org/wiki/Random_forest6.
4. https://github.com/llSourcell/LSTM_Networks/blob/master/LSTM
%20Demo.ipynb.
5. H. Yu, O. G. Garrod, and P. G. Schyns, “Perception-driven facialexpression
synthesis,” Computers & Graphics, vol. 36, no. 3, pp.152–162, 2012.
6. S. Oh, J. Bailenson, N. Kr¨amer, and B. Li, “Let the avatar brightenyour smile:
Effects of enhancing facial expressions in virtualenvironments,” PLoS ONE, vol.
11, no. 9, p. e0161794, 2016.
7. James R. Williamson. MIT Lincoln, “Detecting Depression using Vocal, Facial
and SemanticCommunication Cues,” AVEC'16, October 16 2016, Amsterdam,
NetherlandsACM. ISBN 978-1-4503-4516-3/16/10
8. V. Surakka and J. K. Hietanen, “Facial and emotional reactionsto duchenne and
non-duchenne smiles,” International Journal ofPsychophysiology, vol. 29, no. 1,
pp. 23–33, 1998.
9. Christine L. Lisetti, Diane J. Schiano “Automatic Facial Expression
Interpretation: Where Human-ComputerInteraction,
10. Artificial Intelligence and Cognitive Science Intersect.” Pragmatics and
Cognition (Special Issue on Facial Information Processing: A
MultidisciplinaryPerspective), Vol. 8(1): 185-235, 2000.
11. , “Thesimulation of smiles (sims) model: Embodied simulation and themeaning of
facial expression,” Behavioral and brain sciences, vol. 33,no. 06, pp. 417–433,
2010
12. Zhang, G. P. Business forecasting with artificial neural networks: An overview.
Neural networks in business forecasting, 2004, 1-22.
13. https://www.kaggle.com/c/favorita-grocery-sales forecasting#evaluation.
14. Shen, S., Jiang, H., & Zhang, T. Stock market forecasting using machine learning
algorithms. Department of Electrical Engineering, Stanford University, Stanford,
CA, 2012, 1-5.
15. https://en.wikipedia.org/wiki/Random_forest6.