Boozed’d Trees—Beer Sales Forecasting
Dan Zylberglejd 1                                                           1
                                                                                dzylber@cs.stanford.edu
  We predict sales volume of beer brands by unit size and region in
  Mexico during 2016 for brewery-conglomerate Anheuser-Busch Inbev.
  Augmentation with environmental data and surveying suited models
  lead us to an ensemble method combining lasso and bagging with
  ABI’s in-house predictions, over which we achieve a test error improve-
  ment in RMSE of 9.2%. We additionally present an optimal reallocation
  strategy when forecasts change after production has finished.
The Challenge
Anheuser-Busch Inbev is the largest beer company in the globe. It
recently announced one of the largest deals in history, its acquisi-
tion of SABMiller. It also controls some of the main breweries in the
world, such as Interbrew in Belgium and Anheuser-Busch in the US.
Our project is focused on Grupo Modelo, its Mexican branch that
produces, among others, beers such as Corona, Modelo and Victoria.
Mexico is a substantial beer consumer, and a big challenge posed
by its new CEO is to improve accuracy of two month leading sales            Figure 1: Anheuser-Busch Inbev logo
forecasts. There is not much historical data on past forecasts, but an
internal analysis indicates that the consolidated monthly percentage
error for Mexico is around 15–20%. To reduce unnecessary waste and
ensure they meet customer demand, Modelo needs to decrease this
number, and we implement a Machine Learning approach to tackle
this great challenge. We believe that it can generate a considerable
impact not only from a profit perspective, but also as a step to push a
data-driven culture to its business.
The Data
                                                                                Log change in sales volume
                                                                                                             10.0
                                                                                                              1.0
For each trio of UEN (region), unit size (bottle, can,...), and brand,
                                                                                                              0.1
ABI provides us with monthly sales in hectolitres since Jan/2013, as
well as discounts and other pricing measures. For each example, we                                                  2013        2014          2015
                                                                                                                                           Date
                                                                                                                                                         2016
use past monthly sales up to one year.
                                                                            Figure 2: Monthly change of sales
                                                                            volume per trio UEN-Brand-Size on a
We also scrape the web in order to augment our dataset by in-               log scale
cluding temperature, holidays, soccer calendars and economic inputs
(inflation, GDP). We carefully align the data with the information we
would have had "at the moment of training". In other words, since we                  Training: 2014—2015                                             Testing: 2016
                                                                                              19424 samples                                           6933 samples
are predicting 2 months ahead, some features need to have a 2-month
                                                                                                                    4:1 validation split
delay (such as past sales), while others don’t, since they are known in
advance (holidays, soccer calendar).
                                                                                   January ‘14                                   ‘15                 ‘16 August ‘16
Finally, we perform feature engineering on the main features. We
                                                                                beer sales forecasting                           2
incorporate the difference between consecutive month sales, and also
group past sales by UEN, unit size, brand, and also all the pairwise
combinations (6 in total).
The final dataset then consists on 25694 examples with 149 fea-
tures and 1 numerical output. We split it into train, validation and
test set following chronological order. We run all of our experiments
on the training set followed by eventual checks on the validation set.
The test set was only used for the final model assessment.
The Metric
Industrial engineering and supply-chain management literature men-
tions the so called bullwhip effect2 . This phenomenon occurs as
inaccurate forecasts propagate backwards through a supply chain,
leading to inefficiencies in inventory management that grow superlin-
early. To account for this effect, we agreed with ABI to use a metric
that penalizes errors more than linearly for evaluation of model per-
formance. We pick Root Mean Squared Error (RMSE).                         Figure 3: Illustration of the bullwhip
                                                                          effect and the outsized distortions as we
                                                                          go through the supply chain
We also report other types of metrics which ABI executives are            2
                                                                            Dean C Chatfield, Jeon G Kim, Terry P
more used to. However, it is important to reinforce that from a profit-   Harrison, and Jack C Hayya. The
oriented perspective RMSE is the most important evaluation criterion      bullwhip effect–impact of stochastic
                                                                          lead time, information quality, and
to look at. In particular their use of Mean Average Percentage Error      information sharing: a simulation study.
(MAPE) could be misleading: it grows linearly in terms of the nu-         Production and Operations Management,
                                                                          13(4):340–353, 2004
merator, and the denominator inflates errors for small sales, which
goes in the opposite direction of maximizing profit.
              s                                                     
                       n                                n    Ŷ − Y 
                  1                                1
                      ∑ (Ŷi − Yi )2                   ∑  Yi 
                                                             i     i
     RMSE:                               MAPE:
                  n   i =1
                                                   n   i =1
The Model
We start our modelling with a baseline, using Constant YOY,
                                                                                                                test error
which predicts the same volume for each trio (UEN-unit size-brand)                                              training error
                                                                              error
as the previous year.
We then jump into linear models, applying linear regression
and then adding an L-1 penalty to its coefficients to obtain a lasso                  0.1       0.4      0.6      0.8        1.0
                                                                                            share of training set used
model. It is interesting to mention that our linear regression had high
variance due to multicollinearity (especially economic inputs that        Figure 4: Convergence of training and
changed only 3 times per year and were highly correlated - GDP, %         test error when the full training data set
                                                                          is used. (Linear Regression)
change in GDP, Touristic GDP). lasso took care of it, and produced
good results with low variance, achieving similar training and vali-
dation errors. However, there was a large bias since the training error
was still high.
                                                                                               beer sales forecasting                                3
To decrease our bias, we explore more flexible models such as
tree-based ones (given its success in Kaggle competitions 3 and other       3
                                                                             Kaggle. Forecast sales using store,
applications). For boosting, we tuned the max-depth of each tree            promotion, and competitor data,
                                                                            September 2015. URL https://www.
and the number of trees using 5-fold cross validation (the step size        kaggle.com/c/rossmann-store-sales
we set to 0.01). Surprisingly for us, it does not perform well on the                                 0   31        72          105   125    140
validation set. We applied random forests with the main hyperpa-
                                                                                               1000
rameter (number of possible variables on each split) set to its default,
with some success. We then tuned it using out-of-bag error (faster
than Cross Validation) and, surprisingly, the optimal number of al-
                                                                                               500
lowed variables is the greatest as possible.
                                                                                Coefficients
   This means we are not introducing any randomness on each split
                                                                                               0
and, thus, implementing bagging, which led to our best results so
far. We believe that the common reason for which boosting and
random forest did not perform well is the high order interaction
                                                                                               −500
effect of our data. Since each trio behaves differently from other trios,
we believe we have a strong interaction effect of order at least 3. For
high order interaction effects, boosting is not suited well if we are                                 0   2000      4000       6000
                                                                                                                           L1 Norm
                                                                                                                                      8000   10000
using simple base learners with low depth, and the limitation of
which variables we can pick at each split on random forest does             Figure 5: LASSO coefficients path. The
                                                                            largest coefficients are associated to
not allow to fully capture those interactions either.                       brands and months.
             Constant YOY         LASSO                       Boosting                                           Ensemble
                                                                                                                 RF+LASSO
             Linear Regression    Random Forest               Bagging                                            Ensemble
                                                                                                                 Bagging+LASSO
Prediction
             Ground Truth
                                                                            Figure 6: Predicted over actual Hec-
                                                                            tolitres for the validation set on a
   Given that we obtained good performance with both lasso and              log–log plot. A perfect prediction
bagging, we decide to combine both. We opt to ensemble them in              would display as a 45◦ line. Note that
a naive way, by just taking the average of their predictions (which         we penalize with the squared of the er-
                                                                            ror, so good predictors should be sharp
works as something like a uniform prior to the model weights, lead-         on the top right. LASSO performs well
ing to a regularization effect). Since those functions define very dif-     even though its predictions spread out
                                                                            in the bottom left corner as those values
ferent surfaces, we expect a good ensemble effect, with errors can-         are the least relevant for the resulting
                                                                            RMSE.
                                                                                                    beer sales forecasting                4
celling each other and less sensitivity to a particular model (i.e., we
expect to decrease our overall bias as well as variance, respectively).
We report the validation error for all the mentioned models, and pick
the ensemble of lasso with bagging as our final one. We are now
ready to apply it to the test set.
   Linear Regression                                                   1601                       Figure 7: Performance of different
                                                                                                  models on the validation set.
           Boosting                                                  1564
    Random Forrest                                                1512
            LASSO                                                 1492
            Bagging                                              1476
      RF + LASSO                                                1455
  Bagging + LASSO                                              1437
    Baseline (YOY)                                                                         1877
                       Bar length proportional to MSE. Numbers show RMSE in Hectolitres.
The Results
We run a complete backtest on the test set (Jan-Aug ’16). More specif-
ically, for each month that we are forecasting, we use all available
data up to 2 months before it, retrain our model and make the pre-
dictions. We do it in a rolling window basis, so by the end of the
backtest we have predictions for each month in the test set, exactly as
if we have implemented our model to ABI’s routines and made those
forecasts by then. This allows us to fully trust our results and gener-
ate reliable estimates for future test errors (possibly even pessimistic,
since we will naturally have more data in the future).
The table on the right describes our results quantitatively. Some
                                                                                                    Metric          ABI     Ours        Both
important conclusions can be drawn:
                                                                                                    RMSE           1883     1743        1709
1. The model that ABI is currently developing achieves an RMSE                                      MAPE           3.8%     4.8%        4.2%
   of 1883, which is believed to be already better than their current                               " by UEN       8.4%     8.3%        7.6%
                                                                                                    " by Brand    12.2%    13.0%       11.7%
   system. Our model achieves 1743, which is a great improvement                                    " by Size      8.0%     9.8%        8.3%
   over it. When we combine our model with ABI’s through simple
                                                                                                  Table 1: Performance on the test set
   averaging, the resulting model achieves 1709, which corresponds                                for ABI’s model, the model we devel-
   to a 9.2% overall improvement. So we consider our approach suc-                                oped (ensemble of Bagging and LASSO
                                                                                                  based on the mentioned features), and
   cessful, and recommend it to be implemented.                                                   a model which consists on simply av-
                                                                                                  eraging our predictions with ABI’s.
2. When analyzing other types of metrics, our model seems to be                                   We report results for both Root Mean
   slightly inferior than ABI’s model. It outperforms ABI on MAPE                                 Squared Error and also different aggre-
   aggregated by UEN, and underperforms it if aggregated by brand                                 gations of Mean Absolute Percentage
                                                                                                  Error.
   or unit size. We once again restate that even though more inter-
   pretable, this measure is not representative of potential profit, and
   is also distorted by aggregations with low volume.
3. Even though we don’t recommend the MAPE as the optimal met-
   ric, we still achieve good results on the average monthly MAPE
                                                                            beer sales forecasting                                               5
  over the test set. We achieve less than 5% error, ABI’s achieves less
  than 4% and the combined one around 4%.
4. Other important advantages of the model are worth being men-
   tioned: it deals well with missing data (the bagging part of the          Ground truth (examples)         2-month leading prediction
   model uses surrogate splits for this purpose); each monthly model
   can be trained in around 1 hour on a standard local CPU; it is flex-                                                             Corona Extra
   ible to incorporate new types of data, such as weather/humidity
                                                                                                                                          Victoria
   (which were not used yet), or even analysts forecasts.
                                                                                Oct 2015          Jan 2016          Apr 2016          Jul 2016
                                                                          Figure 8: Actual sales (gray) and pre-
The Improvement                                                           dictions (red) on the validation and test
                                                                          set (separated by the dashed line), for
From an Artificial Intelligence perspective, we extend the model to       the two mostly consumed brands in
                                                                          Mexico: Corona Extra and Victoria.
adjust itself dynamically. Predictions made with 1 month in advance
tend to be more accurate than those made 2 months ahead. Since the
production plan starts with 2 months in advance, ABI cannot change
the amount of beer and bottles produced. However, they can reallo-
cate it differently through the UENs. This can enhance logistics and
even sales/marketing planning. So each month we train a model to
predict sales one month in advance, and adjust our previous predic-
tions under the constraint the the sum of the hectolitres for each pair
of brand and unit size remains constant. We first analyzed it from
a CSP perspective, but it turns out this optimization can be solved
analytically through Lagrange multipliers (appendix).
Running a backtest on the same format as previously described,
we obtain an improvement from an RMSE of 1743 to 1701 for our
model. Curiously, ABI’s model gets worse, increasing its RMSE from
1883 to 1966. The combination of both models also shows an im-
provement, going from 1709 to 1692. Thus, again, we recommend
implementing the dynamic regional reallocation.
The Next Steps
This project explored an application of Machine Learning to an in-
dustrial field such as food & beverages. It shows that a data-driven
approach allows for enhancement of business performance, and that
there is still room for other applications and improvement. Some pos-
sible extensions of our particular model, that came out of meetings
with senior management of Grupo Modelo are:
1. Forecast exports and international markets.
2. Make longer term predictions, such as six months or a full year.
3. Predict sales for new brands or unit sizes that are yet unreleased.
                                                                        beer sales forecasting   6
References
Dean C Chatfield, Jeon G Kim, Terry P Harrison, and Jack C Hayya.
 The bullwhip effect–impact of stochastic lead time, information
 quality, and information sharing: a simulation study. Production and
 Operations Management, 13(4):340–353, 2004.
Da Veiga et al. A comparison between the HoltWinters and ARIMA
 models. WSEAS TRANSACTIONS, 11:608–614, 2014.
Kaggle. Stores sales forecasting, February 2014. URL https://www.
  kaggle.com/c/walmart-recruiting-store-sales-forecasting/
  data.
Kaggle. Forecast sales using store, promotion, and competi-
  tor data, September 2015. URL https://www.kaggle.com/c/
  rossmann-store-sales.
                                                                                                              beer sales forecasting   7
Appendix
       (1)     (2)            (n)
Call qold , qold , . . . , qold the old estimates for the quantity of beer to
be sold for each pair UEN for a given pair (brand, unit size). Also,
      (1)      (2)                 (n)
call qnew , qnew , . . . , qnew the new estimates for them, for the same pair
(brand, unit size). Also, call r (1) , . . . , r (n) our final estimates. Our goal
is to solve the following optimization problem:
                                                                        n
                                                                       ∑ (r(i) − qnew )2
                                                                                          (i )
                minimize                 f ( r (1) . . . r ( n ) ) =
                  r (1) ...r (n)                                       i =1
                                                                         n               n
                                                                       ∑ r(i) − ∑ qold = 0
                                                                                                 (i )
                subject to               g ( r (1) . . . r ( n ) ) =
                                                                       i =1             i =1
   Calling Knew the sum of total hectolitres produced by a given UEN
and Kold the sum of total hectolitres previously predicted by the 2-
month in advance model, our constraint consists of Knew − Kold = 0.
When we first designed this task, we thought about using Constraint
Satisfaction Problem tools. However, it turns out that this optimiza-
tion problem is convex and can be solved analytically using Lagrange
multipliers. Define:
             L(r (1) . . . r (n) , λ) = f (r (1) . . . r (n) ) − λg(r (1) . . . r (n) )
   From Lagrange multipliers, we need to set:
                              ∇r(1) ...r(n) ,λ L(r (1) . . . r (n) , λ) = 0 ⇒
                                                                            (1)
                                                            r (1) − qnew = λ
                                                                                  ...
                                                                                                        (1)
                                                                     (n)
                                                            r (n) − qnew          =λ
                                                                   n
                                                                       ∑r
                                                                      (i )
                                                                                  = Kold
                                                                   i =1
   From the last equation in 1 we get that Knew + nλ = Kold ⇒ λ =
Kold −Knew
     n   Solving for an arbitrary r (i) , we get the updated estimates:
           .
                                                    (i )       Kold − Knew
                                         r (i) = qnew +
                                                                    n
This makes intuitive sense, since what it is doing is basically reallo-
cating the estimated excess or shortage of beer/bottles equally over
all the UENs while trying to stay as close as possible to the new,
updated predictions.