SC Report
SC Report
Objectives
Conduct a clustering analysis of SKUs to identify distinct groups based on sales patterns and other relevant features.
Determine the key factors influencing each SKU cluster.
In [ ]: import warnings
import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import squarify
from matplotlib.ticker import FuncFormatter, PercentFormatter
from shapely.geometry import Point
warnings.filterwarnings("ignore")
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
In [ ]: df = pd.read_csv(
filepath_or_buffer="data\DataCoSupplyChainDataset.csv", encoding="latin-1"
)
In [ ]: df.shape
In [ ]: df.head()
Out[ ]: Days
Days for
for Benefit per Sales per Delivery Category Category Customer Customer Cu
Type shipment Late_delivery_risk
shipping order customer Status Id Name City Country
(scheduled)
(real)
Shipping Sporting
2 CASH 4 4 -247.779999 309.720001 0 73 San Jose EE. UU. XXXXX
on time Goods
In [ ]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180519 entries, 0 to 180518
Data columns (total 53 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Type 180519 non-null object
1 Days for shipping (real) 180519 non-null int64
2 Days for shipment (scheduled) 180519 non-null int64
3 Benefit per order 180519 non-null float64
4 Sales per customer 180519 non-null float64
5 Delivery Status 180519 non-null object
6 Late_delivery_risk 180519 non-null int64
7 Category Id 180519 non-null int64
8 Category Name 180519 non-null object
9 Customer City 180519 non-null object
10 Customer Country 180519 non-null object
11 Customer Email 180519 non-null object
12 Customer Fname 180519 non-null object
13 Customer Id 180519 non-null int64
14 Customer Lname 180511 non-null object
15 Customer Password 180519 non-null object
16 Customer Segment 180519 non-null object
17 Customer State 180519 non-null object
18 Customer Street 180519 non-null object
19 Customer Zipcode 180516 non-null float64
20 Department Id 180519 non-null int64
21 Department Name 180519 non-null object
22 Latitude 180519 non-null float64
23 Longitude 180519 non-null float64
24 Market 180519 non-null object
25 Order City 180519 non-null object
26 Order Country 180519 non-null object
27 Order Customer Id 180519 non-null int64
28 order date (DateOrders) 180519 non-null object
29 Order Id 180519 non-null int64
30 Order Item Cardprod Id 180519 non-null int64
31 Order Item Discount 180519 non-null float64
32 Order Item Discount Rate 180519 non-null float64
33 Order Item Id 180519 non-null int64
34 Order Item Product Price 180519 non-null float64
35 Order Item Profit Ratio 180519 non-null float64
36 Order Item Quantity 180519 non-null int64
37 Sales 180519 non-null float64
38 Order Item Total 180519 non-null float64
39 Order Profit Per Order 180519 non-null float64
40 Order Region 180519 non-null object
41 Order State 180519 non-null object
42 Order Status 180519 non-null object
43 Order Zipcode 24840 non-null float64
44 Product Card Id 180519 non-null int64
45 Product Category Id 180519 non-null int64
46 Product Description 0 non-null float64
47 Product Image 180519 non-null object
48 Product Name 180519 non-null object
49 Product Price 180519 non-null float64
50 Product Status 180519 non-null int64
51 shipping date (DateOrders) 180519 non-null object
52 Shipping Mode 180519 non-null object
dtypes: float64(15), int64(14), object(24)
memory usage: 73.0+ MB
Out[ ]: False
In [ ]: df.duplicated().any()
Out[ ]: False
In [ ]: market_sales = df.groupby("Market")["Sales"].sum().sort_values(ascending=False)
department_sales = (
df.groupby("Department Name", observed=False)["Sales"]
.sum()
.sort_values(ascending=True)
)
plt.subplots_adjust(wspace=0.5)
plt.show()
Europe and Latin America (LATAM) seem to be larger markets compared to all other regions combined. Let's look at it on an annual basis.
Year
To maintain consistency, we can only consider 2017 for inventory management processes.
In [ ]: market_sales_2017 = df_.groupby("Market")["Sales"].sum().sort_values(ascending=False)
department_sales_2017 = (
df_.groupby("Department Name", observed=False)["Sales"]
.sum()
.sort_values(ascending=True)
)
top_markets_2017 = dict(
sorted(market_sales_2017.items(), key=lambda item: item[1], reverse=True)[:2]
)
top_department_2017 = "Fan Shop"
plt.subplots_adjust(wspace=0.5)
plt.show()
LATAM and Europe appear to be the largest markets in terms of sales, followed by Pacific Asia, Africa, and USCA.
In [ ]: market_sales_idx = (
df_.groupby("Market")["Sales"].sum().sort_values(ascending=False).index
)
total_sales_by_market_department = (
df_.groupby(["Market", "Department Name"])["Sales"].sum().reset_index()
)
top5_departments = (
total_sales_by_market_department.groupby("Market")
.apply(lambda x: x.nlargest(10, "Sales"))
.reset_index(drop=True)
)
ax.set_title(market)
ax.yaxis.set_major_formatter(FuncFormatter(currency_formatter))
ax.spines[["top", "right"]].set_visible(False)
top5_departments = (
total_sales_by_market_category.groupby("Market")
.apply(lambda x: x.nlargest(10, "Sales"))
.reset_index(drop=True)
)
ax.set_title(market)
ax.yaxis.set_major_formatter(FuncFormatter(currency_formatter))
ax.spines[["top", "right"]].set_visible(False)
In [ ]: market_order_idx = (
df_.groupby("Market")["Order Item Quantity"]
.sum()
.sort_values(ascending=False)
.index
)
total_order_by_market_product = (
df_.groupby(["Market", "Product Name"])["Order Item Quantity"].sum().reset_index()
)
top10_product = (
total_order_by_market_product.groupby("Market")
.apply(lambda x: x.nlargest(10, "Order Item Quantity"))
.reset_index(drop=True)
)
ax.set_title(market)
ax.yaxis.set_major_formatter(FuncFormatter(quantity_formatter))
ax.spines[["top", "right"]].set_visible(False)
The 'Perfect Fitness Perfect Rip Deck' is the top-selling product in all markets. LATAM market shows higher performance in terms of order
quantity compared to other markets.
In [ ]: df_.columns
Out[ ]: Index(['Type', 'Days for shipping (real)', 'Days for shipment (scheduled)',
'Benefit per order', 'Sales per customer', 'Delivery Status',
'Late_delivery_risk', 'Category Name', 'Customer City',
'Customer Country', 'Customer Segment', 'Customer State',
'Customer Street', 'Department Name', 'Latitude', 'Longitude', 'Market',
'Order City', 'Order Country', 'order date (DateOrders)',
'Order Item Discount', 'Order Item Discount Rate',
'Order Item Product Price', 'Order Item Profit Ratio',
'Order Item Quantity', 'Sales', 'Order Item Total',
'Order Profit Per Order', 'Order Region', 'Order State', 'Order Status',
'Product Name', 'Product Price', 'Product Status',
'shipping date (DateOrders)', 'Shipping Mode', 'Year'],
dtype='object')
In [ ]: all_countries_sales.shape
Out[ ]: (129, 2)
In [ ]: world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
In [ ]: world.query("name == 'Brazil'")
29 211049527.0 South America Brazil BRA 1839758 POLYGON ((-53.37366 -33.76838, -53.65054 -33.2...
Country names do not match. Let's see which countries don't match.
In [ ]: unmatched_countries = set(all_countries_sales["Order Country"]).difference(
set(world["name"])
)
In [ ]: unmatched_countries
Out[ ]: {'Afganistán',
'Alemania',
'Arabia Saudí',
'Argelia',
'Azerbaiyán',
'Bangladés',
'Barbados',
'Belice',
'Benín',
'Bielorrusia',
'Bosnia y Herzegovina',
'Brasil',
'Bután',
'Bélgica',
'Camboya',
'Camerún',
'Chipre',
'Corea del Sur',
'Costa de Marfil',
'Croacia',
'Dinamarca',
'Egipto',
'Eslovaquia',
'España',
'Filipinas',
'Finlandia',
'Francia',
'Gabón',
'Grecia',
'Guadalupe',
'Guayana Francesa',
'Haití',
'Hong Kong',
'Hungría',
'Irak',
'Irlanda',
'Irán',
'Italia',
'Japón',
'Jordania',
'Kazajistán',
'Kenia',
'Kirguistán',
'Lituania',
'Luxemburgo',
'Líbano',
'Macedonia',
'Malasia',
'Marruecos',
'Martinica',
'Moldavia',
'Myanmar (Birmania)',
'México',
'Noruega',
'Níger',
'Pakistán',
'Panamá',
'Papúa Nueva Guinea',
'Países Bajos',
'Perú',
'Polonia',
'Reino Unido',
'República Checa',
'República Democrática del Congo',
'República Dominicana',
'Ruanda',
'Rumania',
'Rusia',
'Singapur',
'SudAfrica',
'Suecia',
'Suiza',
'Surinam',
'Tailandia',
'Taiwán',
'Trinidad y Tobago',
'Turkmenistán',
'Turquía',
'Ucrania',
'Uzbekistán',
'Yibuti',
'Zimbabue'}
In [ ]: mapping_dict = {
"Afganistán": "Afghanistan",
"Alemania": "Germany",
"Arabia Saudí": "Saudi Arabia",
"Argelia": "Algeria",
"Azerbaiyán": "Azerbaijan",
"Bangladés": "Bangladesh",
"Barbados": "Barbados",
"Baréin": "Bahrain",
"Belice": "Belize",
"Benín": "Benin",
"Bielorrusia": "Belarus",
"Bosnia y Herzegovina": "Bosnia and Herzegovina",
"Botsuana": "Botswana",
"Brasil": "Brazil",
"Bután": "Bhutan",
"Bélgica": "Belgium",
"Camboya": "Cambodia",
"Camerún": "Cameroon",
"Chipre": "Cyprus",
"Corea del Sur": "South Korea",
"Costa de Marfil": "Ivory Coast",
"Croacia": "Croatia",
"Dinamarca": "Denmark",
"Egipto": "Egypt",
"Emiratos Árabes Unidos": "United Arab Emirates",
"Eslovaquia": "Slovakia",
"Eslovenia": "Slovenia",
"España": "Spain",
"Estados Unidos": "United States of America",
"Etiopía": "Ethiopia",
"Filipinas": "Philippines",
"Finlandia": "Finland",
"Francia": "France",
"Gabón": "Gabon",
"Grecia": "Greece",
"Guadalupe": "Guadeloupe",
"Guayana Francesa": "French Guiana",
"Guinea Ecuatorial": "Equatorial Guinea",
"Haití": "Haiti",
"Hong Kong": "Hong Kong",
"Hungría": "Hungary",
"Irak": "Iraq",
"Irlanda": "Ireland",
"Irán": "Iran",
"Italia": "Italy",
"Japón": "Japan",
"Jordania": "Jordan",
"Kazajistán": "Kazakhstan",
"Kenia": "Kenya",
"Kirguistán": "Kyrgyzstan",
"Lesoto": "Lesotho",
"Libia": "Libya",
"Lituania": "Lithuania",
"Luxemburgo": "Luxembourg",
"Líbano": "Lebanon",
"Macedonia": "North Macedonia",
"Malasia": "Malaysia",
"Marruecos": "Morocco",
"Martinica": "Martinique",
"Moldavia": "Moldova",
"Myanmar (Birmania)": "Myanmar",
"México": "Mexico",
"Noruega": "Norway",
"Nueva Zelanda": "New Zealand",
"Níger": "Niger",
"Omán": "Oman",
"Pakistán": "Pakistan",
"Panamá": "Panama",
"Papúa Nueva Guinea": "Papua New Guinea",
"Países Bajos": "Netherlands",
"Perú": "Peru",
"Polonia": "Poland",
"Reino Unido": "United Kingdom",
"República Centroafricana": "Central African Republic",
"República Checa": "Czech Republic",
"República Democrática del Congo": "Democratic Republic of the Congo",
"República Dominicana": "Dominican Republic",
"República de Gambia": "The Gambia",
"República del Congo": "Republic of the Congo",
"Ruanda": "Rwanda",
"Rumania": "Romania",
"Rusia": "Russia",
"Sierra Leona": "Sierra Leone",
"Singapur": "Singapore",
"Siria": "Syria",
"Suazilandia": "Eswatini",
"SudAfrica": "South Africa",
"Sudán": "Sudan",
"Sudán del Sur": "South Sudan",
"Suecia": "Sweden",
"Suiza": "Switzerland",
"Surinam": "Suriname",
"Sáhara Occidental": "Western Sahara",
"Tailandia": "Thailand",
"Taiwán": "Taiwan",
"Tayikistán": "Tajikistan",
"Trinidad y Tobago": "Trinidad and Tobago",
"Turkmenistán": "Turkmenistan",
"Turquía": "Turkey",
"Túnez": "Tunisia",
"Ucrania": "Ukraine",
"Uzbekistán": "Uzbekistan",
"Yibuti": "Djibouti",
"Zimbabue": "Zimbabwe",
}
In [ ]: world_sales.head()
Out[ ]: pop_est continent name iso_a3 gdp_md_est geometry Order Country Sales
# folium.Choropleth(
# geo_data=world_sales,
# name="choropleth",
# data=world_sales,
# columns=["name", "Sales"],
# key_on="feature.properties.name",
# fill_color="YlGn",
# fill_opacity=0.7,
# line_opacity=0.2,
# nan_fill_color="white",
# legend_name="Sales",
# highlight=True,
# ).add_to(m)
# folium.GeoJson(
# data=world_sales,
# style_function=lambda feature: {
# "color":"",
# "weight": 0.2,
# },
# tooltip=folium.GeoJsonTooltip(
# fields=["name", "Sales"],
# aliases=["Country:", "Sales: "],
# labels=True,
# sticky=True,
# style="background-color: white;"
# )
# ).add_to(m)
# m
In [ ]: world_sales.groupby("Order Country")["Sales"].sum().reset_index(
name="Sales"
).sort_values(by="Sales", ascending=False)[:10]
34 France 1.466954e+06
65 Mexico 1.371471e+06
37 Germany 1.064684e+06
110 United Kingdom 8.170683e+05
15 Brazil 8.019948e+05
51 Italy 5.422939e+05
96 Spain 4.334722e+05
43 Honduras 3.942301e+05
31 El Salvador 3.758007e+05
25 Cuba 3.386557e+05
In [ ]: # Group the data by longitude and latitude to count the number of stores at each location
stores_count = (
df_.groupby(["Longitude", "Latitude"]).size().reset_index(name="Store_Counts")
)
# Add the Point geometries to the stores_count DataFrame as a new column named "point_geometry"
stores_count["point_geometry"] = geometry
In [ ]: # Create a GeoDataFrame from the stores_count DataFrame with Point geometries and EPSG:4326 CRS
stores_gdf = gpd.GeoDataFrame(stores_count, geometry="point_geometry", crs="EPSG:4326")
In [ ]: # Perform a spatial join between the GeoDataFrame of stores within the USA and the GeoDataFrame of USA states
gdf_with_states = gpd.sjoin(stores_gdf_usa, us_states, op="within")
merged_states_counts.plot(
column="Store_Counts",
cmap="YlGn",
linewidth=0.8,
edgecolor="0.8",
legend=True,
ax=ax,
).set_axis_off()
cbar = ax.get_figure().colorbar(ax.collections[0], shrink=0.3)
old_cbar = ax.get_figure().get_axes()[1]
old_cbar.remove()
# Set the title and show the plot
plt.title(
"Distribution of Stores in the United States by State in 2017",
fontdict={"fontsize": "14", "fontweight": "1"},
)
plt.show()
It seems that all the customers are located within the US, however, the orders are shipped worldwide. This indicates that the company
might be an online shop, where stores across the US are selling their goods online to customers all over the world.
# folium.Choropleth(
# geo_data=merged_states_counts,
# name="choropleth",
# data=merged_states_counts,
# columns=["STATE_NAME", "Store_Counts"],
# key_on="feature.properties.STATE_NAME",
# fill_color="YlGn",
# fill_opacity=0.7,
# line_opacity=0.2,
# nan_fill_color="white",
# legend_name="Store_Counts",
# highlight=True,
# ).add_to(f)
# f
In [ ]: df_.columns
Out[ ]: Index(['Type', 'Days for shipping (real)', 'Days for shipment (scheduled)',
'Benefit per order', 'Sales per customer', 'Delivery Status',
'Late_delivery_risk', 'Category Name', 'Customer City',
'Customer Country', 'Customer Segment', 'Customer State',
'Customer Street', 'Department Name', 'Latitude', 'Longitude', 'Market',
'Order City', 'Order Country', 'order date (DateOrders)',
'Order Item Discount', 'Order Item Discount Rate',
'Order Item Product Price', 'Order Item Profit Ratio',
'Order Item Quantity', 'Sales', 'Order Item Total',
'Order Profit Per Order', 'Order Region', 'Order State', 'Order Status',
'Product Name', 'Product Price', 'Product Status',
'shipping date (DateOrders)', 'Shipping Mode', 'Year'],
dtype='object')
First, to cluster the SKUs, the data needs to be aggregated so that only one row per SKU is obtained.
What is SKU?
"SKU" stands for Stock Keeping Unit. It's a term used as a stock tracking unit, commonly in retail sales. It serves as a unique identifier for
a product. SKUs can include information about product specifications, supplier details, prices, and they are often readable via barcodes.
Businesses such as stores and online sales platforms use SKUs for inventory management and sales tracking purposes.
There are different approaches and methods for segmentation. We will use the inventory management approach. Clustering groups similar
SKUs, while ABC Classification emphasizes important SKUs for inventory management strategies.
In [ ]: clustering_features = [
"Product Name",
"Order Item Discount",
"Order Item Discount Rate",
"Order Item Product Price",
"Order Item Profit Ratio",
"Order Item Quantity",
"Sales",
"Order Item Total",
"Order Profit Per Order",
"Product Price",
"order date (DateOrders)",
]
df_clustering = df_[clustering_features]
df_clustering_agg = (
df_clustering.groupby("Product Name")
.agg(
{
"Order Item Quantity": "sum",
"Order Item Discount Rate": "mean",
"Sales": "sum",
"Order Item Profit Ratio": "mean",
}
)
.reset_index()
)
In [ ]: df_clustering_agg.head()
Out[ ]: Product Name Order Item Quantity Order Item Discount Rate Sales Order Item Profit Ratio
In [ ]: df_clustering_agg.shape
Out[ ]: (118, 5)
In [ ]: df_cluster_sorted_order = df_clustering_agg.sort_values(
by="Order Item Quantity", ascending=False
).reset_index(drop=True)
In [ ]: df_cluster_sorted_order["cum_quantity_perc"] = (
df_cluster_sorted_order["Order Item Quantity"].cumsum()
/ df_cluster_sorted_order["Order Item Quantity"].sum()
)
ax.bar(
df_cluster_sorted_order["Product Name"],
df_cluster_sorted_order["Order Item Quantity"],
)
ax.set_xticklabels(
df_cluster_sorted_order["Product Name"], rotation=90, fontsize=6
)
ax.set_title("Order Quantity Distribution: Pareto Chart")
ax2 = ax.twinx()
ax2.plot(
df_cluster_sorted_order["Product Name"],
df_cluster_sorted_order["cum_quantity_perc"],
color="tab:red",
)
ax2.yaxis.set_major_formatter(PercentFormatter())
ax2.grid(axis="y")
ax2.set_axisbelow(True)
ax.tick_params(axis="y", colors="tab:blue")
ax2.tick_params(axis="y", colors="tab:red")
plt.show()
As seen in the graph above, only nine out of 118 SKUs accounted for almost 85% of the total order quantity.
In [ ]: # Total order quantity of 9 out of 118 SKUs
df_cluster_sorted_order.head(9)
0 Perfect Fitness Perfect Rip Deck 19807 0.101641 1.188222e+06 0.135466 0.186640
1 Nike Men's Dri-FIT Victory Golf Polo 17257 0.101407 8.628500e+05 0.128004 0.349252
2 O'Brien Men's Neoprene Life Vest 15530 0.101722 7.761894e+05 0.127727 0.495590
3 Nike Men's Free 5.0+ Running Shoe 10122 0.101597 1.012099e+06 0.126268 0.590969
5 Nike Men's CJ Elite 2 TD Football Cleat 6126 0.101714 7.963188e+05 0.122739 0.731041
In [ ]: df_cluster_sorted_sales = df_clustering_agg.sort_values(
by="Sales", ascending=False
).reset_index(drop=True)
In [ ]: df_cluster_sorted_sales["cum_sales_perc"] = (
df_cluster_sorted_sales["Sales"].cumsum() / df_cluster_sorted_sales["Sales"].sum()
)
fig, ax = plt.subplots(figsize=(15, 8))
ax2 = ax.twinx()
ax2.plot(
df_cluster_sorted_sales["Product Name"],
df_cluster_sorted_sales["cum_sales_perc"],
color="tab:red",
)
ax2.yaxis.set_major_formatter(PercentFormatter())
ax2.grid(axis="y")
ax2.set_axisbelow(True)
ax.tick_params(axis="y", colors="tab:blue")
ax2.tick_params(axis="y", colors="tab:red")
plt.show()
The same interpretation can be made for sales value. Only nine SKUs contribute around 80% of the total sales value over the observed
period.
1 Perfect Fitness Perfect Rip Deck 19807 0.101641 1.188222e+06 0.135466 0.261553
3 Nike Men's Free 5.0+ Running Shoe 10122 0.101597 1.012099e+06 0.126268 0.445499
4 Nike Men's Dri-FIT Victory Golf Polo 17257 0.101407 8.628500e+05 0.128004 0.518570
6 Nike Men's CJ Elite 2 TD Football Cleat 6126 0.101714 7.963188e+05 0.122739 0.658510
7 O'Brien Men's Neoprene Life Vest 15530 0.101722 7.761894e+05 0.127727 0.724242
Combination of ABCXYZ-Classification
ABC Classification:
ABC classification is a method based on the Pareto Principle. It categorizes items in inventory into three categories: A, B, and C,
based on their importance in terms of value or usage.
Categories:
A: High-value or high-usage items requiring close monitoring.
B: Moderate-value or moderate-usage items managed with standard control.
C: Low-value or low-usage items managed with minimal attention.
XYZ Classification:
Combining ABCXYZ-Classification:
This approach helps businesses focus their resources on the most valuable items (A) while spending less time on less critical ones (C),
saving time and money. It also helps reduce inventory holding costs and prevent stockouts by considering demand variability.
# Add the first missing value in column "per" to be the first of cumulative percentage
df_abc.loc[0, "per"] = df_abc["cum_per"][0]
In [ ]: # Define function to classify the SKUs based on their cumlated percentage revenue
def abc_classification(data):
if data["cum_per"] <= 70:
return "A"
elif data["cum_per"] > 70 and data["cum_per"] <= 95:
return "B"
elif data["cum_per"] > 95:
return "C"
abc_summary = (
df_abc[["abc", "Product Name", "Sales"]]
.groupby("abc")
.agg(Revenue=("Sales", "sum"), count=("Product Name", "count"))
)
# Adjust layout
plt.tight_layout()
plt.show()
Only 7 products contribute around 67% of our total sales, while 86 contribute only 5%. These products are vital to our store,
demanding a high level of service. Maintaining safety stock is imperative to ensure we never run out of them. It is important to have
various suppliers for A-class products to prevent revenue loss and customer migration to competitors due to product shortages.
The 25 products in category B contribute 28% of total sales. This category may represent products of medium importance.
Procurement processes and inventory management may need to be planned more carefully.
The 86 products in category C contribute around 5% of total sales. This category may represent lower value and more common
products. Supply chain managers must continuously make improvements to optimize stock levels of these products and increase
efficiency.
Priority should be given to class A products when allocating capital and stocking space, followed by class B products. Class C
products, which generate the least revenue, should have minimal inventory. With 86 products in Class C, excess inventory not only
takes up significant stocking space but also ties up capital unnecessarily.
Let's analyze segment A products by country to prioritize inventory based on regional customer preferences
In [ ]: # Filtering the data for the top 6 cities and grouping by city and product name to calculate total revenue
country_sales = (
df_[df_["Order Country"].isin(top5_countries)]
.groupby(["Order Country", "Product Name"], as_index=False)
.agg(Revenue=("Sales", "sum"))
)
In [ ]: city_product_revenue_sorted = country_sales.sort_values(
by=["Order Country", "Revenue"], ascending=False
)
In [ ]: # Calculating the total revenue for each order country again for later use
city_product_revenue_sorted["Total_Country_Revenue"] = (
city_product_revenue_sorted.groupby("Order Country")["Revenue"].transform("sum")
)
In [ ]: # Calculating the cumulative percentage of total revenue per country per product
city_product_revenue_sorted["cum_per"] = (
city_product_revenue_sorted.groupby("Order Country")["Revenue"].cumsum()
/ city_product_revenue_sorted["Total_Country_Revenue"]
* 100
)
In [ ]: # Applying the ABC segmentation function to determine the segment for each product in each country
city_product_revenue_sorted["Segment"] = city_product_revenue_sorted.apply(
abc_classification, axis=1
)
In [ ]: segment_A_data = city_product_revenue_sorted[
city_product_revenue_sorted["Segment"] == "A"
]
# Set the zorder of the bars to a higher value than the grid lines
for bar in bars:
bar.set_zorder(2)
# Add grid and set the zorder of the grid lines to a lower value than the bars
ax.grid(axis="x", linestyle="--", alpha=0.7, fillstyle="left")
for line in ax.get_xgridlines():
line.set_zorder(1)
XYZ Classification
In [ ]: df_clustering_cp = df_clustering.copy()
In [ ]: df_clustering_cp["Year"] = df_clustering_cp["order date (DateOrders)"].dt.year
df_clustering_cp["Month"] = df_clustering_cp["order date (DateOrders)"].dt.month
In [ ]: df_cp_agg = (
df_clustering_cp.groupby(["Product Name", "Year", "Month"])["Order Item Quantity"]
.sum()
.reset_index()
)
In [ ]: df_cp_agg["Month"] = df_cp_agg["Month"].map("{:02}".format)
df_cp_agg["Year_Month"] = (
df_cp_agg["Year"].astype(str) + "-" + df_cp_agg["Month"].astype(str)
)
In [ ]: df_cp_agg.head()
In [ ]: df_cp_wide = (
df_cp_agg.pivot(
index="Product Name", columns="Year_Month", values="Order Item Quantity"
)
.reset_index()
.fillna(0)
)
In [ ]: df_cp_wide.head()
Out[ ]: 2017- 2017- 2017- 2017- 2017- 2017- 2017- 2017- 2017- 2017- 2017- 2017-
Year_Month Product Name
01 02 03 04 05 06 07 08 09 10 11 12
0 Adult dog supplies 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 127.0 119.0
1 Baby sweater 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 82.0 0.0 125.0
Bowflex SelectTech
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 3.0 0.0 0.0
1090 Dumbbells
The minimum Covariance of 0.59 suggests that some products have relatively stable demand patterns.
The overall mean Covariance of 1.82 implies a moderate level of demand variability across the dataset.
The maximum Covariance of 3.46 indicates significant variability in demand for certain products, potentially due to factors like
seasonality or fluctuations in customer preferences.
That means this dataset includes lots of products with fluctuating or seasonal demand, which is going to make things much harder for
procurement staff to keep in check.
In [ ]: def xyz_classification(cov):
In [ ]: df_cp_wide["xyz"] = df_cp_wide["cov"].apply(xyz_classification)
xyz
X 81741.0 8
Y 8739.0 1
Z 15644.0 109
In [ ]: df_cp_wide.head()
Out[ ]: Product 2017- 2017- 2017- 2017- 2017- 2017- 2017- 2017- 2017- 2017- 2017- 2017-
Year_Month total_demand avg_demand
Name 01 02 03 04 05 06 07 08 09 10 11 12
Adult dog
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 127.0 119.0 246.0 20.500000 4
supplies
Baby
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 82.0 0.0 125.0 207.0 17.250000 4
sweater
Bag Boy
2 Beverage 11.0 30.0 33.0 24.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 98.0 8.166667 1
Holder
Bag Boy
3 M330 0.0 0.0 0.0 4.0 70.0 33.0 25.0 52.0 24.0 0.0 0.0 0.0 208.0 17.333333 2
Push Cart
Bowflex
SelectTech
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 3.0 0.0 0.0 10.0 0.833333
1090
Dumbbells
In [ ]: total_demand_by_month_xyz = df_cp_wide.groupby("xyz").agg(
Jan=("2017-01", "sum"),
Feb=("2017-02", "sum"),
Mar=("2017-03", "sum"),
Apr=("2017-04", "sum"),
May=("2017-05", "sum"),
Jun=("2017-06", "sum"),
Jul=("2017-07", "sum"),
Aug=("2017-08", "sum"),
Sep=("2017-09", "sum"),
Oct=("2017-10", "sum"),
Nov=("2017-11", "sum"),
Dec=("2017-12", "sum"),
)
In [ ]: total_demand_by_month_xyz
Out[ ]: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
xyz
X 9247.0 9013.0 9208.0 8946.0 9146.0 8444.0 9210.0 9225.0 8883.0 419.0 0.0 0.0
Y 1053.0 771.0 992.0 1014.0 1015.0 906.0 1054.0 917.0 992.0 25.0 0.0 0.0
Z 1305.0 1286.0 1476.0 1229.0 872.0 844.0 827.0 953.0 627.0 2046.0 2055.0 2124.0
In [ ]: plt.figure(figsize=(12, 6))
bar_width = 0.25
index = total_demand_by_month_xyz.columns
x = range(len(index))
# Set x-ticks
plt.xticks([i + bar_width for i in x], index)
# Add grid
plt.grid(axis="y", linestyle="--", alpha=0.7, fillstyle="left")
# Set the zorder of the bars to a higher value than the grid lines
for container in plt.gca().containers:
for bar in container:
bar.set_zorder(2)
# Access the grid lines through the current axes and set their zorder
for line in plt.gca().get_ygridlines():
line.set_zorder(1)
# Add legend
plt.legend()
# Adjust layout
plt.tight_layout()
# Show the plot
plt.show()
Based on the data in the table:
There is a noticeable seasonal variation among the XYZ categories. Particularly, category Z's sales show a significant increase in
October, November, and December compared to other months. This indicates that category Z is more seasonal and experiences higher
demand during these months.
Category X's sales have a more balanced distribution, but there is a noticeable decrease in October compared to other months. This
suggests that the category receives lower demand in October compared to other months.
Similarly, category Y's sales are generally evenly distributed, but there is a significant decrease in October compared to other months.
This indicates that the category experiences lower demand in October.
In conclusion, based on the data in the table, there is seasonal variability among the XYZ categories, with category Z showing higher
demand in October, November, and December.
plt.figure(figsize=(10, 6))
bars = plt.barh(df_x["Product Name"], df_x["total_demand"], color="tab:blue")
# Set the zorder of the bars to a higher value than the grid lines
for bar in bars:
bar.set_zorder(2)
plt.gca().spines["top"].set_visible(False)
plt.gca().spines["right"].set_visible(False)
plt.xlabel("Total Demand")
plt.title("Top Products by Total Demand for X Category")
plt.gca().invert_yaxis()
# Add the grid and set the zorder of the grid lines to a lower value than the bars
plt.grid(axis="x", linestyle="--", alpha=0.7, fillstyle="left")
# Access the grid lines through the current axes and set their zorder
for line in plt.gca().get_xgridlines():
line.set_zorder(1)
plt.tight_layout()
plt.show()
High-Demand SKUs: The "Perfect Fitness Perfect Rip Deck" and "Nike Men's Dri-FIT Victory Golf Polo" show significantly higher total
demands. These products are key drivers of revenue and require robust inventory management to ensure consistent availability.
Moderate-Demand SKUs: Products like the "O'Brien Men's Neoprene Life Vest" and "Nike Men's Free 5.0+ Running Shoe" have
moderate demand. They benefit from automatic replenishment but may not need as high a buffer as high-demand items.
Lower-Demand SKUs: Items such as the "Diamondback Women's Serene Classic Comfort Bike" have lower demand, suggesting a
need for careful management to avoid overstocking while ensuring availability when needed.
In [ ]: df_abc_xyz_summary = (
df_abc_xyz.groupby("abc_xyz")
.agg(
total_skus=("Product Name", "nunique"),
total_demand=("total_demand", sum),
avg_demand=("avg_demand", "mean"),
total_revenue=("Sales", sum),
)
.reset_index()
)
df_abc_xyz_summary.sort_values(by="total_revenue", ascending=False)
Understanding these classes' revenue contributions and demand patterns allows supply chain analysts to optimize resources, improve
forecasting accuracy, and enhance overall supply chain performance.
The Association of International Certified Professional Accountants provides practical guidance on applying ABC XYZ classifications for
procurement managers to maximize revenue and profit without excessive capital investment in stock. They recommend the following
approaches:
Implement semi-automatic replenishment processes with manual adjustments for seasonal demand changes.
Manage stock with a carefully adjusted seasonal buffer to optimize stock levels.
6. CY Class (Low Revenue, Variable Demand):
Use automatic replenishment systems with a higher buffer inventory to meet variable demand.
Conduct periodic inspections to ensure stock levels align with demand patterns.
Implementing these management approaches based on ABC XYZ classifications can help optimize inventory management, reduce
stockouts, and improve overall supply chain efficiency.
Demand Forecasting
In this section, we explore time series demand forecasting, a critical part of supply chain and inventory management.
Objectives
Forecasting demand for the upcoming month (4 weeks ahead)
Evaluation Metrics
Symmetric Mean Absolute Percentage Error
Mean Absolute Error
Methods
We will focus on using Nixtla open-source libraries
Statistical model
TimeGPT
In [ ]: # Import libraries
import os
import time
In [ ]: demand_df = pd.read_csv("data/ts_demand_forecasting_train.csv")
In [ ]: demand_df.shape
In [ ]: demand_df.head()
Out[ ]: STORE_SKU DATE UNITS UNITS_MIN UNITS_MAX UNITS_MEAN UNITS_STD TRANSACTIONS_SUM PROMO_MAX
2019-
0 store_130_SKU_120931082 388.0 44.0 69.0 55.428571 8.182443 243.0 1.0
05-06
2019-
1 store_130_SKU_120931082 318.0 37.0 62.0 45.428571 8.079958 210.0 1.0
05-13
2019-
2 store_130_SKU_120931082 126.0 13.0 23.0 18.000000 3.915780 118.0 0.0
05-20
2019-
3 store_130_SKU_120931082 285.0 23.0 65.0 40.714286 14.067863 197.0 1.0
05-27
2019-
4 store_130_SKU_120931082 93.0 10.0 20.0 13.285714 3.352327 87.0 0.0
06-03
In [ ]: demand_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8589 entries, 0 to 8588
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STORE_SKU 8589 non-null object
1 DATE 8589 non-null object
2 UNITS 8589 non-null float64
3 UNITS_MIN 8589 non-null float64
4 UNITS_MAX 8589 non-null float64
5 UNITS_MEAN 8589 non-null float64
6 UNITS_STD 8589 non-null float64
7 TRANSACTIONS_SUM 8589 non-null float64
8 PROMO_MAX 8589 non-null float64
9 PRICE_MEAN 8589 non-null float64
10 STORE 8589 non-null object
11 SKU 8589 non-null object
12 SKU_CATEGORY 8589 non-null object
dtypes: float64(8), object(5)
memory usage: 872.4+ KB
In [ ]: date_col = "DATE"
series_id = "STORE_SKU"
target = "UNITS"
The target variable shows a right-skewed distribution. Most of the data points are clustered between 0 and 200 units.
We might consider transforming the target variable (using a logarithmic transformation) to normalize the distribution.
STORE_SKU
The mean values represent the average sales volume for each SKU-store combination, which is essential for establishing baseline
demand levels.
High standard deviations indicate fluctuating demand, which suggests the need for advanced forecasting techniques to account for
variability.
Low standard deviations indicate more consistent demand, suitable for simpler forecasting models.
Minimum values of zero are present in several series, indicating periods with no sales, which could be due to stockouts, low demand,
or product lifecycle events.
Maximum values vary widely, reflecting peak sales periods likely influenced by promotions, seasonality, or other factors.
Spikes:
Several series, such as store_198_SKU_209939182, store_175_SKU_120969012, and store_175_SKU_9888909, show notable spikes
in sales. These spikes may correspond to specific events like promotions, seasonal demand, or market interventions.
The spikes are particularly noticeable around early 2020 and late 2021, suggesting potential external influences or business activities
that led to increased sales during these periods.
Dips:
Dips in sales are also evident in the series, notably around mid-2020 and early 2021. These dips might indicate periods of low
demand, stockouts, or changes in market conditions.
In [ ]: # Plot rolling mean and standard deviation for high variability series
window = 4
plt.figure(figsize=(14, 7))
for series in top_var_series:
series_data = demand_df[demand_df[series_id] == series].set_index(date_col)["UNITS"]
rolling_mean = series_data.rolling(window=window).mean()
rolling_std = series_data.rolling(window=window).std()
plt.subplot(2, 1, 1)
plt.plot(rolling_mean, label=f"{series} Rolling Mean", linewidth=2)
plt.title(f"{window}-Week Rolling Mean")
plt.xlabel("Date")
plt.ylabel("Units Sold")
plt.subplot(2, 1, 2)
plt.plot(rolling_std, label=f"{series} Rolling Std", linewidth=2)
plt.title(f"{window}-Week Rolling Standard Deviation")
plt.xlabel("Date")
plt.ylabel("Units Sold")
plt.legend(fontsize="x-small")
plt.tight_layout()
plt.show()
Rolling Mean
The plot (top) shows the rolling mean of the sales data for the top variability series, providing insight into the overall trends over time.
Series like store_175_SKU_120969012 show a generally decreasing trend over time, suggesting a decline in sales.
Conversely, some series like store_198_SKU_209939182 exhibit an increasing trend towards the end of the period, indicating growing
sales.
Check if any series start after the the minimum datetime and if any series end before the maximum datetime
Parameters:
df (pd.DataFrame): The DataFrame containing demand data.
date_col (str): The name of the column containing date information.
series_id (str): The name of the column that uniquely identifies the series.
Returns:
tuple: A tuple containing the overall minimum date, overall maximum date,
DataFrame of series that start after the overall minimum date,
and DataFrame of series that end before the overall maximum date.
"""
# Calculate overall min and max dates
min_date = df[date_col].min()
max_date = df[date_col].max()
The dataset includes 10 series that lack data from the start of the entire time range (2019-05-06). Instead, they start at various points
later:
2019-08-12 has 1 series
2020-02-03 has 5 series
2020-02-17 has 2 series
2021-06-14 has 2 series
There are no series that end before the overall maximum date.
I will drop the series that starts on 2019-08-12 and the series that starts on 2021-06-14 and use data from 2020-02-03. I will use the
interpolation technique to fill the gaps in the other remaining series.
filtered_df_ = demand_df[~demand_df["STORE_SKU"].isin(to_drop)]
In [ ]: filtered_df.shape
There are no series that start after the overall minimum date.
Parameters:
df (pd.DataFrame): The filtered DataFrame to process.
date_col (str): The name of the column containing date information.
series_id (str): The name of the column that uniquely identifies the series.
Returns:
pd.DataFrame: The prepared DataFrame for forecasting.
"""
# Rename columns
df_forecast.rename(
columns={date_col: "ds", "UNITS": "y", series_id: "unique_id"}, inplace=True
)
return df_forecast
In [ ]: df_forecast = prepare_forecast_data(
df=filtered_df, date_col=date_col, series_id=series_id
)
return results
diff_order = 0
if diff:
# First differencing
diff_series = series.diff().dropna()
result = augmented_dickey_fuller_test(diff_series, significance_level)
diff_order = 1
else:
result = augmented_dickey_fuller_test(series, significance_level)
results_adf.append({
"unique_id": uid,
"Test Statistic": result["Test Statistic"],
"p-value": result["p-value"],
"No Lags Used": result["No Lags Used"],
"Number of Observations Used": result["Number of Observations Used"],
"Stationarity": result["Stationarity"],
"Differencing Order": diff_order
})
return pd.DataFrame(results_adf)
In [ ]: def get_non_stationary_stores(results_adf):
nonst_df = results_adf[results_adf["Stationarity"] == False]
nonst_df_ = nonst_df[["unique_id", "p-value", "Stationarity"]]
print(f"Total number of non-stationary store_sku {nonst_df_.shape[0]}")
return nonst_df_
In [ ]: nonst_df = get_non_stationary_stores(results_adf)
nonst_df
In [ ]: nonst_df_diff = get_non_stationary_stores(results_adf_)
nonst_df_diff
When using machine learning models for time series forecasting, it is not strictly necessary to make the data stationary. Unlike traditional
time series methods (e.g., ARIMA), which rely on stationarity to make accurate predictions, machine learning models can often handle non-
stationary data well. However, it is possible to make the time series stationary by using methods such as differencing, STL, and log
transformations.
The MSTL (Multiple Seasonal-Trend decomposition using LOESS) model can provide more accurate forecasts by decomposing the time
series into seasonal, trend, and residual components. This makes it suitable for handling non-stationary data, which is why we will use this
model.
train_list = []
test_list = []
grouped = df_forecast.groupby("unique_id")
for name, group in grouped:
train, test = split_by_unique_id(group)
train_list.append(train)
test_list.append(test)
train_data = pd.concat(train_list)
test_data = pd.concat(test_list)
In [ ]: train_data.shape,test_data.shape
In [ ]: train_df_u = train_data.drop(
columns=[
"UNITS_MIN",
"UNITS_MAX",
"UNITS_MEAN",
"UNITS_STD",
"TRANSACTIONS_SUM",
"PROMO_MAX",
"PRICE_MEAN",
]
)
exog_df = test_data.drop(columns=["y"])
test_y = test_data[["unique_id", "ds", "y"]]
exog_df["ds"] = pd.to_datetime(exog_df["ds"])
test_y["ds"] = pd.to_datetime(test_y["ds"])
In [ ]: HORIZON = 4 # forecast horizon: 4 weeks ahead
LEVEL = [90] # means that the range of values should include the actual future value with probability 90%.
S_LENGTH = [4] # length of the seasonal period of the time series
N_WINDOWS = 8 # number of windows used for cross-validation, meaning the number of forecasting processes in the pas
STEP_SIZE = 2 # step size between each window, meaning how often do you want to run the forecasting process.
unique_id
In [ ]: forecasts_df.reset_index(inplace=True)
In [ ]: mae_exg = mae(res["y"],res["MSTL"])
smape_exg_mean = smape(res["y"],res["MSTL"])
A MAE of 6.97 means that, on average, your model's predictions deviate from the actual values by approximately 6.97 units.
A SMAPE of 7.84 indicates that, on average, the forecast error is 7.84 of the magnitude of the actual and forecasted values.
In [ ]: # univariate model
fcst_u = sf_model.forecast(df=train_df_u, h=HORIZON)
res_u = test_y.merge(fcst_u, how="left", on=["unique_id", "ds"])
mae_u = mae(res_u["y"],res_u["MSTL"])
# Calculate SMAPE
smape_mean_u = smape(res_u["y"],res_u["MSTL"])
In [ ]: sf.plot(train_df_u,forecasts_df,engine="plotly",level=LEVEL)
MSTL Cross-Validation
In [ ]: crossvalidation_df_mstl = sf_model.cross_validation(
df=df_forecast, h=HORIZON, n_windows=N_WINDOWS,step_size=STEP_SIZE,
)
In [ ]: crossvalidation_df_mstl.head()
unique_id
store_130_SKU_120931082 2022-06-27 2022-06-20 96.0 94.234886
evals.append(eval_)
evals = pd.concat(evals, axis=1)
evals = evals.groupby(['unique_id']).mean(numeric_only=True)
return evals
In [ ]: evaluation_mstl_cv_mae = evaluate_cross_validation(crossvalidation_df_mstl,mae)
In [ ]: mae_mstl_cv_std = evaluation_mstl_cv_mae.std()[0]
mae_mstl_cv = evaluation_mstl_cv_mae.mean()[0]
MAE: 6.49
MAE: std: 7.13
The mean MAE value for the Cross Validation is 6.49, with a standard deviation of 7.13.
In [ ]: evaluation_mstl_cv_smape = evaluate_cross_validation(crossvalidation_df_mstl,smape)
In [ ]: smape_mstl_cv_std = evaluation_mstl_cv_smape.std()[0]
smape_mstl_cv = evaluation_mstl_cv_smape.mean()[0]
SMAPE: 8.33
SMAPE std: 9.01
The mean SMAPE value for the Cross Validation is 8.33, with a standard deviation of 9.01.
In [ ]: sf.plot(
df_forecast,
crossvalidation_df_mstl[["ds","MSTL"]],
engine="plotly",
)
Although there isn't much intermittent data in my dataset, the following metrics can be helpful
Error Metrics for Intermittent Demand (CFE, PIS, MSR)
Intermittent demand is present in many retail settings and when forecasting stock requirements, businesses must strike a
balance between the cost of goods and loss of sales due to a lack of stock. When training a forecasting model, using
common accuracy measures, such as the MAE or MSE, do not always translate to ideal real-world outcomes. Due to the
intermittence, forecasts at or near zero may reduce the error but would result in shelves being empty. We will explore a new
range of error metrics targeted towards intermittent demand, such as the Cumulative Forecasting Error (CFE), Periods in
Stock (PIS) and Mean Squared Rate (MSR).
Source
# Calculate period-in-stock
crossvalidation_df_mstl["in_stock"] = crossvalidation_df_mstl["MSTL"] >= crossvalidation_df_mstl["y"]
crossvalidation_df_mstl["PIS"] = crossvalidation_df_mstl["in_stock"].cumsum() / range(1, len(crossvalidation_df_mstl
metrics_df.reset_index(inplace=True)
# CFE
metrics_df.plot(kind="bar", x="unique_id", y="CFE", ax=axes[0], legend=False, color="skyblue")
axes[0].set_title("Cumulative Forecast Error (CFE)")
axes[0].set_xlabel("Unique ID")
axes[0].set_ylabel("CFE")
# NOS P
metrics_df.plot(kind="bar", x="unique_id", y="NOS", ax=axes[1], legend=False, color="orchid")
axes[1].set_title("Number of Stockouts (NOS)")
axes[1].set_ylabel("NOS")
axes[1].set_xlabel("Unique ID")
# PIS
metrics_df.plot(kind="bar", x="unique_id", y="PIS", ax=axes[2], legend=False, color="lightgreen")
axes[2].set_title("Period-In-Stock (PIS)")
axes[2].set_xlabel("Unique ID")
axes[2].set_ylabel("PIS")
# MAE
metrics_df.plot(kind="bar", x="unique_id", y="MAR", ax=axes[3], legend=False, color="salmon")
axes[3].set_title("Mean Absolute Rate (MAR)")
axes[3].set_xlabel("Unique ID")
axes[3].set_ylabel("MAR")
for ax in axes:
# Remove top and right spines
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
# Adjust layout
plt.tight_layout()
plt.show()
Cumulative Forecast Error (CFE)
The CFE is the sum of the difference between a forecast and the observed demand.
Positive CFE: Indicates over-forecasting (the model predicts higher demand than actual).
Negative CFE: Indicates under-forecasting (the model predicts lower demand than actual).
The CFE plot shows fluctuations, suggesting varying forecasting accuracy across different unique_ids .
Period-In-Stock (PIS)
This is the accumulation of the CFE, which will give insight into what is happening throughout the period. Shows the reliability of the
forecast in maintaining stock levels.
Values close to 1 indicate good performance, where the stock is frequently in supply.
Lower values indicate more frequent stockouts.
Most unique_ids have a relatively high PIS, indicating that the forecasted stock was sufficient to meet demand for a significant portion
of the time.
Some unique_ids show slightly lower PIS values, suggesting more frequent stockouts or periods where the forecast did not meet actual
demand.
The MAR values show significant variation across unique IDs, with some having very high MAR values, indicating large deviations
between forecasted and actual demand rates.
A few unique IDs stand out with exceptionally high MAR values, suggesting these forecasts are less reliable and deviate substantially
from the actual demand.
The MAR metric highlights the overall accuracy and reliability of the forecast in capturing demand trends.
The MAR plot highlights significant variability, with some unique_ids showing high error, suggesting areas where the model needs
improvement.
The CFE, NOS, and PIS metrics help identify areas where the forecast tends to overestimate or underestimate demand, leading to
stockouts or overstocks.
In [ ]: test_df = df_forecast.groupby('unique_id').tail(4)
input_df = df_forecast.drop(test_df.index).reset_index(drop=True)
exog_df = test_df.drop(columns=["y"])
In [ ]: # Load environment variables from a .env file and retrieve the NIXTLA API key
load_dotenv(".env")
api_key=os.getenv("NIXTLA_API_KEY")
Initialize NixtlaClient
Create and validate the Nixtla client using the API key.
In [ ]: nixtla_client = NixtlaClient(api_key=api_key)
In [ ]: nixtla_client.validate_api_key()
INFO:nixtla.nixtla_client:Happy Forecasting! :), If you have questions or need support, please email ops@nixtla.io s
haring this response and ID: VE5FH9GRYJ
Out[ ]: True
horizon: 4
n_windows: 8
step_size: 2
level: [90]
In [ ]: start = time.time()
fcst_df = nixtla_client.forecast(
df=input_df,
h=HORIZON,
X_df=exog_df,
level=LEVEL,
finetune_steps=10,
finetune_loss="mae",
time_col="ds",
target_col="y",
id_col="unique_id",
)
end = time.time()
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Inferred freq: W-MON
INFO:nixtla.nixtla_client:Using the following exogenous variables: UNITS_MIN, UNITS_MAX, UNITS_MEAN, UNITS_STD, TRAN
SACTIONS_SUM, PROMO_MAX, PRICE_MEAN
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...
Time (TimeGPT): 4.50141453742981
In [ ]: nixtla_client.plot(
test_df,
fcst_df,
models=["TimeGPT"],
level=LEVEL,
time_col="ds",
target_col="y",
engine="plotly",
)
evaluation_gpt = evaluate(
res_gpt,
metrics=[mae,smape],
models=["TimeGPT"],
target_col="y",
id_col="unique_id",
In [ ]: evaluation_gpt.groupby("metric")["TimeGPT"].agg(["mean", "std"])
metric
Perform Cross-Validation
Conduct cross-validation on the entire dataset to ensure robustness of the model.
In [ ]: timegpt_cv_df = nixtla_client.cross_validation(
df_forecast,
h=HORIZON,
n_windows=N_WINDOWS,
time_col="ds",
target_col="y",
freq="W-MON",
)
INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Using the following exogenous variables: UNITS_MIN, UNITS_MAX, UNITS_MEAN, UNITS_STD, TRAN
SACTIONS_SUM, PROMO_MAX, PRICE_MEAN
INFO:nixtla.nixtla_client:Calling Cross Validation Endpoint...
In [ ]: nixtla_client.plot(
df_forecast,
timegpt_cv_df[["unique_id","ds","TimeGPT"]],
engine="plotly"
)
In [ ]: evaluation_gpt_cv = evaluate(
timegpt_cv_df,
metrics=[mae,smape],
models=["TimeGPT"],
target_col="y",
id_col="unique_id",
)
Cross-Validation Results
In [ ]: evaluation_gpt_cv.groupby("metric")["TimeGPT"].agg(["mean", "std"])
metric
Parameters:
- evaluation_gpt_cv (pd.DataFrame): Evaluation DataFrame for TimeGPT containing metrics.
- evaluation_mstl_cv_mae (pd.DataFrame): Evaluation DataFrame for MSTL containing MAE metrics.
- evaluation_mstl_cv_smape (pd.DataFrame): Evaluation DataFrame for MSTL containing SMAPE metrics.
Returns:
- pd.DataFrame: Merged DataFrame with unique_id, TimeGPT MAE, and MSTL MAE and RMSE.
"""
eval_timegpt_cv_mae = evaluation_gpt_cv[evaluation_gpt_cv["metric"] == "mae"][["unique_id", "TimeGPT"]]
eval_timegpt_cv_smape = evaluation_gpt_cv[evaluation_gpt_cv["metric"] == "smape"][["unique_id", "TimeGPT"]]
eval_smape_mae_mstl = pd.merge(evaluation_mstl_cv_mae, evaluation_mstl_cv_smape, on="unique_id")
eval_smape_mae_mstl.reset_index(inplace=True)
Parameters:
df (pd.DataFrame): The DataFrame containing the actual and predicted values.
actual_col (str): The name of the column containing the actual values.
predicted_col (str): The name of the column containing the predicted values.
Returns:
pd.DataFrame: The DataFrame with the new error column added.
"""
In [ ]: width = 0.35
axs.bar(
eval_timegpt_cv_mae["unique_id"],
eval_timegpt_cv_mae["TimeGPT"],
width,
label="TimeGPT",
color="b",
alpha=0.6,
)
axs.bar(
eval_smape_mae_mstl["unique_id"],
eval_smape_mae_mstl["MSTL_mae"],
width,
label="MSTL",
color="r",
alpha=0.6,
bottom=eval_timegpt_cv_mae["TimeGPT"],
)
axs.spines["top"].set_visible(False)
axs.spines["right"].set_visible(False)
# Show the combined plot
plt.tight_layout()
plt.show()
The MSTL model consistently achieves lower Mean Absolute Error (MAE) values compared to the TimeGPT model across most unique
IDs.
This indicates that the MSTL model generally provides more accurate predictions for the majority of store and SKU combinations.
The MSTL model outperforms the TimeGPT model in terms of overall accuracy, as evidenced by the lower MAE values.
Error Analysis
This error analysis helps in understanding the error distribution and identifying areas where the models can be improved.
Out[ ]: count mean std min 50% 70% 80% 90% 95% 99% ma
y 1536.0 116.420860 117.729195 7.000000 73.000000 114.000000 168.000000 259.500000 398.750000 505.000000 1263.00000
MSTL 1536.0 116.534538 117.434235 -7.863361 73.506863 114.574203 166.720703 259.155624 402.694397 507.229431 1239.82226
error_abs 1536.0 6.494394 10.606213 0.001400 2.940297 6.127014 8.932001 16.441643 24.655142 51.241245 125.05862
Out[ ]: count mean std min 50% 70% 80% 90% 95% 99% m
y 1536.0 115.186478 115.505403 2.000000 72.000000 108.805000 163.000000 254.500000 405.250000 524.600000 1263.0000
TimeGPT 1536.0 118.203973 119.598659 -65.911036 71.726481 107.203879 167.513115 274.879559 412.104381 539.418692 783.1445
error_abs 1536.0 16.844512 32.871438 0.006061 5.906795 13.450920 20.378603 40.325256 73.504694 159.686003 479.8554
The MSTL model generally performs better than the TimeGPT model in terms of both average performance and consistency. MSTL has
lower mean and median errors, a smaller standard deviation of errors, and significantly lower worst-case errors. These factors suggest that
MSTL is a more reliable model for this dataset, with more accurate and consistent predictions.
In [ ]: fig, axs = plt.subplots(1, 2, figsize=(24, 8))
# Plotting the histograms
axs[0].hist(
timegpt_cv_df["error_"],
bins=30,
edgecolor="k",
alpha=0.6,
label="TimeGPT",
width=20,
)
axs[1].scatter(
crossvalidation_df_mstl["MSTL"],
crossvalidation_df_mstl["error_"],
alpha=0.5,
c="orangered",
)
axs[1].scatter(
timegpt_cv_df["TimeGPT"], timegpt_cv_df["error_"], alpha=0.6, c="tab:blue"
)
Histogram of Residuals
Both models have a high concentration of residuals around zero, indicating that the majority of the predictions are close to the actual
values.
The MSTL model (orange bars) shows a slightly tighter distribution around zero compared to the TimeGPT model (blue bars),
suggesting that MSTL predictions are generally more accurate.
There are fewer extreme residuals (both positive and negative) for the MSTL model compared to the TimeGPT model. This indicates
that the MSTL model has fewer large errors and is more robust.
The residuals are mostly centered around the zero line, indicating no obvious bias in the predictions.
There is a slight increase in the spread of residuals as the predicted values increase, suggesting that both models may have higher
variability in their errors for larger predictions.
Both models have a similar pattern, but the MSTL model (orange dots) seems to have fewer large residuals compared to the TimeGPT
model (blue dots).
The horizontal line at zero helps to visualize that most residuals are clustered around this line, further indicating that the predictions are
generally accurate.
Conclusion
The MSTL model generally provides more accurate and robust predictions compared to the TimeGPT model, as evidenced by the
tighter distribution of residuals around zero and fewer extreme residuals.
Both models do not show obvious bias in their predictions, as most residuals are centered around zero.
Both models exhibit higher variability in their errors for larger predicted values, which suggests that further model tuning or additional
features may be needed to improve performance for these cases.
MLflow
In [ ]: import mlflow
from statsforecast import StatsForecast
from sklearn.metrics import mean_absolute_error
import mlflavors
Parameters:
y_true (numpy array or list): Array of actual values
y_pred (numpy array or list): Array of predicted values
Returns:
float: SMAPE value
"""
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred)))
In [ ]: ARTIFACT_PATH = "model"
DATA_PATH = "./data"
# Define HORIZON and LEVEL
HORIZON = 4
LEVEL = [90]
X_test = test_data.drop(columns=["y"])
y_test = test_data[["y"]]
sf.fit()
# Evaluate model
y_pred = sf.predict(h=HORIZON, X_df=X_test, level=LEVEL)["MSTL"]
metrics = {
"mae": mean_absolute_error(y_test, y_pred),
"smape": custom_smape(y_test, y_pred),
}
print(f"Metrics: \n{metrics}")
# Log metrics
mlflow.log_metrics(metrics)
# Log parameters
mlflow.log_param("horizon", HORIZON)
mlflow.log_param("level", LEVEL)
Metrics:
{'mae': 6.972873224218687, 'smape': 0.7521288414425913}
2024/06/12 14:45:31 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requiremen
ts (model URI: C:\Users\emirh\AppData\Local\Temp\tmp_jf1fmqo\model\model.pkl, flavor: statsforecast). Fall back to r
eturn ['statsforecast==1.7.4']. Set logging level to DEBUG to see the full traceback.
MLflow run id:
d862726a798e4befb5fa8f9b66fd069d