A REPORT ON
“Identifying Data Mining Tasks and Performing them on
the given dataset”
SUBMITTED BY
Karan Parge SeatNo:B1908004222
Niraj Mahanta SeatNo:B1908004235
Rajat Potgantiwar SeatNo:B1908004248
Mayuresh Sontankke SeatNo:B1908004260
DEPARTMENT OF COMPUTER ENGINEERING
TSSM’s BHIVARABAI SAWANT COLLEGE OF
ENGINEERING AND RESEARCH, NARHE, PUNE-41
SAVITRIBAI PHULE PUNE UNIVERSITY
2024-2025
i
CERTIFICATE
This is to certify that the project report entitles
“Identifying Data Mining Tasks and Performing them on the given
dataset”
Submitted by
Karan Parge SeatNo:B1908004222
SeatNo:B1908004235
Niraj Mahanta
Rajat Potgantiwar SeatNo:B1908004248
Mayuresh Sontankke SeatNo:B1908004260
The report has been approved as it satisfies the academic requirements in respect of mini-project
work prescribed for the course Laboratory practice IV (Computer Engineering).
Prof. S. S. Bhagat Dr. A. D. Gujar
Guide Head of Department
ii
ACKNOWLEDGEMENT
It gives us great pleasure in presenting the preliminary mini project report on “Identifying
Data Mining Tasks and Performing them on the given dataset”
With We would like to express our sincere gratitude to Prof. S. S. Bhagat. Their invaluable
guidance, support, and contributions have been instrumental in the successful completion of
this project.
We are particularly grateful to HOD Computer Department Dr. A. D. Gujar for providing data,
expertise, or resources.
Their encouragement and mentorship have been invaluable throughout this journey. We are truly
indebted to their support.
We would also like to acknowledge the contributions of all, although not directly involved in the
project, provided indirect support or inspiration.
iii
ABSTRACT
Data is one of the most essential commodities for any organization in the 21st century.
Harnessing data and utilizing it to create effective marketing strategies and make informed
decisions is crucial for organizations. For a conglomerate as large as Walmart, organizing and
analyzing the vast volumes of data generated is necessary to understand existing performance
and identify growth potential. The primary objective of this project is to analyze how various
factors influence sales for Walmart and leverage these insights to develop more efficient plans
and strategies aimed at increasing revenue.
This paper investigates the performance of a subset of Walmart stores and forecasts future
weekly sales using several models, including linear regression, lasso regression, random forest,
and gradient boosting. An exploratory data analysis was conducted on the dataset to assess the
impact of factors such as holidays, fuel prices, and temperature on Walmart's weekly sales.
Additionally, a Power BI dashboard was created to visualize predicted sales information for
each store and department, providing an overview of overall predicted sales trends.
The analysis revealed that the gradient boosting model yielded the most accurate sales
predictions, and notable relationships were observed between factors such as store size,
holidays, unemployment rate, and weekly sales. Implementation of interaction effects within
linear models highlighted relationships between combinations of variables like temperature,
Consumer Price Index (CPI), and unemployment, directly impacting sales for Walmart stores.
iv
TABLE OF CONTENTS
1. INTRODUCTION
1.1 TOOLS AND TECHNOLOGIES APPLIED
2. PROBLEM STATEMENT
3. METHODOLOGY
4. ABOUT THE DATASET
4.1 EXPLORATORY DATA ANALYSIS
4.2 CORELATION MATRIX
5. DATA CLEANING AND PREPROCESSING
6. MODEL SELECTION AND IMPLEMENTATION
7. BUILDING BI DASHBOARD
8. CONCLUSION
9. REFERENCES
v
CHAPTER 1
INTRODUCTION
The 21st century has witnessed an explosion of data generated from the widespread adoption
of advancing technologies. Retail giants like Walmart view this data as their most valuable
asset, enabling them to predict future sales, understand customer behavior, and formulate
strategies to drive profits and maintain competitiveness. Walmart, an American multinational
retail corporation with nearly 11,000 stores across 27 countries and over 2.2 million associates,
relies on extensive data analytics to support its "everyday low prices" promise and annual
revenue of nearly $500 billion.
Walmart's diverse product range spans groceries, home furnishings, personal care items,
electronics, clothing, and more, generating substantial consumer data that fuels predictive
analytics for customer buying patterns, sales forecasts, promotional planning, and innovative
in-store technologies. Embracing modern technological approaches is vital for Walmart's
success in today's dynamic global market, enabling the company to develop distinctive
products and services that set them apart from competitors.
This research focuses on predicting Walmart's sales based on historical data and investigating
whether factors such as temperature, unemployment, fuel prices, and holidays impact the
weekly sales of specific stores under study. By understanding variations in sales during
holidays like Christmas and Thanksgiving compared to regular days, Walmart can tailor
promotional offers to drive sales and increase revenue.
Walmart strategically schedules promotional markdown sales after major U.S. holidays,
underscoring the importance of assessing their impact on weekly sales to guide resource
allocation toward key initiatives. Understanding user preferences and buying patterns is critical
for enhancing customer retention and demand, ultimately driving profitability. Insights from
this study will inform Walmart's resource allocation based on regional demand and profitability
throughout the year.
vi
Furthermore, leveraging big data analytics enables efficient analysis of historical data to
identify at-risk stores, predict future sales, assess organizational performance, and ensure
strategic alignment.
This study utilizes SQL, R, Python, and Power BI to analyze the dataset provided by Walmart
Recruiting on Kaggle ("Walmart Recruiting - Store Sales Forecasting," 2014). The research
involves modeling and exploratory data analysis in R and Python, aggregation and querying
using SQL, and the creation of a final dashboard in Power BI.
1.1 TOOLS AND TECHNOLOGIES APPLIED
The analysis for this study was conducted using key tools such as R, Python, and Power BI,
with specific tasks performed using development environments like R Studio and PyCharm.
Various packages were utilized to facilitate the initial Exploratory Data Analysis (EDA) and
finalize the outcomes. For the initial EDA, a combination of R and Python libraries including
inspectdf, ggplot2, plotly, caret, matplotlib, seaborn, among others, was employed. Packages
like numpy, pandas, tidyverse, etc., were used for data wrangling and manipulation.
For model creation, several packages such as scikit-learn, xgboost, and others were applied to
develop and evaluate predictive models based on the analyzed data.
vii
CHAPTER 2
PROBLEM STATEMENT
The objective of this study is to forecast the weekly sales for Walmart based on historical data
collected between 2010 and 2013 from 45 stores situated across various regions in the country.
Each store encompasses multiple departments, and the primary deliverable is to predict the
weekly sales for all departments.
The dataset, sourced from Kaggle, includes weekly sales data for 45 Walmart stores, store size
and type information, departmental details, weekly sales figures, and holiday indicators.
Additional data on various influencing factors such as Consumer Price Index (CPI),
temperature, fuel prices, promotional markdowns, and unemployment rates for each week were
also collected to investigate potential correlations with weekly sales.
This study incorporates correlation testing to assess relationships between individual factors
and weekly sales, aiming to identify impactful variables on Walmart's sales performance.
Extensive exploratory data analysis has been conducted on the Walmart dataset, focusing on:
• Identifying store and department-wide sales trends.
• Analyzing sales variations based on store size and type.
• Assessing sales patterns during holiday periods.
• Examining correlations among different factors affecting sales.
• Calculating average yearly sales.
• Analyzing weekly sales in relation to regional temperature, CPI, fuel prices, and
unemployment rates.
A Linear Regression model is employed to explore whether specific combinations of factors
directly influence Walmart's weekly sales. Various algorithms are used for predicting future
sales and analyzing correlations within the retail store dataset.
viii
3. ABOUT THE DATASET
The dataset used in this study was obtained from a previous Kaggle competition hosted by
Walmart, accessible at https://www.kaggle.com/c/walmart-recruiting-store-sales-
forecasting/data. It includes historical weekly sales information for 45 Walmart stores across
different regions, along with department-level details.
The 'test.csv' file from this dataset is utilized solely for predicting values using the model with
the lowest Weighted Mean Absolute Error (WMAE) score. Since this dataset lacks the target
variable 'Weekly Sales', it cannot be used for testing purposes in this analysis. Instead, the
training dataset ('train.csv') is split into training and validation datasets for model development.
The primary objective of this study is to predict department-level weekly sales for each store
using the provided dataset.
The training dataset covers weekly sales data from February 5, 2010, to November 1, 2012,
and includes information about stores, departments, and holiday dates. The testing dataset is
identical to the training dataset except for the absence of weekly sales information. The training
dataset comprises 421,570 rows, while the testing dataset contains 115,064 rows (Figure 1).
Fig. A summary of the Training dataset
There is another dataset called ‘stores.csv’ that contains some more detailed information about
the type and size of these 45 stores used in this study. Another big aspect of this study is to
determine whether there is an increase in the weekly store sales because of changes in
temperature, fuel prices, holidays, mark- downs, unemployment rate, and fluctuations in
consumer price indexes, The file ‘fea-tures.csv’ SKNCOE, Department of Computer
Engineering 2022-2023 11 contains all necessary information about these factors and is used
in the analysis to study their impact on sale performances.
ix
The holiday information listed in the study is
A summary of the features dataset is displayed in the image below. (Figure 2)
Fig. A summary of the Training dataset
The final file called ‘sampleSubmission.csv’ contains two main columns: dates for each of the
weeks in the study as well as a blank column that should be utilized to record predicted sales
for that week based on the different models and techniques applied.
The results of the most accurate and efficient model have been recorded in this file and the final
Power BI dashboard has been created based on these predicted values, in conformity with the
‘stores’ and ‘features’ dataset.
.
x
3.1 EXPLORATORY DATA ANALYSIS
It is essential to thoroughly understand the dataset used in this analysis to identify the most
accurate prediction models. Often, underlying patterns or trends in the data are not readily
apparent, highlighting the necessity of conducting comprehensive exploratory data analysis
(EDA). This in-depth examination is critical for grasping the dataset's underlying structure and
drawing meaningful insights to validate our analysis.
The study commences with a preliminary analysis of the dataset to grasp its main characteristics
and relevant components for the research. EDA plays a pivotal role, given the dataset's
numerous attributes essential for drawing insights and making predictions. As part of EDA,
various visualizations have been crafted to clarify the study objectives and highlight attributes
contributing to improved results.
EDA serves as an initial investigation, focusing on exploring relationships and understanding
column characteristics. Utilizing tools like the 'inspectdf' package (Ellis, 2019) and 'glimpse'
package (Sullivan, 2019) in R aids in addressing questions related to dataset dimensions,
missing values, variable distributions, correlation coefficients, and more.
Several other packages such as 'ggplot2', 'matplotlib', 'seaborn', and 'plotly' have been employed
to generate visualizations depicting weekly sales by store and department, sales comparisons
on holidays versus normal days, regional and store-specific sales trends based on store type and
size, annual sales averages, and sales variations due to factors like CPI, fuel prices, temperature,
and unemployment. These visualizations, including heatmaps, correlation matrices (Kedia et
al., 2013), histograms, scatterplots, among others, are accompanied by concise descriptions to
elucidate findings and outline potential modeling avenues for the project's subsequent stages.
xi
3.2 CORELATION MATRIX
A correlation matrix describes the correlation between the various variables of a dataset. Each
variable in the table is correlated to each of the other variables in the table and helps in
understanding which variables are more closely related to each other (Glen, 2016).
With the numerous variables available through this dataset, it became imperative to study
correlations between some of them. By default, this matrix also calculates correlation through
Pearson’s Correlation Coefficient (Johnson, 2021) that calculates the linear relationship
between two variables, within a range of −1 to +1. The closer the correlation to |1|, the higher
the linear relationship between the variables and vice versa.
The heatmap/correlation matrix in Figure 22, created using the seaborn library in Python
(Szabo, 2020) gives the following information:
• There is a slight correlation between weekly sales and store size, type, and de- partment.
xii
4. DATA CLEANING AND PREPROCESSING
The data contains 421,570 rows, with some store-specific departments missing a few too many
weeks of sales. As observed in Figure 4, some columns in the features dataset contain missing
values, however, after the features dataset is merged with the training dataset, the only missing
values that exist are in the Markdown columns (as shown in figure 23).
After the extensive EDA, it was determined that these five markdown files, with missing
values, have barely any correlation to the weekly sales for Walmart, hence these five columns
have been eliminated from the subsequent training and testing dataset.
Because the source already provides training and testing datasets, there is no need to create
them for our study. Because the main focus of this study is to accurately predict weekly sales
for different Walmart stores, the previously modified ‘Date’, ‘Month’, ‘Quarter’, and ‘Day’
columns have been dropped and only the ‘Week of Year’ column has been used in the upcoming
models.
xiii
Data has been checked for inaccuracies, missing or out of range values using the ‘inspectdf’
package in R as part of the initial EDA. Columns with missing values have been dropped. The
dataset contains information about weekly sales which was initially broken down to acquire
information about monthly as well as quarterly sales for our analysis, however, that information
is not going to be utilized during the modeling process. The boolean ‘isHoliday’ column in the
dataset contains information about whether the weekly date was a holiday week or not. As
observed in the EDA above, sales have been higher during the holiday season as compared to
non-holiday season sales, hence the ‘isHoliday’ column has been used for further analysis.
Furthermore, as part of this data preprocessing step, I have also created input and target data
frames along with the training and validation datasets that help accurately measure the
performance of applied models. In addition, as part of this data preprocessing, feature scaling
(Vashisht, 2021) has been applied to normalize different data attributes. This has primarily been
done to unionize the independent variables in the training and testing datasets so that these
variables will be centered around the same range (0,1) and provide more accuracy.
Also referred to as normalization, this method uses a simple min-max scaling technique
(implemented in Python using the Scikit-learn (Sklearn) library. The Weighted Mean
SKNCOE, Department of Computer Engineering 2022-2023 16 Absolute Error is one of the
most common metrics used to measure accuracy for continuous variables (JJ, 2016).
A WMAE function has been created that provides a measure of success for the different models
applied. It is the average of errors between prediction and actual observations, with a weighting
factor. In conclusion, the smaller the WMAE, the more efficient the model.
xiv
5. MODEL SELECTION AND IMPLEMENTATION
Trying to find and implement the most effective model is the biggest challenge of this study.
Selecting a model will depend solely on the kind of data available and the analysis that has to
be performed on the data (UNSW, 2020).
Several models have been studied as part of this study that were selected based on different
aspects of our dataset; the main purpose of creating such models is to predict the weekly sales
for different Walmart stores and departments, hence, based on the nature of models that should
be created, the following four machine learning models have been used:
i) Linear Regression
ii) Lasso Regression
iii) Gradient Boosting Machine
iv) Random Forest
Each of these methods have been discussed briefly in the upcoming report. For each of the
models, why they were chosen, their implementation and their success rate (through WMAE)
have been included.
xv
3. BUILDING BI DASHBOARD
As an end product, this Power BI dashboard is going to serve as the final product of this
research. The dashboard contains detailed information about the original data related to the 45
Walmart stores as well as displays their respective predicted weekly sales. Most of the
explorationsthat have been performed as part of the EDA will be included in this dashboard in
the form of a story and users can filter data based on their requirements in the dashboard. After
the final predicted weekly sales are exported in the ‘sampleSubmissionFinal’ file, the id column
is split to separate the store, department, and date information into different columns through
Power BI data transformations (as shown in the figures be- low).
This file is then merged with the ‘stores’ file that contains information about the type and size
of the store as well as holiday information. All these columns will be used to create several
visualizations that track weekly predicted sales for various stores and departments, sales based
on store size and type, etc. The dashboard also provides detailed information SKNCOE,
Department of Computer Engineering 2022-2023 18 about stores and departments that generate
the highest revenue and their respective store types. The PDF file contains brief information
about all the visualizations created in the dashboard
xvi
The dashboard can be found in the final submitted folder. If a user does not have access to Power
BI, a PDF export of the entire dashboard is included along with the .pbix file that contains all of the
created visualizations and reports in the dashboard. Some views of the dashboard created are
included below
xvii
4. CONCLUSION
The main purpose of this study was to predict Walmart’s sales based on the available his toric
data and identify whether factors like temperature, unemployment, fuel prices, etc affect the
weekly sales of particularstores under study. This study also aimsto under- stand whether sales
are relatively higher during holidays like Christmas and Thanks- giving than normal days so
that stores can work on creating promotional offers that increase sales and generate higher
revenue.
As observed through the exploratory data analysis, store size and holidays have a direct
relationshipwith highWalmartsales. Itwas also observed that out of all the store types,Type A
stores gathered the most sales for Walmart. Additionally, departments 92, 95, 38, and 72
accumulate the most sales for Walmart stores across all three store types; for all of the 45 stores,
the presence of these departments in a store ensures higher sales. Pertaining to the specific
factors provided in the study (temperature, unemployment, CPI, and fuel price),it was observed
that sales do tend to go up slightly during favorable climate conditions as well as when the
prices of fuel are adequate. However, it is difficult to make a strong claim about this assumption
considering the limited scope of the training dataset provided as part of this study. By the
observations in the exploratory data analysis, sales also tend to be relatively higher when the
unemployment level is lower. Additionally, with the dataset provided for this study, there does
not seem to be a relationship between sales and the CPI index. Again, it is hard to make a
substantial claim about these findings without the presence of a larger training dataset with
additional information available.
Interaction effects were studied as part of the linear regression model to identify if a
combination of different factors could influence the weekly sales for Walmart. This was
necessary because of the presence of a high number of predictor variables in the dataset. While
the interaction effects were tested on a combination of significant variables, a statistically
significant relationship was only observed between the independent variables of temperature,
CPI and unemployment, and weekly sales (predictor variable). However, this is not definite
because of the limitation of training data.
xviii
5. REFERENCES
1. Bakshi, C. (2020). Random forest regression. https : / / levelup . gitconnected . com / random-
forest-regression-209c0f354c84
2. Bari, A., Chaouchi, M., & Jung, T. (n.d.). How to utilize linear regressions in predictive
analytics. https://www.dummies.com/programming/big-data/data-science/ how-to utilize-
linear-regressions-in-predictive-analytics/
3. Baum, D. (2011). How higher gas prices affect consumer behavior. https : / / www .
sciencedaily.com/releases/2011/05/110512132426.htm
4. Brownlee, J. (2016). Feature importance and feature selection with xgboost in python. https
: / / machinelearningmastery . com / feature - importance - and - feature - selection-
with xgboost-in-python/
5. Chouksey, P., & Chauhan, A. S. (2017). A review of weather data analytics using big data.
International Journal of Advanced Research in Computer and Communica- tion
Engineering,https://doi.org/https://ijarcce.com/upload/2017/january 17/IJARCCE%2072.pdf
6. Crown, M. (2016). Weekly sales forecasts using non-seasonal arima models. http : //
mxcrown.com/walmart-sales-forecasting/
7. Editor, M. B. (2013). Regression analysis: How do i interpret r-squared and assess the
goodness-of-fit? https : / /blog . minitab . com /en /adventures - in - statistics - 2 /
regression analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of- fit
8. Ellis, L. (2019). Simple eda in r with inspectdf. https://www.r-bloggers.com/2019/05/ part-
2-simple-eda-in-r-with-inspectdf/
9. Frost, J. (2021). Regression coefficients- statistics by jim. https://statisticsbyjim.com/
glossary/regression-coefficient/
10. Glen, S. (2016). Elementary statistics for the rest of us. https://www.statisticshowto.
com/correlation-matrix/
xix