100% found this document useful (1 vote)
149 views32 pages

Wholesale Custumer

The document provides an analysis of customer spending data across various product categories, regions, and channels. It finds that: - The region with the highest overall spending is "Other" and the region with the lowest is "Oporto". The channel with the highest spending is "Hotel" and the lowest is "Retail". - Fresh items have the highest average spending in the "Other" region for both retail and hotel channels. Milk and grocery items also tend to be highest in the "Other" region. Spending on frozen, detergent, and deli items varies more across regions and channels. - The fresh, milk, and grocery categories show the most consistent purchasing behavior across customers, while

Uploaded by

Ankita Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
149 views32 pages

Wholesale Custumer

The document provides an analysis of customer spending data across various product categories, regions, and channels. It finds that: - The region with the highest overall spending is "Other" and the region with the lowest is "Oporto". The channel with the highest spending is "Hotel" and the lowest is "Retail". - Fresh items have the highest average spending in the "Other" region for both retail and hotel channels. Milk and grocery items also tend to be highest in the "Other" region. Spending on frozen, detergent, and deli items varies more across regions and channels. - The fresh, milk, and grocery categories show the most consistent purchasing behavior across customers, while

Uploaded by

Ankita Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

BUSINESS REPORT

1
CONTENTS

Problem 1……………………………………………………………………..3
1.1 Use methods of descriptive statistics to summarize data. Which Region and which
Channel spent the most? Which Region and which Channel spent the
least?......................................................................................................................3
1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a detailed
justification for your answer…………………………………………………………………………………………….9
1.3 On the basis of a descriptive measure of variability, which item shows the most
inconsistent behaviour? Which items show the least inconsistent
behaviour?.....................................................................................................................18
1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique
with the help of detailed comments……………………………………………………………………………….20

Problem 2:
1.Perform Exploratory Data Analysis [Univariate, Bivariate, and Multivariate analysis to be
performed]. What insight do you draw from the EDA?..................................................21

2
PROBLEM 1
1.1 Use methods of descriptive statistics to summarize data. Which Region and which
Channel spent the most? Which Region and which Channel spent the least?

Overview of data

Detailed overview of dataset

• Records in the dataset = 440 ROWS


• Columns in the dataset = 9 COLUMNS
1. FRESH: annual spending (m.u.) on fresh products (Continuous)
2. MILK: - annual spending (m.u.) on milk products (Continuous)
3. GROCERY: - annual spending (m.u.) on grocery products (Continuous)
4. FROZEN: - annual spending (m.u.) on frozen products (Continuous)
5. DETERGENTS_PAPER: - annual spending (m.u.) on detergents and paper products
(Continuous)
6. DELICATESSEN: - annual spending (m.u.) on and delicatessen products (Continuous);
7. CHANNEL: - sales channel Hotel and Retailer
8. REGION: - three regions (Lisbon, Oporto, Other)
9. BUYER/SPENDOR: type of customer (buyer or spender)

3
The dataset contain numerical values for each of the product categories (FRESH, MILK, GROCERY,
FROZEN, DETERGENTS_PAPER, DELICATESSEN) representing the annual spending on each category.
The sales channel (CHANNEL) and region (REGION) attributes are categorical variables, while the
BUYER/SPENDOR attribute is binary, indicating whether the customer is a buyer or a spender.

Dataset was collected to explore purchasing patterns of wholesale customers and to perform
analyses such as customer segmentation or product category analysis.

4
• Buyer/Spender: binary attribute indicating whether the customer is a buyer (1) or a spender
(0).
• Channel: categorical attribute indicating the sales channel through which the customer
made purchases (Retail or Hotel).
• Region: categorical attribute indicating the region where the customer is located (Other,
Lisbon or Oporto).
• Fresh: continuous attribute representing the annual spending (in monetary units) on fresh
products.
• Milk: continuous attribute representing the annual spending (in monetary units) on milk
products.
• Grocery: continuous attribute representing the annual spending (in monetary units) on
grocery products.
• Frozen: continuous attribute representing the annual spending (in monetary units) on frozen
products.
• Detergents_Paper: continuous attribute representing the annual spending (in monetary
units) on detergents and paper products.
• Delicatessen: continuous attribute representing the annual spending (in monetary units) on
delicatessen products.

5
The dataset['Region'].value counts() function would count the number of occurrences of each
unique value in the 'Region' column of the dataset, and return a Pandas Series object with the
counts.

Assuming that the dataset is loaded into a Pandas Data Frame object called dataset, running
dataset['Region'].value counts() would output something like this:

This shows that there are 316 observations in the "Other" region, 77 observations in the "Lisbon"
region, and 47 observations in the "Oporto" region.

dataset['Channel'].value_counts()

The code dataset['Channel'].value_counts() is used to count the number of occurrences of each


unique value in the "Channel" column of a dataset (assuming that dataset is a variable containing
the dataset). The output will be a pandas Series object with the counts of each unique value in
descending order. For example:

Descriptive Statastics of our Data:

This shows the count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th
percentile or Q2), 75th percentile (Q3), and maximum values for each numerical column in the
dataset.

6
Descriptive Statistics of our Data including Channel & Retail:

This shows the count, unique values, most frequent value (top), frequency of the most frequent
value (freq), mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or
Q2), 75th percentile (Q3), and maximum values for each column in the dataset. Note that for
categorical columns such as "Channel" and "Region", the mean, standard deviation, and quartile
values are not applicable and are displayed as "NaN".

From the above two describe function, we can infer the following:

Channel has two unique values, with "Hotel" as most frequent with 298 out of 440 transactions. i.e.
67.7 percentage of spending comes from "Hotel" channel.

Retail has three unique values, with "Other" as most frequent with 316 out of 440 transactions.
i.e.,71.8 percentage of spending comes from "Other" region.

Fresh item (440 records), has a mean of 12000.3, standard deviation of 12647.3, with min value of 3
and max value of 112151.

The other aspect is Q1(25%) is 3127.75, Q3(75%) is 16933.8, with Q2(50%) 8504
range = max-min =112151-3=112,148 & IQR = Q3-Q1 = 16933.8-3127.75 = 13,806.05 (this helpful in
calculating the outlier (1.5 IQR Lower/Upper limit))

Milk item (440 records), has a mean of 5796.27, standard deviation of 7380.38, with min value of 55
and max value of 73498.

The other aspect is Q1(25%) is 1533, Q3(75%) is 7190.25, with Q2(50%) 3627
range = max-min =73498-55=73443 & IQR = Q3-Q1 = 7190.25-1533 = 5657.25 (this helpful in
calculating the outlier (1.5 IQR Lower/Upper limit))

Grocery item (440 records), has a mean of 7951.28, standard deviation of 9503.16, with min value of
3 and max value of 92780.

The other aspect is Q1(25%) is 2153, Q3(75%) is 10655.8, with Q2(50%) 4755.5
range = max-min =92780-3=92777 & IQR = Q3-Q1 = 10655.8-2153 = 8502.8 (this helpful in
calculating the outlier (1.5 IQR Lower/Upper limit))

Frozen (440 records), has a mean of 3071.93, standard deviation of 4854.67, with min value of 25
and max value of 60869
The other aspect is Q1(25%) is 742.25, Q3(75%) is 3554.25, with Q2(50%) 1526

7
range = max-min =60869-25=60844 & IQR = Q3-Q1 = 3554.25-742.25 = 2812 (this helpful in
calculating the outlier (1.5 IQR Lower/Upper limit))

Detergents Paper (440 records), has a mean of 2881.49, standard deviation of 4767.85, with min
value of 3 and max value of 40827.

The other aspect is Q1(25%) is 256.75, Q3(75%) is 3922, with Q2(50%) 816.5

range = max-min =40827-3=40824 & IQR = Q3-Q1 = 3922-256.75 = 3665.25 (this helpful in
calculating the outlier (1.5 IQR Lower/Upper limit))

Frozen (440 records), has a mean of 3071.93, standard deviation of 4854.67, with min value of 25
and max value of 60869.

The other aspect is Q1(25%) is 742.25, Q3(75%) is 3554.25, with Q2(50%) 1526

range = max-min =60869-25=60844 & IQR = Q3-Q1 = 3554.25-742.25 = 2812 (this helpful in
calculating the outlier (1.5 IQR Lower/Upper limit))

Detergents Paper (440 records),

has a mean of 2881.49, standard deviation of 4767.85, with min value of 3 and max value of 40827.

The other aspect is Q1(25%) is 256.75, Q3(75%) is 3922, with Q2(50%) 816.5

range = max-min =40827-3=40824 & IQR = Q3-Q1 = 3922-256.75 = 3665.25 (this helpful in
calculating the outlier (1.5 IQR Lower/Upper limit))

Which Region and which Channel seems to spend more?

Highest spend in the Region is from Others and lowest spend in the region is from Oporto

Highest spend in the Channel is from hotel and lowest spend in the Channel is from Retail.

8
1.2 There are 6 different varieties of items that are considered. Describe and comment/explain all
the varieties across Region and Channel? Provide a detailed justification for your answer?

Together, this bar plot showing the average "Fresh" item spending of different regions across both
retail and hotel.

9
Together, this bar plot showing the average "Fresh" item spending of different regions across both
retail and hotel.

Together, this bar plot showing the average "Fresh" item spending for different regions.

Based on the plots, Fresh item is sold more in the Retail channel

10
Together, this bar plot showing the average "Milk" item spending of different regions across both
retail and hotel.

Bar chart to show the distribution of the "Milk" column in a dataset, grouped by the "Channel"
column. The title of the plot is set to "Item - Milk”. Overall, the plot shows the distribution of the
"Milk" column in the dataset, with separate bars for each value of the "Channel" column. The height
of each bar represents the average value of "Milk" within each channel. The plot can provide insights
into the differences in average "Milk" between the different channels.

11
Overall, the plot shows the distribution of the "Milk" column in the dataset, with separate bars for
each value of the "Region" column. The height of each bar represents the average value of "Milk"
within each region. The plot can provide insights into the differences in average "Milk" between the
different regions.

Categorical plot of the "Grocery" item, grouped by "Channel" and colored by "Region", with a title of
"Item - Grocery".

12
Categorical plot (catplot) of the "Grocery" item, grouped by "Channel". The "kind" parameter is set
to "bar" to create a bar chart, and the "ci" parameter is set to None to disable error bars. Finally, the
code sets the title of the plot to "Item - Grocery".

Categorical plot of the "Grocery" item, grouped by "Channel", with a title of "Item - Grocery".

Categorical plot of the "Grocery" item, grouped by "Region", with a title of "Item - Grocery".

13
Categorical plot (cat plot) of the "Frozen" item, grouped by "Channel" and coloured by "Region". The
"kind" parameter is set to "bar" to create a bar chart, and the "ci" parameter is set to None to
disable error bars. Finally, the code sets the title of the plot to "Item - Frozen".

Categorical plot of the "Frozen" item, grouped by "Channel" and coloured by "Region", with a title of
"Item - Frozen".

14
"Channel" variable on the x-axis, the "Frozen" variable on the y-axis, and bars showing the
distribution of the "Frozen" variable for each category of the "Channel" variable. The title of the plot
will be "Item - Frozen".

Plot will have the "Region" variable on the x-axis, the "Frozen" variable on the y-axis, and bars
showing the distribution of the "Frozen" variable for each category of the "Region" variable. The title
of the plot will be "Item - Frozen".

15
Bar plot using the cat plot function in seaborn. The "x" argument specifies the variable to be plotted
on the x-axis, "y" specifies the variable to be plotted on the y-axis.

Bar plot of the "Detergents Paper" variable in the dataset, grouped by "Channel" and coloured by
"Region", with a title of "Item - Detergents Paper.

Bar plot of the "Detergents Paper" variable in the dataset, grouped by "Channel", with a title of
"Item - Detergents Paper".

16
Bar plot of the "Detergents Paper" variable in the dataset, grouped by "Region", with a title of "Item
- Detergents Paper".

Bar plot of the "Delicatessen" variable in the dataset, grouped by "Channel" and colored by
"Region", with a title of "Delicatessen".

Bar plot of the "Delicatessen" variable in the dataset, grouped by "Channel", with a title of "Item -
Delicatessen".

17
Bar plot of the "Delicatessen" variable in the dataset, grouped by "Region", with a title of "Item -
Delicatessen".

1.3 On the basis of a descriptive measure of variability, which item shows the
most inconsistent behaviour? Which items show the least inconsistent
behaviour?

Fresh item has highest Standard deviation So that is Inconsistent.


Delicatessen item have smallest Standard deviation, so that is consistent.

18
“Fresh” item has lowest coefficient of Variation So that is consistent.
“Delicatessen” item has highest coefficient of Variation, so that is Inconsistent.

19
1.4 Are there any outliers in the data? Back up your answer with a suitable
plot/technique with the help of detailed comments

Box plot of the entire dataset, with a size of 15 inches wide by 8 inches tall, and using the "Set2"
colour palette. The box plot displays the distribution of the data for each numerical variable in the
dataset, as well as any outliers or extreme values.

The black point is the outliers in boxplot graph.


Yes, there are outliers in all the items across the product range (Fresh, Milk,
Grocery, Frozen, Detergents Paper & Delicatessen)
Outliers are detected but not necessarily removed, it depends of the
situation. Here I will assume that the wholesale distributor provided us a
dataset with correct data, so I will keep them as is.

20
PROBLEM 2
The given dataset consists of data points of names of various university and college which has
number of application received, accepted, and enrolled, percentage of new students from top 10%
of higher secondary class, percentage of new students from top 25% of higher secondary class,
Number of fulltime undergraduates, Number of parttime undergraduate students, Number of
students for whom the particular college is out of state tuition, cost of room and board, estimated
book costs for a student, estimated personal spending for a student, percentage of faculties with
PHD, percentage of faculties with terminal degree, student/faculty ratio, percentage of alumni who
donate, The instructional expenditure per student, Graduation Rate.

INFERENCE OF THE DATASET


The shape of the dataset seems to be with 777 rows and 18 columns.

All the columns seem to be integer or float values.

The Names column alone is a categorical value.

We also can see they are no duplicates in the dataset.

The entire dataset does not have missing values or null values

Names 0
Apps 0
Accept 0
Enroll 0
Top10perc 0
Top25perc 0
F.Undergrad 0
P.Undergrad 0
Outstate 0
Room.Board 0
Books 0
Personal 0
PhD 0
Terminal 0
S.F.Ratio 0
perc.alumni 0
Expend 0
Grad.Rate 0
dtype: int64

Perform Exploratory Data Analysis


UNIVARIATE ANALYSIS Helps us to understand the distribution of data in the dataset. With
univariate analysis we can find patterns and we can summarize the data for

21
APPS

The Box plot of Apps variable seems to have outliers, the distribution of the data is skewed we could
also understand that each college or university offers application in the range of 3000 to 5000. The
max applications seem to be around 50,000. For univariate analysis of apps we are using box plot
and dist. plot to find information or patterns in the data. So, we can clearly understand from the box
plot

we have outliers in the dataset

ACCEPT

22
The accept variable seems to have outliers. The dist. plot shows us the majority of applications
accepted from each university are in the range from 70 to 1500.The accept variable seems to be
positively skewed.

ENROLL

The box plot of the Enroll variable also has outliers. The distribution of the data is positively skewed.
From the dist. plot we can understand majority of the colleges have enrolled students in the range of
200 to 500 students.

Top 10perc
The box plot of the students from top 10 percentage of higher secondary class seems to have
outliers. The distribution seems to be positively skewed. There is good amount of intake about 30 to
50 students from top 10 percentage of higher secondary class.

23
TOP 25 PERC

The box plot for the top 25% has no outliers. The distribution is almost normally distributed.
Majority of the students are from top 25% of higher secondary class.

FULL TIME UNDERGRADUATE

The box plot of the full-time graduates has outliers. The distribution of the data is positively skewed.

In the range about 3000 to 5000 they are full time graduates studying in all the university .

24
PART TIME UNDERGRADUATE

The box plot of the part time graduates has outliers. The distribution of the data is positively
skewed. In the range about 1000 to 3000 they are part-time graduates studying in all the university.

OUTSTATE

The box plot of outstate has only one outlier. The distribution is almost normally distributed.

25
ROOM BOARD

The Room board has few outliers. The distribution is normally distributed.

BOOKS

The box plot of books has outliers. The distribution seems to be bimodal. The cost of books per
student seems to be in the range of 500 to 100.

26
PERSONAL

The box plot of personal expense has outliers. Some student’s personal expense is way bigger than
the rest of the students. The distribution seems to be positively skewed.

PHD

The box plot of PHD has outliers. The distribution seems to be negatively skewed.

27
TERMINAL

The box plot of terminal seems to have outliers in the dataset. The distribution for the terminal also
seems to be negatively skewed.

SF RATIO

The SF ratio variable also has outliers in the dataset. The distribution is almost normally distributed.
The student faculty ratio is almost same in all the university and colleges .

28
PERCI ALUMINI

The percentage of alumni box plot seems to have outliers in the dataset. The distribution is almost
normally distributed.

EXPENDITURE

The expenditure variable also has outliers in the dataset. The distribution of the expenditure is
positively skewed.

29
GRAD RATE

The graduation rate among the students in all the university above 60%. The box plot of the
graduation rate has outliers in the dataset. The distribution is normally distributed.

30
MULTIVARIATE ANALYSIS

The pair plot helps us to understand the relationship between all the numerical values in the
dataset. On comparing all the variables with each other we could understand the patterns or trends
in the dataset

31
HEATMAP

This Heat map gives us the correlation between two numerical values.

We could understand the application variable is highly positively correlated with application
accepted, students enrolled and full-time graduates. So, this relationship gives the insights that
when student submits the application ‘it is accepted’ and the student ‘is enrolled’ as fulltime
graduate.

We can find negative correlation between application and percentage of alumni. This indicates us
not all students are part of alumni of their college or university.

The application with top 10, 25 of higher secondary class, outstate, room board, books, personal,
PhD, terminal, S.F ratio, expenditure and Graduation ratio are positively correlated.

32

You might also like