0% found this document useful (0 votes)
48 views10 pages

1.1 Univariate Analysis: 1.1.1 Categorical Data

DATA ANALYTICS AND DATA SCIENCE

Uploaded by

Eugene Berna I
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views10 pages

1.1 Univariate Analysis: 1.1.1 Categorical Data

DATA ANALYTICS AND DATA SCIENCE

Uploaded by

Eugene Berna I
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

22AMH32 – DATA ANALYTICS AND

DATA SCIENCE

UNIT I & DATA VISUALIZATION


BASIC TOOLS (PLOTS, GRAPHS AND
SUMMARY STATISTICS) OF EDA

1. DATA VISUALIZATION BASIC TOOLS (PLOTS, GRAPHS


AND SUMMARY STATISTICS) OF EDA
Data Visualization represents the text or numerical data in a visual format, which makes it
easy to grasp the information the data express. We, humans, remember the pictures more
easily than readable text, so Python provides us various libraries for data visualization like
matplotlib, seaborn, plotly, etc. In this tutorial, we will use Matplotlib and seaborn for
performing various techniques to explore data using various plots.
1.1 Univariate Analysis
Univariate analysis is the simplest form of analysis where we explore a single variable.
Univariate analysis is performed to describe the data in a better way. we perform
Univariate analysis of Numerical and categorical variables differently because plotting
uses different plots.

1.1.1 Categorical Data


A variable that has text-based information is referred to as categorical variables. let’s
look at various plots which we can use for visualizing Categorical data.

1) CountPlot
Countplot is basically a count of frequency plot in form of a bar graph. It plots the count
of each category in a separate bar. When we use the pandas’ value counts function on
any column, It is the same visual form of the value counts function. In our data-target
variable is survived and it is categorical so let us plot a countplot of this.

2) Pie Chart
The pie chart is also the same as the countplot, only gives you additional information
about the percentage presence of each category in data means which category is getting
how much weightage in data. let us check about the Sex column, what is a percentage of
Male and Female members traveling.

data['Sex'].value_counts().plot(kind="pie", autopct="%.2f")
plt.show()

1.1.2 Numerical Data


Analyzing Numerical data is important because understanding the distribution of
variables helps to further process the data. Most of the time you will find much
inconsistency with numerical data so do explore numerical variables.

1) Histogram
A histogram is a value distribution plot of numerical columns. It basically creates bins in
various ranges in values and plots it where we can visualize how values are distributed.
We can have a look where more values lie like in positive, negative, or at the
center(mean). Let’s have a look at the Age column.

plt.hist(data['Age'], bins=5)
plt.show()
2) Distplot
Distplot is also known as the second Histogram because it is a slight improvement
version of the Histogram. Distplot gives us a KDE(Kernel Density Estimation) over
histogram which explains PDF(Probability Density Function) which means what is the
probability of each value occurring in this column. If you have study statistics before
then definitely you should know about PDF function.

sns.distplot(data['Age'])
plt.show()

3) Boxplot
Boxplot is a very interesting plot that basically plots a 5 number summary. to get 5
number summary some terms we need to describe.

 Median – Middle value in series after sorting


 Percentile – Gives any number which is number of values present before this
percentile like for example 50 under 25th percentile so it explains total of 50
values are there below 25th percentile
 Minimum and Maximum – These are not minimum and maximum values, rather
they describe the lower and upper boundary of standard deviation which is
calculated using Interquartile range(IQR).
IQR = Q3 - Q1
Lower_boundary = Q1 - 1.5 * IQR
Upper_bounday = Q3 + 1.5 * IQR
Here Q1 and Q3 is 1st quantile(25th percentile) and 3rd Quantile(75th percentile)
Bivariate/ Multivariate Analysis
We have study about various plots to explore single categorical and numerical data.
Bivariate Analysis is used when we have to explore the relationship between 2 different
variables and we have to do this because, in the end, our main task is to explore the
relationship between variables to build a powerful model. And when we analyze more
than 2 variables together then it is known as Multivariate Analysis. we will work on
different plots for Bivariate as well on Multivariate Analysis.

1.1.3 Numerical and Numerical


First, let’s explore the plots when both the variable is numerical.

1) Scatter Plot
To plot the relationship between two numerical variables scatter plot is a simple plot to
do. Let us see the relationship between the total bill and tip provided using a scatter plot.

sns.scatterplot(tips["total_bill"], tips["tip"])

Multivariate analysis with scatter plot


we can also plot 3 variable or 4 variable relationships with scatter plot. suppose we want
to find the separate ratio of male and female with total bill and tip provided.

sns.scatterplot(tips["total_bill"], tips["tip"], hue=tips["sex"])


plt.show()
We can also see 4 variable multivariate analyses with scatter plots using style argument.
Suppose now along with gender I also want to know whether the customer was a smoker
or not so we can do this.

sns.scatterplot(tips["total_bill"], tips["tip"], hue=tips["sex"], style=tips['smoker'])


plt.show()

1.1.4 Numerical and Categorical


If one variable is numerical and one is categorical then there are various plots that we
can use for Bivariate and Multivariate analysis.

1) Bar Plot
Bar plot is a simple plot which we can use to plot categorical variable on the x-axis and
numerical variable on y-axis and explore the relationship between both variables. The
blacktip on top of each bar shows the confidence Interval. let us explore P-Class with
age.
sns.barplot(data['Pclass'], data['Age'])
plt.show()

Multivariate analysis using Bar plot


Hue’s argument is very useful which helps to analyze more than 2 variables. Now along
with the above relationship we want to see with gender.

sns.barplot(data['Pclass'], data['Fare'], hue = data["Sex"])


plt.show()

2) Boxplot
We have already study about boxplots in the Univariate analysis above. we can draw a
separate boxplot for both the variable. let us explore gender with age using a boxplot.

sns.boxplot(data['Sex'], data["Age"])
Multivariate analysis with boxplot
Along with age and gender let’s see who has survived and who has not.

sns.boxplot(data['Sex'], data["Age"], data["Survived"])


plt.show()

3) Distplot
Distplot explains the PDF function using kernel density estimation. Distplot does not
have a hue parameter but we can create it. suppose we want to see the probability of
people with an age range that of survival probability and find out whose survival
probability is high to the age range of death ratio.

sns.distplot(data[data['Survived'] == 0]['Age'], hist=False, color="blue")


sns.distplot(data[data['Survived'] == 1]['Age'], hist=False, color="orange")
plt.show()
As we can see the graph is really very interesting. the blue one shows the probability of
dying and the orange plot shows the survival probability. If we observe it we can see that
children’s survival probability is higher than death and which is the opposite in the case
of aged peoples. This small analysis tells sometimes some big things about data and It
helps while preparing data stories.

1.1.5 Categorical and Categorical


Now we will work on categorical and categorical columns.

1) Heatmap
If you have ever used a crosstab function of pandas then Heatmap is a similar visual
representation of that only. It basically shows that how much presence of one category
concerning another category is present in the dataset. let me show first with crosstab and
then with heatmap.

pd.crosstab(data['Pclass'], data['Survived'])
Now with heatmap, we have to find how many people survived and died.

sns.heatmap(pd.crosstab(data['Pclass'], data['Survived']))

2) Cluster map
we can also use a cluster map to understand the relationship between two categorical
variables. A cluster map basically plots a dendrogram that shows the categories of
similar behavior together.

sns.clustermap(pd.crosstab(data['Parch'], data['Survived']))
plt.show()
DISCUSSION QUESTIONS:

1. How do different types of plots and graphs (e.g., histograms, scatter plots, box plots)
contribute to understanding the distribution and relationships within datasets?
2. What role do summary statistics (e.g., mean, median, standard deviation) play in
summarizing key characteristics of data, and how can visualizations enhance their
interpretation?
3. In what ways can interactive visualization tools (e.g., Plotly, Tableau) enhance the
exploration and communication of insights derived from EDA?

You might also like