0% found this document useful (0 votes)

48 views10 pages

1.1 Univariate Analysis: 1.1.1 Categorical Data

DATA ANALYTICS AND DATA SCIENCE

Uploaded by

Eugene Berna I

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views10 pages

1.1 Univariate Analysis: 1.1.1 Categorical Data

DATA ANALYTICS AND DATA SCIENCE

Uploaded by

Eugene Berna I

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

22AMH32 – DATA ANALYTICS AND

DATA SCIENCE

UNIT I & DATA VISUALIZATION

BASIC TOOLS (PLOTS, GRAPHS AND
SUMMARY STATISTICS) OF EDA

1. DATA VISUALIZATION BASIC TOOLS (PLOTS, GRAPHS

AND SUMMARY STATISTICS) OF EDA
Data Visualization represents the text or numerical data in a visual format, which makes it
easy to grasp the information the data express. We, humans, remember the pictures more
easily than readable text, so Python provides us various libraries for data visualization like
matplotlib, seaborn, plotly, etc. In this tutorial, we will use Matplotlib and seaborn for
performing various techniques to explore data using various plots.
1.1 Univariate Analysis
Univariate analysis is the simplest form of analysis where we explore a single variable.
Univariate analysis is performed to describe the data in a better way. we perform
Univariate analysis of Numerical and categorical variables differently because plotting
uses different plots.

1.1.1 Categorical Data

A variable that has text-based information is referred to as categorical variables. let’s
look at various plots which we can use for visualizing Categorical data.

1) CountPlot
Countplot is basically a count of frequency plot in form of a bar graph. It plots the count
of each category in a separate bar. When we use the pandas’ value counts function on
any column, It is the same visual form of the value counts function. In our data-target
variable is survived and it is categorical so let us plot a countplot of this.

2) Pie Chart
The pie chart is also the same as the countplot, only gives you additional information
about the percentage presence of each category in data means which category is getting
how much weightage in data. let us check about the Sex column, what is a percentage of
Male and Female members traveling.

data['Sex'].value_counts().plot(kind="pie", autopct="%.2f")
plt.show()

1.1.2 Numerical Data

Analyzing Numerical data is important because understanding the distribution of
variables helps to further process the data. Most of the time you will find much
inconsistency with numerical data so do explore numerical variables.

1) Histogram
A histogram is a value distribution plot of numerical columns. It basically creates bins in
various ranges in values and plots it where we can visualize how values are distributed.
We can have a look where more values lie like in positive, negative, or at the
center(mean). Let’s have a look at the Age column.

plt.hist(data['Age'], bins=5)
plt.show()
2) Distplot
Distplot is also known as the second Histogram because it is a slight improvement
version of the Histogram. Distplot gives us a KDE(Kernel Density Estimation) over
histogram which explains PDF(Probability Density Function) which means what is the
probability of each value occurring in this column. If you have study statistics before
then definitely you should know about PDF function.

sns.distplot(data['Age'])
plt.show()

3) Boxplot
Boxplot is a very interesting plot that basically plots a 5 number summary. to get 5
number summary some terms we need to describe.

 Median – Middle value in series after sorting

 Percentile – Gives any number which is number of values present before this
percentile like for example 50 under 25th percentile so it explains total of 50
values are there below 25th percentile
 Minimum and Maximum – These are not minimum and maximum values, rather
they describe the lower and upper boundary of standard deviation which is
calculated using Interquartile range(IQR).
IQR = Q3 - Q1
Lower_boundary = Q1 - 1.5 * IQR
Upper_bounday = Q3 + 1.5 * IQR
Here Q1 and Q3 is 1st quantile(25th percentile) and 3rd Quantile(75th percentile)
Bivariate/ Multivariate Analysis
We have study about various plots to explore single categorical and numerical data.
Bivariate Analysis is used when we have to explore the relationship between 2 different
variables and we have to do this because, in the end, our main task is to explore the
relationship between variables to build a powerful model. And when we analyze more
than 2 variables together then it is known as Multivariate Analysis. we will work on
different plots for Bivariate as well on Multivariate Analysis.

1.1.3 Numerical and Numerical

First, let’s explore the plots when both the variable is numerical.

1) Scatter Plot
To plot the relationship between two numerical variables scatter plot is a simple plot to
do. Let us see the relationship between the total bill and tip provided using a scatter plot.

sns.scatterplot(tips["total_bill"], tips["tip"])

Multivariate analysis with scatter plot

we can also plot 3 variable or 4 variable relationships with scatter plot. suppose we want
to find the separate ratio of male and female with total bill and tip provided.

sns.scatterplot(tips["total_bill"], tips["tip"], hue=tips["sex"])

plt.show()
We can also see 4 variable multivariate analyses with scatter plots using style argument.
Suppose now along with gender I also want to know whether the customer was a smoker
or not so we can do this.

sns.scatterplot(tips["total_bill"], tips["tip"], hue=tips["sex"], style=tips['smoker'])

plt.show()

1.1.4 Numerical and Categorical

If one variable is numerical and one is categorical then there are various plots that we
can use for Bivariate and Multivariate analysis.

1) Bar Plot
Bar plot is a simple plot which we can use to plot categorical variable on the x-axis and
numerical variable on y-axis and explore the relationship between both variables. The
blacktip on top of each bar shows the confidence Interval. let us explore P-Class with
age.
sns.barplot(data['Pclass'], data['Age'])
plt.show()

Multivariate analysis using Bar plot

Hue’s argument is very useful which helps to analyze more than 2 variables. Now along
with the above relationship we want to see with gender.

sns.barplot(data['Pclass'], data['Fare'], hue = data["Sex"])

plt.show()

2) Boxplot
We have already study about boxplots in the Univariate analysis above. we can draw a
separate boxplot for both the variable. let us explore gender with age using a boxplot.

sns.boxplot(data['Sex'], data["Age"])
Multivariate analysis with boxplot
Along with age and gender let’s see who has survived and who has not.

sns.boxplot(data['Sex'], data["Age"], data["Survived"])

plt.show()

3) Distplot
Distplot explains the PDF function using kernel density estimation. Distplot does not
have a hue parameter but we can create it. suppose we want to see the probability of
people with an age range that of survival probability and find out whose survival
probability is high to the age range of death ratio.

sns.distplot(data[data['Survived'] == 0]['Age'], hist=False, color="blue")

sns.distplot(data[data['Survived'] == 1]['Age'], hist=False, color="orange")
plt.show()
As we can see the graph is really very interesting. the blue one shows the probability of
dying and the orange plot shows the survival probability. If we observe it we can see that
children’s survival probability is higher than death and which is the opposite in the case
of aged peoples. This small analysis tells sometimes some big things about data and It
helps while preparing data stories.

1.1.5 Categorical and Categorical

Now we will work on categorical and categorical columns.

1) Heatmap
If you have ever used a crosstab function of pandas then Heatmap is a similar visual
representation of that only. It basically shows that how much presence of one category
concerning another category is present in the dataset. let me show first with crosstab and
then with heatmap.

pd.crosstab(data['Pclass'], data['Survived'])
Now with heatmap, we have to find how many people survived and died.

sns.heatmap(pd.crosstab(data['Pclass'], data['Survived']))

2) Cluster map
we can also use a cluster map to understand the relationship between two categorical
variables. A cluster map basically plots a dendrogram that shows the categories of
similar behavior together.

sns.clustermap(pd.crosstab(data['Parch'], data['Survived']))
plt.show()
DISCUSSION QUESTIONS:

1. How do different types of plots and graphs (e.g., histograms, scatter plots, box plots)
contribute to understanding the distribution and relationships within datasets?
2. What role do summary statistics (e.g., mean, median, standard deviation) play in
summarizing key characteristics of data, and how can visualizations enhance their
interpretation?
3. In what ways can interactive visualization tools (e.g., Plotly, Tableau) enhance the
exploration and communication of insights derived from EDA?

DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
Experiment No 9
No ratings yet
Experiment No 9
13 pages
DSBDL Write Ups 8 To 10
No ratings yet
DSBDL Write Ups 8 To 10
7 pages
Programming For AI: Exploratory Data Analysis
No ratings yet
Programming For AI: Exploratory Data Analysis
52 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
DVA Practical
No ratings yet
DVA Practical
19 pages
Lecture 4
No ratings yet
Lecture 4
60 pages
Data Visualization Part 2
No ratings yet
Data Visualization Part 2
18 pages
Unit 2
No ratings yet
Unit 2
36 pages
Python Session 7
No ratings yet
Python Session 7
31 pages
Data Analisis 2
No ratings yet
Data Analisis 2
13 pages
Pandas Cheat Sheet 2
No ratings yet
Pandas Cheat Sheet 2
12 pages
Sl-3 Assignment No.8
No ratings yet
Sl-3 Assignment No.8
21 pages
Advanced Plot Types With Seaborn
No ratings yet
Advanced Plot Types With Seaborn
4 pages
Data Visualization
No ratings yet
Data Visualization
23 pages
Data Analysis Graphs
No ratings yet
Data Analysis Graphs
9 pages
Week 6
No ratings yet
Week 6
40 pages
Data Visualization
No ratings yet
Data Visualization
10 pages
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
No ratings yet
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
7 pages
Advanced Plot Types With Seaborn
No ratings yet
Advanced Plot Types With Seaborn
8 pages
Univariate Analysis in Machine Learning
No ratings yet
Univariate Analysis in Machine Learning
17 pages
L5 6 DataViz
No ratings yet
L5 6 DataViz
79 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
Aphical Representation
No ratings yet
Aphical Representation
12 pages
Seaborn
No ratings yet
Seaborn
7 pages
Data Visualization Essentials
No ratings yet
Data Visualization Essentials
87 pages
Lab Manual For Students
No ratings yet
Lab Manual For Students
38 pages
Datavisualization Interview
No ratings yet
Datavisualization Interview
3 pages
Ai&Ml Bail606 ML Lab Manual
No ratings yet
Ai&Ml Bail606 ML Lab Manual
50 pages
Unit II 09 Data Visualization Matplotlib
No ratings yet
Unit II 09 Data Visualization Matplotlib
9 pages
Data Visualisation
No ratings yet
Data Visualisation
5 pages
Exp 8
No ratings yet
Exp 8
19 pages
19 Matplotlib
No ratings yet
19 Matplotlib
26 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
6) Exploratory Data Analysis
No ratings yet
6) Exploratory Data Analysis
29 pages
Aphical Representation
No ratings yet
Aphical Representation
8 pages
Ad3301 Unit 1
No ratings yet
Ad3301 Unit 1
15 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
Data Visualisation
No ratings yet
Data Visualisation
12 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
CG DADL - 2024 June - Lecture 02
No ratings yet
CG DADL - 2024 June - Lecture 02
64 pages
Exploratory Data Analysis Course
No ratings yet
Exploratory Data Analysis Course
139 pages
Data Visualisation Using Python
100% (1)
Data Visualisation Using Python
77 pages
Part A Assignment - No - 8
No ratings yet
Part A Assignment - No - 8
19 pages
DV 6
No ratings yet
DV 6
9 pages
Class 1 Data Visualization in Python Using Matplotlib
No ratings yet
Class 1 Data Visualization in Python Using Matplotlib
13 pages
Data Visualizaton On 1D, 2D, 3D
No ratings yet
Data Visualizaton On 1D, 2D, 3D
26 pages
Lecture3 Classnotes
No ratings yet
Lecture3 Classnotes
31 pages
Machine Learning
No ratings yet
Machine Learning
149 pages
Python Unit 4.notes
No ratings yet
Python Unit 4.notes
50 pages
Python Data Visualization Guide
No ratings yet
Python Data Visualization Guide
34 pages
Applied - Data - Science MODULE 3 SEM 8
No ratings yet
Applied - Data - Science MODULE 3 SEM 8
41 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Python
No ratings yet
Python
29 pages
Unit 3 DS
No ratings yet
Unit 3 DS
30 pages
Data Visualization
No ratings yet
Data Visualization
17 pages
Import Seaborn As Sns
No ratings yet
Import Seaborn As Sns
27 pages
22amh32 - Data Analytics and Data Science Unit Iii & Counting Ones in Awindow 1. Counting Ones in A Window
No ratings yet
22amh32 - Data Analytics and Data Science Unit Iii & Counting Ones in Awindow 1. Counting Ones in A Window
6 pages
LM17
No ratings yet
LM17
5 pages
22amh32 - Data Analytics and Data Science Unit Iii & Estimating Moments 1. Estimating Moments
No ratings yet
22amh32 - Data Analytics and Data Science Unit Iii & Estimating Moments 1. Estimating Moments
4 pages
22amh32 - Data Analytics and Data Science Unit I & Mathematics Foundations For Data Science 1. Mathematics Foundations For Data Science
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Mathematics Foundations For Data Science 1. Mathematics Foundations For Data Science
5 pages
22amh32 - Data Analytics and Data Science Unit Iv & Mining Frequent Item Sets 1. Mining Frequent Item Sets
No ratings yet
22amh32 - Data Analytics and Data Science Unit Iv & Mining Frequent Item Sets 1. Mining Frequent Item Sets
6 pages
22amh32 - Data Analytics and Data Science Unit I & Statistical Inference and Modelling 1. Statistical Inference and Modelling
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Statistical Inference and Modelling 1. Statistical Inference and Modelling
4 pages
Time Series Analysis and Forecasting-Introduction
No ratings yet
Time Series Analysis and Forecasting-Introduction
52 pages
Stat211 062 02 E1
No ratings yet
Stat211 062 02 E1
9 pages
Math Review
No ratings yet
Math Review
4 pages
Heizer Om10 Irm Ch06
100% (2)
Heizer Om10 Irm Ch06
15 pages
BM Unit 4.3
No ratings yet
BM Unit 4.3
37 pages
Unit Iii
No ratings yet
Unit Iii
108 pages
4MB0 01 Que 20170109 PDF
No ratings yet
4MB0 01 Que 20170109 PDF
20 pages
Types of Graphs and Charts Guide
No ratings yet
Types of Graphs and Charts Guide
81 pages
Process Capability Analysis Guide
No ratings yet
Process Capability Analysis Guide
6 pages
Intro to Descriptive Statistics
No ratings yet
Intro to Descriptive Statistics
39 pages
Topic2 - 2024 - Descriptive Statistics - STD - Revised
No ratings yet
Topic2 - 2024 - Descriptive Statistics - STD - Revised
20 pages
Thesis - Remote Acoustic Sensing of Vibrating Structures For Structural Health Monitoring
No ratings yet
Thesis - Remote Acoustic Sensing of Vibrating Structures For Structural Health Monitoring
200 pages
Sample Question Paper - 4 Class-IX Session - 2021-22
No ratings yet
Sample Question Paper - 4 Class-IX Session - 2021-22
16 pages
Data Visualization Guide 1698311298
No ratings yet
Data Visualization Guide 1698311298
14 pages
Roti Et Al-1987-Cytometry
No ratings yet
Roti Et Al-1987-Cytometry
7 pages
Project Completion Time Analysis
No ratings yet
Project Completion Time Analysis
13 pages
Frequency Distributions Guide
No ratings yet
Frequency Distributions Guide
25 pages
Project 3 Instructions
No ratings yet
Project 3 Instructions
11 pages
Business Graphs & Charts Guide
No ratings yet
Business Graphs & Charts Guide
23 pages
Intro to Statistics for Students
No ratings yet
Intro to Statistics for Students
100 pages
IGCSE BIO - TB Practical Activities40
No ratings yet
IGCSE BIO - TB Practical Activities40
1 page
MNO2602 Tutorial Letter 201
No ratings yet
MNO2602 Tutorial Letter 201
56 pages
Homework 1.
No ratings yet
Homework 1.
3 pages
CCGPS Math 6 Grade Unit 6 Study Guide - Statistics: Name: Period: Date
No ratings yet
CCGPS Math 6 Grade Unit 6 Study Guide - Statistics: Name: Period: Date
4 pages
Lab2 LectureFreqD
No ratings yet
Lab2 LectureFreqD
16 pages
HW 3 - Mora Carrillo John
No ratings yet
HW 3 - Mora Carrillo John
5 pages
PHC 6052 Midterm Practice Questions
No ratings yet
PHC 6052 Midterm Practice Questions
3 pages
Descriptive Statistics: Describing Data With Numbers - PART 2
No ratings yet
Descriptive Statistics: Describing Data With Numbers - PART 2
11 pages
Graphing for Grades 2-5
No ratings yet
Graphing for Grades 2-5
4 pages
Statistical Methods
100% (1)
Statistical Methods
77 pages
7 Data Pre-Processing in Clementine
No ratings yet
7 Data Pre-Processing in Clementine
7 pages