lOMoARcPSD|50930822
UNIT 3 EDA - notes
Exploratory Data Analysis (Anna University)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Univariate analysis is a statistical technique that examines a single variable to understand its
distribution, central tendency, and variability. It's the simplest form of data analysis and
serves as a foundation for more complex analyses. Here’s an introduction to key concepts:
Key Concepts
1. Definition: Univariate analysis focuses on one variable at a time, providing insights
into its characteristics without considering relationships with other variables.
2. Types of Data:
o Categorical Data: Represents categories or groups (e.g., gender, colors).
o Numerical Data: Represents quantities that can be measured (e.g., age,
income).
3. Descriptive Statistics:
o Measures of Central Tendency:
Mean: The average value.
Median: The middle value when data is sorted.
Mode: The most frequently occurring value.
o Measures of Dispersion:
Range: Difference between the maximum and minimum values.
Variance: Measure of how much values deviate from the mean.
Standard Deviation: The square root of variance, indicating
dispersion in the same units as the data.
4. Data Visualization:
o Histograms: Graphical representation of frequency distribution.
o Box Plots: Visual summary showing median, quartiles, and outliers.
o Bar Charts: Used for categorical data to show frequencies or proportions.
5. Distribution Analysis:
o Understanding the shape of the data distribution (normal, skewed, bimodal,
etc.) can inform further analysis and model selection.
Applications
Market Research: Analyzing customer preferences or demographic data.
Quality Control: Assessing product measurements against specifications.
Social Sciences: Exploring characteristics of populations, such as income levels or
education attainment.
Importance
Univariate analysis provides essential insights that can guide decision-making, identify
patterns, and inform further research or multivariate analyses. It’s crucial for initial data
exploration and understanding the data’s characteristics before delving deeper into
relationships between multiple variables.
Conclusion
Mastering univariate analysis is a foundational skill in statistics, equipping analysts with the
tools to summarize and interpret data effectively.
EXAMPLES FOR UNIVARIATE ANALYSIS
Here are some practical examples of univariate analysis across different contexts:
1. Sales Data Analysis
Variable: Monthly Sales Revenue
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
o Descriptive Statistics:
Mean: Calculate the average sales revenue over several months.
Median: Identify the middle month’s sales revenue.
Mode: Determine the most common sales revenue figure.
o Visualization:
Histogram: Show the frequency distribution of monthly sales.
Box Plot: Visualize the spread and identify any outliers in sales
revenue.
2. Health Survey
Variable: Age of Respondents
o Descriptive Statistics:
Mean Age: Average age of survey participants.
Range: Calculate the difference between the oldest and youngest
respondents.
Standard Deviation: Assess the variation in ages.
o Visualization:
Bar Chart: Display the number of respondents in different age groups
(e.g., 18-24, 25-34).
3. Customer Feedback
Variable: Customer Satisfaction Ratings (on a scale from 1 to 5)
o Descriptive Statistics:
Mode: Identify the most frequently given rating.
Percentage of each rating: Show the proportion of customers who rated
each score.
o Visualization:
Pie Chart: Represent the distribution of satisfaction ratings.
4. Educational Performance
Variable: Test Scores
o Descriptive Statistics:
Mean Score: Calculate the average score of all students.
Variance: Measure how test scores differ from the mean.
o Visualization:
Histogram: Show the distribution of test scores to identify trends, such
as a normal distribution or skewness.
5. Website Traffic
Variable: Daily Page Views
o Descriptive Statistics:
Median: Determine the median daily page views to understand typical
performance.
Standard Deviation: Assess the variability in daily page views.
o Visualization:
Time Series Plot: Display page views over time to observe trends or
patterns.
6. Employee Data
Variable: Years of Experience
o Descriptive Statistics:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Mean: Calculate the average years of experience among employees.
Range: Identify the experience span from the least to the most
experienced employee.
o Visualization:
Box Plot: Illustrate the distribution of years of experience and
highlight any outliers.
These examples illustrate how univariate analysis can provide valuable insights into
individual variables across various fields, helping to summarize and understand the data
effectively.
Categorical Data Examples
Example 1: Survey on Favorite Colors
Data: Responses from 100 individuals regarding their favorite colors (e.g., Red, Blue, Green,
Yellow).
Problems:
1. Frequency Count: How many respondents chose each color?
o Analysis: Create a frequency table to summarize the number of respondents for
each color.
2. Percentage Distribution: What percentage of respondents prefers each color?
o Analysis: Calculate the percentage for each color based on the total number of
respondents.
3. Visualization: Create a bar chart or pie chart to visually represent the distribution of
favorite colors.
Example 2: Employee Job Titles
Data: Job titles of 50 employees (e.g., Manager, Engineer, Sales, HR).
Problems:
1. Mode: What is the most common job title among the employees?
o Analysis: Identify the mode of the job titles.
2. Group Comparison: How many employees fall into each job category?
o Analysis: Use a frequency table to show counts for each job title.
3. Visualization: Create a bar chart to compare the number of employees in each job
title.
Numerical Data Examples
Example 1: Heights of Students
Data: Heights (in cm) of 30 students.
Problems:
1. Mean Height: What is the average height of the students?
o Analysis: Calculate the mean of the height data.
2. Standard Deviation: How much do the heights vary from the mean?
o Analysis: Compute the standard deviation to understand the dispersion.
3. Visualization: Create a histogram to show the distribution of students' heights.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Example 2: Monthly Expenses
Data: Monthly expenses (in $) for a group of 12 households.
Problems:
1. Median Expense: What is the median monthly expense?
o Analysis: Sort the data and find the middle value.
2. Range: What is the range of monthly expenses in this group?
o Analysis: Subtract the minimum expense from the maximum expense.
3. Visualization: Create a box plot to illustrate the distribution of expenses and identify
any outliers.
Here are some example problems related to the measures of central tendency—mean, median,
and mode—along with their solutions:
Problem 1: Mean
Problem 2: Median
Problem 3: Mode
Data: The colors of cars in a parking lot are: Red, Blue, Red, Green, Blue, Blue, Yellow,
Red.
Question: What is the mode of the car colors?
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Solution:
1. Count the Frequency of Each Color:
o Red: 3
o Blue: 3
o Green: 1
o Yellow: 1
2. Identify the Most Frequent: Both Red and Blue occur 3 times. Answer: The data is
bimodal, with modes Red and Blue.
Problem 4: Mixed Measures
Data: The following weights (in kg) of 8 packages are recorded: 5, 10, 10, 15, 20, 20, 20, 25.
Questions:
1. What is the mean weight?
2. What is the median weight?
3. What is the mode weight?
Here are some example problems related to measures of dispersion—range, variance, and
standard deviation—along with their solutions:
Problem 1: Range
Data: The daily temperatures (in °C) over a week are: 15, 20, 18, 22, 19, 25, 17.
Question: What is the range of the temperatures?
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Answer: The range of the temperatures is 10°C.
Problem 2: Variance
Problem 3: Standard Deviation
Data: The weights (in kg) of 6 bags are: 10, 12, 15, 14, 11, 13.
Question: What is the standard deviation of the weights?
Solution:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Answer: The standard deviation of the weights is approximately 2.39 kg.
Problem 4: Mixed Measures of Dispersion
Data: The ages of a group of 7 individuals are: 18, 22, 24, 30, 26, 20, 28.
Questions:
1. What is the range?
2. What is the variance?
3. What is the standard deviation?
Solutions:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Answer:
Range: 12 years
Variance: 16
Standard Deviation: 4 years
Here are some example problems related to histograms, box plots, and bar charts, along with
their solutions:
Problem 1: Histogram
Data: The number of books read by a group of 20 students in a month is as follows: 1, 2, 2, 3,
4, 5, 3, 1, 2, 4, 5, 6, 3, 2, 4, 5, 6, 7, 5, 4.
Question: Create a histogram to represent the frequency distribution of the number of books
read.
Solution
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Problem 2: Box Plot
Data: The weights (in kg) of 12 packages are: 15, 18, 22, 25, 30, 35, 40, 18, 20, 22, 28, 24.
Question: Create a box plot for the weights of the packages.
Solution:
roblem 3: Bar Chart
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Data: The favorite fruits of 15 people are: Apple, Banana, Orange, Apple, Banana, Grape,
Orange, Apple, Grape, Banana, Apple, Orange, Grape, Apple, Banana.
Question: Create a bar chart to represent the frequencies of each fruit.
Solution:
1. Count the Frequencies:
o Apple: 5
o Banana: 4
o Orange: 3
o Grape: 3
2. Create the Bar Chart:
o X-axis: Fruits (Apple, Banana, Orange, Grape)
o Y-axis: Frequency
o Draw bars for each fruit based on the frequency count.
Answer: The bar chart would visually show the number of people who selected each fruit,
with each bar representing the frequency of that fruit.
In univariate analysis, understanding distributions and variables is essential for interpreting
data accurately. Here’s an overview of key concepts related to distributions and variables in
this context.
1. Types of Variables
Categorical Variables
Definition: Represent categories or groups. They can be nominal (no inherent order, e.g.,
colors, types of animals) or ordinal (with a meaningful order, e.g., satisfaction ratings).
Examples:
o Nominal: Gender, eye color, car brands.
o Ordinal: Education level (e.g., high school, bachelor's, master's).
Numerical Variables
Definition: Represent measurable quantities and can be further classified into discrete and
continuous variables.
Types:
o Discrete: Can take specific values (e.g., number of students in a class).
o Continuous: Can take any value within a range (e.g., height, weight).
2. Understanding Distributions
A distribution describes how the values of a variable are spread or arranged. Key aspects
include:
1. Shape of Distribution
Normal Distribution: Bell-shaped curve where most values cluster around the mean.
Characterized by symmetry.
Skewed Distribution: Asymmetrical distribution:
o Right-Skewed (positively skewed): Tail extends to the right (e.g., income
distribution).
o Left-Skewed (negatively skewed): Tail extends to the left.
Bimodal Distribution: Two distinct peaks, indicating two prevalent values or groups.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
2. Descriptive Statistics
Central Tendency: Measures like mean, median, and mode describe the center of the
distribution.
Dispersion: Measures like range, variance, and standard deviation indicate how spread out
the values are.
3. Analyzing Distributions
Steps in Univariate Analysis
1. Data Collection: Gather data relevant to the variable of interest.
2. Data Cleaning: Remove or correct any errors or inconsistencies in the data.
3. Descriptive Analysis:
o Calculate measures of central tendency and dispersion.
o Create visual representations (histograms, box plots) to explore the distribution
shape.
4. Interpretation:
o Understand the implications of the distribution shape and measures.
o Identify any outliers or anomalies that may affect analysis.
4. Example Scenarios
Example 1: Analyzing Test Scores
Variable: Test scores of 30 students.
Data: 45, 67, 78, 90, 56, 88, 76, 95, 83, 71, 64, 89, 92, 70, 60, 75, 80, 85, 62, 57, 72, 68, 94,
81, 66, 73, 79, 86, 77, 91.
Analysis:
o Calculate mean, median, and mode.
o Create a histogram to visualize the distribution.
o Determine if the distribution is normal, skewed, or bimodal.
Example 2: Survey on Favorite Ice Cream Flavors
Variable: Favorite ice cream flavors (categorical).
Data: Chocolate, Vanilla, Strawberry, Chocolate, Mint, Vanilla, Strawberry, Chocolate,
Vanilla, Mint.
Analysis:
o Count the frequency of each flavor.
o Create a bar chart to represent the proportions of each flavor.
o Identify the mode (most preferred flavor).
Types of Skewness
The following figure describes the classification of skewness:
Types of Skewness
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
1. Symmetric Skewness: A perfect symmetric distribution is one in which
frequency distribution is the same on the sides of the center point of the
frequency curve. In this, Mean = Median = Mode. There is no skewness in a
perfectly symmetrical distribution.
Symmetric Skewness
2. Asymmetric Skewness: A asymmetrical or skewed distribution is one in
which the spread of the frequencies is different on both the sides of the center
point or the frequency curve is more stretched towards one side or value of Mean.
Median and Mode falls at different points.
Positive Skewness: In this, the concentration of frequencies is more towards
higher values of the variable i.e. the right tail is longer than the left tail.
the mean value is greater than the median and moves towards the right, and the mode occurs at
the highest frequency of the distribution.
Negative Skewness: In this, the concentration of frequencies is more towards
the lower values of the variable i.e. the left tail is longer than the right tail.
The skewness of the given distribution is on the left; hence, the mean value is less
than the median and moves towards the left, and the mode occurs at the highest
frequency of the distribution.
Positive Skewness and Negative Skewness
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
The distribution of exam scores in a class where some students excel while
others perform poorly, creating two distinct peaks.
What is Kurtosis?
It is also a characteristic of the frequency distribution. It gives an idea about
the shape of a frequency distribution. Basically, the measure of kurtosis is the
extent to which a frequency distribution is peaked in comparison with a normal
curve. It is the degree of peaked Ness of a distribution.
Types of Kurtosis
The following figure describes the classification of kurtosis:
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Types of Kurtosis
1. Leptokurtic: Leptokurtic is a curve having a high peak than the normal
distribution. In this curve, there is too much concentration of items near the
central value.
2. Mesokurtic: Mesokurtic is a curve having a normal peak than the normal
curve. In this curve, there is equal distribution of items around the central
value.
3. Platykurtic: Platykurtic is a curve having a low peak than the normal curve is
called platykurtic. In this curve, there is less concentration of items around the
central value.
Types of Kurtosis
Difference Between Skewness and Kurtosis
Below is the difference between Skewness and Kurtosis.
Sr.
No. Skewness Kurtosis
It indicates the shape and size of variation It indicates the frequencies of
1. on either side of the central value. distribution at the central value.
The measure differences of skewness tell us It indicates the concentration of
about the magnitude and direction of the items at the central part of a
2. asymmetry of a distribution. distribution.
It studies the divergence of the
It indicates how far the distribution differs
given distribution from the
from the normal distribution.
3. normal distribution.
The measure of skewness studies the extent
It indicates the concentration of
to which deviation clusters is are above or
items.
4. below the average.
In an asymmetrical distribution, the
deviation below or above an average is not No such distribution takes place.
5. equal.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
NUMMERICAL SUMMERIES OF LEVEL AND SPREAD
1. MEASURES OF CENTRAL TENDENCY-LEVEL
2. MEASURES OF DISPERSION-SPREAD
1. MEASURES OF CENTRAL TENDENCY (SUMMARIES OF LEVEL)
MEAN
MEDIAN
MODE
PERCENTILE
QUARTILES (5 NUMBER SUMMARY)
PROPOTION
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
2.MEASURES OF DISPERSION- SUMMARIES OF SPREAD
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
RANGE
VARIANCE
STANDARD VDEVIATION
INTERQUARTILE RANGE(IQR)
COEFFICIENT OF VARIANCE
CORRELATION
Measures of Dispersion are used to represent the scattering of data. These are
the numbers that show the various aspects of the data spread across various
parameters.
Dispersion in Statistics
Dispersion in statistics is a way to describe how spread out or scattered the data is
around an average value. It helps to understand if the data points are close
together or far apart.
Dispersion shows the variability or consistency in a set of data. There are
different measures of dispersion like range, variance, and standard deviation.
Measure of Dispersion in Statistics
Measures of Dispersion measure the scattering of the data. It tells us how the
values are distributed in the data set. In statistics, we define the measure of
dispersion as various parameters that are used to define the various attributes of
the data.
These measures of dispersion capture variation between different values of the
data.
Types of Measures of Dispersion
Measures of dispersion can be classified into the following two types :
Absolute Measure of Dispersion
Relative Measure of Dispersion
These measures of dispersion can be further divided into various categories. They
have various parameters and these parameters have the same unit.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Absolute Measure of Dispersion
The measures of dispersion that are measured and expressed in the units of data
themselves are called Absolute Measure of Dispersion. For example – Meters,
Dollars, Kg, etc.
Some absolute measures of dispersion are:
Range: It is defined as the difference between the largest and the smallest value
in the distribution.
Mean Deviation: It is the arithmetic mean of the difference between the values
and their mean.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Standard Deviation: It is the square root of the arithmetic average of the square
of the deviations measured from the mean.
Variance: It is defined as the average of the square deviation from the mean of
the given data set.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Quartile Deviation: It is defined as half of the difference between the third
quartile and the first quartile in a given data set.
Interquartile Range: The difference between upper(Q3 ) and lower(Q1) quartile
is called Interterquartile Range. Its formula is given as Q3 – Q1.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Relative Measure of Dispersion
We use relative measures of dispersion to measure the two quantities that have
different units to get a better idea about the scattering of the data.
Here are some of the relative measures of dispersion:
Coefficient of Range: It is defined as the ratio of the difference between the
highest and lowest value in a data set to the sum of the highest and lowest value.
Coefficient of Variation: It is defined as the ratio of the standard deviation to the
mean of the data set. We use percentages to express the coefficient of variation.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Coefficient of Mean Deviation: It is defined as the ratio of the mean deviation to
the value of the central point of the data set.
Coefficient of Quartile Deviation: It is defined as the ratio of the difference
between the third quartile and the first quartile to the sum of the third and first
quartiles.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)