0% found this document useful (0 votes)
31 views24 pages

Unit 3 Eda Notes

Uploaded by

savi ezhilarasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views24 pages

Unit 3 Eda Notes

Uploaded by

savi ezhilarasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

lOMoARcPSD|50930822

UNIT 3 EDA - notes

Exploratory Data Analysis (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822

Univariate analysis is a statistical technique that examines a single variable to understand its
distribution, central tendency, and variability. It's the simplest form of data analysis and
serves as a foundation for more complex analyses. Here’s an introduction to key concepts:

Key Concepts

1. Definition: Univariate analysis focuses on one variable at a time, providing insights


into its characteristics without considering relationships with other variables.
2. Types of Data:
o Categorical Data: Represents categories or groups (e.g., gender, colors).
o Numerical Data: Represents quantities that can be measured (e.g., age,
income).
3. Descriptive Statistics:
o Measures of Central Tendency:
 Mean: The average value.
 Median: The middle value when data is sorted.
 Mode: The most frequently occurring value.
o Measures of Dispersion:
 Range: Difference between the maximum and minimum values.
 Variance: Measure of how much values deviate from the mean.
 Standard Deviation: The square root of variance, indicating
dispersion in the same units as the data.
4. Data Visualization:
o Histograms: Graphical representation of frequency distribution.
o Box Plots: Visual summary showing median, quartiles, and outliers.
o Bar Charts: Used for categorical data to show frequencies or proportions.
5. Distribution Analysis:
o Understanding the shape of the data distribution (normal, skewed, bimodal,
etc.) can inform further analysis and model selection.

Applications

 Market Research: Analyzing customer preferences or demographic data.


 Quality Control: Assessing product measurements against specifications.
 Social Sciences: Exploring characteristics of populations, such as income levels or
education attainment.

Importance

Univariate analysis provides essential insights that can guide decision-making, identify
patterns, and inform further research or multivariate analyses. It’s crucial for initial data
exploration and understanding the data’s characteristics before delving deeper into
relationships between multiple variables.

Conclusion

Mastering univariate analysis is a foundational skill in statistics, equipping analysts with the
tools to summarize and interpret data effectively.

EXAMPLES FOR UNIVARIATE ANALYSIS

Here are some practical examples of univariate analysis across different contexts:

1. Sales Data Analysis

 Variable: Monthly Sales Revenue


Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822

o Descriptive Statistics:
 Mean: Calculate the average sales revenue over several months.
 Median: Identify the middle month’s sales revenue.
 Mode: Determine the most common sales revenue figure.
o Visualization:
 Histogram: Show the frequency distribution of monthly sales.
 Box Plot: Visualize the spread and identify any outliers in sales
revenue.

2. Health Survey

 Variable: Age of Respondents


o Descriptive Statistics:
 Mean Age: Average age of survey participants.
 Range: Calculate the difference between the oldest and youngest
respondents.
 Standard Deviation: Assess the variation in ages.
o Visualization:
 Bar Chart: Display the number of respondents in different age groups
(e.g., 18-24, 25-34).

3. Customer Feedback

 Variable: Customer Satisfaction Ratings (on a scale from 1 to 5)


o Descriptive Statistics:
 Mode: Identify the most frequently given rating.
 Percentage of each rating: Show the proportion of customers who rated
each score.
o Visualization:
 Pie Chart: Represent the distribution of satisfaction ratings.

4. Educational Performance

 Variable: Test Scores


o Descriptive Statistics:
 Mean Score: Calculate the average score of all students.
 Variance: Measure how test scores differ from the mean.
o Visualization:
 Histogram: Show the distribution of test scores to identify trends, such
as a normal distribution or skewness.

5. Website Traffic

 Variable: Daily Page Views


o Descriptive Statistics:
 Median: Determine the median daily page views to understand typical
performance.
 Standard Deviation: Assess the variability in daily page views.
o Visualization:
 Time Series Plot: Display page views over time to observe trends or
patterns.

6. Employee Data

 Variable: Years of Experience


o Descriptive Statistics:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

 Mean: Calculate the average years of experience among employees.


 Range: Identify the experience span from the least to the most
experienced employee.
o Visualization:
 Box Plot: Illustrate the distribution of years of experience and
highlight any outliers.

These examples illustrate how univariate analysis can provide valuable insights into
individual variables across various fields, helping to summarize and understand the data
effectively.

Categorical Data Examples

Example 1: Survey on Favorite Colors

 Data: Responses from 100 individuals regarding their favorite colors (e.g., Red, Blue, Green,
Yellow).

Problems:

1. Frequency Count: How many respondents chose each color?


o Analysis: Create a frequency table to summarize the number of respondents for
each color.
2. Percentage Distribution: What percentage of respondents prefers each color?
o Analysis: Calculate the percentage for each color based on the total number of
respondents.
3. Visualization: Create a bar chart or pie chart to visually represent the distribution of
favorite colors.

Example 2: Employee Job Titles

 Data: Job titles of 50 employees (e.g., Manager, Engineer, Sales, HR).

Problems:

1. Mode: What is the most common job title among the employees?
o Analysis: Identify the mode of the job titles.
2. Group Comparison: How many employees fall into each job category?
o Analysis: Use a frequency table to show counts for each job title.
3. Visualization: Create a bar chart to compare the number of employees in each job
title.

Numerical Data Examples

Example 1: Heights of Students

 Data: Heights (in cm) of 30 students.

Problems:

1. Mean Height: What is the average height of the students?


o Analysis: Calculate the mean of the height data.
2. Standard Deviation: How much do the heights vary from the mean?
o Analysis: Compute the standard deviation to understand the dispersion.
3. Visualization: Create a histogram to show the distribution of students' heights.

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Example 2: Monthly Expenses

 Data: Monthly expenses (in $) for a group of 12 households.

Problems:

1. Median Expense: What is the median monthly expense?


o Analysis: Sort the data and find the middle value.
2. Range: What is the range of monthly expenses in this group?
o Analysis: Subtract the minimum expense from the maximum expense.
3. Visualization: Create a box plot to illustrate the distribution of expenses and identify
any outliers.

Here are some example problems related to the measures of central tendency—mean, median,
and mode—along with their solutions:

Problem 1: Mean

Problem 2: Median

Problem 3: Mode

Data: The colors of cars in a parking lot are: Red, Blue, Red, Green, Blue, Blue, Yellow,
Red.

Question: What is the mode of the car colors?

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Solution:

1. Count the Frequency of Each Color:


o Red: 3
o Blue: 3
o Green: 1
o Yellow: 1
2. Identify the Most Frequent: Both Red and Blue occur 3 times. Answer: The data is
bimodal, with modes Red and Blue.

Problem 4: Mixed Measures

Data: The following weights (in kg) of 8 packages are recorded: 5, 10, 10, 15, 20, 20, 20, 25.

Questions:

1. What is the mean weight?


2. What is the median weight?
3. What is the mode weight?

Here are some example problems related to measures of dispersion—range, variance, and
standard deviation—along with their solutions:

Problem 1: Range

Data: The daily temperatures (in °C) over a week are: 15, 20, 18, 22, 19, 25, 17.

Question: What is the range of the temperatures?

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Answer: The range of the temperatures is 10°C.

Problem 2: Variance

Problem 3: Standard Deviation

Data: The weights (in kg) of 6 bags are: 10, 12, 15, 14, 11, 13.

Question: What is the standard deviation of the weights?

Solution:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Answer: The standard deviation of the weights is approximately 2.39 kg.

Problem 4: Mixed Measures of Dispersion

Data: The ages of a group of 7 individuals are: 18, 22, 24, 30, 26, 20, 28.

Questions:

1. What is the range?


2. What is the variance?
3. What is the standard deviation?

Solutions:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Answer:

 Range: 12 years
 Variance: 16
 Standard Deviation: 4 years

Here are some example problems related to histograms, box plots, and bar charts, along with
their solutions:

Problem 1: Histogram

Data: The number of books read by a group of 20 students in a month is as follows: 1, 2, 2, 3,


4, 5, 3, 1, 2, 4, 5, 6, 3, 2, 4, 5, 6, 7, 5, 4.

Question: Create a histogram to represent the frequency distribution of the number of books
read.

Solution

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Problem 2: Box Plot

Data: The weights (in kg) of 12 packages are: 15, 18, 22, 25, 30, 35, 40, 18, 20, 22, 28, 24.

Question: Create a box plot for the weights of the packages.

Solution:

roblem 3: Bar Chart

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Data: The favorite fruits of 15 people are: Apple, Banana, Orange, Apple, Banana, Grape,
Orange, Apple, Grape, Banana, Apple, Orange, Grape, Apple, Banana.

Question: Create a bar chart to represent the frequencies of each fruit.

Solution:

1. Count the Frequencies:


o Apple: 5
o Banana: 4
o Orange: 3
o Grape: 3
2. Create the Bar Chart:
o X-axis: Fruits (Apple, Banana, Orange, Grape)
o Y-axis: Frequency
o Draw bars for each fruit based on the frequency count.

Answer: The bar chart would visually show the number of people who selected each fruit,
with each bar representing the frequency of that fruit.

In univariate analysis, understanding distributions and variables is essential for interpreting


data accurately. Here’s an overview of key concepts related to distributions and variables in
this context.

1. Types of Variables

Categorical Variables

 Definition: Represent categories or groups. They can be nominal (no inherent order, e.g.,
colors, types of animals) or ordinal (with a meaningful order, e.g., satisfaction ratings).
 Examples:
o Nominal: Gender, eye color, car brands.
o Ordinal: Education level (e.g., high school, bachelor's, master's).

Numerical Variables

 Definition: Represent measurable quantities and can be further classified into discrete and
continuous variables.
 Types:
o Discrete: Can take specific values (e.g., number of students in a class).
o Continuous: Can take any value within a range (e.g., height, weight).

2. Understanding Distributions

A distribution describes how the values of a variable are spread or arranged. Key aspects
include:

1. Shape of Distribution

 Normal Distribution: Bell-shaped curve where most values cluster around the mean.
Characterized by symmetry.
 Skewed Distribution: Asymmetrical distribution:
o Right-Skewed (positively skewed): Tail extends to the right (e.g., income
distribution).
o Left-Skewed (negatively skewed): Tail extends to the left.
 Bimodal Distribution: Two distinct peaks, indicating two prevalent values or groups.

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

2. Descriptive Statistics

 Central Tendency: Measures like mean, median, and mode describe the center of the
distribution.
 Dispersion: Measures like range, variance, and standard deviation indicate how spread out
the values are.

3. Analyzing Distributions

Steps in Univariate Analysis

1. Data Collection: Gather data relevant to the variable of interest.


2. Data Cleaning: Remove or correct any errors or inconsistencies in the data.
3. Descriptive Analysis:
o Calculate measures of central tendency and dispersion.
o Create visual representations (histograms, box plots) to explore the distribution
shape.
4. Interpretation:
o Understand the implications of the distribution shape and measures.
o Identify any outliers or anomalies that may affect analysis.

4. Example Scenarios

Example 1: Analyzing Test Scores

 Variable: Test scores of 30 students.


 Data: 45, 67, 78, 90, 56, 88, 76, 95, 83, 71, 64, 89, 92, 70, 60, 75, 80, 85, 62, 57, 72, 68, 94,
81, 66, 73, 79, 86, 77, 91.
 Analysis:
o Calculate mean, median, and mode.
o Create a histogram to visualize the distribution.
o Determine if the distribution is normal, skewed, or bimodal.

Example 2: Survey on Favorite Ice Cream Flavors

 Variable: Favorite ice cream flavors (categorical).


 Data: Chocolate, Vanilla, Strawberry, Chocolate, Mint, Vanilla, Strawberry, Chocolate,
Vanilla, Mint.
 Analysis:
o Count the frequency of each flavor.
o Create a bar chart to represent the proportions of each flavor.
o Identify the mode (most preferred flavor).

Types of Skewness
The following figure describes the classification of skewness:

Types of Skewness

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

1. Symmetric Skewness: A perfect symmetric distribution is one in which


frequency distribution is the same on the sides of the center point of the
frequency curve. In this, Mean = Median = Mode. There is no skewness in a
perfectly symmetrical distribution.

Symmetric Skewness

2. Asymmetric Skewness: A asymmetrical or skewed distribution is one in


which the spread of the frequencies is different on both the sides of the center
point or the frequency curve is more stretched towards one side or value of Mean.
Median and Mode falls at different points.
 Positive Skewness: In this, the concentration of frequencies is more towards
higher values of the variable i.e. the right tail is longer than the left tail.
 the mean value is greater than the median and moves towards the right, and the mode occurs at
the highest frequency of the distribution.
 Negative Skewness: In this, the concentration of frequencies is more towards
the lower values of the variable i.e. the left tail is longer than the right tail.
 The skewness of the given distribution is on the left; hence, the mean value is less
than the median and moves towards the left, and the mode occurs at the highest
frequency of the distribution.

Positive Skewness and Negative Skewness

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

The distribution of exam scores in a class where some students excel while
others perform poorly, creating two distinct peaks.

What is Kurtosis?
It is also a characteristic of the frequency distribution. It gives an idea about
the shape of a frequency distribution. Basically, the measure of kurtosis is the
extent to which a frequency distribution is peaked in comparison with a normal
curve. It is the degree of peaked Ness of a distribution.
Types of Kurtosis
The following figure describes the classification of kurtosis:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Types of Kurtosis

1. Leptokurtic: Leptokurtic is a curve having a high peak than the normal


distribution. In this curve, there is too much concentration of items near the
central value.
2. Mesokurtic: Mesokurtic is a curve having a normal peak than the normal
curve. In this curve, there is equal distribution of items around the central
value.
3. Platykurtic: Platykurtic is a curve having a low peak than the normal curve is
called platykurtic. In this curve, there is less concentration of items around the
central value.

Types of Kurtosis

Difference Between Skewness and Kurtosis


Below is the difference between Skewness and Kurtosis.
Sr.
No. Skewness Kurtosis

It indicates the shape and size of variation It indicates the frequencies of


1. on either side of the central value. distribution at the central value.

The measure differences of skewness tell us It indicates the concentration of


about the magnitude and direction of the items at the central part of a
2. asymmetry of a distribution. distribution.

It studies the divergence of the


It indicates how far the distribution differs
given distribution from the
from the normal distribution.
3. normal distribution.

The measure of skewness studies the extent


It indicates the concentration of
to which deviation clusters is are above or
items.
4. below the average.

In an asymmetrical distribution, the


deviation below or above an average is not No such distribution takes place.
5. equal.

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

NUMMERICAL SUMMERIES OF LEVEL AND SPREAD

1. MEASURES OF CENTRAL TENDENCY-LEVEL

2. MEASURES OF DISPERSION-SPREAD

1. MEASURES OF CENTRAL TENDENCY (SUMMARIES OF LEVEL)

MEAN

MEDIAN

MODE

PERCENTILE

QUARTILES (5 NUMBER SUMMARY)

PROPOTION

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

2.MEASURES OF DISPERSION- SUMMARIES OF SPREAD

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

RANGE

VARIANCE

STANDARD VDEVIATION

INTERQUARTILE RANGE(IQR)

COEFFICIENT OF VARIANCE

CORRELATION

Measures of Dispersion are used to represent the scattering of data. These are
the numbers that show the various aspects of the data spread across various
parameters.

Dispersion in Statistics
Dispersion in statistics is a way to describe how spread out or scattered the data is
around an average value. It helps to understand if the data points are close
together or far apart.
Dispersion shows the variability or consistency in a set of data. There are
different measures of dispersion like range, variance, and standard deviation.
Measure of Dispersion in Statistics
Measures of Dispersion measure the scattering of the data. It tells us how the
values are distributed in the data set. In statistics, we define the measure of
dispersion as various parameters that are used to define the various attributes of
the data.
These measures of dispersion capture variation between different values of the
data.
Types of Measures of Dispersion
Measures of dispersion can be classified into the following two types :
 Absolute Measure of Dispersion
 Relative Measure of Dispersion
These measures of dispersion can be further divided into various categories. They
have various parameters and these parameters have the same unit.

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Absolute Measure of Dispersion


The measures of dispersion that are measured and expressed in the units of data
themselves are called Absolute Measure of Dispersion. For example – Meters,
Dollars, Kg, etc.
Some absolute measures of dispersion are:
Range: It is defined as the difference between the largest and the smallest value
in the distribution.

Mean Deviation: It is the arithmetic mean of the difference between the values
and their mean.

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Standard Deviation: It is the square root of the arithmetic average of the square
of the deviations measured from the mean.

Variance: It is defined as the average of the square deviation from the mean of
the given data set.
Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822

Quartile Deviation: It is defined as half of the difference between the third


quartile and the first quartile in a given data set.

Interquartile Range: The difference between upper(Q3 ) and lower(Q1) quartile


is called Interterquartile Range. Its formula is given as Q3 – Q1.

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Relative Measure of Dispersion


We use relative measures of dispersion to measure the two quantities that have
different units to get a better idea about the scattering of the data.
Here are some of the relative measures of dispersion:
Coefficient of Range: It is defined as the ratio of the difference between the
highest and lowest value in a data set to the sum of the highest and lowest value.

Coefficient of Variation: It is defined as the ratio of the standard deviation to the


mean of the data set. We use percentages to express the coefficient of variation.

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Coefficient of Mean Deviation: It is defined as the ratio of the mean deviation to


the value of the central point of the data set.

Coefficient of Quartile Deviation: It is defined as the ratio of the difference


between the third quartile and the first quartile to the sum of the third and first
quartiles.

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)

You might also like