Unit 1

Exploratory Data Analysis (EDA) is a process used to examine datasets to uncover patterns, anomalies, and test hypotheses through statistical measures. The stages of EDA include data requirement, collection, preprocessing, cleaning, analysis, and modeling, culminating in effective communication of results through visualization. EDA is essential for making informed decisions and generating insights from complex datasets across various fields.

Uploaded by

Asra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views29 pages

Unit 1

Uploaded by

Asra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

EXPLORATORY DATA

ANALYSIS
INTRODUCTION
• Data encompasses a collection of discrete objects, numbers, words, events,
facts, measurements, observations, or even descriptions of things.
• Such data is collected and stored by every event or process occurring in several
disciplines, including biology, economics, engineering, marketing, and others.
• Processing such data elicits useful information and processing such information
generates useful knowledge. But an important question is:
• how can we generate meaningful and useful information from such data? An
answer to this question is EDA.
• EDA is a process of examining the available dataset to discover patterns, spot
anomalies, test hypotheses, and check assumptions using statistical measures.
STAGES OF EDA
• Data Requirement
• Data Collection
• Data Preprocessing
• Data Cleaning
• EDA
• Modeling and Alogirthm
STAGES OF EDA
• Let's understand in brief what these stages are:
 Data requirements:
 There can be various sources of data for an organization.
 It is important to comprehend what type of data is required for the organization to be collected, curated,
and stored.
 For example, an application tracking the sleeping pattern of patients suffering from dementia requires
several types of sensors' data storage, such as sleep data, heart rate from the patient, electro-dermal
activities, and user activities pattern.
 All of these data points are required to correctly diagnose the mental state of the person. Hence, these are
mandatory requirements for the application. In addition to this, it is required to categorize the data,
numerical or categorical, and the format of storage and dissemination.
Data collection:
 Data collected from several sources must be stored in the correct format and transferred to the right
information technology personnel within a company.
 As mentioned previously, data can be collected from several objects on several events using different types
of sensors and storage tools.
STAGES OF EDA
Data processing:
 Preprocessing involves the process of pre-curating the dataset before actual analysis.
 Common tasks involve correctly exporting the dataset, placing them under the right tables, structuring
them, and exporting them in the correct format.
Data cleaning:
 Preprocessed data is still not ready for detailed analysis. It must be correctly transformed for an
incompleteness check, duplicates check, error check, and missing value check.
 These tasks are performed in the data cleaning stage, which involves responsibilities such as matching the
correct record, finding inaccuracies in the dataset, understanding the overall data quality, removing
duplicate items, and filling in the missing values.
 However, how could we identify these anomalies on any dataset? Finding such data issues requires us to
perform some analytical techniques. We will be learning several such analytical techniques in Data
Transformation.
 To understand briefly, data cleaning is dependent on the types of data under study. Hence, it is most
essential for data scientists or EDA experts to comprehend different types of datasets. An example of data
cleaning would be using outlier detection methods for quantitative data cleaning.
STAGES OF EDA
EDA:
 Exploratory data analysis, as mentioned before, is the stage where we actually start to understand the message
contained in the data.
 It should be noted that several types of data transformation techniques might be required during the process of
exploration.
Modeling and algorithm:
 From a data science perspective, generalized models or mathematical formulas can represent or exhibit relationships
among different variables, such as correlation or causation.
 These models or equations involve one or more variables that depend on other variables to cause an event. For
example, when buying, say, pens, the total price of pens(Total) = price for one pen(UnitPrice) * the number of pens
bought (Quantity).
 Hence, our model would be
 Total = UnitPrice * Quantity. Here, the total price is dependent on the unit price.
 Hence, the total price is referred to as the dependent variable and the unit price is referred to as an independent
variable.
 In general, a model always describes the relationship between independent and dependent variables. Inferential
statistics deals with quantifying relationships between particular variables. The Judd model for describing the
relationship between data, model, and error still holds true: Data = Model + Error.
STAGES OF EDA
• Data Product:
• Any computer software that uses data as inputs, produces outputs, and
provides feedback based on the output to control the environment is referred to
as a data product.
• Example: A recommendation model, which uses user purchase history as input
and suggests related items that the user is likely to buy.
• Communication:
• This stage deals with disseminating the results to end stakeholders to use the
result for business intelligence.
• One of the most notable steps in this stage is data visualization.
• Visualization deals with information relay techniques such as tables, charts,
summary diagrams, and bar charts to show the analyzed result.
THE SIGNIFICANCE OF EDA
• Different fields of science, economics, engineering, and marketing accumulate and store data primarily in
electronic databases.
• Appropriate and well-established decisions should be made using the data collected.
• It is practically impossible to make sense of datasets containing more than a handful of data points without
the help of computer programs.
• Data mining is performed to extract insights and make decisions, involving various analysis processes.
• Exploratory data analysis is key, and usually the first exercise in data mining.
• It allows us to visualize data to understand it as well as to create hypotheses for further analysis.
• The exploratory analysis centers around creating a synopsis of data or insights for the next steps in a data
mining project.
• EDA actually reveals ground truth about the content without making any underlying assumptions.
• This is the fact that data scientists use this process to actually understand what type of modeling and
hypotheses can be created.
• Key components of exploratory data analysis include summarizing data, statistical analysis, and visualization of
data.
Steps in EDA
• Problem Definition
• Data Preparation
• Data Analysis
• Development and Representation of the results
STEPS IN EDA
• Problem definition:
• Before trying to extract useful insight from the data,
• It is essential to define the business problem to be solved.
• The problem definition works as the driving force for a data analysis
plan execution.
• The main tasks involved in problem definition are
• Defining the main objective of the analysis,
• Defining the main deliverables,
• Outlining the main roles and responsibilities,
• Obtaining the current status of the data,
• Defining the timetable, and performing cost/benefit analysis.
Based on such a problem definition, an execution plan can be created.
STEPS IN EDA
• Data preparation:
• This step involves methods for preparing the dataset before actual
analysis.
• Define the sources of data,
• Define data schemas and tables,
• Understand the main characteristics of the data, clean the dataset,
• Delete non-relevant datasets,
• Transform the data,
• Divide the data into required chunks for analysis.
STEPS IN EDA
• Data analysis:
• This is one of the most crucial steps
that deals with descriptive statistics • Some of the techniques used for
and analysis of the data. data summarization are
• The main tasks involve • Summary tables,
• Summarizing the data, • Graphs,
• Finding the hidden correlation • descriptive statistics,
and relationships among the • Inferential statistics,
data,
• Correlation statistics,
• Developing predictive models,
• Searching,
• Evaluating the models,
• Grouping
• Calculating the accuracies.
• Mathematical models.
STEPS IN EDA
• Development and representation of the results:
• This step involves presenting the dataset to the target audience in the form of
graphs, summary tables, maps, and diagrams.
• This is also an essential step as the result analyzed from the dataset should be
interpretable by the business stakeholders, which is one of the major goals of EDA.
• Most of the graphical analysis techniques include
• Scattering plots
• Character plots
• Histograms
• Box plots
• Residual plots
• Mean plots
MAKING SENSE OF DATA
• It is crucial to identify the type of data• For instance, a dataset about patients in
under analysis. a hospital can contain many
• Different disciplines store different kinds observations.
of data for different purposes. • A patient can be described by
• a patient identifier (ID),
• Example, • name, address,
• medical researcher store patients' data, • weight,
• universities store students' and • date of birth,
teachers' data, • address,
• real estate industries storehouse and • email,
building datasets. • gender.
• A dataset contains many observations• Each of these features that describes a
about a particular object. patient is a variable.
MAKING SENSE OF DATA
• Each observation can have a specific value for each of these
variables. For example, a patient can have the following:

• These datasets are stored in hospitals and are presented for analysis.
Most of this data is stored in some sort of database management
system in tables/schema. An example of a table for storing patient
information is shown here:
MAKING SENSE OF DATA
• These datasets are stored in hospitals and are presented for analysis.
Most of this data is stored in some sort of database management
system in tables/schema. An example of a table for storing patient
information is shown here:
MAKING SENSE OF DATA
• To summarize the preceding table, there are four observations (001,
002, 003, 004, 005). Each observation describes variables (PatientID,
name, address, dob, email, gender, and weight).
• Most of the dataset broadly falls into two groups—
• Numerical data
• Discrete data
• Continuous
• Categorical data.
MAKING SENSE OF DATA
• Numerical data
• This data has a sense of measurement involved in it;
• for example, a person's age, height, weight, blood pressure, heart rate,
temperature, number of teeth, number of bones, and the number of
family members.
• This data is often referred to as quantitative data in statistics.
• The numerical dataset can be either discrete or continuous types.
• Discrete data
• data that is countable and its values can be listed out.
• Data that can take on only specific, distinct values.
• For example, if we flip a coin, the number of heads in 200 coin flips can
take values from 0 to 200 (finite) cases.
• A variable that represents a discrete dataset is referred to as a discrete
variable.
• The discrete variable takes a fixed number of distinct values.
• For example, the Country variable can have values such as Nepal, India,
Norway, and Japan. It is fixed. The Rank variable of a student in a
classroom can take values from 1, 2, 3, 4, 5, and so on.
• Continuous data
• A variable that can have an infinite number of numerical values within a
specific range is classified as continuous data.
• A variable describing continuous data is a continuous variable.
• Data that can take any value within a given range.
• For example,
• what is the temperature of your city today?
• Can we be finite?
• Similarly, the weight variable in the previous section is a continuous
variable.
MAKING SENSE OF DATA
• Categorical data
• This type of data represents the characteristics of an object;
• for example, gender, marital status, type of address, or categories of the movies. This data is often
referred to as qualitative datasets in statistics.
• To understand clearly, here are some of the most common types of categorical data you can find in
data:
• Gender (Male, Female, Other, or Unknown)
• Marital Status (Annulled, Divorced, Interlocutory, Legally Separated, Married, Polygamous, Never
Married, Domestic Partner, Unmarried, Widowed, or Unknown) ◦ Movie genres (Action, Adventure,
Comedy, Crime, Drama, Fantasy, Historical, Horror, Mystery, Philosophical, Political, Romance, Saga,
Satire, Science Fiction, Social, Thriller, Urban, or Western ◦ Blood type (A, B, AB, or O)
• Types of drugs (Stimulants, Depressants, Hallucinogens, Dissociatives, Opioids, Inhalants, or
Cannabis) A variable describing categorical data is referred to as a categorical variable. These types
of variables can have one of a limited number of values. It is easier for computer science students
to understand categorical values as enumerated types or enumerations of variables.
Types
There are different types of categorical variables:
• A binary categorical variable can take exactly two values and is also referred
to as a dichotomous variable.
• For example, when you create an experiment, the result is either success or
failure. Hence, results can be understood as a binary categorical variable.
• Polytomous variables are categorical variables that can take more than two
possible values.
• For example, marital status can have several values, such as annulled,
divorced, interlocutory, legally separated, married, polygamous, never
married, domestic partners, unmarried, widowed, domestic partner, and
unknown. Since marital status can take more than two possible values, it is a
polytomous variable.
MEASUREMENT SCALES
• NOMINAL
• ORDINAL
• INTERVAL
• RATIO
NOMINAL
• Categorizes data without any order or ranking.
• Labels items without quantitative value is called Labels.
• Example:
• What is your gender
• Male
• Female
• Third Gender
• I prefer not to answer
• Other
• Other examples
• The Languages that are spoken in a particular country
• Biological Species
• Parts of speech in Grammar
• Taxonomic ranks in biology
• Types of fruit
Nominal
• Frequency
• The rate at which a label occurs over a period of time within the dataset.
• Proportion
• It can be calculated by dividing the frequency by the total number of events.
• Percentage
• Proportion expressed as a percentage.
• Visualize
• Tools: Bar Chart, Pie Chart
• Purpose: Helps in understanding the distribution of categories in nominal
data
Ordinal
• Categorizes data with a meaningful order, but the intervals between
ranks are not equal.
• Customer Satisfaction ratings(Satisfied, neutral, dissatisfied)
• Education level
COMPARING EDA WITH CLASSICAL
AND BAYESIAN ANALYSIS
• There are several approaches to data analysis. The most popular ones that are relevant
to this book are the following:
• Classical data analysis: For the classical data analysis approach, the problem definition and data
collection step are followed by model development, which is followed by analysis and result
communication.
• Exploratory data analysis approach: For the EDA approach, it follows the same approach as
classical data analysis except the model imposition and the data analysis steps are swapped. The
main focus is on the data, its structure, outliers, models, and visualizations. Generally, in EDA, we
do not impose any deterministic or probabilistic models on the data. ]
• Bayesian data analysis approach: The Bayesian approach incorporates prior probability distribution
knowledge into the analysis steps as shown in the following diagram. Well, simply put, prior
probability distribution of any quantity expresses the belief about that particular quantity before
considering some evidence. Are you still lost with the term prior probability distribution? Andrew
Gelman has a very descriptive paper about prior probability distribution. The following diagram
shows three different approaches for data analysis illustrating the difference in their execution
steps:

Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
Notes Unit I
No ratings yet
Notes Unit I
47 pages
Unit 1
No ratings yet
Unit 1
50 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Ccs346 Eda Unit 1
No ratings yet
Ccs346 Eda Unit 1
129 pages
Unit 1
No ratings yet
Unit 1
19 pages
Unit 3
No ratings yet
Unit 3
83 pages
UNIT 1 Exploratory Data Analysis
100% (4)
UNIT 1 Exploratory Data Analysis
21 pages
Unit 1 Dev
No ratings yet
Unit 1 Dev
26 pages
Unit 1
No ratings yet
Unit 1
52 pages
P23MBA547 Predictive Analytics
No ratings yet
P23MBA547 Predictive Analytics
133 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
What Is Exploratory Data Analysis
No ratings yet
What Is Exploratory Data Analysis
28 pages
Eda 1
No ratings yet
Eda 1
25 pages
Unit3 Eda
No ratings yet
Unit3 Eda
13 pages
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
No ratings yet
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
8 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Unit 1
No ratings yet
Unit 1
23 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
17 pages
Module 2
No ratings yet
Module 2
78 pages
Group 7
No ratings yet
Group 7
19 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Data Exploration and Visualization
100% (1)
Data Exploration and Visualization
281 pages
Data Analytics Interview Questions
No ratings yet
Data Analytics Interview Questions
3 pages
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
9 pages
Unit 4
No ratings yet
Unit 4
33 pages
EDA Unit 1
No ratings yet
EDA Unit 1
41 pages
Eda Sandhya
No ratings yet
Eda Sandhya
7 pages
Eda 2
No ratings yet
Eda 2
69 pages
AD3301 Data Exploration and Visualization
No ratings yet
AD3301 Data Exploration and Visualization
278 pages
What Is Exploratory Data Analysis (EDA)
100% (2)
What Is Exploratory Data Analysis (EDA)
13 pages
Aiht Notes Dev 1-5
No ratings yet
Aiht Notes Dev 1-5
236 pages
Datascience Unit-4
No ratings yet
Datascience Unit-4
6 pages
Exploratory Data Analysis & Data Preprocessing
No ratings yet
Exploratory Data Analysis & Data Preprocessing
16 pages
Data Exploration & Visualization Guide
No ratings yet
Data Exploration & Visualization Guide
42 pages
Eda
No ratings yet
Eda
6 pages
Unit 1 DXV
No ratings yet
Unit 1 DXV
28 pages
The Analysis - in - EDA
No ratings yet
The Analysis - in - EDA
7 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
Unit 2
No ratings yet
Unit 2
58 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
DSP Unit - Ii
No ratings yet
DSP Unit - Ii
14 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Exploratory Data Analysis Guide
No ratings yet
Exploratory Data Analysis Guide
6 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
ML Exp1 - 2201107
No ratings yet
ML Exp1 - 2201107
34 pages
Unit 3-BA
No ratings yet
Unit 3-BA
31 pages
What Is EDA in Data Science - Everything About Exploratory Data - by Aman Kharwal - Medium
No ratings yet
What Is EDA in Data Science - Everything About Exploratory Data - by Aman Kharwal - Medium
11 pages
4.1 Advanced Data Analysis & Visualization
No ratings yet
4.1 Advanced Data Analysis & Visualization
12 pages
Part 7
No ratings yet
Part 7
26 pages
Lecture 21
No ratings yet
Lecture 21
16 pages
EDA: Essential for Data Scientists
No ratings yet
EDA: Essential for Data Scientists
7 pages
Biplobsinhapython
No ratings yet
Biplobsinhapython
6 pages
Ballistic Pendulum Lab Report and Analysis
No ratings yet
Ballistic Pendulum Lab Report and Analysis
2 pages
Nonparametric Regression: Lowess/Loess
No ratings yet
Nonparametric Regression: Lowess/Loess
4 pages
Introduction To Psychology: By. Khyzar Hayat
No ratings yet
Introduction To Psychology: By. Khyzar Hayat
9 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
5 pages
Integrative Learning Design Framework
No ratings yet
Integrative Learning Design Framework
14 pages
(David Tolfree, Mark J. Jackson) Commercializing M PDF
100% (1)
(David Tolfree, Mark J. Jackson) Commercializing M PDF
271 pages
The Use of Peer Assessment To Improve Students Le
No ratings yet
The Use of Peer Assessment To Improve Students Le
20 pages
Alagappa University, Karaikudi Revised Syllabus Under Cbcs Pattern (W.E.F. 2011-12) B.B.A. Programme Structure
No ratings yet
Alagappa University, Karaikudi Revised Syllabus Under Cbcs Pattern (W.E.F. 2011-12) B.B.A. Programme Structure
23 pages
Forensic Analytical Chemistry Course
No ratings yet
Forensic Analytical Chemistry Course
10 pages
Wa0039.
No ratings yet
Wa0039.
80 pages
STA2216 Mini Project
No ratings yet
STA2216 Mini Project
3 pages
4-Qualitative Data Analysis
No ratings yet
4-Qualitative Data Analysis
27 pages
Albert Einstein
No ratings yet
Albert Einstein
2 pages
Case Study & Ethnography Guide
No ratings yet
Case Study & Ethnography Guide
44 pages
Operations MGT Module #2
100% (1)
Operations MGT Module #2
4 pages
Module 10 - STAT PDF
No ratings yet
Module 10 - STAT PDF
6 pages
Science vs Faith: Genetic Engineering Debate
No ratings yet
Science vs Faith: Genetic Engineering Debate
2 pages
Final PR 12 Output
No ratings yet
Final PR 12 Output
62 pages
MODULE 5. Research Methodology
100% (1)
MODULE 5. Research Methodology
11 pages
Chapter 25 - Discriminant Analysis
No ratings yet
Chapter 25 - Discriminant Analysis
20 pages
Big Data Analytics - Histogram
No ratings yet
Big Data Analytics - Histogram
17 pages
(History of Anthropology) George W. Stocking Jr. - Volksgeist As Method and Ethic - Essays On Boasian Ethnography and The German Anthropological Tradition-University of Wisconsin Press (1996)
100% (2)
(History of Anthropology) George W. Stocking Jr. - Volksgeist As Method and Ethic - Essays On Boasian Ethnography and The German Anthropological Tradition-University of Wisconsin Press (1996)
359 pages
Anis Siti Aisah Akuntansi 2020 Fix
No ratings yet
Anis Siti Aisah Akuntansi 2020 Fix
17 pages
Econometric Lec7
No ratings yet
Econometric Lec7
26 pages
Investigating The Effect of Virtual Laboratories On Students' Academic Performance and Attitudes Towards Learning Biology
No ratings yet
Investigating The Effect of Virtual Laboratories On Students' Academic Performance and Attitudes Towards Learning Biology
25 pages
Keywords: Budget Variances, Ratchet Effect, Responsibility Accounting, Local
No ratings yet
Keywords: Budget Variances, Ratchet Effect, Responsibility Accounting, Local
19 pages
Group Presentation Outline
No ratings yet
Group Presentation Outline
1 page
Uestion 1
No ratings yet
Uestion 1
52 pages
SHS PR2 Q1 Week 1A
No ratings yet
SHS PR2 Q1 Week 1A
13 pages
Science and Ideology in Economic, Political and Social Thought
No ratings yet
Science and Ideology in Economic, Political and Social Thought
71 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

EXPLORATORY DATA

You might also like