0% found this document useful (0 votes)
16 views29 pages

Unit 1

Exploratory Data Analysis (EDA) is a process used to examine datasets to uncover patterns, anomalies, and test hypotheses through statistical measures. The stages of EDA include data requirement, collection, preprocessing, cleaning, analysis, and modeling, culminating in effective communication of results through visualization. EDA is essential for making informed decisions and generating insights from complex datasets across various fields.

Uploaded by

Asra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views29 pages

Unit 1

Exploratory Data Analysis (EDA) is a process used to examine datasets to uncover patterns, anomalies, and test hypotheses through statistical measures. The stages of EDA include data requirement, collection, preprocessing, cleaning, analysis, and modeling, culminating in effective communication of results through visualization. EDA is essential for making informed decisions and generating insights from complex datasets across various fields.

Uploaded by

Asra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

EXPLORATORY DATA

ANALYSIS
INTRODUCTION
• Data encompasses a collection of discrete objects, numbers, words, events,
facts, measurements, observations, or even descriptions of things.
• Such data is collected and stored by every event or process occurring in several
disciplines, including biology, economics, engineering, marketing, and others.
• Processing such data elicits useful information and processing such information
generates useful knowledge. But an important question is:
• how can we generate meaningful and useful information from such data? An
answer to this question is EDA.
• EDA is a process of examining the available dataset to discover patterns, spot
anomalies, test hypotheses, and check assumptions using statistical measures.
STAGES OF EDA
• Data Requirement
• Data Collection
• Data Preprocessing
• Data Cleaning
• EDA
• Modeling and Alogirthm
STAGES OF EDA
• Let's understand in brief what these stages are:
 Data requirements:
 There can be various sources of data for an organization.
 It is important to comprehend what type of data is required for the organization to be collected, curated,
and stored.
 For example, an application tracking the sleeping pattern of patients suffering from dementia requires
several types of sensors' data storage, such as sleep data, heart rate from the patient, electro-dermal
activities, and user activities pattern.
 All of these data points are required to correctly diagnose the mental state of the person. Hence, these are
mandatory requirements for the application. In addition to this, it is required to categorize the data,
numerical or categorical, and the format of storage and dissemination.
Data collection:
 Data collected from several sources must be stored in the correct format and transferred to the right
information technology personnel within a company.
 As mentioned previously, data can be collected from several objects on several events using different types
of sensors and storage tools.
STAGES OF EDA
Data processing:
 Preprocessing involves the process of pre-curating the dataset before actual analysis.
 Common tasks involve correctly exporting the dataset, placing them under the right tables, structuring
them, and exporting them in the correct format.
Data cleaning:
 Preprocessed data is still not ready for detailed analysis. It must be correctly transformed for an
incompleteness check, duplicates check, error check, and missing value check.
 These tasks are performed in the data cleaning stage, which involves responsibilities such as matching the
correct record, finding inaccuracies in the dataset, understanding the overall data quality, removing
duplicate items, and filling in the missing values.
 However, how could we identify these anomalies on any dataset? Finding such data issues requires us to
perform some analytical techniques. We will be learning several such analytical techniques in Data
Transformation.
 To understand briefly, data cleaning is dependent on the types of data under study. Hence, it is most
essential for data scientists or EDA experts to comprehend different types of datasets. An example of data
cleaning would be using outlier detection methods for quantitative data cleaning.
STAGES OF EDA
EDA:
 Exploratory data analysis, as mentioned before, is the stage where we actually start to understand the message
contained in the data.
 It should be noted that several types of data transformation techniques might be required during the process of
exploration.
Modeling and algorithm:
 From a data science perspective, generalized models or mathematical formulas can represent or exhibit relationships
among different variables, such as correlation or causation.
 These models or equations involve one or more variables that depend on other variables to cause an event. For
example, when buying, say, pens, the total price of pens(Total) = price for one pen(UnitPrice) * the number of pens
bought (Quantity).
 Hence, our model would be
 Total = UnitPrice * Quantity. Here, the total price is dependent on the unit price.
 Hence, the total price is referred to as the dependent variable and the unit price is referred to as an independent
variable.
 In general, a model always describes the relationship between independent and dependent variables. Inferential
statistics deals with quantifying relationships between particular variables. The Judd model for describing the
relationship between data, model, and error still holds true: Data = Model + Error.
STAGES OF EDA
• Data Product:
• Any computer software that uses data as inputs, produces outputs, and
provides feedback based on the output to control the environment is referred to
as a data product.
• Example: A recommendation model, which uses user purchase history as input
and suggests related items that the user is likely to buy.
• Communication:
• This stage deals with disseminating the results to end stakeholders to use the
result for business intelligence.
• One of the most notable steps in this stage is data visualization.
• Visualization deals with information relay techniques such as tables, charts,
summary diagrams, and bar charts to show the analyzed result.
THE SIGNIFICANCE OF EDA
• Different fields of science, economics, engineering, and marketing accumulate and store data primarily in
electronic databases.
• Appropriate and well-established decisions should be made using the data collected.
• It is practically impossible to make sense of datasets containing more than a handful of data points without
the help of computer programs.
• Data mining is performed to extract insights and make decisions, involving various analysis processes.
• Exploratory data analysis is key, and usually the first exercise in data mining.
• It allows us to visualize data to understand it as well as to create hypotheses for further analysis.
• The exploratory analysis centers around creating a synopsis of data or insights for the next steps in a data
mining project.
• EDA actually reveals ground truth about the content without making any underlying assumptions.
• This is the fact that data scientists use this process to actually understand what type of modeling and
hypotheses can be created.
• Key components of exploratory data analysis include summarizing data, statistical analysis, and visualization of
data.
Steps in EDA
• Problem Definition
• Data Preparation
• Data Analysis
• Development and Representation of the results
STEPS IN EDA
• Problem definition:
• Before trying to extract useful insight from the data,
• It is essential to define the business problem to be solved.
• The problem definition works as the driving force for a data analysis
plan execution.
• The main tasks involved in problem definition are
• Defining the main objective of the analysis,
• Defining the main deliverables,
• Outlining the main roles and responsibilities,
• Obtaining the current status of the data,
• Defining the timetable, and performing cost/benefit analysis.
Based on such a problem definition, an execution plan can be created.
STEPS IN EDA
• Data preparation:
• This step involves methods for preparing the dataset before actual
analysis.
• Define the sources of data,
• Define data schemas and tables,
• Understand the main characteristics of the data, clean the dataset,
• Delete non-relevant datasets,
• Transform the data,
• Divide the data into required chunks for analysis.
STEPS IN EDA
• Data analysis:
• This is one of the most crucial steps
that deals with descriptive statistics • Some of the techniques used for
and analysis of the data. data summarization are
• The main tasks involve • Summary tables,
• Summarizing the data, • Graphs,
• Finding the hidden correlation • descriptive statistics,
and relationships among the • Inferential statistics,
data,
• Correlation statistics,
• Developing predictive models,
• Searching,
• Evaluating the models,
• Grouping
• Calculating the accuracies.
• Mathematical models.
STEPS IN EDA
• Development and representation of the results:
• This step involves presenting the dataset to the target audience in the form of
graphs, summary tables, maps, and diagrams.
• This is also an essential step as the result analyzed from the dataset should be
interpretable by the business stakeholders, which is one of the major goals of EDA.
• Most of the graphical analysis techniques include
• Scattering plots
• Character plots
• Histograms
• Box plots
• Residual plots
• Mean plots
MAKING SENSE OF DATA
• It is crucial to identify the type of data• For instance, a dataset about patients in
under analysis. a hospital can contain many
• Different disciplines store different kinds observations.
of data for different purposes. • A patient can be described by
• a patient identifier (ID),
• Example, • name, address,
• medical researcher store patients' data, • weight,
• universities store students' and • date of birth,
teachers' data, • address,
• real estate industries storehouse and • email,
building datasets. • gender.
• A dataset contains many observations• Each of these features that describes a
about a particular object. patient is a variable.
MAKING SENSE OF DATA
• Each observation can have a specific value for each of these
variables. For example, a patient can have the following:

• These datasets are stored in hospitals and are presented for analysis.
Most of this data is stored in some sort of database management
system in tables/schema. An example of a table for storing patient
information is shown here:
MAKING SENSE OF DATA
• These datasets are stored in hospitals and are presented for analysis.
Most of this data is stored in some sort of database management
system in tables/schema. An example of a table for storing patient
information is shown here:
MAKING SENSE OF DATA
• To summarize the preceding table, there are four observations (001,
002, 003, 004, 005). Each observation describes variables (PatientID,
name, address, dob, email, gender, and weight).
• Most of the dataset broadly falls into two groups—
• Numerical data
• Discrete data
• Continuous
• Categorical data.
MAKING SENSE OF DATA
• Numerical data
• This data has a sense of measurement involved in it;
• for example, a person's age, height, weight, blood pressure, heart rate,
temperature, number of teeth, number of bones, and the number of
family members.
• This data is often referred to as quantitative data in statistics.
• The numerical dataset can be either discrete or continuous types.
• Discrete data
• data that is countable and its values can be listed out.
• Data that can take on only specific, distinct values.
• For example, if we flip a coin, the number of heads in 200 coin flips can
take values from 0 to 200 (finite) cases.
• A variable that represents a discrete dataset is referred to as a discrete
variable.
• The discrete variable takes a fixed number of distinct values.
• For example, the Country variable can have values such as Nepal, India,
Norway, and Japan. It is fixed. The Rank variable of a student in a
classroom can take values from 1, 2, 3, 4, 5, and so on.
• Continuous data
• A variable that can have an infinite number of numerical values within a
specific range is classified as continuous data.
• A variable describing continuous data is a continuous variable.
• Data that can take any value within a given range.
• For example,
• what is the temperature of your city today?
• Can we be finite?
• Similarly, the weight variable in the previous section is a continuous
variable.
MAKING SENSE OF DATA
• Categorical data
• This type of data represents the characteristics of an object;
• for example, gender, marital status, type of address, or categories of the movies. This data is often
referred to as qualitative datasets in statistics.
• To understand clearly, here are some of the most common types of categorical data you can find in
data:
• Gender (Male, Female, Other, or Unknown)
• Marital Status (Annulled, Divorced, Interlocutory, Legally Separated, Married, Polygamous, Never
Married, Domestic Partner, Unmarried, Widowed, or Unknown) ◦ Movie genres (Action, Adventure,
Comedy, Crime, Drama, Fantasy, Historical, Horror, Mystery, Philosophical, Political, Romance, Saga,
Satire, Science Fiction, Social, Thriller, Urban, or Western ◦ Blood type (A, B, AB, or O)
• Types of drugs (Stimulants, Depressants, Hallucinogens, Dissociatives, Opioids, Inhalants, or
Cannabis) A variable describing categorical data is referred to as a categorical variable. These types
of variables can have one of a limited number of values. It is easier for computer science students
to understand categorical values as enumerated types or enumerations of variables.
Types
There are different types of categorical variables:
• A binary categorical variable can take exactly two values and is also referred
to as a dichotomous variable.
• For example, when you create an experiment, the result is either success or
failure. Hence, results can be understood as a binary categorical variable.
• Polytomous variables are categorical variables that can take more than two
possible values.
• For example, marital status can have several values, such as annulled,
divorced, interlocutory, legally separated, married, polygamous, never
married, domestic partners, unmarried, widowed, domestic partner, and
unknown. Since marital status can take more than two possible values, it is a
polytomous variable.
MEASUREMENT SCALES
• NOMINAL
• ORDINAL
• INTERVAL
• RATIO
NOMINAL
• Categorizes data without any order or ranking.
• Labels items without quantitative value is called Labels.
• Example:
• What is your gender
• Male
• Female
• Third Gender
• I prefer not to answer
• Other
• Other examples
• The Languages that are spoken in a particular country
• Biological Species
• Parts of speech in Grammar
• Taxonomic ranks in biology
• Types of fruit
Nominal
• Frequency
• The rate at which a label occurs over a period of time within the dataset.
• Proportion
• It can be calculated by dividing the frequency by the total number of events.
• Percentage
• Proportion expressed as a percentage.
• Visualize
• Tools: Bar Chart, Pie Chart
• Purpose: Helps in understanding the distribution of categories in nominal
data
Ordinal
• Categorizes data with a meaningful order, but the intervals between
ranks are not equal.
• Customer Satisfaction ratings(Satisfied, neutral, dissatisfied)
• Education level
COMPARING EDA WITH CLASSICAL
AND BAYESIAN ANALYSIS
• There are several approaches to data analysis. The most popular ones that are relevant
to this book are the following:
• Classical data analysis: For the classical data analysis approach, the problem definition and data
collection step are followed by model development, which is followed by analysis and result
communication.
• Exploratory data analysis approach: For the EDA approach, it follows the same approach as
classical data analysis except the model imposition and the data analysis steps are swapped. The
main focus is on the data, its structure, outliers, models, and visualizations. Generally, in EDA, we
do not impose any deterministic or probabilistic models on the data. ]
• Bayesian data analysis approach: The Bayesian approach incorporates prior probability distribution
knowledge into the analysis steps as shown in the following diagram. Well, simply put, prior
probability distribution of any quantity expresses the belief about that particular quantity before
considering some evidence. Are you still lost with the term prior probability distribution? Andrew
Gelman has a very descriptive paper about prior probability distribution. The following diagram
shows three different approaches for data analysis illustrating the difference in their execution
steps:

You might also like