EDA Unit I
EDA Unit I
1
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
AL23V11-Exploratory Data Analysis
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.
The main phases of data science life cycle are given below:
2
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
AL23V11-Exploratory Data Analysis
1. Discovery: The first phase is discovery, which involves asking the right questions. When you
start any data science project, you need to determine what are the basic requirements, priorities,
and project budget. In this phase, we need to determine all the requirements of the project such as
the number of people, technology, time, data, an end goal, and then we can frame the business
problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to
perform the following tasks:
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques to
establish the relation between input variables. We will apply Exploratory data analytics(EDA) by
using various statistical formula and visualization tools to understand the relations between
variable and to see what data can inform us. Common tools used for model planning are:
o SQL Analysis Services
o R
o SAS
o Python
4. Model-building: In this phase, the process of model building starts. We will create datasets for
training and testing purpose. We will apply different techniques such as association, classification,
and clustering, to build the model.
Following are some common Model building tools:
o SAS Enterprise Miner
o WEKA
o SPCS Modeler
o MATLAB
5. Operationalize: In this phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents. This phase provides you a clear overview of complete
project performance and other components on a small scale before the full deployment.
3
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
AL23V11-Exploratory Data Analysis
6. Communicate results: In this phase, we will check if we reach the goal, which we have set on
the initial phase. We will communicate the findings and final result with the business team.
Data Science Techniques
Here are some of the technical concepts you should know about before starting to learn what is data
science.
Machine Learning: Machine learning is the backbone of data science. Data Scientists need to have
a solid grasp of ML in addition to basic knowledge of statistics.
Modeling: Mathematical models enable you to make quick calculations and predictions based on
what you already know about the data. Modeling is also a part of Machine Learning and involves
identifying which algorithm is the most suitable to solve a given problem and how to train these models.
Statistics: Statistics are at the core of data science. A sturdy handle on statistics can help you extract more
intelligence and obtain more meaningful results.
Hypothesis testing
Central limit theorem,
z-test, and t-test
Correlation coefficients,
Sampling techniques.
Programming: Some level of programming is required to execute a successful data science project.
The most common programming languages are Python, and R. Python is especially popular because it’s
easy to learn, and it supports multiple libraries for data science and ML.
Database: A capable data scientist needs to understand how databases work, how to manage them,
and how to extract data from them.
Data Science Components:
4
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
AL23V11-Exploratory Data Analysis
1. Statistics: Statistics is one of the most important components of data science. Statistics is a way
to collect and analyze the numerical data in a large amount and finding meaningful insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together. Domain
expertise means specialized knowledge or skills of a particular area. In data science, there are
various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring, storing,
retrieving, and transforming the data. Data engineering also includes metadata (data about data) to
the data.
4. Visualization: Data visualization is meant by representing data in a visual context so that people
can easily understand the significance of data. Data visualization makes it easy to access the huge
amount of data in visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced
computing involves designing, writing, debugging, and maintaining the source code of computer
programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the study
of quantity, structure, space, and changes. For a data scientist, knowledge of good mathematics is
essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is all about
to provide training to a machine so that it can act as a human brain. In data science, we use various
machine learning algorithms to solve the problems.
5
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
AL23V11-Exploratory Data Analysis
1. Problem Formulation
The product managers or the stakeholders need to understand the problems associated with a particular
operation. It is one of the most crucial aspects of a Data Science pipeline. To frame a use case as a
Data Science problem, the subject matter experts must first understand the current work stream and
the nitty-gritties associated with it. Data Science problem needs a strong domain input, without which
coming up with a viable success criterion becomes challenging.
2. Data Sources
Once the problem is clearly defined, the product managers, along with the Data Scientist, need to work
together to figure out the data required and the various sources from which it may be acquired. The
source of data could be IoT sensors, cloud platforms like GCP, AWS, Azure, or even web-scraped
data from social media.
3. Exploratory Data Analysis
The next process in the pipeline is EDA, where the gathered data is explored and analyzed for any
descriptive pattern in the data. Often the common exploratory data analysis steps involve finding
missing values, checking for correlation among the variables, performing univariate, bi-variate, and
multivariate analysis.
4. Feature Engineering
The process of EDA is followed by fetching key features from the raw data or creating additional
features based on the results of EDA and some domain experience. The process of feature engineering
could be both model agnostic such as finding correlation, forward selection, backward elimination,
etc., and model dependent such as getting feature importance from tree-based algorithms.
6
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
AL23V11-Exploratory Data Analysis
5. Modelling
It largely depends on whether the scope of the project deems the usage of predictive, diagnostic, or
prescriptive modeling. In this step, a Data Scientist would try out multiple experiments using various
Machine Learning or Deep Learning algorithms. The trained models are validated against the test data
to check its performance. Once you have performed a thorough analysis of data and decided on a
suitable model/algorithm, it’s time to develop the real model and test it. Before building the model,
you need to divide the data into Training and Testing data sets. In normal circumstances, the Training
data set constitutes 80% of the data, whereas, Testing data set consists of the remaining 20%.
Firstly, Training data is employed to build the model, and then Testing data is used to evaluate
whether the developed model works correctly or not.
There are various packaged libraries in different programming languages (R, Python, MATLAB),
which you can use to build the model just by inputting the labeled training data.
6. Deployment
The models developed need to be hosted on on-premises or cloud server for the end users to consume
it. Highly optimized and scalable code must be written to put models in production.
7
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
AL23V11-Exploratory Data Analysis
7. Monitoring
After the models are deployed, it is necessary to set up a monitoring pipeline. Often the deployed
models suffer from various data drift challenges in real time which need to be monitored and dealt
with accordingly.
8. User Acceptance
The data science project life cycle is only completed once the end-user has given a sign-off. The
deployed models are kept under observation for some time to validate their success against various
business metrics. Once that’s validated over a period, the users often give a sign-off for the closure of
the project.
EDA Fundamentals
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to
apprehend their predominant traits, discover patterns, locate outliers, and identify relationships between
variables. EDA is normally carried out as a preliminary step before undertaking extra formal statistical
analyses or modelling.
8
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
AL23V11-Exploratory Data Analysis
Visualizations consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts assist
in identifying styles, trends, and relationships within the facts.
4. Feature Engineering: EDA allows for the exploration of various variables and their
adjustments to create new functions or derive meaningful insights. Feature engineering can contain
scaling, normalization, binning, encoding express variables, and creating interplay or derived variables.
5. Correlation and Relationships: EDA allows discover relationships and dependencies
between variables. Techniques such as correlation analysis, scatter plots, and pass- tabulations offer
insights into the power and direction of relationships between variables.
6. Data Segmentation: EDA can contain dividing the information into significant segments based
totally on sure standards or traits. This segmentation allows advantage insights into unique subgroups
inside the information and might cause extra focused analysis.
7. Hypothesis Generation: EDA aids in generating hypotheses or studies questions based totally
on the preliminary exploration of the data. It facilitates form the inspiration for in addition evaluation
and model building.
8. Data Quality Assessment: EDA permits for assessing the nice and reliability of the
information. It involves checking for records integrity, consistency, and accuracy to make certain the
information is suitable for analysis.
Types of EDA
Depending on the number of columns we are analyzing we can divide EDA into two types.
EDA, or Exploratory Data Analysis, refers back to the method of analyzing and analyzing information
units to uncover styles, pick out relationships, and gain insights. There are various sorts of EDA strategies
that can be hired relying on the nature of the records and the desires of the evaluation. Here are some not
unusual kinds of EDA:
9
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
AL23V11-Exploratory Data Analysis
1. Univariate Analysis: This sort of evaluation makes a speciality of analyzing character
variables inside the records set. It involves summarizing and visualizing a unmarried variable at a time to
understand its distribution, relevant tendency, unfold, and different applicable records. Techniques like
histograms, field plots, bar charts, and precis information are generally used in univariate analysis.
2. Bivariate Analysis: Bivariate evaluation involves exploring the connection between variables.
It enables find associations, correlations, and dependencies between pairs of variables. Scatter plots, line
plots, correlation matrices, and move-tabulation are generally used strategies in bivariate analysis.
3. Multivariate Analysis: Multivariate analysis extends bivariate evaluation to encompass
greater than variables. It ambitions to apprehend the complex interactions and dependencies among
more than one variables in a records set. Techniques inclusive of heatmaps, parallel coordinates, aspect
analysis, and primary component analysis (PCA) are used for multivariate analysis.
4. Time Series Analysis: This type of analysis is mainly applied to statistics sets that have a
temporal component. Time collection evaluation entails inspecting and modeling styles, traits, and
seasonality inside the statistics through the years. Techniques like line plots, autocorrelation analysis,
transferring averages, and ARIMA (AutoRegressive Integrated Moving Average) fashions are generally
utilized in time series analysis.
5. Missing Data Analysis: Missing information is a not unusual issue in datasets, and it may
impact the reliability and validity of the evaluation. Missing statistics analysis includes figuring out
missing values, know-how the patterns of missingness, and using suitable techniques to deal with missing
data. Techniques along with lacking facts styles, imputation strategies, and sensitivity evaluation are
employed in lacking facts evaluation.
6. Outlier Analysis: Outliers are statistics factors that drastically deviate from the general sample
of the facts. Outlier analysis includes identifying and knowledge the presence of outliers, their capability
reasons, and their impact at the analysis. Techniques along with box plots, scatter plots, z-rankings, and
clustering algorithms are used for outlier evaluation.
7. Data Visualization: Data visualization is a critical factor of EDA that entails creating visible
representations of the statistics to facilitate understanding and exploration. Various visualization
techniques, inclusive of bar charts, histograms, scatter plots, line plots, heatmaps, and interactive
dashboards, are used to represent exclusive kinds of statistics.
These are just a few examples of the types of EDA techniques that can be employed at some stage in
information evaluation. The choice of strategies relies upon on the information traits, research questions,
10
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
AL23V11-Exploratory Data Analysis
and the insights sought from the analysis.
Understanding data science or Stages of EDA
➢ Data science involves cross-disciplinary knowledge from computer science, data, statistics, and
mathematics.
➢ There are several phases of data analysis, including
1. Data requirements
2. Data collection
3. Data processing
4. Data cleaning
5. Exploratory data analysis
6. Modeling and algorithms
7. Data product and communication
➢ These phases are similar to the Cross-Industry Standard Process for data mining (CRISP) framework
in data mining.
1. Data requirements
• There can be various sources of data for an organization. • It is important to comprehend what type of
data is required for the organization to be collected, curated, and stored. • For example, an application
tracking the sleeping pattern of patients suffering from dementia requires several types of sensors’ data
storage, such as sleep data, heart rate from the patient, electro-dermal activities, and user activities
patterns.
• All of these data points are required to correctly diagnose the mental state of the person.
• Hence, these are mandatory requirements for the application. • It is also required to categorize the data,
numerical or categorical, and the format of storage and dissemination.
2. Data collection
• Data collected from several sources must be stored in the correct format and transferred to the right informati
technology personnel within a company.
• Data can be collected from several objects during several events using different types of sensors and storage tools.
3. Data processing
• Preprocessing involves the process of pre-curating (selecting and organizing) the dataset before actual analysis. •
Common tasks involve correctly exporting the dataset, placing them under the right tables, structuring them, and
11
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
exporting them in the correct format.
4. Data cleaning
• Preprocessed data is still not ready for detailed analysis. • It must be correctly transformed for an incompleteness
check, duplicates check, error check, and missing value check.
• This stage involves responsibilities such as matching the correct record, finding inaccuracies in the
dataset, understanding the overall data quality, removing duplicate items, and filling in the missing values.
• Data cleaning is dependent on the types of data under study. • Hence, it is essential for data scientists or EDA
experts to comprehend different types of datasets.
• An example of data cleaning is using outlier detection methods for quantitative data cleaning.
5. EDA
• Exploratory data analysis is the stage where the message contained in the data is actually understood.
• Several types of data transformation techniques might be required during the process of exploration.
6. Modeling and algorithm
• Generalized models or mathematical formulas represent or exhibit relationships among different variables, such
as correlation or causation.
• These models or equations involve one or more variables that depend on other variables to cause an event.
• For example, when buying pens, the total price of pens(Total) = price for one pen(UnitPrice) * the number of
pens bought (Quantity). Hence, our model would be Total = UnitPrice * Quantity. Here, the total price is
dependent on the unit price. Hence, the total price is referred to as the dependent variable and the unit price is
referred to as an independent variable.
• In general, a model always describes the relationship between independent and dependent variables.
• Inferential statistics deals with quantifying relationships between particular variables.
• The model for describing the relationship between data, model, and the error still holds true:
Data = Model + Error
7. Data Product
• Any computer software that uses data as inputs, produces outputs, and provides feedback based on the output to
control the environment is referred to as a data product.
• A data product is generally based on a model developed during data analysis
• Example: a recommendation model that inputs user purchase history and recommends a related item that the user
is highly likely to buy.
8. Communication
12
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
• This stage deals with disseminating the results to end stakeholders to use the result for
business intelligence. • One of the most notable steps in this stage is data visualization.
• Visualization deals with information relay techniques such as tables, charts, summary
diagrams, and bar charts to show the analyzed result.
The significance of EDA
Different fields of science, economics, engineering, and marketing accumulate and
store data primarily in electronic databases. Appropriate and well-established decisions should be
made using the data collected. It is practically impossible to make sense of datasets containing more
than a handful of data points without the help of computer programs. To be certain of the insights
that the collected data provides and to make further decisions, data mining is performed where we
go through distinctive analysis processes. Exploratory data analysis is key, and usually the first
exercise in data mining. It allows us to visualize data to understand it as well as to create hypotheses
for further analysis. The exploratory analysis centers around creating a synopsis of data or insights
for the next steps in a data mining project.
EDA actually reveals ground truth about the content without making any underlying assumptions.
This is the fact that data scientists use this process to actually understand what type of modeling and
hypotheses can be created. Key components of exploratory data analysis include summarizing data,
statistical analysis, and visualization of data.
➢ Python provides expert tools for exploratory analysis
• pandas for summarizing
• scipy, along with others, for statistical analysis
• matplotlib and plotly for visualizations
Steps in EDA
13
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
The four different steps involved in exploratory data analysis are,
1. Problem Definition
2. Data Preparation
3. Data Analysis
4. Development and Representation of the Results
Problem definition: Before trying to extract useful insight from the data, it is essential to define the business
problem to be solved. The problem definition works as the driving force for a data analysis plan execution.
The main tasks involved in problem definition are defining the main objective of the analysis, defining the
main deliverables, outlining the main roles and responsibilities, obtaining the current status of the data,
defining the timetable, and performing cost/benefit analysis. Based on such a problem definition, an execution
plan can be created.
Data preparation: This step involves methods for preparing the dataset before actual analysis. In this step, we
define the sources of data, define data schemas and tables, understand the main characteristics of the data,
clean the dataset, delete non-relevant datasets, transform the data, and divide the data into required chunks for
analysis.
Data analysis: This is one of the most crucial steps that deals with descriptive statistics and analysis of the
data. The main tasks involve summarizing the data, finding the hidden correlation and relationships among the
data, developing predictive models, evaluating the models, and calculating the accuracies. Some of the
techniques used for data summarization are summary tables, graphs, descriptive statistics, inferential statistics,
correlation statistics, searching, grouping, and mathematical models.
Development and representation of the results: This step involves presenting the dataset to the target audience
in the form of graphs, summary tables, maps, and diagrams. This is also an essential step as the result analyzed
from the dataset should be interpretable by the business stakeholders, which is one of the major goals of EDA.
14
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Most of the graphical analysis techniques include scattering plots, character plots, histograms, box plots,
residual plots, mean plots, and others.
Making Sense of Data
It is crucial to identify the type of data under analysis. In this section, we are going to learn
about different types of data that you can encounter during analysis. Different disciplines store different kinds
of data for different purposes. For example, medical researchers store patients' data, universities store students'
and teachers' data, and real estate industries storehouse and building datasets. A dataset contains many
observations about a particular object. For instance, a dataset about patients in a hospital can contain many
observations. A patient can be described by a patient identifier (ID), name, address, weight, date of birth,
address, email, and gender. Each of these features that describes a patient is a variable. Each observation can
have a specific value for each of these variables. For example, a patient can have the following:
PATIENT_ID = 1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway
Date of birth = 10th July 2018
Email = yoshmimukhiya@gmail.com
Weight = 10
Gender = Female
These datasets are stored in hospitals and are presented for analysis. Most of this data is stored in some sort of
database management system in tables/schema. An example of a table for storing patient information is shown
here:
To summarize the preceding table, there are four observations (001, 002, 003, 004, 005). Each observation
describes variables (PatientID, name, address, dob, email, gender, and weight). Most of the dataset broadly
falls into two groups—numerical data and categorical data.
15
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Types of datasets
➢ Most datasets broadly fall into two groups—numerical data and categorical data.
Numerical data
This data has a sense of measurement involved in it; for example, a person's age, height, weight, blood
pressure, heart rate, temperature, number of teeth, number of bones, and the number of family members. This
data is often referred to as quantitative data in statistics. The numerical dataset can be either discrete or
continuous types.
Discrete data
This is data that is countable and its values can be listed out. For example, if we flip a coin, the number of
heads in 200 coin flips can take values from 0 to 200 (finite) cases. A variable that represents a discrete dataset
is referred to as a discrete variable. The discrete variable takes a fixed number of distinct values. For example,
the Country variablecan have values such as Nepal, India, Norway, and Japan. It is fixed. The Rankvariableof
a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
16
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Continuous data
A variable that can have an infinite number of numerical values within a specific range is classified as
continuous data. A variable describing continuous data is a continuous variable. For example, what is the
temperature of your city today? Can we be finite? Similarly, the weight variablein the previous section is a
continuous variable.
A section of the table is shown in the following table:
Check the preceding table and determine which of the variables are discrete and which of the
variables are continuous. Can you justify your claim? Continuous data can follow an interval measure of scale
or ratio measure of scale.
Categorical data
This type of data represents the characteristics of an object; for example, gender, marital status, type of
address, or categories of the movies. This data is often referred to as qualitative datasets in statistics. To
understand clearly, here are some of the most common types of categorical data you can find in data:
Gender (Male, Female, Other, or Unknown)
Marital Status (Annulled, Divorced, Interlocutory, Legally Separated, Married, Polygamous,
Never Married, Domestic Partner, Unmarried, Widowed, or Unknown)
Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy, Historical, Horror,
Mystery, Philosophical, Political, Romance, Saga, Satire, Science Fiction, Social, Thriller, Urban, or Western)
Blood type (A, B, AB, or O)
Types of drugs (Stimulants, Depressants, Hallucinogens, Dissociatives, Opioids, Inhalants, or
Cannabis)
17
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
A variable describing categorical data is referred to as a categorical variable. These types of variables can
have one of a limited number of values. It is easier for computer science students to understand categorical
values as enumerated types or enumerations of variables. There are different types of categorical variables:
A binary categorical variable can take exactly two values and is also referred to as
a dichotomous variable. For example, when you create an experiment, the result is either success or failure.
Hence, results can be understood as a binary categorical variable.
Polytomous variables are categorical variables that can take more than two possible values.
For example, marital status can have several values, such as annulled, divorced, interlocutory, legally
separated, married, polygamous, never married, domestic partners, unmarried, widowed, domestic partner, and
unknown. Since marital status can take more than two possible values, it is a polytomous variable.
Most of the categorical dataset follows either nominal or ordinal measurement scales.
Measurement scales
There are four different types of measurement scales described in statistics: nominal, ordinal,
interval, and ratio. These scales are used morein academic industries. Let's understand each of them with some
examples.
Nominal
These are practiced for labeling variables without any quantitative value. The scales are generally referred to
as labels. And these scales are mutually exclusive and do not carry any numerical importance. Let's see some
examples:
What is your gender?
Male
Female
Third gender/Non-binary
I prefer not to answer
Other
Other examples include the following:
The languages that are spoken in a particular country
Biological species
Parts of speech in grammar (noun, pronoun, adjective, and so on)
Taxonomic ranks in biology (Archea, Bacteria, and Eukarya)
Nominal scales are considered qualitative scales and the measurements that are taken using qualitative scales
are considered qualitative data. However, the advancement in qualitative research has created confusion to be
definitely considered as qualitative. If, for example, someone uses numbers as labels in the nominal
18
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
measurement sense, they have no concrete numerical value or meaning. No form of arithmetic calculation can
be made on nominal measures.
Well, for example, in the case of a nominal dataset, you can certainly know the following:
Frequency is the rate at which a label occurs over a period of time within the dataset.
Proportion can be calculated by dividing the frequency by the total number of events.
Then, you could compute the percentage of each proportion.
And to visualize the nominal dataset, you can use either a pie chart or a bar chart.
Ordinal
The main difference in the ordinal and nominal scale is the order. In ordinal scales, the order of the values is a
significant factor. An easy tip to remember the ordinal scale is that it sounds like an order. Have you heard
about the Likert scale, which uses a variation of an ordinal scale? Let's check an example of ordinal scale
using the Likert scale: WordPress is making content managers' lives easier. How do you feel about this
statement? The following diagram shows the Likert scale:
As depicted in the preceding diagram, the answer to the question of WordPress is making content managers'
lives easier is scaled down to five different ordinal values, Strongly Agree, Agree, Neutral, Disagree,
and Strongly Disagree. Scales like these are referred to as the Likert scale. Similarly, the following diagram
shows more examples of the Likert scale:
To make it easier, consider ordinal scales as an order of ranking (1st, 2nd, 3rd, 4th, and so on).
The median item is allowed as the measure of central tendency; however, the average is not permitted.
Interval
19
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
In interval scales, both the order and exact differences between the values are significant. Interval scales are
widely used in statistics, for example, in the measure of central tendencies—mean, median, mode, and
standard deviations. Examples include location in Cartesian coordinates and direction measured in degrees
from magnetic north. The mean, median, and mode are allowed on interval data.
Ratio
Ratio scales contain order, exact values, and absolute zero, which makes it possible to be used in descriptive
and inferential statistics. These scales provide numerous possibilities for statistical analysis. Mathematical
operations, the measure of central tendencies, and the measure of dispersion and coefficient of variation can
also be computed from such scales.
Examples include a measure of energy, mass, length, duration, electrical energy, plan angle, and volume. The
following table gives a summary of the data types and scale measures:
20
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
some evidence. Are you still lost with the term prior probability distribution? Andrew Gelman has a very
descriptive paper about prior probability distribution. The following diagram shows three different
approaches for data analysis illustrating the difference in their execution steps:
Data analysts and data scientists freely mix steps mentioned in the preceding approaches to get meaningful
insights from the data. In addition to that, it is essentially difficult to judge or estimate which model is best for
data analysis. All of them have their paradigms and are suitable for different types of data analysis.
Software tools available for EDA
Exploratory Data Analysis (EDA) involves examining datasets to summarize their main
characteristics, often with visual methods. There are several software tools available for EDA, each offering a
variety of features for data visualization, statistical analysis, and data manipulation. Here are some of the most
popular tools:
1. Python Libraries
Pandas: A powerful data manipulation and analysis library. It offers data structures like DataFrames
and functions for cleaning, aggregating, and analyzing data.
Matplotlib: A plotting library that produces static, animated, and interactive visualizations in Python.
Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive statistical
graphics.
Plotly: An interactive graphing library that makes it easy to create interactive plots.
Scipy: A library used for scientific and technical computing.
Statsmodels: A library for estimating and testing statistical models.
Sweetviz: An open-source library that generates beautiful, high-density visualizations for EDA.
Pandas Profiling: A library that creates HTML profiling reports from pandas DataFrames.
21
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
2. R Libraries
ggplot2: A data visualization package for the statistical programming language R.
dplyr: A data manipulation library that provides a grammar for data manipulation.
DataExplorer: An R package that simplifies EDA by providing functions to visualize data distributions,
22
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Databricks: An enterprise software company founded by the creators of Apache Spark, that
provides a unified analytics platform.
Python tools and packages
Bar chart Bar plots are used to display the distribution of a categorical variable or to compare
values across different categories. They can be used for both single and grouped categorical data.
Scatter plot : Scatter plots are used to visualize the relationship between two numerical variables.
Each data point is plotted as a point on the graph, allowing you to observe patterns, trends, and correlations
24
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Pie chart Pie charts are used to show the proportion of different categories within a whole.
However, they are best used when you have a small number of categories and clear proportions.
Table chart A table chart is a chart that helps visually represent data that is arranged in rows and columns.
Throughout all forms of communication and research, tables are used extensively to store, analyze, compare,
Lollipop chart
A lollipop chart can be a sweet alternative to a regular bar chart if you are dealing with a lot of categories and
want to make optimal use of space. It shows the relationship between a numeric and a categorical variable.
This type of chart consists of a line, which represents the magnitude, and ends with a dot, or a circle, which
highlights the data value.
25
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Violin Plot
Violin plots combine box plots with kernel density estimation, offering a detailed view of data distribution.
import seaborn as sns
import matplotlib.pyplot as plt
sns.violinplot(x="Pclass", y="Age", data=data)
Box plot
26
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Box plots represent the distribution and spread of data, useful for detecting outliers and understanding central
tendencies.
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x="Pclass", y="Age", data=data)
Heatmap
Heatmaps visualize the correlation between numerical features, helping to uncover dependencies in the data.
import seaborn as sns
import matplotlib.pyplot as plt
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")1
27
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Tree Maps With Heatmaps
Tree maps visualize hierarchical data by nesting rectangles within larger rectangles, with each rectangle
representing a category or element. The addition of heatmaps to tree maps provides a way to encode additional
information within each rectangle's color.
import plotly.express as px
data = px.data.tips()
fig = px.treemap(
data, path=['day', 'time', 'sex'], values='total_bill',
color='tip', hover_data=['tip'], color_continuous_scale='Viridis'
)
fig.update_layout(title_text="Tree Map with Heatmap Example")
fig.show()
Histograms: Histograms display the distribution of a single numerical variable by dividing it into bins
and showing the frequency or count of data points in each bin. They help you understand the central
tendency, spread, and shape of the data.
Box Plots (Box-and-Whisker Plots): Box plots provide a visual summary of the distribution of a
numerical variable. They show the median, qualities, and potential outliers in the data.
28
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Pair Plots: A pair plot (from Seaborn) displays pairwise relationship in a dataset, showing scatter
plots for numerical variables and histograms for diagonal variables.
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical
Univariate Non-graphical: this is the simplest form of data analysis as during this we use just
one variable to research the info. The standard goal of univariate non-graphical EDA is to know
the underlying sample distribution/ data and make observations about the population. Outlier
detection is additionally part of the analysis. The characteristics of population distribution
include:
Central tendency: The central tendency or location of distribution has got to do with typical
or middle values. The commonly useful measures of central tendency are statistics called mean,
median, and sometimes mode during which the foremost common is mean. For skewed
distribution or when there’s concern about outliers, the median may be preferred.
Spread: Spread is an indicator of what proportion distant from the middle we are to seek
29
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
out the find the info values. the quality deviation and variance are two useful measures of
spread. The variance is that the mean of the square of the individual deviations and
therefore the variance is the root of the variance
Skewness and kurtosis: Two more useful univariates descriptors are the skewness and
kurtosis of the distribution. Skewness is that the measure of asymmetry and kurtosis may
be a more subtle measure of peakedness compared to a normal distribution
Multivariate Non-graphical: Multivariate non-graphical EDA technique is usually wont
to show the connection between two or more variables within the sort of either cross-
tabulation or statistics.
For categorical data, an extension of tabulation called cross-tabulation is extremely useful.
For 2 variables, cross-tabulation is preferred by making a two- way table with column
headings that match the amount of one-variable and row headings that match the amount
of the opposite two variables, then filling the counts with all subjects that share an
equivalent pair of levels.
For each categorical variable and one quantitative variable, we create statistics for
quantitative variables separately for every level of the specific variable then compare the
statistics across the amount of categorical variable.
Univariate graphical: Non-graphical methods are quantitative and objective, they are
not able to give the complete picture of the data; therefore, graphical methods are used
more as they involve a degree of subjective analysis, also are required. Common sorts of
univariate graphics are:
Histogram: The foremost basic graph is a histogram, which may be a barplot during
which each bar represents the frequency (count) or proportion (count/total count) of cases
for a variety of values. Histograms are one of the simplest ways to quickly learn a lot
about your data, including central tendency, spread, modality, shape and outliers.
Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It
shows all data values and therefore the shape of the distribution.
Boxplots: Another very useful univariate graphical technique is that the boxplot.
Boxplots are excellent at presenting information about central tendency and show robust
measures of location and spread also as providing information about symmetry and
outliers, although they will be misleading about aspects like multimodality. One among
the simplest uses of boxplots is within the sort of side- by-side boxplots.
Quantile-normal plots: The ultimate univariate graphical EDA technique is that the
most intricate. it’s called the quantile-normal or QN plot or more generally the quantile-
30
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
quantile or QQ plot. it’s wont to see how well a specific sample follows a specific
theoretical distribution. It allows detection of non-normality and diagnosis of skewness
and kurtosis
Multivariate graphical: Multivariate graphical data uses graphics to display
relationships between two or more sets of knowledge. The sole one used commonly may
be a grouped barplot with each group representing one level of 1 of the variables and
every bar within a gaggle representing the amount of the opposite variable.
Other common sorts of multivariate graphics are:
Scatterplot: For 2 quantitative variables, the essential graphical EDA technique
is that the scatterplot , sohas one variable on the x-axis and one on the y-axis and
therefore the point for every case in your dataset.
Run chart: It’s a line graph of data plotted over time.
Heat map: It’s a graphical representation of data where values are depicted by
color.
Multivariate chart: It’s a graphical representation of the relationships
between factors and response.
Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in two-dimensional
plot.
Technical requirements
You can find the code for this chapter on GitHub: https://github.com/PacktPublishing/hands-on-
exploratory-data-analysis-with-python. In order to get the best out of this chapter, ensure the following:
Make sure you have Python 3.X installed on your computer. It is recommended to use a Python
notebook such as Anaconda.
You must have Python libraries such as pandas, seaborn, and matplotlib installed.
Data transformation:
Data transformation is the mutation of data characteristics to improve access or
storage. Transformation may occur on the format, structure, or values of data. With regard to
data analytics, transformation usually occurs after data is extracted or loaded (ETL/ELT).
Data transformation increases the efficiency of analytic processes and enables
data-driven decisions. Raw data is often difficult to analyze and too vast in quantity to derive
meaningful insight, hence the need for clean, usable data.
During the transformation process, an analyst or engineer will determine the data structure.
The most common types of data transformation are:
31
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Constructive: The data transformation process adds, copies, or replicates data.
Destructive: The system deletes fields or records.
Aesthetic: The transformation standardizes the data to meet requirements or parameters.
Structural: The database is reorganized by renaming, moving, or combining columns.
Data transformation techniques:
There are 6 basic data transformation techniques that you can use in your analysis project or
data pipeline:
• Data Smoothing
• Attribution Construction
• Data Generalization
• Data Aggregation
• Data Discretization
• Data Normalization
Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some algorithms. It
allows for highlighting important features present in the dataset. It helps in predicting the patterns.
When collecting data, it can be manipulated to eliminate or reduce any variance or any other noise
form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict
different trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data
which can often be difficult to digest for finding patterns that they wouldn't see otherwise.
We have seen how the noise is removed from the data using the techniques such as binning, regression,
clustering.
32
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
o Binning: This method splits the sorted data into the number of bins and smoothens the data
values in each bin considering the neighborhood values around it.
o Regression: This method identifies the relation among two dependent attributes so that if we
have one attribute, it can be used to predict the other attribute.
o Clustering: This method groups similar data values and form a cluster. The values that lie
outside a cluster are known as outliers.
Attribution Construction
Attribution construction is one of the most common techniques in data transformation
pipelines. Attribution construction or feature construction is the process of creating new
features from a set of the existing features/attributes in the dataset.
Data Generalization
Data generalization refers to the process of transforming low-level attributes into high-level ones by
using the concept of hierarchy. Data generalization is applied to categorical data where they have a
finite but large number of distinct values. It converts low-level data attributes to high-level data
attributes using concept hierarchy. This conversion from a lower level to a higher conceptual level is
useful to get a clearer picture of the data. Data generalization can be divided into two approaches:
o Data cube process (OLAP) approach.
o Attribute-oriented induction (AOI) approach.
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).
Data Aggregation Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources into a data
analysis description. This is a crucial step since the accuracy of data analysis insights is highly dependent on
the quantity and quality of the data used.
33
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning financing or business strategy of the
product, pricing, operations, and marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of each year. We can
aggregate the data to get the enterprise's annual sales report.
Data Discretization
Data discretization refers to the process of transforming continuous data into a set of data
intervals. This is an especially useful technique that can help you make the data easier to study
and analyze and improve the efficiency of any applied algorithm.
Data Normalization
ata normalization is the process of scaling the data to a much smaller range, without losing
information in order to help minimize or exclude duplicated data, and improve algorithm
efficiency and data extraction performance.
There are three methods to normalize an attribute:
This method implements a linear transformation on the original data. Let us consider that we have
minA and maxA as the minimum and maximum value observed for attribute A and V iis the value for
attribute A that has to be normalized.
The min-max normalization would map Vi to the V'i in a new smaller range [new_minA, new_maxA]. The
formula for min-max normalization is given below:
For example, we have $1200 and $9800 as the minimum, and maximum value for the attribute income, and
34
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
[0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:
Z-score normalization: This method normalizes the value for attribute A using the meanand standard
deviation. The following formula is used for Z-score normalization:
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and $16,000. And we have
to normalize the value $73,600 using z-score normalization.
Decimal Scaling: This method normalizes the value of attribute A by moving the decimal point in the
value. This movement of a decimal point depends on the maximum absolute value of A. The formula for
the decimal scaling is given below:
Data Manipulation
Data manipulation refers to the process of making your data more readable and organized.
This can be achieved by changing or altering your raw datasets.
Data manipulation tools can help you identify patterns in your data and apply any data
transformation technique (e.g. attribution creation, normalization, or aggregation) in an
efficient and easy way.
The Data Transformation Process
In a cloud data warehouse, the data transformation process most typically takes the form
of ELT (Extract Load Transform) or ETL (Extract Transform Load). With cloud storage costs
becoming cheaper by the year, many teams opt for ELT— the difference being that all data is
loaded in cloud storage, then transformed and added to a warehouse.
The transformation process generally follows 6 stages/steps:
1. Data Discovery: During the first stage, analysts work to understand and identify data in its source format. To
do this, they will use data profiling tools. This step helps analysts decide what they need to do to get data into
its desired format.
2. Data Mapping: During this phase, analysts perform data mapping to determine how individual fields are
modified, mapped, filtered, joined, and aggregated. Data mapping is essential to many data processes, and
one misstep can lead to incorrect analysis and ripple through your entire organization.
3. Data Extraction: During this phase, analysts extract the data from its original source. These may include
structured sources such as databases or streaming sources such as customer log files from web applications.
4. Code Generation and Execution: Once the data has been extracted, analysts need to create a code to
complete the transformation. Often, analysts generate codes with the help of data transformation platforms or
tools.
5. Review: After transforming the data, analysts need to check it to ensure everything has been formatted
correctly.
6. Sending: The final step involves sending the data to its target destination. The target might be a data
warehouse or a database that handles both structured and unstructured data.
Database:
A database is an organized collection of data, so that it can be easily accessed and managed.
36
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
We can organize data into tables, rows, columns, and index it to make it easier to find
relevant information.
Database handlers create a database in such a way that only one set of software
program provides access of data to all the users.
The main purpose of the database is to operate a large amount of information by
storing, retrieving, and managing data.
There are many dynamic websites on the World Wide Web nowadays which are
handled through databases. For example, a model that checks the availability of rooms
in a hotel. It is an example of a dynamic website that uses a database.
There are many databases available like MySQL, Sybase, Oracle, MongoDB,
Informix, PostgreSQL, SQL Server, etc.
Modern databases are managed by the database management system (DBMS).
SQL or Structured Query Language is used to operate on the data stored in a database.
SQL depends on relational algebra and tuple relational calculus.
Merging Data
We will often come across multiple independent or dependent datasets (e.g. when working
with different groups or multiple kind of data). However, it might be necessary to get all the
different data from different datasets into one data frame.
At first we need to find a meaningful way in which we can merge the datasets "zoo1",
"zoo2", "zoo3" and "class" into one data frame. Take a look at the data again. Do you see
similarities and differences? In order to properly match two (or more) datasets, we have to
make sure that the structure and datatypes match.
Preparation
Before merging the zoo datasets, you should match the data type and structure of all the zoo
datasets. You can for example use a loop in order to match the data types so that you don't have
to do all variables manually. You can convert data types in R with as.numeric(), as.character(),
as.logical(), ...
All of the datasets can then be merged in two steps.
1) Merging zoo datasets
In order to merge all of the zoo datasets, we recommend using either cbind() or rbind().You
can find in the you documentation what both of the commands do. Choose the one according to
data.
2) Merging zoo and class datasets
When looking at the class dataset, you will see that it looks fundamentally different from the
37
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
zoo datasets. For this are dplyr's join-functions helpful: left_join(), right_join(), full_join() or inner_join(). You
should merge the datasets in such a way that the class dataset adds data to your merged zoo dataset
We can use at least 3 clever methods and functions from pandas to merge your datasets. Be creative. We can
use merge(), concat() and join() from pandas.
Reshaping and Pivoting
Reshaping and Pivoting are powerful operations in Pandas that allow us to transform and
restructure our data in various ways. This can be particularly useful when dealing with messy or
unstructured datasets. In this explanation, we will cover both reshaping and pivoting in detail
with examples, sample code, and output.
1. Reshaping:
Reshaping in Pandas is the process of converting data from a “wide” format to a “long”
format or vice versa. This can be achieved using the `melt()` and `stack()` functions.
a) Melt:
The `melt()` function is used to transform a DataFrame from a wide format to a long format. It
essentially unpivots the data and brings it into a tabular form.
Let’s consider an example where we have a DataFrame with three variables: Name, Math score,
and English score. We can use the `melt()` function to bring the Math and English scores into a
single column:
“`python
import pandas as pd
data = {‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’],
‘Math Score’: [85, 90, 75],
‘English Score’: [95, 80, 70]} df
= pd.DataFrame(data)
melted_df = pd.melt(df, id_vars=[‘Name’], var_name=’Subject’, value_name=’Score’) “`
Output:
“`
Name Subject Score
0 Alice Math Score 85
1 Bob Math Score 90
2 Charlie Math Score 75
3 Alice English Score 95
4 Bob English Score 80
38
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
5 Charlie English Score 70 “`
In the melted DataFrame, the ‘Math Score’ and ‘English Score’ columns have been merged
into a single ‘Subject’ column, and the respective scores are placed in a new ‘Score’ column.
The ‘Name’ column remains unchanged.
b) Stack:
The `stack()` function is used to pivot a level of column labels into a level of row labels,
resulting in a long format DataFrame.
Consider a DataFrame where we have multiple variables for each person. We can use the
`stack()` function to stack these variables into a single column:
“`python
import pandas as pd
data = {‘Name’: [‘Alice’, ‘Bob’], ‘Variable1’:
[5, 10],
‘Variable2’: [8, 12]}
df = pd.DataFrame(data)
stacked_df = df.set_index(‘Name’).stack().reset_index()
stacked_df.columns = [‘Name’, ‘Variable’, ‘Value’]
“`
Output:
“`
Name Variable Value
Alice Variable1 5
Alice Variable2 8
Bob Variable1 10
Bob Variable2 12 “`
The `stack()` function stacks the ‘Variable1’ and ‘Variable2’ columns into a single ‘Variable’
column. The respective values are stored in the ‘Value’ column, while the ‘Name’ column
remains unchanged.
Pivoting:
Pivoting is the process of converting a long format DataFrame into a wide format, where one or
more columns become new columns in the new DataFrame. This can be done using the
`pivot()` and `pivot_table()` functions.
c) Pivot:
The `pivot()` function is used to convert a DataFrame from long format to wide format based
39
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
on unique values in a column. It requires specifying both an index and a column.
Consider the following melted DataFrame from earlier:
“`python
import pandas as pd
melted_df = pd.DataFrame({
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Alice’, ‘Bob’, ‘Charlie’],
‘Subject’: [‘Math Score’, ‘Math Score’, ‘Math Score’, ‘English Score’, ‘English Score’,
‘English Score’],
‘Score’: [85, 90, 75, 95, 80, 70]
})
pivoted_df = melted_df.pivot(index=’Name’, columns=’Subject’, values=’Score’) “`
Output:
“`
Subject English Score Math Score Name
Alice 95 85
Bob 80 90
Charlie 70 75 “`
In the pivoted DataFrame, the unique values in the ‘Subject’ column (‘Math Score’, ‘English
Score’) become the column names, and the corresponding scores are filled into the respective
cells.
d) Pivot_table:
The `pivot_table()` function is similar to `pivot()` but allows us to aggregate values that have
the same indices. This is useful when we have multiple values for the same index-column
combinations.
Consider the following DataFrame:
“`python
import pandas as pd
data = {‘Name’: [‘Alice’, ‘Bob’, ‘Alice’, ‘Bob’],
‘Subject’: [‘Math’, ‘Math’, ‘English’, ‘English’], ‘Score’:
[85, 90, 95, 80]}
df = pd.DataFrame(data)
pivoted_table = pd.pivot_table(df, index=’Name’, columns=’Subject’, values=’Score’,
aggfunc=’mean’)
“`
40
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Output:
“`
Subject English Math Name
Alice 95 85
Bob 80 90
“`
The `pivot_table()` function aggregates the scores based on the provided aggregation
function, which is the mean (`aggfunc=’mean’`). We obtain the average scores for each
subject and each person in the pivoted DataFrame.
These are some examples of reshaping and pivoting on Pandas DataFrames using Python.
The ability to reshape and pivot data provides flexibility in manipulating and analyzing
datasets efficiently.
GROUPING DATASET and DATA AGGREGATION
Grouping and aggregating will help to achieve data analysis easily using
various functions. These methods will help us to the group and summarize our data and make
complex analysis comparatively easy.
Creating a sample dataset of marks of various subjects.
# import module
import pandas as pd
Output:
41
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Aggregation in Pandas
Aggregation in pandas provides various functions that perform a mathematical or logical
operation on our dataset and returns a summary of that function. Aggregation can be used to
get a summary of columns in our dataset like getting sum, minimum, maximum, etc. from a
particular column of our dataset. The function used for aggregation is agg(), the parameter is
the function we want to perform.
Some functions used in the aggregation are:
Function Description:
• sum() :Compute sum of column values
• min() :Compute min of column values
• max() :Compute max of column values
• mean() :Compute mean of column
• size() :Compute column sizes
• describe() :Generates descriptive statistics
• first() :Compute first of group values
• last() :Compute last of group values
• count() :Compute count of column values
• std() :Standard deviation of column
• var() :Compute variance of column
• sem() :Standard error of the mean of column
Examples:
• The sum() function is used to calculate the sum of every value.
• Python
• df.sum()
42
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Output:
• We used agg() function to calculate the sum, min, and max of each
column in our dataset.
• Python
• df.agg(['sum', 'min', 'max'])
Output:
Grouping in Pandas
Grouping is used to group data using some criteria from our dataset. It is used as split-
apply-combine strategy.
43
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
• Splitting the data into groups based on some criteria.
• Applying a function to each group independently.
• Combining the results into a data structure.
Examples:
We use groupby() function to group the data on “Maths” value. It returns the object as result.
Python
df.groupby(by=['Maths'])
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000012581821388>
Applying groupby() function to group the data on “Maths” value. To view result of formed
groups use first() function.
• Python
• a =
df.groupby('Maths')
a.first()
Output:
First grouping based on “Maths” within each team we are grouping based on “Science”
• Python
44
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
Implementation on a Dataset
Here we are using a dataset of diamond information.
# import module import
numpy as np import
pandas as pd
45
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
• Here we are grouping using cut and color and getting minimum value
for all other groups.
• dataset.groupby(['cut', 'color']).agg('min')
Output:
Here we are grouping using color and getting aggregate values like sum, mean,
min, etc. for the price group.
# dictionary having key as group name of price and # value
as list of aggregation function
# we want to perform on group price agg_functions = {
'price':
46
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
['sum', 'mean', 'median', 'min', 'max', 'prod']
}
dataset.groupby(['color']).agg(agg_functions)
Output:
Pivot Tables: A pivot table is a table of statistics that summarizes the data of a more extensive table (such as
from a database, spreadsheet, or business intelligence program). This summary might include sums, averages,
or other statistics, which the pivot table groups together in a meaningful way.
Steps Needed
• Import Library (Pandas)
• Import / Load / Create data.
• Use Pandas.pivot_table() method with different variants.
# import packages
import pandas as pd
# create data
df = pd.DataFrame({'ID': {0: 23, 1: 43, 2: 12,
3: 13, 4: 67, 5: 89,
6: 90, 7: 56, 8: 34},
'Name': {0: 'Ram', 1: 'Deep', 2:
'Yash',
3: 'Aman', 4: 'Arjun', 5:
'Aditya',
6: 'Akash', 7: 'Chalsea',
8: 'Divya'},
'Marks': {0: 89, 1: 97, 2: 45,
3: 78, 4: 56, 5: 76,
Output:
47
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
# import packages
import pandas as pd
# create data
df = pd.DataFrame({'ID': {0: 23, 1: 43, 2: 12,
3: 13, 4: 67, 5: 89,
6: 90, 7: 56, 8: 34},
48
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
CROSS TABULATIONS:
Cross tabulation (or crosstab) is an important tool for analyzing two
categorical variables in a dataset. It provides a tabular summary of the frequency
distribution of two variables, allowing us to see the relationship between them and
identify any patterns or trends.
Also known as contingency tables or cross tabs, cross tabulation groups
variables to understand the correlation between different variables. It also
shows how
correlations change from one variable grouping to another. It is usually used in
statistical analysis to find patterns, trends, and probabilities within raw data.
The pandas crosstab function builds a cross-tabulation table that can show the
frequency with which certain groups of data appear.
This method is used to compute a simple cross-tabulation of two (or more) factors.
By default, computes a frequency table of the factors unless an array of values and
an aggregation function are passed.
Syntax: pandas.crosstab(index, columns, values=None, rownames=None,
colnames=None, aggfunc=None, margins=False, margins_name=’All’,
dropna=True, normalize=False)
Arguments :
• index : array-like, Series, or list of arrays/Series, Values to group by in the rows.
• columns : array-like, Series, or list of arrays/Series, Values to
49
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Engineering
CCS346-Exploratory Data Analysis
group by in the columns.
• values : array-like, optional, array of values to aggregate
according to the factors. Requires `aggfunc` be specified.
• rownames : sequence, default None, If passed, must match
number of row arrays passed.
• colnames : sequence, default None, If passed, must match
number of column arrays passed.
• aggfunc : function, optional, If specified, requires `values` be specified as well.
• margins : bool, default False, Add row/column margins (subtotals).
• margins_name : str, default ‘All’, Name of the row/column that
will contain the totals when margins is True.
• dropna : bool, default True, Do not include columns whose entries are all NaN.
Unit 1 Completed
50