0% found this document useful (0 votes)

45 views528 pages

Text Book

Uploaded by

ifrat.official25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views528 pages

Text Book

Uploaded by

ifrat.official25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 528

Foundations of

Data Science

T
Applied Statistics and Probability with Python
AF Lecture Note

Dr. Md Rezaul Karim

Professor
Department of Statistics and Data Science
Jahangirnagar University, Savar, Bangladesh

‘Are those who know equal to those who do not know?’ Only they will
remember [who are] people of understanding (Surah Al-Zumar (39:9),
Al-Quran).
T
AF
DR

Copyright © 2025 Dr. Md Rezaul Karim

Preface

T
In today’s data-driven world, the ability to analyze and interpret data has
become a crucial skill across various disciplines. As we navigate through an
abundance of information, the principles of statistics and probability serve as
the bedrock upon which data science stands. This book, Foundations of Data
Science: Applied Statistics and Probability with Python, aims to bridge the gap
between theory and practice, providing readers with the tools they need to har-
AF
ness the power of data effectively.

The journey of data science can be both exhilarating and overwhelming.

This book is designed for aspiring data scientists, students, and professionals
who wish to develop a robust understanding of statistical methods and their
application in Python. We will explore key concepts such as descriptive statis-
tics, inferential statistics, hypothesis testing, and probability theory, all while
emphasizing practical applications.

Each chapter is structured to introduce fundamental concepts, followed by

Python code examples and exercises that reinforce learning. My goal is to
make complex ideas accessible and engaging, allowing readers to build their
DR

confidence as they apply these techniques to real-world datasets.

I would like to extend my gratitude to the many educators, practitioners,

and learners who inspired this work. Your passion for data science fuels my
own. I hope this book serves as a valuable resource on your journey, encourag-
ing curiosity and fostering a deeper understanding of the fascinating world of
data.

Whether you are a beginner or someone looking to refine your skills, I invite
you to dive in and explore the foundations that will empower you in your data
science endeavors.

Happy learning!

Prof. Dr. Md Rezaul Karim

May 21, 2025

iv
Table of Contents

T
1 Introduction to Data Science 1
1.1 Welcome to Data Science . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Key Components of Data Science . . . . . . . . . . . . . . . . . . 2
1.3 Concepts of Statistics . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Population and Sample . . . . . . . . . . . . . . . . . . . 6
AF 1.3.2 Census and Sample Survey .
1.3.3 Parameters and Statistic . . .
1.3.4 Types of Statistics . . . . . .
1.4 What is Data? . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
8
9
10
1.4.1 Levels of Measurement . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Scope of Applied Statistics . . . . . . . . . . . . . . . . . . . . . 17
1.6 Statistical Methods in Data Science . . . . . . . . . . . . . . . . 17
1.7 Overview of Data Science Workflow . . . . . . . . . . . . . . . . 19
1.8 Popular Statistical Analysis Tools . . . . . . . . . . . . . . . . . . 20
1.9 Why Choose Python for This Book? . . . . . . . . . . . . . . . . 20
1.10 Getting Started with Python . . . . . . . . . . . . . . . . . . . . 22
DR

1.11 Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . 22

1.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.13 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Data Exploration: Tabular and Graphical Displays 25

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Tabular Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Graphical Displays . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Summarizing Qualitative Data . . . . . . . . . . . . . . . . . . . 27
2.4.1 Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.2 Python Code: Bar Chart . . . . . . . . . . . . . . . . . . 28
2.4.3 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.4 Python Code: Pie Chart . . . . . . . . . . . . . . . . . . . 30
2.5 Summarizing Quantitative Data . . . . . . . . . . . . . . . . . . . 32
2.5.1 Constructing a Frequency Distribution Table . . . . . . . 33
2.5.2 Python Code: Frequency Distribution . . . . . . . . . . . 35

v
TABLE OF CONTENTS

2.5.3 Frequency Polygon . . . . . . . . . . . . . . . . . . . . . . 36

2.5.4 Python Code: Frequency Polygon . . . . . . . . . . . . . 37
2.5.5 Ogive Curve . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.6 Python Code: Ogive . . . . . . . . . . . . . . . . . . . . . 39
2.5.7 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.8 Python Code: Histogram . . . . . . . . . . . . . . . . . . 42
2.5.9 Stem-and-Leaf Plot . . . . . . . . . . . . . . . . . . . . . 42
2.5.10 Python Code: Stem-and-leaf . . . . . . . . . . . . . . . . 44
2.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 Data Exploration: Numerical Measures 49

T
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . 49
3.2.1 Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.2 Advantages and Disadvantages of Arithmetic Mean . . . . 50
3.2.3 Harmonic Mean . . . . . . . . . . . . . . . . . . . . . . . 52
AF Advantages and Disadvantages of Harmonic Mean . . . .
3.2.4 Geometric Mean . . . . . . . . . . . . . . . . . . . . . . .
Advantages and Disadvantages of Geometric Mean . . . .
3.2.5 Relationships Between Arithmetic Mean, Geometric Mean,
52
53
54

and Harmonic Mean . . . . . . . . . . . . . . . . . . . . . 54

3.2.6 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.7 Advantages and Disadvantages of Median . . . . . . . . . 59
3.2.8 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.9 Advantages and Disadvantages of Mode . . . . . . . . . . 60
3.2.10 Choosing the Ideal Measure of Central Tendency . . . . . 61
3.2.11 Weighted Mean . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.12 Measures of Central Tendency for Grouped Data . . . . . 64
DR

3.2.13 Python Code: Mean, Median and Mode . . . . . . . . . . 66

3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Measures of Dispersion or Variability . . . . . . . . . . . . . . . . 69
3.4.1 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.3 Standard Deviation . . . . . . . . . . . . . . . . . . . . . 71
3.4.4 Measures of Variability for Grouped Data . . . . . . . . . 72
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6 Measures of Distribution Shape . . . . . . . . . . . . . . . . . . . 75
3.6.1 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6.2 Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.6.3 Coefficient of Variation . . . . . . . . . . . . . . . . . . . 80
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.8 Quartiles, Percentiles, Deciles and Outlier Detection . . . . . . . 82
3.8.1 Quartiles . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.8.2 Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . 84

vi
TABLE OF CONTENTS

3.8.3 Deciles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8.4 Interquartile Range (IQR) . . . . . . . . . . . . . . . . . . 86
3.8.5 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . 86
3.8.6 Python Code: Dispersion Measures . . . . . . . . . . . . . 88
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.10 Five-Number Summary and Boxplot . . . . . . . . . . . . . . . . 91
3.10.1 Five-Number Summary . . . . . . . . . . . . . . . . . . . 91
3.10.2 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.10.3 Importance of Boxplots . . . . . . . . . . . . . . . . . . . 93
3.10.4 Python Code: Boxplot . . . . . . . . . . . . . . . . . . . . 97
3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

T
3.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.13 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4 Introduction to Probability 105

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
AF 4.2.1 Experiment . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Random Experiment . . . . . . . . . . . . . . . . . .
4.2.3 Sample Space and Events . . . . . . . . . . . . . . .
4.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
106
106
107
108
4.3.1 Union of Events . . . . . . . . . . . . . . . . . . . . . . . 111
4.3.2 Intersection of Events . . . . . . . . . . . . . . . . . . . . 111
4.3.3 Complementary Event . . . . . . . . . . . . . . . . . . . . 112
4.3.4 Equally Likely Events . . . . . . . . . . . . . . . . . . . . 113
4.3.5 Mutually Exclusive Events . . . . . . . . . . . . . . . . . 114
4.3.6 Probability Axioms . . . . . . . . . . . . . . . . . . . . . . 114
4.4 Types of Probability . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.4.1 Classical (Theoretical) Probability . . . . . . . . . . . . . 116
DR

4.4.2 Experimental (Empirical) Probability . . . . . . . . . . . 117

4.4.3 Subjective Approach . . . . . . . . . . . . . . . . . . . . . 118
4.5 Joint and Marginal Probabilities . . . . . . . . . . . . . . . . . . 119
4.6 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 120
4.6.1 Probabilities Computation form Contingency Table . . . . 122
4.6.2 Independent Events . . . . . . . . . . . . . . . . . . . . . 125
4.7 Posterior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 129
4.7.1 Law of Total Probability . . . . . . . . . . . . . . . . . . . 129
4.7.2 Total Probability with Multiple Conditions . . . . . . . . 132
4.7.3 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . 135
4.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.9 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 139

vii
TABLE OF CONTENTS

5 Random Variable and Its Properties 145

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.3 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . 147
5.3.1 Probability Mass Function (pmf) . . . . . . . . . . . . . . 148
5.3.2 Cumulative Distribution Function (cdf) . . . . . . . . . . 149
5.3.3 Properties of the Cumulative Distribution Function . . . 150
5.3.4 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.4 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . 157
5.4.1 Probability Density Function (pdf) . . . . . . . . . . . . . 158
5.4.2 Cumulative Distribution Function (cdf) . . . . . . . . . . 162

T
5.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.5 The Expectation of a Random Variable . . . . . . . . . . . . . . 168
5.5.1 Example: Testing Electronic Components . . . . . . . . . 169
5.5.2 Example: Metal Cylinder Production . . . . . . . . . . . 171
5.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.6 The Variance of a Random Variable . . . . . . . . . . . . . . . . 172
AF 5.6.1 Example: Metal Cylinder Production . . . . . . . .
5.6.2 Chebyshev’s Inequality . . . . . . . . . . . . . . . . .
5.6.3 Example: Blood Pressure Measurement . . . . . . .
5.6.4 Example: Employee Salaries . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
175
176
176
177
5.6.5 Quantiles of Random Variables . . . . . . . . . . . . . . . 177
5.6.6 Example: Metal Cylinder Production . . . . . . . . . . . 178
5.6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.7 Essential Generating Functions . . . . . . . . . . . . . . . . . . . 181
5.7.1 Moment Generating Function . . . . . . . . . . . . . . . . 182
5.7.2 Key Properties of MGF . . . . . . . . . . . . . . . . . . . 182
5.7.3 Probability Generating Function (PGF) . . . . . . . . . . 185
5.7.4 Characteristic Function (CF) . . . . . . . . . . . . . . . . 189
DR

5.7.5 Key Properties of Characteristic Functions . . . . . . . . 189

5.7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.8 Jointly Distributed Random Variables . . . . . . . . . . . . . . . 196
5.8.1 Joint Probability Mass Function (pmf) . . . . . . . . . . . 196
5.8.2 Example: Computer Maintenance . . . . . . . . . . . . . 197
5.8.3 Joint Probability Density Function (pdf) . . . . . . . . . 198
5.8.4 Example: Mineral Deposits . . . . . . . . . . . . . . . . . 198
5.8.5 Marginal Distributions . . . . . . . . . . . . . . . . . . . . 199
5.8.6 Example: Computer Maintenance . . . . . . . . . . . . . 200
5.8.7 Example: Mineral Deposits . . . . . . . . . . . . . . . . . 201
5.8.8 Conditional Distributions . . . . . . . . . . . . . . . . . . 203
5.8.9 Example: Computer Maintenance . . . . . . . . . . . . . 203
5.8.10 Example: Mineral Deposits . . . . . . . . . . . . . . . . . 204
5.8.11 Independence and Covariance . . . . . . . . . . . . . . . . 204
5.8.12 Covariance and Correlation . . . . . . . . . . . . . . . . . 207
5.8.13 Linear Functions of a Random Variable . . . . . . . . . . 212

viii
TABLE OF CONTENTS

5.8.14 Linear Combinations of Random Variables . . . . . . . . . 216

5.8.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
5.9 Python Functions for Statistical Distributions . . . . . . . . . . . 221
5.10 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 222
5.11 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 223

6 Some Discrete Probability Distributions 225

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.2 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . 225
6.2.1 Expected Value (Mean) . . . . . . . . . . . . . . . . . . . 226
6.2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.2.3 Moment Generating Function (MGF) . . . . . . . . . . . 230

T
6.2.4 Characteristic Function . . . . . . . . . . . . . . . . . . . 230
6.2.5 Probability Generating Function . . . . . . . . . . . . . . 230
6.2.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.2.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.2.8 Python Code for Bernoulli Distribution . . . . . . . . . . 233
AF 6.2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Expected Value . . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Variance and Standard Deviation . . . . . . . . . . . . .
.
.
.
.
234
236
238
239
6.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.3.4 Python Code for Binomial Distribution . . . . . . . . . . 243
6.3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
6.4 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.4.1 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . 251
6.4.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.4.3 Moment Generating Function . . . . . . . . . . . . . . . . 253
6.4.4 Characteristic Function . . . . . . . . . . . . . . . . . . . 253
DR

6.4.5 Approximation of Binomial Distribution Using Poisson

Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.4.6 Python Code for Poisson Distribution . . . . . . . . . . . 256
6.4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.4.8 Discrete Uniform Distribution . . . . . . . . . . . . . . . . 260
6.4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.6 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 265

7 Some Continuous Probability Distributions 267

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.2 Continuous Uniform Distribution . . . . . . . . . . . . . . . . . . 268
7.2.1 Distributional Properties . . . . . . . . . . . . . . . . . . 269
7.2.2 Python Code for Uniform Distribution Characteristics . . 272
7.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.3 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . 274

ix
TABLE OF CONTENTS

7.3.1 Properties of the Exponential Distribution . . . . . . . . . 280

7.3.2 Memoryless Property . . . . . . . . . . . . . . . . . . . . 282
Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.3.3 Python Code for Exponential Distribution Characteristics 285
7.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
7.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.4.1 Definition of the Normal Distribution . . . . . . . . . . . 290
7.4.2 Properties of the Normal Distribution . . . . . . . . . . . 293
7.4.3 Standard Normal Distribution . . . . . . . . . . . . . . . 296
7.4.4 Finding the Probability P (a ≤ X ≤ b . . . . . . . . . . . 298
7.4.5 Central Limit Theorem . . . . . . . . . . . . . . . . . . . 304

T
7.4.6 Python Code for Normal Distribution Characteristics . . 307
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
7.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 311
7.7 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 312

8 Confidence Interval Estimation 315

AF
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . .
8.3 Confidence Intervals for the Population Mean . . . . . . . .
8.4 Confidence Intervals for Variances and Standard Deviations
.
.
.
.
.
.
.
.
.
.
.
.
315
315
316
329
8.4.1 Confidence Interval for Variance . . . . . . . . . . . . . . 330
8.4.2 Confidence Interval for Standard Deviation . . . . . . . . 330
8.5 Confidence Intervals for Population Proportions . . . . . . . . . . 334
8.6 Sample Size Estimation . . . . . . . . . . . . . . . . . . . . . . . 338
8.6.1 Sample Size for Estimating a Population Mean . . . . . . 338
8.6.2 Sample Size for Estimating a Population Proportion . . . 342
8.6.3 Sample Size Estimation for Finite Populations . . . . . . 343
8.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 344
DR

8.8 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 345

9 Hypothesis Testing for Decision Making 348

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9.2 Concepts of Hypothesis Testing . . . . . . . . . . . . . . . . . . . 349
9.3 Steps for Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 351
9.3.1 Formulating Hypotheses . . . . . . . . . . . . . . . . . . . 351
9.3.2 Level of Significance . . . . . . . . . . . . . . . . . . . . . 353
9.3.3 Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 354
9.3.4 Acceptance and Rejection Regions . . . . . . . . . . . . . 354
9.3.5 Decision Rules . . . . . . . . . . . . . . . . . . . . . . . . 357
9.4 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
9.5 Why is hypothesis testing so important? . . . . . . . . . . . . . . 358
9.6 Hypothesis Testing for Means . . . . . . . . . . . . . . . . . . . . 359
9.6.1 One-Sample Test of Means . . . . . . . . . . . . . . . . . 359
9.6.2 Testing Equality of Two Means . . . . . . . . . . . . . . . 369

x
TABLE OF CONTENTS

9.6.3 Independent Samples T -Test . . . . . . . . . . . . . . . . 369

1. Equal Variances (Pooled T -Test) . . . . . . . . . . . . 370
2. Unequal Variances (Welch’s T -Test) . . . . . . . . . . . 371
9.6.4 Paired T -Test . . . . . . . . . . . . . . . . . . . . . . . . . 373
9.7 Testing Equality of Several Means . . . . . . . . . . . . . . . . . 376
9.7.1 Analysis of Variance (ANOVA) . . . . . . . . . . . . . . . 376
9.8 Power of the Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
9.9 Sample Size Estimation for the Mean Test . . . . . . . . . . . . . 386
9.9.1 When Testing for the Mean of a Normal Distribution
(One-Sided Alternative) . . . . . . . . . . . . . . . . . . . 386
9.9.2 Sample Size Estimation When Testing for the Mean of a

T
Normal Distribution (Two-Sided Alternative) . . . . . . . 387
9.10 Single Proportion Test . . . . . . . . . . . . . . . . . . . . . . . . 388
9.10.1 Sample Size Estimation for Proportion Test . . . . . . . . 390
9.11 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 391
9.12 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 392
AF
10 Correlation and Regression Analysis
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Scatter Diagram . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Python Code: Scatter diagram . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
396
396
397
399
10.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
10.5 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 404
10.6 Pearson’s Correlation Coefficient . . . . . . . . . . . . . . . . . . 404
10.6.1 Interpretation of the value of Correlation Coefficient . . . 405
10.6.2 Properties of the Correlation Coefficient . . . . . . . . . . 407
10.6.3 Testing the Significance of the Correlation Coefficient . . 410
10.6.4 Python Code: Correlation Matrix . . . . . . . . . . . . . 411
10.7 Rank Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
DR

10.7.1 Key Types of Rank Correlation . . . . . . . . . . . . . . . 412

10.7.2 Applications of Rank Correlation . . . . . . . . . . . . . . 415
10.7.3 Python Code: Rank Correlation . . . . . . . . . . . . . . 415
10.7.4 Kendall Tau Correlation Coefficient . . . . . . . . . . . . 416
10.7.5 Advantages and Disadvantages . . . . . . . . . . . . . . . 418
10.7.6 Python Code: Kendall Tau . . . . . . . . . . . . . . . . . 419
10.7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
10.8 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 423
10.8.1 Types of regression analysis . . . . . . . . . . . . . . . . . 425
10.8.2 Simple Regression Model . . . . . . . . . . . . . . . . . . 426
10.8.3 Assumptions of the CSLR Model (10.6) . . . . . . . . . . 428
10.8.4 Ordinary Least Squares (OLS) Estimation . . . . . . . . . 429
10.8.5 Interpretation of Regression Coefficients . . . . . . . . . . 431
10.8.6 The Estimated Error Variance or Standard Error . . . . . 433
10.8.7 Coefficient of Determination . . . . . . . . . . . . . . . . . 438
10.8.8 Relationship between R2 and rxy . . . . . . . . . . . . . 439

xi
TABLE OF CONTENTS

10.8.9 Advantages and Disadvantages of R2 . . . . . . . . . . . . 440

10.8.10 Adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . 440
10.8.11 Python Code: Simple Regression Analysis . . . . . . . . . 441
10.8.12 Interval Estimation and Hypothesis Testing . . . . . . . . 443
Confidence Interval for β0 . . . . . . . . . . . . . . . . . . 443
Confidence Interval for β1 . . . . . . . . . . . . . . . . . . 443
10.8.13 The F -tests in Simple Linear Regression Model . . . . . . 444
Decision Rule for ANOVA F -test . . . . . . . . . . . . . . 445
Decision Rule for ANOVA F -test . . . . . . . . . . . . . 445
10.8.14 The t-tests in Simple Linear Regression Model . . . . . . 445
Decision Rule for t-test . . . . . . . . . . . . . . . . . . . 446

T
Confidence Interval for E(Y |X = x) . . . . . . . . . . . . 447
10.8.15 Python Code: Linear Regression Model . . . . . . . . . . 449
10.8.16 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
10.9 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . 454
10.9.1 Model Assumptions . . . . . . . . . . . . . . . . . . . . . 455
10.9.2 Estimation Procedure . . . . . . . . . . . . . . . . . . . . 455
AF 10.9.3 Estimation Procedure of Error Variance . . . . . . . . .
10.9.4 Mean of the OLS Estimator . . . . . . . . . . . . . . . .
10.9.5 Variance of the OLS Estimator . . . . . . . . . . . . . .
10.9.6 Coefficient of Determination . . . . . . . . . . . . . . . .
.
.
.
.
456
457
457
458
10.9.7 Adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . 458
10.9.8 Example Dataset and Regression Calculations . . . . . . . 459
Goodness-of-fit R2 and Adjusted R2 . . . . . . . . . . . . 462
10.9.9 F -test in Multiple Regression . . . . . . . . . . . . . . . . 463
10.9.10 ANOVA Table in Regression Analysis . . . . . . . . . . . 463
10.9.11 The t-tests in Multiple Regression . . . . . . . . . . . . . 464
10.9.12 Python Code: Linear Regression Model . . . . . . . . . . 471
10.9.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
DR

10.10Regression Model Diagnostics . . . . . . . . . . . . . . . . . . . . 474

10.10.1 Assumptions of Linear Regression . . . . . . . . . . . . . 475
10.10.2 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . 475
10.10.3 Formal Tests . . . . . . . . . . . . . . . . . . . . . . . . . 480
10.10.4 Test for Autocorrelation . . . . . . . . . . . . . . . . . . . 483
10.10.5 Tests for Non-Constancy of Variance . . . . . . . . . . . . 484
10.10.6 Influential Observations, Outliers, and Cook’s Distance . 490
10.10.7 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 492
10.10.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
10.11Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 498
10.12Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 498

xii
Chapter 1

Introduction to Data

T
Science
AF
1.1 Welcome to Data Science
Welcome to the exciting world of data science, where numbers tell hidden stories
and reveal valuable insights! In today’s digital era, data is incredibly powerful,
and those who can understand and use it have the key to endless opportu-
nities. Data science is all about extracting meaningful information from vast
amounts of data to make informed decisions, solve complex problems, and drive
innovation.

Data Science: An interdisciplinary field that employs scientific methods,

processes, algorithms, and systems to extract knowledge and insights from
both structured and unstructured data. It combines various disciplines,
DR

including statistics, probability, machine learning, data engineering, and

domain-specific expertise, to understand and analyze data.

In the era of big data and advanced analytics, data science has become a
crucial field in both academia and industry. At the heart of data science is the
ability to extract meaningful insights from data, a process that heavily relies on
statistical methods. To succeed in this field, a strong foundation in statistics is
essential. Statistics provides the tools and techniques needed to analyze data,
identify patterns, and draw reliable conclusions. Without this knowledge, it’s
challenging to make sense of the data and leverage its full potential.

In this chapter, we will introduce the foundational concepts of applied statis-

tics that are essential for data scientists. We will cover basic statistical termi-
nology, explore different types of data, and discuss various statistical methods
commonly used in data science. But data science is more than just numbers
and statistics. It involves programming, data engineering, machine learning,

1
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

and data visualization. These skills allow data scientists to collect, clean, and
process data, build predictive models, and present their findings in a clear and
compelling way.

As we embark on this journey, we will dive deep into the statistical founda-
tions that every aspiring data scientist needs to master. We will also explore
how these principles are applied in real-world scenarios, from predicting cus-
tomer behavior to identifying trends in healthcare. So, buckle up and get ready
to unlock the secrets of data science!

1.2 Key Components of Data Science

T
Data Science is made up of several interrelated components, each playing a
crucial role in transforming raw data into actionable insights. Below are some
of the most important components:

Statistics and Probability

AF
Statistics and probability form the mathematical foundation of Data Science,
providing essential tools for analyzing data, making inferences, and building pre-
dictive models. Statistics helps summarize and describe data through measures
such as mean, median, variance, and standard deviation, while also enabling hy-
pothesis testing and confidence interval estimation. Probability theory is used
to model uncertainty and assess the likelihood of different outcomes, which is es-
pecially crucial in decision-making and risk analysis. Together, these disciplines
support key Data Science tasks such as data interpretation, model evaluation,
and the development of algorithms in machine learning. Tools like R, Python
(with libraries such as NumPy, SciPy, and Statsmodels), SPSS and SAS are
DR

commonly used for statistical analysis.

Machine Learning
Machine learning, a subset of artificial intelligence, plays a central role in Data
Science by enabling computers to learn from data and make predictions or
decisions without explicit programming. It involves training models on histor-
ical data to detect patterns and generate accurate forecasts or classifications.
Machine learning is widely applied in areas such as recommendation systems,
fraud detection, image recognition, and natural language processing. Common
approaches include supervised learning, unsupervised learning, and reinforce-
ment learning. Popular tools and libraries used to build and deploy models
include Scikit-learn, TensorFlow, Keras, and PyTorch.

2
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Deep learning
Deep learning is an advanced subset of machine learning that uses artificial
neural networks with multiple layers to model and understand complex pat-
terns in large volumes of data. It excels at tasks where traditional algorithms
struggle, such as image and speech recognition, natural language processing,
and autonomous systems. Deep learning models, such as convolutional neural
networks (CNNs) and recurrent neural networks (RNNs), learn hierarchical fea-
tures directly from raw data without the need for manual feature extraction.
These models require significant computational power and large datasets to per-
form effectively. Popular frameworks for developing deep learning applications
include TensorFlow, Keras, and PyTorch.

T
Data Engineering
Data Engineering involves the design, development, and maintenance of systems
and architectures that enable the collection, storage, and processing of large
datasets. Data engineers build data pipelines that move data from various
AF
sources to storage and analytics platforms, ensuring it is clean, reliable, and
accessible for analysis. They work with tools and technologies like SQL, Apache
Spark, Hadoop, Airflow, and cloud services (e.g., AWS, Azure, GCP) to handle
big data efficiently. Data Engineering lays the foundation for data analysis and
machine learning by making high-quality data available to data scientists and
analysts.

Data Visualization
Data visualization is a critical aspect of Data Science that involves represent-
ing data and analytical results through visual formats such as charts, graphs,
DR

maps, and dashboards. It helps transform complex datasets into easily under-
standable insights, enabling quicker interpretation and more informed decision-
making. Effective data visualization allows analysts and stakeholders to iden-
tify patterns, trends, and outliers that might not be immediately evident in raw
data. It is widely used in business intelligence, reporting, and exploratory data
analysis. Common tools and libraries include Tableau, Power BI, Matplotlib,
Seaborn, and Plotly, each offering powerful capabilities for creating both static
and interactive visualizations.

Big Data Analytics

Big Data Analytics is a key component of Data Science that focuses on process-
ing and analyzing vast volumes of complex and fast-moving data. It enables
data scientists to uncover meaningful patterns, trends, and insights from large
datasets that traditional tools cannot handle efficiently. By leveraging technolo-
gies like Hadoop, Spark, and NoSQL databases, Big Data Analytics supports
real-time decision-making and predictive modeling. This component is essential

3
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

in industries where data is generated at high speed and scale, helping organi-
zations drive innovation, improve operations, and gain a competitive edge.

In this book, we focus on statistics and probability, emphasizing applied

statistical tools and applied probability with examples. The following chapters
will provide detailed explanations and practical applications.

1.3 Concepts of Statistics

Understanding the concepts of statistics is crucial for any data scientist. Statis-
tics provides the foundation for data analysis, enabling us to make sense of

T
data and draw meaningful conclusions. As a data scientist, you will frequently
encounter questions such as:

• How do we summarize large datasets?

• What can we infer from sample data about a larger population?

AF•
•
How can we validate the accuracy of our models?

What techniques can we use to identify patterns or anomalies in data?

By mastering statistical concepts, you will be better equipped to answer

these questions and make informed decisions based on data. Statistics plays a
crucial role in data science by providing tools and methods for:

• Summarizing and exploring data

• Designing experiments and surveys

• Making inferences and predictions

• Evaluating and validating models

In this section, we will cover some fundamental statistical concepts essential

for data science.

Statistics: The science of collecting data, organizing, summarizing, clas-

sifying, comparing, and drawing inferences about a population.

Statistics encompasses both theoretical and applied aspects. The theoret-

ical side focuses on developing new statistical methods and theories. Applied
statistics involves the practical application of these methods to real-world prob-
lems and data, utilizing established statistical techniques to analyze data and
draw conclusions in various fields such as healthcare, business, engineering, and
social sciences.

4
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Applied statistics forms the foundation of data science by collecting, or-

ganizing, and analyzing data to understand larger populations. By applying
these methods, experts can predict trends and outcomes using models based
on real-world observations. This process enhances data quality through ex-
periments and sampling techniques and supports evidence-based conclusions.
Applied statistics is vital for fields like natural sciences, social sciences, public
policy, healthcare, engineering, and business, enabling informed decisions and
systematic approaches to complex challenges.

Example: Customer Satisfaction Survey

Imagine a company wants to understand the average customer satisfaction level

T
with its new product. Surveying every customer is impractical, so the company
surveys a sample of 500 customers out of its 50,000 customers.

Collecting Data
The company distributes a satisfaction survey to 500 randomly selected cus-
AF
tomers, asking them to rate their satisfaction on a scale from 1 to 10.

Analyzing Data
Once the responses are collected, the company calculates the average satisfac-
tion score from the sample. Suppose the average satisfaction score from the 500
customers is 7.2.

Interpreting Data
Using statistical methods, the company estimates the average satisfaction score
for the entire population of 50,000 customers based on the sample. This involves
DR

calculating confidence intervals to understand the range within which the true
average satisfaction score likely falls.

Presenting Data
The company creates a report with graphs and charts to visualize the distri-
bution of satisfaction scores, the average satisfaction score, and the confidence
interval.

Organizing Data
The company stores the survey data in a database, ensuring it is organized for
future analysis or reference.

5
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Making Inferences
By analyzing the sample data, the company infers that the average satisfaction
score of all its customers is approximately 7.2, with a certain level of confidence
(e.g., 95% confidence interval).

Quantifying Uncertainty
To quantify the uncertainty of their estimate, the company calculates a confi-
dence interval. For instance, they might determine that they are 95% confident
that the true average satisfaction score of all customers is between 6.9 and 7.5.

T
Through collecting, analyzing, interpreting, presenting, and organizing data
from a sample, the company uses statistics to make informed decisions about
customer satisfaction for the entire customer base. This approach helps the
company understand and quantify the uncertainty associated with their esti-
mate, enabling them to make more accurate and reliable business decisions.
AF
1.3.1 Population and Sample
In statistics, it is often impractical or impossible to study an entire population.
Instead, we rely on samples. Understanding the difference between a population
and a sample is fundamental to conducting statistical analyses.

Population
The population refers to the entire group of individuals or instances about whom
we want to draw conclusions. It includes all possible observations or outcomes
that are of interest in a particular study or analysis.
DR

Population: The entire set of individuals, items, or data points of interest

in a study or analysis.

Examples:
• If we are studying the prevalence of diabetes in adults aged 40-60 in
a country, the population would include all adults aged 40-60 in that
country. This would encompass every individual in that age range,
regardless of their health status, socioeconomic background, or other
characteristics.

• If we are studying the average height of adult men in a country, the

population would include all adult men in that country.

6
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Sample
A sample is a subset of the population that is selected for the actual study. The
goal is to choose a sample that is representative of the population so that the
findings can be used to make inferences or generalizations about the popula-
tion.

Sample: A subset of the population selected for analysis to make infer-

ences about the entire population.

Examples:

• Continuing with the diabetes study, it might be impractical to test

T
every adult aged 40-60 in the country. Instead, researchers might select
a sample of 1,000 adults from this age group and measure their blood
sugar levels. The results from this sample can then be used to estimate
the prevalence of diabetes in the entire population of adults aged 40-60.

• If we cannot measure the height of all adult men in a country, we

AF might select a sample of 1000 adult men and measure their heights.
The results from this sample can then be used to estimate the average
height of the entire population.

By studying a sample, we can make inferences about the population. How-

ever, it is crucial to use proper sampling techniques to ensure that the sample
accurately represents the population.

1.3.2 Census and Sample Survey

Census
DR

The census is a comprehensive and periodic data collection process aimed

at gathering detailed demographic information about the entire population
of a country. In the context of Bangladesh, the census is conducted by the
Bangladesh Bureau of Statistics (BBS) and serves as a crucial tool for policy
planning, resource allocation, and development programs. The first population
census in Bangladesh was conducted in 1974, following the country’s indepen-
dence. Since then, the census has been carried out every ten years, with the
most recent one being the 2022 Population and Housing Census. As per
as 2022 census, Bangladesh have a population of 165,158,616 people, of which
81,712,824 are male, while 83,347,206 are female. As many as 113,063,587 of
them live in rural areas and 52,009,072 live in Urban.

Importance of Census
The census data is essential for:
• Informing government policies and development strategies.

7
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

• Allocating resources and planning public services at both national and

local levels.

• Providing a basis for the estimation of various demographic trends and

indicators.

Sample Surveys
Sample surveys are a method of collecting data from a subset of the population
to infer information about the entire population. In Bangladesh, sample surveys
are conducted for various purposes, including economic research, social policy
evaluation, and program assessment.

T
Types of Sample Surveys
In Bangladesh, different types of sample surveys are carried out by the Bangladesh
Bureau of Statistics (BBS) and other organizations. Some key surveys include:
•
AF•
Labor Force Survey (LFS): Assesses employment, unemployment,
and labor market conditions.

Multiple Indicator Cluster Survey (MICS): Provides data on

child health, education, and protection.

• Household Income and Expenditure Survey (HIES): Measures

household income, consumption patterns, and poverty levels.

• Demographic and Health Survey (DHS): Collects data on popu-

lation health, fertility, and mortality.

1.3.3 Parameters and Statistic

Parameter
A parameter is a measurable characteristic of the population which is a nu-
merical value that summarizes a characteristic of a population. It is a fixed
value, although its exact value is often unknown. Parameters describe the en-
tire population.

Parameter: A numerical characteristic or measure that describes a spe-

cific aspect of a population, such as its mean, variance, or proportion.

Examples:
• Population Mean (µ): The average height of all adult men in a
country.

• Population Proportion (P ): The proportion of voters in a country

who support a particular candidate.

8
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Statistic
A statistic is a measurable characteristic of the sample which is a numerical
value that summarizes a characteristic of a sample. It is used to estimate
the corresponding population parameter. Statistics can vary from sample to
sample.

Statistic: A measurable characteristic or quantity calculated from a sam-

ple, which is used to estimate or infer the corresponding population pa-
rameter.

Examples:

T
• Sample Mean (x̄): The average height of 100 randomly selected adult
men from the population.

• Sample Proportion (p): The proportion of voters in a sample of

1,000 who support a particular candidate.
AFParameters describe the entire population and are fixed values, whereas
statistics describe samples and can vary from sample to sample. Understanding
the distinction between these concepts is crucial in the field of statistics, as we
often use sample statistics to make inferences about population parameters.

1.3.4 Types of Statistics

When using statistics to derive information from data for decision-making, we
employ either descriptive statistics or inferential statistics. The choice between
these methods depends on the questions we aim to answer and the nature of
the data at hand.
DR

(i). Descriptive Statistics

Descriptive statistics is all about summarizing data in a way that is easy
to understand. It helps us describe the data we have and make it more un-
derstandable. Instead of looking at a large amount of raw data, descriptive
statistics helps us look at it in a more organized and simple way. It focuses on
things like:

• Averages: To find a typical value, like the average score in a class.

• Spreads: To understand how different the data points are from each
other, such as how spread out the heights of people in a group are.

• Visual Tools: Charts, graphs, and tables that make it easier to see
patterns and trends in data.

9
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Descriptive statistics: Methods of organizing, summarizing, and pre-

senting data in an informative way.

(ii). Inferential Statistics

Inferential statistics allows us to make predictions or generalizations about
a larger group based on a small sample of data. In many situations, it’s not
practical or possible to collect data from everyone (for example, asking every
person in a country about their opinion). Instead, we collect data from a small
group (a sample) and use that to make guesses about the entire group (called
a population).

T
For example, suppose you wanted to know how much time students spend
studying for exams in your school. Instead of asking every student, you could
ask a small group of students (a sample) and then use their answers to make
an estimate about the entire student population.
AF
Inferential statistics: Techniques for making predictions or inferences
about a population based on sample data.

Inferential statistics helps us make decisions or draw conclusions about

a larger group, even though we only have data from a smaller group. This is
helpful when we need to make choices or predictions without needing to survey
everyone.

1.4 What is Data?

Data refers to raw facts, observations, or units of information, including num-
DR

bers, words, measurements, observations, images, videos, audio, or descriptions,

that can be collected, stored, analyzed, and used to inform decisions. It can be
structured (like databases) or unstructured (like text or images). It is the ba-
sis for generating useful insights and making informed decisions in various fields.

Data: Raw facts and figures or units of information that can be collected
for analysis, which can be quantitative or qualitative.

Types of Data
Data can be categorized into several types, each with its own characteristics
and uses. These categories help us understand different ways to collect and
analyze information. The primary data types are as follows:

(i). Quantitative Data or Numeric Data: This type of data involves

numerical values that can be measured or counted, answering questions

10
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

like “how much?” or “how many?” Quantitative data provides us with

specific amounts or counts, and it can be broken down further into two
subcategories:
• Discrete Data: This refers to numerical data that can only take
specific, separate values, often representing counts. For example,
the number of students in a class or the number of products sold.
• Continuous Data: In contrast, continuous data can take any
value within a given range, including fractions. Examples include
measurements like height, weight, or temperature.
Some examples of quantitative data are as follows:

T
• How old you are: 15, 16, 17, etc.
• How many students are in a class: 25, 30, 35, etc.
• The temperature in a city: 20°C, 25°C, etc.
(ii). Qualitative Data or Non-numeric Data: Unlike quantitative data,
AF qualitative data or non-numeric data does not deal with numbers. In-
stead, it describes categories, qualities, or characteristics. This type of
data is often used to answer questions like “what kind?” or “which one?”
Sometimes, it is called categorical data. Qualitative data includes var-
ious forms of descriptive information, and examples include:
• Your favorite color: Red, Blue, Green, etc.
• The type of pets people have: Dog, Cat, Fish, etc.
• The kind of food people like: Pizza, Pasta, Salad, etc.
DR

Other Types of Data

• Binary Data: Contains only two possible values (e.g., true/false,
yes/no).

• Text Data: Unstructured information in textual form (e.g., articles,

tweets).

• Time Series Data: Recorded observations over time intervals (e.g.,

stock prices, weather data).

• Spatial Data: Information about the physical location and shape of

objects (e.g., GPS coordinates, maps).

• Image and Video Data: Visual information captured as images or

videos.

• Audio Data: Sound or speech information, including recordings or

live streams.

11
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

1.4.1 Levels of Measurement

The levels of measurement or scale of measurement refer to the nature of
the data and determine the types of statistical analyses that can be performed.
There are four main levels of measurement:
(i). Nominal

(ii). Ordinal

(iii). Interval

(iv). Ratio

T
Nominal Level
The nominal level of measurement is the most basic type of data categoriza-
tion. It classifies data into distinct categories that do not have a natural order
or ranking. These categories are mutually exclusive and collectively exhaustive,
AF
meaning each observation fits into one and only one category, and all categories
together include all possible observations.

Characteristics of Nominal Data:

• Categorical: Data are grouped into categories based on some quali-
tative property.

• No Order: There is no inherent order or ranking among the categories.

• Labels Only: Categories are typically labeled with names, numbers,

or symbols for identification.
DR

Examples:
• Gender (male, female)

• Marital status (single, married, divorced)

• Types of cuisine (Italian, Chinese, Mexican)

• Types of Engineering Degrees: Civil, Mechanical, Electrical, Chemical,

Computer.

• Different types of engines used in vehicles and machinery: Diesel, Elec-

tric, Gasoline, Hybrid.

• Various materials used in building structures: Wood, Steel, Concrete,

Brick.

• Software Versions: v1.0, v2.0, v3.0.

12
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

• Blood Types: A, B, AB, O.

• Disease Types: Flu, Cold, Allergies, Asthma.

• Medical Specialties: Cardiology, Neurology, Oncology, Pediatrics.

• Vaccination Status: Vaccinated, Not Vaccinated.

• Hospital Departments: Emergency, Radiology, Surgery, Pediatrics.

Ordinal Level
The ordinal level of measurement classifies data into categories that have a

T
meaningful order or ranking among them. Unlike nominal data, ordinal data
allow for comparisons of the relative position of items, but the intervals between
the categories are not necessarily equal or known. Ordinal data are widely used
in surveys, questionnaires, and educational assessments to gauge attitudes, per-
ceptions, and performance levels.
AF
Characteristics of Ordinal Data:
• Categorical with Order: Data are grouped into categories that have
a logical sequence or ranking.

• Relative Position: Categories indicate relative positions but not the

magnitude of difference between them.

• Rankings or Ratings: Often represented by rankings or ratings.

Examples:
•
DR

Education level (high school, bachelor’s, master’s, PhD)

• Customer satisfaction (very unsatisfied, unsatisfied, neutral, satisfied,

very satisfied)

• Engineering Design Maturity: Concept, Prototype, Final Design.

• Performance Ratings: Poor, Fair, Good, Excellent.

• Pain Levels: No Pain, Mild Pain, Moderate Pain, Severe Pain.

• Stage of Disease: Stage I, Stage II, Stage III, Stage IV.

• Quality of Life Scores: Poor, Fair, Good, Excellent.

• Severity of Symptoms: Mild, Moderate, Severe.

• Patient Satisfaction Levels: Very Unsatisfied, Unsatisfied, Neutral, Sat-

isfied, Very Satisfied.

13
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Interval Level
The interval level of measurement involves data that are not only ordered but
also have equal intervals between values. Unlike ordinal data, the differences
between values are meaningful. However, interval data do not have a true zero
point, which means that ratios are not meaningful. Despite lacking a true zero
point, interval data provide a higher level of detail compared to nominal and
ordinal data, allowing for more sophisticated analytical techniques.

Characteristics of Interval Data:

• Equal Intervals: The difference between values is consistent and

T
meaningful. For example, consider temperature measured in degrees
Celsius. The difference between 10°C and 20°C is 10 degrees, and the
difference between 20°C and 30°C is also 10 degrees. These equal in-
tervals allow us to perform meaningful addition and subtraction.

• No True Zero Point: Zero does not indicate the absence of the
AF•
quantity being measured, so ratios are not meaningful.

Ordered: Data have a meaningful order.

Examples:
• Temperature Measurements in Celsius or Fahrenheit: 10°C, 20°C, 30°C.

• IQ Scores: Differences between scores are meaningful, but there is no

true zero point.

• Calendar Years: The difference between years is consistent, but there

is no true zero year.
DR

• Body Temperature in Celsius: 36.5°C, 37.0°C, 37.5°C.

• SAT Scores: There is no true zero score that indicates an absence of

knowledge or ability; the score is relative to the testing scale.

• GPA (Grade Point Average): GPA values represent equal intervals

(e.g., the difference between 3.0 and 4.0 is the same as between 2.0 and
3.0). However, there is no true zero point on the GPA scale; it is just
a measure of academic performance.

Ratio Level
The ratio level of measurement is the highest level of measurement and includes
all the properties of the interval level, with the addition of a true zero point.
This allows for meaningful comparisons and calculation of ratios. The pres-
ence of a true zero point allows for a full range of mathematical and statistical

14
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

operations, making ratio data the most informative and versatile level of mea-
surement.

Characteristics of Ratio Data:

• Equal Intervals: The difference between values is consistent and
meaningful. For example, the difference between 10 kilograms and
20 kilograms is the same as the difference between 30 kilograms and
40 kilograms. In both cases, the difference is 10 kilograms. This con-
sistency allows for accurate and meaningful addition and subtraction
of values.

T
• True Zero Point: Zero indicates the absence of the quantity being
measured, making ratios meaningful.

• Ordered: Data have a meaningful order.

• Meaningful Ratio: This characteristic enables comparisons using

AF ratios (e.g., twice as much, half as much).

Examples:
• Length of Engineering Components: 5 m, 10 m, 15 m.

• Weight of Materials: 2 kg, 5 kg, 10 kg.

• Cost of Engineering Projects: $1000, $2000, $3000.

• Mechanical Stress Measurements: 10 MPa, 20 MPa, 30 MPa.

• Production Quantity: 100 units, 200 units, 300 units.

• Height of Patients: 150 cm, 160 cm, 170 cm.

• Weight of Patients: 50 kg, 60 kg, 70 kg.

• Number of Hospital Visits: 1 visit, 2 visits, 3 visits.

• Dosage of Medication: 10 mg, 20 mg, 30 mg.

• Length of Hospital Stay: 1 day, 2 days, 3 days.

By grasping these basic statistical concepts, data scientists can better analyze
and interpret data, leading to more accurate and meaningful insights. As we
delve deeper into data science, these foundational principles will serve as the
building blocks for more advanced topics and applications.

15
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

1.4.2 Variables
In statistics and data science, a variable is any characteristic or properties that
can take on different values. Variables are essential for research, as they rep-
resent the different factors or elements that can change or vary across different
individuals, conditions, or time periods. They can take on different values de-
pending on the nature of the data being collected.

Variable: A characteristic or quantity that can vary or take on different

values in a dataset or experiment.

Understanding the types of variables is crucial because it determines the kind

T
of statistical analysis that can be performed. Variables are broadly categorized
into qualitative and quantitative variables, each with further subtypes.

Qualitative Variables
Qualitative variables, also known as categorical variables, describe categories
AF
or groups. These variables represent characteristics that cannot be measured
numerically but can be classified into distinct groups.

• Nominal Variables: These are variables with categories that have

no natural order or ranking. Examples include gender (male, female),
marital status (single, married, divorced), and hair color (blonde, brunette,
redhead).

• Ordinal Variables: These are variables with categories that have

a meaningful order or ranking but the intervals between the cate-
gories are not necessarily equal. Examples include education level (high
school, bachelor’s, master’s, doctorate) and customer satisfaction rat-
DR

ings (satisfied, neutral, dissatisfied).

Quantitative Variables
Quantitative variables, also known as numerical variables, represent measurable
quantities and can be expressed numerically. They can be further divided into
discrete and continuous variables.

• Discrete Variables: These variables represent countable values. They

take on a finite number of distinct values and are often integers. Ex-
amples include the number of students in a class, the number of cars
in a parking lot, and the number of books on a shelf.

• Continuous Variables: These variables represent measurable quan-

tities that can take on any value within a given range. They are often
associated with measurements. Examples include height, weight, tem-
perature, and time.

16
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Understanding the type of variable is essential for choosing the appropriate

statistical methods and analyses. Each type of variable provides different in-
sights and requires specific techniques for accurate interpretation and decision-
making.

1.5 Scope of Applied Statistics

Applied statistics encompasses a wide range of activities and applications across
various fields. Here are some key areas of its scope:

• Data Collection and Management: Designing surveys, experi-

T
ments, and data collection methods.

• Data Analysis: Employing statistical techniques to analyze data,

including hypothesis testing, regression analysis, and ANOVA.

• Decision Making: Providing statistical insights for decision-making

AF•
in business, healthcare, policy-making, and more.

Predictive Modeling: Developing models to forecast future trends

and outcomes based on historical data.

• Quality Control: Implementing statistical methods for quality as-

surance and improvement in manufacturing and services.

• Market Research: Analyzing consumer behavior, market trends, and

product performance.

• Epidemiology: Studying the distribution and determinants of health-

related events in populations.
DR

• Financial Analysis: Applying statistics to risk assessment, invest-

ment strategies, and economic forecasting.

• Educational Assessment: Evaluating educational programs, stu-

dent performance, and teaching effectiveness.

1.6 Statistical Methods in Data Science

Statistical methods play a central role in data science by providing the tools and
techniques needed to analyze, interpret, and draw conclusions from data. These
methods help data scientists understand underlying patterns, make predictions,
and inform decision-making. Below are some key statistical methods commonly
used in data science:

17
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

• Descriptive Statistics: Descriptive statistics involve summarizing

and describing the main features of a dataset. This can include mea-
sures such as:
■ Mean: The sum of all values divided by the number of values.
■ Median: The middle value when the data is sorted.
■ Mode: The most frequently occurring value.
■ Standard Deviation: A measure of how spread out the values are
from the mean.
■ Variance: The square of the standard deviation, representing

T
the spread of the data.

• Inferential Statistics: Inferential statistics allow us to make predic-

tions or inferences about a population based on a sample of data. Key
techniques include:
AF ■ Hypothesis Testing: Used to test assumptions or claims about
a population based on sample data, using tests like t-tests or
chi-squared tests.
■ Confidence Intervals: Estimating the range within which a pop-
ulation parameter is likely to fall, based on sample data.
■ ANOVA (Analysis of Variance): A method for comparing the
means of multiple groups to determine if there are any statisti-
cally significant differences between them.

• Regression Analysis: Regression techniques are used to model rela-

tionships between variables and make predictions. Common methods
DR

include:
■ Linear Regression: Modeling the relationship between a depen-
dent variable and one or more independent variables.
■ Logistic Regression: Used for binary classification problems where
the outcome is categorical (e.g., yes/no, success/failure).

• Probability Theory: Probability theory helps quantify uncertainty

and is fundamental to many data science algorithms. Common tools
include:
■ Bayesian Statistics: A method of statistical inference that up-
dates probabilities based on new evidence.
■ Random Variables: Variables whose values are outcomes of ran-
dom phenomena, often described by probability distributions
(e.g., normal distribution, binomial distribution).

18
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

• Machine Learning Algorithms: Many machine learning algorithms,

especially supervised learning algorithms, are built upon statistical
principles. These include:
■ Classification: Methods like decision trees, support vector ma-
chines (SVM), and neural networks that categorize data into
distinct classes.
■ Clustering: Techniques like k-means or hierarchical clustering
that group data points based on similarity.
■ Dimensionality Reduction: Methods such as Principal Compo-
nent Analysis (PCA) that reduce the number of variables in a

T
dataset while preserving as much information as possible.

• Time Series Analysis: Time series methods are used for analyzing
data collected over time. Techniques such as:
■ ARIMA (AutoRegressive Integrated Moving Average): A model
AF used for forecasting time series data.
■ Seasonal Decomposition: Identifying and removing seasonal pat-
terns from time series data to better understand underlying
trends.

• Statistical Sampling: Sampling techniques are used to select subsets

of data from a larger population for analysis. Methods include:

■ Random Sampling: Selecting data randomly to ensure unbiased

representation of the population.
■ Stratified Sampling: Dividing the population into subgroups and
DR

sampling from each group to ensure representation from all seg-

ments.

1.7 Overview of Data Science Workflow

The data science workflow typically involves several key steps:

1. Problem Definition: Understanding the problem and defining objec-

tives.
2. Data Collection: Gathering relevant data from various sources.

3. Data Cleaning: Handling missing values, outliers, and ensuring data

quality.
4. Data Exploration: Summarizing and visualizing data to understand its
structure and patterns.

19
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

5. Modeling: Applying statistical and machine learning techniques to build

predictive models.
6. Evaluation: Assessing the performance of the models using appropriate
metrics.

7. Deployment: Implementing the model in a production environment.

8. Monitoring: Continuously monitoring and refining the model to ensure
its accuracy and relevance.

1.8 Popular Statistical Analysis Tools

T
There are several tools and software commonly used for statistical analysis in
data science, including:

• Python: A versatile programming language with powerful libraries

such as NumPy, pandas, SciPy, and scikit-learn for statistical analysis
AF•
and machine learning.

R: A programming language and software environment designed for

statistical computing and graphics.

• SAS: A software suite used for advanced analytics, multivariate anal-

ysis, business intelligence, and data management.

• STATA: A powerful software provides comprehensive tools for data

manipulation, statistical analysis, and graphical representation.

• SQL: A domain-specific language used for managing and manipulating

relational databases.

• Excel: A spreadsheet software that provides basic statistical functions

and data visualization capabilities.

1.9 Why Choose Python for This Book?

Python has become the language of choice for many data scientists and statisti-
cians. Here are several reasons why Python is particularly suited for statistical
analysis and data science:

1. Ease of Learning and Use

Python’s syntax is straightforward and readable, making it accessible for be-
ginners. Its simplicity allows users to focus on solving problems rather than
getting bogged down by the complexities of the language itself.

20
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

2. Comprehensive Libraries
Python boasts a rich ecosystem of libraries that are essential for statistical
analysis and data science. Some of the most popular libraries include:
• NumPy: Provides support for large multidimensional arrays and ma-
trices, along with a collection of mathematical functions to operate on
these arrays.

• Pandas: Offers data structures and functions needed to manipulate

structured data seamlessly, making it easy to handle and analyze data.

• SciPy: Contains modules for optimization, linear algebra, integration,

T
and other advanced mathematical functions.

• Matplotlib and Seaborn: Provide powerful tools for data visual-

ization, allowing users to create a wide range of static, animated, and
interactive plots.
AF•

•
Statsmodels: Enables statistical modeling, hypothesis testing, and
data exploration.

Scikit-learn: A robust library for machine learning that includes sim-

ple and efficient tools for data mining and data analysis.

3. Community and Support

Python has a large, active community of users and developers. This means
extensive documentation, a wealth of tutorials, and a plethora of forums where
users can ask questions and share knowledge. This community support accel-
erates learning and problem-solving.
DR

4. Integration Capabilities
Python integrates well with other languages and technologies. It can be used
alongside other data tools like SQL for database queries, or even integrated
with languages such as R or C++ for specialized tasks. This flexibility makes
Python a versatile tool in a data scientist’s toolkit.

5. Open Source and Free

Python is open source and freely available, which lowers the barrier to entry for
individuals and organizations. This democratization of technology allows more
people to engage with data science and statistical analysis.

21
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

6. Versatility
Python is not only used for data analysis but also for web development, au-
tomation, scripting, and even artificial intelligence and machine learning. This
versatility means that once you learn Python, you can apply your skills to a
wide range of problems and projects.

7. Real-World Applications
Many industry leaders and tech giants like Google, Facebook, and NASA use
Python for data analysis and machine learning. This real-world application un-
derscores Python’s reliability and effectiveness in handling complex data tasks.

T
8. Continuous Development
Python and its libraries are continuously being developed and improved by the
community, ensuring that users have access to the latest tools and techniques
in data science and statistics.
AFIn summary, Python’s combination of simplicity, powerful libraries, commu-
nity support, integration capabilities, and versatility makes it an ideal choice
for data scientists and statisticians. Throughout this book, we will leverage
Python to demonstrate various statistical tools and methodologies, ensuring
that you can apply what you learn to real-world data challenges effectively.

1.10 Getting Started with Python

Setting Up Python
DR

Instructions for installing Python can be found at https://www.python.org/

downloads/. We recommend using Jupyter Notebook or an Integrated Devel-
opment Environment (IDE) like PyCharm.

Basic Python Concepts

If you are new to Python, start with the basics: variables, data types, con-
trol structures, and functions. Jupyter Notebooks are particularly useful for
interactive data analysis.

1.11 Structure of the Book

Overview of Chapters
This book is structured to gradually build your knowledge in data science,
starting from basic concepts to advanced applications. Each chapter includes

22
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

theoretical explanations, practical examples, and Python code snippets.

Learning Path
To get the most out of this book, follow the chapters sequentially. Practice the
examples and exercises provided to reinforce your understanding.

1.12 Concluding Remarks

Applied statistics is a fundamental component of data science that provides the
tools necessary for data analysis and interpretation. By understanding basic

T
statistical concepts and methods, data scientists can perform effective analyses
and make data-driven decisions. In the following chapters, we will delve deeper
into specific statistical methods and explore how they can be applied to real-
world data science problems.

By the end of this book, you will have a solid foundation in applied statistics
AF
and probability using Python. You will be equipped with the skills to tackle
real-world data science problems.

1.13 Chapter Exercises

1. Define data science. How does it differ from traditional data analysis?
2. List and describe the various stages of a data science project lifecycle.
3. Explain the role of a data scientist. What skills are essential for a data
scientist to be effective in their role?
DR

4. Discuss the importance of interdisciplinary knowledge in data science.

Provide examples of how data science can be applied in different fields.
5. Define statistics. Provide examples of how statistics are applied in the
fields of healthcare and engineering.
6. Explain the difference between descriptive and inferential statistics. Give
an example of each.
7. Discuss the importance of data visualization in statistical analysis. Men-
tion two Python libraries that are useful for creating data visualizations.
8. Define population and sample. Why is sampling important in statistical
analysis?
9. Define parameter and statistic with an example.
10. Why is Python considered a suitable programming language for data sci-
ence and statistical analysis?

23
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

11. List and describe three Python libraries commonly used in data science.
What functionalities do they provide?
12. Explain the importance of community and support in choosing Python
for data science.

13. How does Python’s integration capability benefit a data scientist working
on complex projects?
14. Identify the type of data (quantitative or qualitative) for each of the
following:
(i). The colors of cars in a parking lot.

T
(ii). The heights of students in a class.
(iii). The brands of smartphones owned by a group of people.
(iv). The number of books read by a group of students in a year.
AF (v). The types of cuisine served at different restaurants.
15. For each of the following variables, identify the level of measurement
(nominal, ordinal, interval, or ratio):
(i). The ranking of movies from a film festival.
(ii). The temperatures in degrees Celsius recorded over a week.
(iii). The number of steps taken by an individual in a day.
(iv). The blood types of patients in a hospital.
(v). The ages of participants in a survey.
DR

16. Categorize the following data sets as either nominal, ordinal, interval, or
ratio:
(i). Survey responses (Strongly Agree, Agree, Neutral, Disagree, Strongly
Disagree).
(ii). Birth years of employees in a company.
(iii). Types of pets owned (dog, cat, bird, etc.).
(iv). Test scores out of 100.
(v). Customer satisfaction ratings on a scale of 1 to 10.

24
Chapter 2

Data Exploration: Tabular

T
and Graphical Displays
AF
2.1 Introduction
In data science, data exploration is a crucial phase in the data analysis work-
flow, preceding more complex statistical modeling and hypothesis testing. It
provides a comprehensive overview of the dataset, allowing analysts to under-
stand the underlying structure and characteristics of the data. Through this
process, one can identify data quality issues, such as missing values or outliers,
and gain insights that inform the choice of appropriate analytical methods.

This chapter covers basic methods for exploring data using tables and charts.
First, we will look at tables, which are a simple and effective way to organize
DR

and summarize data. Then, we will explore charts and graphs, which help
us see patterns and trends more easily. By looking at both qualitative and
quantitative data through these methods, we aim to build a solid foundation
for more advanced analysis and ensure a strong approach to exploring data

2.2 Tabular Displays

Tabular displays present data in a structured format, typically organized into
rows and columns. Tables are valuable for displaying raw data, summarizing
information, and comparing different datasets. Basic tabular methods include
frequency tables, which show the distribution of data across different categories
or intervals, and cross-tabulations, which explore relationships between two
categorical variables. Tables are particularly useful for presenting precise data
and performing detailed comparisons, but they can be limited in their ability
to convey trends and patterns visually.

25
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

Key Components
• Frequency Tables: Display the count (or frequency) of each distinct
value or category or interval in a dataset, helping to summarize and
understand the distribution of the data.

• Contingency Tables: Show the frequency distribution of categorical

variables that are dependent on each other.

• Summary Tables: Provide descriptive statistics such as mean, me-

dian, mode, standard deviation, and quartiles.

T
Example
A frequency table showing the distribution of test scores for a class of students.

Score Range Frequency

0-49 2
AF 50-59
60-69
3
7
70-79 8
80-89 10
90-100 5

2.3 Graphical Displays

Graphical displays, on the other hand, offer a visual representation of data that
DR

can reveal trends, relationships, and distributions more intuitively than tables.
Common graphical methods include histograms, bar charts, pie charts, stem-
and-leaf plots, and box plots. Each type of graphical display serves a specific
purpose:

• Histograms show the frequency distribution of numerical data across

intervals, helping to identify patterns and outliers.

• Bar Charts compare different categories by showing the frequency or

count of each category.

• Pie Charts illustrate the proportion of each category relative to the

whole dataset.

• Stem-and-Leaf Plots provide a way to display data that retains the

original data values and reveals distribution shapes.

26
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

• Box Plots summarize the distribution of data through quartiles and

highlight potential outliers.

Graphical methods are indispensable for exploring the data visually and
communicating findings to a broader audience. They help to simplify complex
datasets and highlight patterns that might not be immediately apparent from
tabular data alone.

2.4 Summarizing Qualitative Data

Summarizing Qualitative Data involves organizing and presenting categor-

T
ical data in a way that highlights the frequency and distribution of the different
categories. Here are common methods:

2.4.1 Bar Chart

A bar chart (or bar graph) is a type of data visualization used to represent
AF
and compare different categories of data through rectangular bars. Each bar’s
length or height is proportional to the value or frequency of the category it
represents.

Bar Chart: A chart that uses rectangular bars to represent the frequency
or proportion of categories in a dataset, with the length of each bar corre-
sponding to its value.

The bar chart is a powerful tool for visualizing categorical data. It allows
for easy comparison between different categories. This makes it straightforward
to identify patterns, trends, and outliers within the data.
DR

Problem 2.1. Suppose we conducted a survey asking students about their fa-
vorite movie genre from a list of options: Comedy, Action, Romance, Drama,
and Science Fiction. We gathered responses from a total of 20 students, and
their preferences are given in Table 2.1 and as follows:

Table 2.1: List of favorite movies.

Comedy Science Fiction Comedy Comedy

Action Science Fiction Action Romance
Action Romance Science Fiction Romance
Romance Romance Action Drama
Comedy Romance Action Science Fiction

Make a frequency distribution table and draw bar chart.

27
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

To draw the bar chart, the frequency distribution of the student’s prefer-
ences is given in Table 2.2. The bar chart is presented in Figure 2.1.

Table 2.2: Frequency distribution of movie preferences

Category frequency (fi ) Percentage

Comedy 4 20%
Action 5 25%
Romance 6 30%
Drama 1 5%
Science Fiction 4 20%

T
P
Total
AF i fi = 20

6
6
Number of Students

4 4
4

2
DR
1

0
Comedy Action Romance Drama Science Fiction

Figure 2.1: Bar Chart of Movie Preferences

2.4.2 Python Code: Bar Chart

To create a bar chart for the data in Example 2.1 using Python, you can use
the following code to generate the chart.
1 # Install the necessary libraries
2 import matplotlib . pyplot as plt
3 from collections import Counter
4

5 # Data from the table

6 movies = [

28
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

7 " Comedy " , " Science Fiction " , " Comedy " , " Comedy " ,
8 " Action " , " Science Fiction " , " Action " , " Romance " ,
9 " Action " , " Romance " , " Science Fiction " , " Romance " ,
10 " Romance " , " Romance " , " Action " , " Drama " ,
11 " Comedy " , " Romance " , " Action " , " Science Fiction "
12 ]
13

14 # Count the frequency of each movie genre

15 movie_counts = Counter ( movies )
16

17 # Extract the genres and their corresponding counts

18 genres = list ( movie_counts . keys () )

T
19 counts = list ( movie_counts . values () )
20

21 # Plotting the bar chart

22 plt . figure ( figsize =(10 , 6) )
23 plt . bar ( genres , counts , color = ’ skyblue ’)
24 plt . xlabel ( ’ Movie Genre ’)
25 plt . ylabel ( ’ Frequency ’)
26

29
AF plt . title ( ’ Favorite Movie Genres ’)
plt . show ()

2.4.3 Pie Chart

A pie chart is a circular statistical graphic divided into slices to illustrate nu-
merical proportions. Each slice of the pie represents a category’s contribution
to the total, with the entire pie representing 100% of the data.
DR

Pie Chart: A circular chart that represents the proportion of each cate-
gory in a dataset as slices, allowing easy comparison of relative frequencies.

Pie charts are a straightforward way to show how different parts contribute
to a whole, making them a popular choice for visualizing proportions in data
analysis.

Problem 2.2. Refer to Table 2.1 for the dataset needed to create a pie chart.
Analyze the pie chart and interpret the key findings from the data visualized.

Solution
To draw the pie chart, the frequency distribution of the student’s preferences
is given in Table 2.2. The pie chart is presented in Figure 2.2.

29
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Table 2.3: Frequency Distribution

Category frequency (fi ) Percentage Angle = Pfi × 360◦

i fi

Comedy 4 20% 72
Action 5 25% 90
Romance 6 30% 108
Drama 1 5% 18
Science Fiction 4 20% 72
P
Total i fi = 20

T
Action
Comedy
25%
20%
AF
30% 20%

Romance 5% Science Fiction

Drama

Figure 2.2: Pie Chart

DR
The frequency distribution table and pie chart reveals the preferences for
movie genres among 20 respondents. Romance emerges as the most preferred
genre, accounting for 30% of the choices, as indicated by the largest pie chart
segment. Action is also quite popular, chosen by 25% of respondents. Both
Comedy and Science Fiction have equal popularity, each being preferred by
20% of the respondents. Drama is the least favored genre, with only 5% of the
respondents selecting it. These results suggest that respondents have a diverse
range of genre preferences, with Romance standing out as the most popular
choice.

2.4.4 Python Code: Pie Chart

To create a bar chart for the data in Example 2.1 using Python, you can use
the following code to generate the chart.
1 pip install pandas matplotlib # Install Required Libraries (
if not already installed )

30
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2 import matplotlib . pyplot as plt

3 from collections import Counter
4

5 # Data from the table

6 data = [
7 ’ Comedy ’ , ’ Science Fiction ’ , ’ Comedy ’ , ’ Comedy ’ ,
8 ’ Action ’ , ’ Science Fiction ’ , ’ Action ’ , ’ Romance ’ ,
9 ’ Action ’ , ’ Romance ’ , ’ Science Fiction ’ , ’ Romance ’ ,
10 ’ Romance ’ , ’ Romance ’ , ’ Action ’ , ’ Drama ’ ,
11 ’ Comedy ’ , ’ Romance ’ , ’ Action ’ , ’ Science Fiction ’
12 ]

FT
13

14 # Count occurrences of each genre

15 genre_counts = Counter ( data )
16

17 # Prepare data for the pie chart

18 genres = list ( genre_counts . keys () )
19 counts = list ( genre_counts . values () )
20

21 # Create the pie chart

22 plt . figure ( figsize =(10 , 7) )
23 plt . pie ( counts , labels = genres , autopct = ’ %1.1 f %% ’ , startangle
=140)
24 plt . title ( ’ Favorite Movies Pie Chart ’)
A
25 plt . show ()

Problem 2.3. Imagine you conducted a survey to find out the types of physical
activities performed by patients in a rehabilitation program. You collected data
from 100 patients, and the results are as follows:

• Walking: 25 patients
R
• Cycling: 20 patients

• Swimming: 15 patients

• Yoga: 15 patients

•
D

Strength Training: 10 patients

• Pilates: 8 patients

• Dancing: 7 patients

Answer the following questions:

(a). What is the name of the variable under study? Is this a qualitative vari-
able? If the answer is no, why?

31
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

(b). Make a pie chart to represent the distribution of physical activities among
the patients. Write a summary based on this. Be sure to label the chart
accurately and show the percentage of patients for each activity.

Solution
(a). The variable under study is the type of physical activity performed by
patients in the rehabilitation program. Yes, this is a qualitative variable
because it describes categories or types of activities rather than numerical
values.
(b). Pie Chart:

T
Walking
20%
25% Cycling
AF 15%
Swimming
Yoga
Strength Training
7%
Pilates
Dancing
8%
15%
10%
DR

Summary:
The pie chart shows the distribution of physical activities among the pa-
tients in the rehabilitation program. Walking is the most common activity,
with 25% of the patients participating in it. This is followed by cycling
(20%), swimming (15%), and yoga (15%). Strength training is performed
by 10% of the patients, while pilates and dancing are the least common
activities, with 8% and 7% participation, respectively.

2.5 Summarizing Quantitative Data

Summarizing quantitative data involves organizing and presenting numerical
data in a way that reveals patterns, trends, and important characteristics of
the dataset. Some common methods are explained in the following.

32
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.1 Constructing a Frequency Distribution Table

Let’s take an example to understand how to construct a frequency distribution.
Suppose we have a weekly expenditure of 30 students. To construct a frequency
distribution table, we follow these steps: To construct a frequency distribution
table, we follow these steps:
• Steps for constructing a frequency table:

• Step 1: Sort the data in ascending order.

• Step 2: Find minimum and maximum observation of data.

T
• Step 3: Decide on the number of classes (k) in the frequency distri-
bution.
k = 1 + 3.322 log10 (n).
Alternatively, we can also choose k such that

2k ≥ n.
AF• Step 4: Determine the class interval (h) size.

Maximum observation - Minimum observation

h≥
Number of class (k)

• Step 5: Decide the starting point: the lower class limit or class bound-
ary should cover the smallest value in the raw data.

• Step 6: Tally and count the observations under each interval.

Let’s take the following example to understand how to construct a frequency

distribution.

Problem 2.4. Suppose we have a weekly expenditure of 30 students. Given the

following numbers of observations:

423 369 387 411 393 394

371 377 389 409 392 407
431 401 363 391 405 382
400 381 399 415 428 422
395 371 410 419 386 390

construct a frequency distribution table using an appropriate number of classes

and the class interval.

33
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

Solution
1. Step 1: sort data in ascending order

363 369 371 371 377 381 382 386 387 389 390

391 392 393 394 395 399 400 401 405 407 409

410 411 415 419 422 423 428 431

2. Step 2:

T
Minimum Observation is 363 and maximum observation is 431
3. Step 3: Number of classes
k = 1 + 3.322 log10 (30) = 5.907 ≈ 6

4. Step 4:
AF h≥
Maximum observation - Minimum observation

431 − 363
Number of class
≥ = 11.33 ≈ 12
6
5. Step 5: Decide the staring point 360
Using all steps, the frequency distribution table is presented in Table 2.4.

Table 2.4: Distribution of weekly expenditure of 30 students.

Class Interval Tally Frequency Relative Frequency

360 - 372 4 0.1333
372 - 384 3 0.1000
384 - 396 9 0.3000
396 - 408 5 0.1667
408 - 420 5 0.1667
420 - 432 4 0.1333
Total 30 1.0000

Constructing a frequency distribution table helps in organizing data into

class intervals, providing a clear overview of the distribution of data points.
This is a crucial step in data exploration, as it allows for the identification of
patterns and trends within the dataset.

34
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.2 Python Code: Frequency Distribution

To create a frequency distribution table for the data in Problem 2.4 using
Python, you can use the following code to generate the Frequency Distribution
table.
1 import pandas as pd
2 import matplotlib . pyplot as plt
3

4 # Data : weekly expenditure of 30 students

5 data = [
6 423 , 369 , 387 , 411 , 393 , 394 ,
7 371 , 377 , 389 , 409 , 392 , 407 ,

T
8 431 , 401 , 363 , 391 , 405 , 382 ,
9 400 , 381 , 399 , 415 , 428 , 422 ,
10 395 , 371 , 410 , 419 , 386 , 390
11 ]
12

13 # Convert the data into a pandas DataFrame

17
AF df = pd . DataFrame ( data , columns =[ ’ Expenditure ’ ])

# Define the class intervals ( bins )

bins = range (360 , 440 , 12) # Create bins from 360 to 440
with an interval of 12
18 labels = [ f ’{ bins [ i ]} -{ bins [ i +1] -1} ’ for i in range ( len ( bins
) -1) ] # Labels for each bin
19

20 # Bin the data and calculate frequency distribution

21 df [ ’ Bins ’] = pd . cut ( df [ ’ Expenditure ’] , bins = bins , labels =
labels , right = False )
22 f r e q u e n c y _d is tri bu ti on = df [ ’ Bins ’ ]. value_counts () .
sort_index ()
DR

24 # Print frequency distribution

25 print ( " Frequency Distribution : " )
26 print ( f r eq uen cy _d ist ri bu ti on )
27

28 # Plot frequency distribution as a bar chart

29 plt . figure ( figsize =(12 , 6) )
30 f r e q u e n c y _d is tri bu ti on . plot ( kind = ’ bar ’ , color = ’ skyblue ’)
31 plt . xlabel ( ’ Expenditure Range ’)
32 plt . ylabel ( ’ Frequency ’)
33 plt . title ( ’ Frequency Distribution of Weekly Expenditure with
Fixed Class Intervals ’)
34 plt . xticks ( rotation =45)
35 plt . grid ( axis = ’y ’ , linestyle = ’ -- ’ , alpha =0.7)
36 plt . tight_layout ()
37 plt . show ()
38

35
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.3 Frequency Polygon

A frequency polygon is a line graph that displays the frequencies of different
class intervals. It is created by plotting the frequencies of each class interval at
the midpoints of those intervals and connecting these points with straight lines.

To draw we have to calculate the midpoint for each class interval. The
midpoint is the average of the lower and upper bounds of the class interval
(See Table 2.5). Then plot the midpoints on the x-axis and the corresponding
frequencies on the y-axis. The frequency polygon is depicted in Figure 2.3

Table 2.5: Distribution of Weekly Expenditure of 30 Students

T
Class Tally Frequency Relative Midpoint
Interval Frequency
360 - 372 4 0.1333 366
AF 372 - 384
384 - 396
396 - 408
3
9
5
0.1000
0.3000
0.1667
378
390
402
408 - 420 5 0.1667 414
420 - 432 4 0.1333 426
Total 30 1.0000

Figure 2.3: Frequency Polygon of Weekly Expenditure of 30 Students

Frequency Polygon

8
Frequency

2
Frequency Polygon
0
360 372 384 396 408 420 432
Expenditure

36
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.4 Python Code: Frequency Polygon

To create a frequency polygon for the data in Example 2.4 using Python, you
can use the following code to generate the Frequency Polygon.
1 import pandas as pd
2 import matplotlib . pyplot as plt
3

4 # Data : weekly expenditure of 30 students

5 data = [
6 423 , 369 , 387 , 411 , 393 , 394 ,
7 371 , 377 , 389 , 409 , 392 , 407 ,
8 431 , 401 , 363 , 391 , 405 , 382 ,

T
9 400 , 381 , 399 , 415 , 428 , 422 ,
10 395 , 371 , 410 , 419 , 386 , 390
11 ]
12

13 # Convert the data into a pandas DataFrame

14 df = pd . DataFrame ( data , columns =[ ’ Expenditure ’ ])
15

17
AF # Define the class intervals ( bins )
bins = range (360 , 440 , 12) # Create bins from 360 to 440
with an interval of 12
18 labels = [ f ’{ bins [ i ]} -{ bins [ i +1] -1} ’ for i in range ( len ( bins
) -1) ] # Labels for each bin
19

20 # Bin the data and calculate frequency distribution

21 df [ ’ Bins ’] = pd . cut ( df [ ’ Expenditure ’] , bins = bins , labels =
labels , right = False )
22 f r e q u e n c y _d is tri bu ti on = df [ ’ Bins ’ ]. value_counts () .
sort_index ()
23
DR

24 # Calculate bin midpoints

25 bin_midpoints = [( bins [ i ] + bins [ i +1]) / 2 for i in range (
len ( bins ) -1) ]
26

27 # Plot frequency polygon

28 plt . figure ( figsize =(12 , 6) )
29 plt . plot ( bin_midpoints , fre qu en cy_ di st ri but io n . sort_index () ,
marker = ’o ’ , linestyle = ’ - ’ , color = ’b ’)
30 plt . xlabel ( ’ Expenditure Range ’)
31 plt . ylabel ( ’ Frequency ’)
32 plt . title ( ’ Frequency Polygon of Weekly Expenditure ’)
33 plt . xticks ( bin_midpoints , labels = labels , rotation =45)
34 plt . grid ( True , linestyle = ’ -- ’ , alpha =0.7)
35 plt . tight_layout ()
36 plt . show ()
37

37
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.5 Ogive Curve

An ogive curve (also called Cumulative Frequency Polygons) is a graphical rep-
resentation used in statistics to show the cumulative frequency distribution of
a dataset. It is a type of cumulative frequency graph that visualizes how the
cumulative frequency accumulates over a range of values.

To draw an ogive curve, we need to calculate cumulative frequency which is

calculated by adding the frequency of the current class interval to the cumulative
frequency of the previous class interval. The frequency distribution table with
cumulative frequency is presented in Table 2.6.

T
Table 2.6: Distribution of Weekly Expenditure of 30 Students

Class Tally Frequency Relative Midpoint Cumulative

Interval Frequency Frequency
360 - 372 4 0.1333 366 4
AF
372 - 384
384 - 396
3
9
0.1000
0.3000
378
390
7
16
396 - 408 5 0.1667 402 21
408 - 420 5 0.1667 414 26
420 - 432 4 0.1333 426 30
Total 30 1.0000

The ogive curve is presented in Figure 2.4 for weekly expenditures. In

the frequency polygon, the peak at the midpoint of 390 indicates that most
students’ expenditures fall around this value. The shape of the polygon, with
a rise to the peak and a gradual decline, shows that while expenditures are
somewhat concentrated in the 384-396 range, there is a moderate spread across
other ranges. This visualization helps quickly grasp the central tendency and
variability of expenditures among students.

38
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Figure 2.4: Ogive Curve of Weekly Expenditure of 30 Students

Ogive Curve
Cumulative Frequency

30
25
20
15
10
5
Ogive Curve
0

T
360 372 384 396 408 420 432
Expenditure

In the example of weekly expenditures for 30 students, the ogive curve il-
AF
lustrates that as expenditure increases, the cumulative number of students also
rises. The curve starts at the cumulative frequency of 4 for the interval 360-
372 and gradually increases to 30 for the interval 420-432, reflecting that 30
students’ expenditures are up to 432. The steepness of the curve indicates in-
tervals with higher frequencies, while flatter sections show lower frequencies.
Key features like the median can be identified where the curve reaches 50%
of the total cumulative frequency (15 students), and the quartiles reveal the
spread of expenditures across different percentiles. Overall, the ogive provides
insights into data distribution, helping to visualize the proportion of students
DR
spending up to various amounts.

2.5.6 Python Code: Ogive

To create a Ogive for the data in Example 2.4 using Python, you can use the
following code to generate the Ogive.
1 import pandas as pd
2 import matplotlib . pyplot as plt
3

4 # Data : weekly expenditure of 30 students

5 data = [
6 423 , 369 , 387 , 411 , 393 , 394 ,
7 371 , 377 , 389 , 409 , 392 , 407 ,
8 431 , 401 , 363 , 391 , 405 , 382 ,
9 400 , 381 , 399 , 415 , 428 , 422 ,
10 395 , 371 , 410 , 419 , 386 , 390
11 ]
12

39
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

13 # Convert the data into a pandas DataFrame

14 df = pd . DataFrame ( data , columns =[ ’ Expenditure ’ ])
15

16 # Define the class intervals ( bins ) with an interval of 12

17 bin_start = 360
18 bin_end = 440
19 bin_interval = 12
20 bins = list ( range ( bin_start , bin_end + bin_interval ,
bin_interval ) )
21

22 # Generate labels for the bins

23 labels = [ f ’{ bins [ i ]} -{ bins [ i +1] -1} ’ for i in range ( len ( bins
) -1) ]
24

25 # Bin the data and calculate frequency distribution

26 df [ ’ Bins ’] = pd . cut ( df [ ’ Expenditure ’] , bins = bins , labels =

T
labels , right = False )
27 f r e q u e n c y _d is tri bu ti on = df [ ’ Bins ’ ]. value_counts () .
sort_index ()
28

# Calculate cumulative frequency

30
AF
c u m u l a t i ve_frequency = f re qu enc y_ di str ib ut ion . cumsum ()
31

32 # Calculate bin edges for plotting

33 bin_edges = bins # Use only the bin edges
34

35 # Calculate cumulative frequencies including the starting

point
36 c u m u l a t i v e _ f r e q u e n c y _ w i t h _ s t a r t = [0] + list (
c u m u lative_frequency )
DR
37

38 # Ensure the length of bin_edges matches

cumulative_frequency_with_start
39 if len ( bin_edges ) != len ( c u m u l a t i v e _ f r e q u e n c y _ w i t h _ s t a r t ) :
40 # Extend bin_edges to match the length of
cumulative_frequency_with_start
41 bin_edges = bins + [ bins [ -1] + bin_interval ]
42

43 # Plot the ogive

44 plt . figure ( figsize =(12 , 6) )
45 plt . plot ( bin_edges , cumulative_frequency_with_start , marker =
’o ’ , linestyle = ’ - ’ , color = ’b ’)
46 plt . xlabel ( ’ Expenditure Range ’)
47 plt . ylabel ( ’ Cumulative Frequency ’)
48 plt . title ( ’ Ogive of Weekly Expenditure with Interval 12 ’)
49 plt . xticks ( bin_edges , labels =[ f ’{ edges } -{ edges +12 -1} ’ for
edges in bin_edges [: -1]] + [ ’ ’ ])
50 plt . grid ( True , linestyle = ’ -- ’ , alpha =0.7)
51 plt . tight_layout ()

40
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

52 plt . show ()

2.5.7 Histogram
A histogram is a graphical representation of the distribution of numerical data.
It consists of a series of adjacent rectangles, or bars, where each bar’s height
corresponds to the frequency or count of data points falling within a specific
range or bin. It provides a visual summary of data distribution, helping to
identify patterns such as trends, peaks, and the spread of data.

The following histogram (See in Figure 2.5) displays the distribution of

T
weekly expenditures for 30 students. The x-axis represents the expenditure
ranges (bins), and the y-axis represents the number of students in each range.

8
AF
Frequency

0
360-372 372-384 384-396 396-408 408-420
DR

Value

Figure 2.5: Histogram

The histogram of weekly expenditures for 30 students illustrates that most

students’ spending falls within the $384 to $396 range, which is the highest
frequency interval in the distribution. The data is predominantly centered
around this mid-range, indicating that it is the most common expenditure range
among the students. The frequency of expenditures decreases as you move
towards the lower ($360 to $372) and higher ($420 to $432) intervals, showing
that fewer students have expenditures at the extremes. Overall, the histogram
reveals a central tendency in the middle expenditure ranges, suggesting that
most students have similar spending patterns with moderate expenditures being
more prevalent compared to the extremes.

41
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.8 Python Code: Histogram

To create a histogram for the data in Example 2.4 using Python, you can use
the following code to generate the histogram.
1 import pandas as pd
2 import matplotlib . pyplot as plt
3

4 # Data : weekly expenditure of 30 students

5 data = [
6 423 , 369 , 387 , 411 , 393 , 394 ,
7 371 , 377 , 389 , 409 , 392 , 407 ,
8 431 , 401 , 363 , 391 , 405 , 382 ,

T
9 400 , 381 , 399 , 415 , 428 , 422 ,
10 395 , 371 , 410 , 419 , 386 , 390
11 ]
12

13 # Convert the data into a pandas DataFrame

14 df = pd . DataFrame ( data , columns =[ ’ Expenditure ’ ])
15

18
AF # Define the bin intervals with an interval of 12
bin_start = 360
bin_end = 440
19 bin_interval = 12
20 bins = list ( range ( bin_start , bin_end + bin_interval ,
bin_interval ) )
21

22 # Plot the histogram

23 plt . figure ( figsize =(12 , 6) )
24 plt . hist ( df [ ’ Expenditure ’] , bins = bins , edgecolor = ’ black ’ ,
alpha =0.7)
25 plt . xlabel ( ’ Expenditure Range ’)
DR

26 plt . ylabel ( ’ Frequency ’)

27 plt . title ( ’ Histogram of Weekly Expenditure with Interval 12 ’
)
28 plt . xticks ( bins , labels =[ f ’{ edges } -{ edges +12 -1} ’ for edges
in bins [: -1]] + [ f ’{ bins [ -1]}+ ’ ])
29 plt . grid ( True , linestyle = ’ -- ’ , alpha =0.7)
30 plt . tight_layout ()
31 plt . show ()

2.5.9 Stem-and-Leaf Plot

A stem-and-leaf plot is a data visualization tool used to display the distri-
bution of a dataset while preserving the original data values. It separates each
data point into two parts: the “stem,” which represents the leading digits of
the data, and the “leaf,” which represents the trailing digits. This plot helps
to organize data, reveal patterns, and identify the shape of the distribution. It

42
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

is particularly useful for small to moderate-sized datasets.

Components of a Stem-and-Leaf Plot:

• Stem: Represents the leading digits of the data values.

• Leaf: Represents the last digit of the data values.

• Plot: Lists stems in ascending order with leaves corresponding to each

stem, showing the distribution of data.
Problem 2.5. Consider the following dataset representing the blood pressures

FT
(in mmHg) of 15 patients.

120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148

Construct a stem-and-leaf plot and comment on it.

Solution
The stem-and-leaf plot is presented in Table 2.7. The “stem” represents the
tens and the “leaf” represents the ones digit. For example, for 120, the stem is
12 and the leaf is 0.

Table 2.7: Stem-and-Leaf Plot of Blood Pressure Measurements

A
Stem Leaf
12 0, 2, 4, 6, 8
13 0, 2, 4, 6, 8
14 0, 2, 4, 6, 8
R
The stem-and-leaf plot organizes the data clearly:
• The stems (12, 13, 14) represent the tens digits of the blood pressures.

• The leaves show the units digits for each stem, indicating the exact
D

values of the blood pressures.

Observations from the plot:

The stem-and-leaf plot shows that the blood pressures are fairly evenly dis-
tributed between 120 and 148. There are no extreme outliers, as the values
are consistently spaced out. The plot illustrates a consistent increase in blood
pressure from 120 mmHg to 148 mmHg, indicating a relatively uniform range
of values. Each stem (12, 13, and 14) represents a set of five patients’ blood
pressures, making the distribution easy to interpret.

43
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

Problem 2.6. Construct a stem-and-leaf plot to display the distribution of the

following temperatures recorded in Celsius degrees:

25.3, -3.8, 12, 0.5, -10,

18.9, -7.2, 6, 21.6, -15.4

Create a stem-and-leaf plot appropriately.

Solution
The temperatures recorded in Celsius are:

25.3, −3.8, 12, 0.5, −10, 18.9, −7.2, 6, 21.6, −15.4

We can construct a stem-and-leaf plot as follows:

T
Stem Leaf
−15 4
−10 0
AF −7 2
−3 8
0 5
6 0
12 0
18 9
DR
21 6
25 3
In this plot:

• The ‘stem’ represents the integer part (before the decimal point) of
each temperature. For instance, the temperature 25.3 can be broken
down into a stem of 25 (representing the integer part) and a leaf of 3
(representing the decimal part).

• The ‘leaf’ represents the decimal part (after the decimal point) of each
temperature.

2.5.10 Python Code: Stem-and-leaf

To create a stem-and-leaf for the given data in Example 2.6 in Python, we need
the following Python code that will generate a stem-and-leaf.

44
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

1 from collections import defaultdict

3 # Data
4 data = [25 , -4 , 12 , 1 , -10 , 19 , -7 , 6 , 22 , -15]
5

6 def st em _and_leaf_plot ( data ) :

7 # Separate positive and negative values
8 pos_data = [ x for x in data if x >= 0]
9 neg_data = [ - x for x in data if x < 0]
10

11 # Determine the stems and leaves

12 def create_stem_leaf ( data ) :

T
13 stem_leaf = defaultdict ( list )
14 for value in data :
15 stem = value // 10
16 leaf = value % 10
17 stem_leaf [ stem ]. append ( leaf )
18 return stem_leaf
19

23
AF pos_stem_leaf = create_stem_leaf ( pos_data )
neg_stem_leaf = create_stem_leaf ( neg_data )

# Print positive values

24 print ( " Stem - and - Leaf Plot : " )
25 print ( " Positive values : " )
26 for stem in sorted ( pos_stem_leaf . keys () ) :
27 leaves = sorted ( pos_stem_leaf [ stem ])
28 print ( f " { stem } | { ’ ’. join ( map ( str , leaves ) ) } " )
29

30 # Print negative values

31 print ( " Negative values : " )
DR

32 for stem in sorted ( neg_stem_leaf . keys () ) :

33 leaves = sorted ( neg_stem_leaf [ stem ])
34 print ( f " -{ stem } | { ’ ’. join ( map ( str , leaves ) ) } " )
35

36 # Generate the stem - and - leaf plot

37 st em _a nd _leaf_plot ( data )

2.6 Concluding Remarks

As we conclude our exploration of tabular and graphical displays, it is evident
that these foundational techniques are indispensable for effective data analysis.
The ability to present and interpret data through well-structured tables and
informative graphics not only enhances our understanding but also aids in un-
covering patterns and trends that might otherwise remain obscured. The skills
and Python code examples provided in this chapter equip you with practical
tools for summarizing and visualizing data, setting the stage for deeper statis-

45
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

tical analysis and more complex data science endeavors. As you move forward,
the principles outlined here will serve as a cornerstone for more advanced topics,
reinforcing the importance of clear, accurate, and insightful data presentation
in the broader field of data science.

2.7 Chapter Exercises

1. You are given the following data from a survey on the incidence of differ-
ent types of diseases among a sample of patients:

Flu, Cold, Flu, Allergies, Cold, Flu, Cold, Flu, Allergies, Flu, Flu, Aller-

T
gies, Cold, Flu, Cold, Flu, Cold, Allergies, Cold, Flu

Construct a frequency distribution table for the diseases. Draw a bar chart
and a pie chart based on the frequency distribution. Interpret the results
and comment on the most and least common diseases in the sample.
AF
2. A local community center conducted a survey to find out the preferred
recreational activities of its members. The results of the survey are sum-
marized below:

Activity Number of Members

Basketball 45
Swimming 30
Yoga 20
Tennis 15
DR

Running 10

(a). Calculate the percentage of members who prefer each activity.

(b). Draw a pie chart to represent the distribution of preferred recre-
ational activities among the members. Write a summary based on
this.

3. The following table given in Table 2.8, shows the distribution of sales (in
thousands of units) of five different products in a company during the first
quarter of the year.

46
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Table 2.8: Sales Distribution of Products

Product Sales (thousands of units)

Laptops 30
Smartphones 20
Tablets 15
Desktops 25
Accessories 10

T
(a). Calculate the percentage share of each product in the total sales.
(b). Make a pie chart. Write a summary based on this.

4. Imagine you conducted a survey to find out how people spend their leisure
time on a typical weekend. You collected data from 100 respondents, and
the results are as follows:
AF • Watching TV: 30 respondents
• Reading: 20 respondents
• Playing Sports: 15 respondents
• Socializing with Friends: 10 respondents
• Playing Video Games: 10 respondents
• Hiking and Outdoor Activities: 8 respondents
DR

• Cooking and Baking: 7 respondents

Answer the following questions:

(a). What is the name of the variable under study? Is this a qualitative
variable? If the answer is no, why?
(b). Make a ie chart to represent the distribution of leisure activities
among the respondents. Write a summary based on this. Be sure to
label the chart accurately and show the percentage of respondents
for each activity.
5. You are given the following set of test scores from a group of students in
an engineering course:

45, 67, 53, 52, 61, 59, 68, 72, 56, 54,
63, 75, 49, 62, 60, 58, 66, 64, 55, 70

47
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

Construct a frequency distribution table using an appropriate number of

classes and appropriate class intervals for the given test scores.
6. The following data represents the ages of employees in a company working
on a new engineering project:

102, 98, 105, 110, 95, 107, 101, 99, 103, 106,
104, 100, 108, 97, 96, 109, 111, 94, 93, 92
Create a frequency distribution table using an appropriate number of
classes and appropriate class intervals for the given ages.

T
7. Here are the weekly sales figures (in units) for a pharmaceutical company
over a period of 20 weeks:

23, 27, 31, 35, 29, 33, 22, 28, 26, 34,
32, 25, 30, 24, 36, 21, 37, 38, 39, 20
AFConstruct a frequency distribution table using an appropriate number of
classes and appropriate class intervals for the given sales figures.
8. The following data represents the weights (in kilograms) of a sample of
fruits used in a nutritional study:

5.2, 6.3, 7.1, 8.4, 5.5, 6.8, 7.6, 8.0, 5.9, 6.1,
7.3, 8.7, 5.0, 6.5, 7.8, 8.1, 5.7, 6.9, 7.4, 8.5
Make a frequency distribution table using an appropriate number of classes
and appropriate class intervals for the given weights.
DR

9. You are given the following set of scores from a recent medical examina-
tion:

82, 91, 85, 87, 89, 95, 88, 92, 84, 90,
93, 83, 86, 96, 94, 81, 97, 98, 99, 80
Create a frequency distribution table using an appropriate number of
classes and appropriate class intervals for the given exam scores.
10. Consider the following data set representing the heights (in cm) of 10
plants:

Data: 150, 155, 160, 162, 165, 168, 170, 175, 180, 185
(a). Construct a stem-and-leaf plot for the data set.
(b). What is the maximum number of heights of the data?

48
Chapter 3

Data Exploration:

T
Numerical Measures
AF
3.1 Introduction
In this chapter, we delve into the fundamental concepts of data exploration
with a focus on numerical measures. Understanding these measures is crucial
for analyzing and interpreting data effectively. We begin by examining vari-
ous measures of central tendency, which provide insights into the typical value
within a dataset. These include the arithmetic mean, harmonic mean, geometric
mean, and median, each with its unique properties, advantages, and limitations.

Following the exploration of central tendency, we will address measures of

dispersion or variability. These metrics, such as range, variance, and standard
DR

deviation, help us understand the spread and variability of data points around
the central value. The chapter will also cover measures of distribution shape,
including skewness and kurtosis, which describe the asymmetry and peakedness
of the data distribution.

Moreover, we will discuss quartiles, percentiles, and deciles, which are instru-
mental in dividing the data into meaningful segments, and outline methods for
detecting outliers. The chapter concludes with an overview of the five-number
summary and boxplots, essential tools for summarizing and visualizing data
distribution. Python code will be provided throughout to illustrate practical
applications of these concepts.

3.2 Measures of Central Tendency

Measures of central tendency are statistical measures that describe the center
or representative value of a dataset. The most common measures of central

49
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

tendency are the

(i). Mean
■ Arithmetic mean
■ Harmonic mean
■ Geometric mean

(ii). Median

(iii). Mode

T
These measures provide a single value that represents the middle or center
of the data distribution and are essential for summarizing large datasets.

3.2.1 Arithmetic Mean

The arithmetic mean, often referred to as the average, is the most common
AF
measure of central tendency. It is a useful measure when the data is symmet-
rically distributed without extreme outliers. It is calculated by summing all
values and dividing by the number of values. The arithmetic mean of a set of
numbers x1 , x2 , . . . , xn is typically denoted by x̄ of n and is defined by

n
!
x1 + x2 + · · · + xn 1 X
x̄ = = xi
n n i=1

Consider a study measuring the time in days for patients to recover from a
specific illness, with recovery times being 4, 5, 6, 7, 8 days. We can use the
arithmetic mean to find the average recovery time.
DR

The arithmetic mean is calculated as follows:

4+5+6+7+8 30
x̄ = = =6
5 5
So, the arithmetic mean of the recovery times is 6 days, which represents
the average recovery time.

3.2.2 Advantages and Disadvantages of Arithmetic Mean

The arithmetic mean (or simply the mean) is one of the most commonly used
measures of central tendency. It has both advantages and disadvantages de-
pending on the context in which it is used. Understanding its advantages and
disadvantages is crucial for selecting the appropriate measure of central ten-
dency for different types of data and analyses.

50
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Advantages
(i) Rigidity and Simplicity: It is rigidly defined, simple, easy to under-
stand, and easy to calculate.
(ii) Uses all data points: It is based upon all the observations in the data
set.
(iii) Uniqueness: Its value being unique allows for comparisons between dif-
ferent sets of data.
(iv) Mathematical Properties: The arithmetic mean has useful mathemat-
ical properties. For instance, it can be used in further statistical analysis,

T
like calculating variance and standard deviation.
(v) Best for Symmetric Distributions: The arithmetic mean is a reliable
measure of central tendency when the data follows a symmetric distribu-
tion (like a normal distribution), where the mean is the most representa-
tive value.
AF
(vi) Stability: It is least affected by sampling fluctuations compared to other
measures of central tendency.

Disadvantages
(i) Sensitivity to Outliers: The mean is highly affected by extreme values
or outliers. A few very high or low numbers can skew the mean, making
it not represent the ”typical” value of the data set.
(ii) Not Suitable for Skewed Distributions: In data sets that are heavily
skewed, the mean may not reflect the central location accurately, as it can
DR

be pulled toward the tail of the distribution.

(iii) Not Ideal for Non-Numeric Data: The arithmetic mean cannot be
applied to nominal or ordinal data, as it requires numerical values to make
sense.

(iv) Requirement of Complete Data: It cannot be obtained if a single

observation is missing.
(v) Inapplicability with Open Classes: It cannot be calculated if the
extreme class is open (e.g., below 10 or above 90).

Problem 3.1. Suppose we have the following data on the systolic blood pressure
(in mmHg) of 10 patients:

120, 130, 125, 140, 135, 128, 132, 138, 124, 126

What is the average systolic blood pressure (in mmHg) of 10 patients?

51
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Solution
To calculate the mean systolic blood pressure:
120 + 130 + 125 + 140 + 135 + 128 + 132 + 138 + 124 + 126
x̄ =
10
1298
= = 129.8 mmHg
10

3.2.3 Harmonic Mean

The harmonic mean is a type of average used for rates and ratios. It is defined
as the reciprocal of the arithmetic mean of the reciprocals of a set of values.

T
The formula for the harmonic mean x̄HM of n values x1 , x2 , . . . , xn is:

n n
x̄HM = P
n = P
n
1 1
xi xi
i=1 i=1
AF
Using the recovery times 4, 5, 6, 7, 8 days, we calculate the harmonic mean to
find the average rate of recovery.

The harmonic mean is calculated as follows:

5
x̄HM = 1 1 1 1 1
4 + 5 + 6 + 7 + 8
5
= ≈ 5.59
0.25 + 0.20 + 0.1667 + 0.1429 + 0.125
DR

So, the harmonic mean of the recovery times is approximately 5.59 days,
which represents an average recovery time weighted by the rates of recovery.

Advantages and Disadvantages of Harmonic Mean

Advantages
(i) Appropriate for Rates and Ratios: The harmonic mean is particu-
larly useful for averaging rates or ratios, such as speeds, densities, or other
quantities where the reciprocal of the average is meaningful.
(ii) Minimizes the Impact of Large Values: It tends to minimize the
impact of large values compared to the arithmetic mean. This is beneficial
when large values could distort the overall average.
(iii) Emphasizes Small Values: It gives more weight to smaller values in a
data set, which can be useful in situations where lower values are more
significant or indicative.

52
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

(iv) Mathematically Robust: It is less affected by extreme values than the

arithmetic mean in datasets where the values are rates or ratios.

Disadvantages
(i) Sensitivity to Zero Values: The harmonic mean cannot be computed
if any value in the dataset is zero, as it involves division by the values.

(ii) Less Intuitive: It is less intuitive than the arithmetic mean and is not as
commonly used, which can make interpretation and communication more
challenging.

T
(iii) Not Suitable for All Data Types: It is not suitable for data that does
not represent rates or ratios. It is not typically used for general numerical
data where other means are more appropriate.
(iv) Potential for Misleading Results: In cases where there is significant
variability in the data, particularly if large values are present, the har-
monic mean can provide misleading results.
AF
(v) Complex Calculation: The calculation of the harmonic mean is more
complex compared to the arithmetic mean, which can be a drawback in
some practical applications.

3.2.4 Geometric Mean

The geometric mean is a measure of central tendency that is useful for
datasets with exponential growth or multiplicative effects. It is defined as the
n-th root of the product of n values. The formula for the geometric mean x̄GM
of n values x1 , x2 , . . . , xn is:
DR

v
u n
√ uY
x̄GM = n
x1 × x2 × · · · × xn = t
n
xi
i=1

Using the recovery times 4, 5, 6, 7, 8 days, we calculate the geometric mean to

find the average growth rate of recovery.

The geometric mean is calculated as follows:

√
5
√
5
x̄GM = 4×5×6×7×8= 6720 ≈ 5.59
So, the geometric mean of the recovery times is approximately 5.59 days,
reflecting the average multiplicative rate of recovery over time.

53
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Advantages and Disadvantages of Geometric Mean

Advantages
(i) Appropriate for Multiplicative Processes: The geometric mean is
ideal for data that involves multiplicative processes, such as rates of
growth, financial returns, or other situations where values are multiplied.
(ii) Mitigates the Effect of Extreme Values: It reduces the impact of
extremely high or low values in the dataset, which can provide a more
balanced measure when dealing with skewed data.

(iii) Handles Proportional Relationships: It is useful for datasets where

T
values are in proportional or percentage terms, such as in economic or
financial analysis.
(iv) Stability in Long-Term Growth Rates: In contexts like investment
returns, the geometric mean offers a more accurate measure of average
growth rates over time compared to the arithmetic mean.
AF
Disadvantages
(i) Cannot Handle Zero or Negative Values: The geometric mean is
undefined for datasets containing zero or negative values, as it involves
taking the nth root of the product of values.
(ii) Less Intuitive: It is less intuitive and harder to understand compared
to the arithmetic mean, making it less accessible for some audiences.
(iii) Requires Logarithmic Transformation: Calculating the geometric
mean involves the logarithm of values, which adds complexity compared
to simpler averages.
DR

(iv) Sensitive to Variability: While it reduces the impact of extreme values,

it may still be affected by large variability in the dataset, especially if
values differ greatly.
(v) Not Suitable for All Data Types: It is not appropriate for all types
of data, particularly where the data do not naturally fit a multiplicative
model or where additive relationships are more relevant.

3.2.5 Relationships Between Arithmetic Mean, Geomet-

ric Mean, and Harmonic Mean
Let x1 , x2 , . . . , xn be positive real numbers. The Theorem 3.1 state that the
means of a set of positive numbers satisfy the following inequality

54
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

x̄ ≥ x̄GM ≥ x̄HM ,

where x̄, x̄GM , and x̄HM represent the arithmetic mean, geometric mean, and
harmonic mean, respectively. This inequality reflects a fundamental property
of these measures of central tendency.

Specifically, the arithmetic mean is always greater than or equal to the

geometric mean, which in turn is greater than or equal to the harmonic mean.
Importantly, equality holds in these inequalities if and only if all the elements
in the dataset are equal. In other words, x̄ = x̄GM = x̄HM if and only if
x1 = x2 = · · · = xn .

T
Theorem 3.1. Let x1 , x2 , . . . , xn be positive real numbers. Let x̄AM denote
the arithmetic mean, x̄GM denote the geometric mean, and x̄HM denote the
harmonic mean of these numbers. Then,

x̄AM ≥ x̄GM ≥ x̄HM

AF
Equality holds in both inequalities if and only if x1 = x2 = · · · = xn .
Proof. We will prove the inequality in two parts.

Part 1: Proof of x̄AM ≥ x̄GM (AM-GM Inequality)

We will use the property that for a convex function f , Jensen’s inequality
holds: !
n n
1X 1X
f xi ≤ f (xi )
n i=1 n i=1

Consider the function f (x) = − ln(x) for x > 0. The first derivative is
f ′ (x) = − x1 , and the second derivative is f ′′ (x) = x12 . Since f ′′ (x) > 0 for all
DR

x > 0, the function f (x) = − ln(x) is strictly convex on the interval (0, ∞).

Applying Jensen’s inequality to f (x) = − ln(x) and the positive numbers

x1 , x2 , . . . , xn : !
n n
1X 1X
− ln xi ≤ (− ln(xi ))
n i=1 n i=1
n
1X
− ln (x̄AM ) ≤ − ln(xi )
n i=1
Pn Qn
Using the property of logarithms i=1 ln(xi ) = ln ( i=1 xi ):
n
!
1 Y
− ln (x̄AM ) ≤ − ln xi
n i=1

55
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

 !1/n 
n
Y
− ln (x̄AM ) ≤ − ln  xi 
i=1

− ln (x̄AM ) ≤ − ln (x̄GM )
Since the natural logarithm function ln(x) is strictly increasing, the function
− ln(x) is strictly decreasing. Therefore, multiplying by −1 and reversing the
inequality sign gives:
ln (x̄AM ) ≥ ln (x̄GM ) .
Again, due to the strictly increasing nature of ln(x), we conclude:

T
x̄AM ≥ x̄GM

Equality in Jensen’s inequality for a strictly convex function holds if and

only if all the arguments are equal, i.e., x1 = x2 = · · · = xn . Therefore, equal-
ity in x̄AM ≥ x̄GM holds if and only if x1 = x2 = · · · = xn .

1
AF
Part 2: Proof of x̄GM ≥ x̄HM (GM-HM Inequality)
Consider the reciprocals of the positive numbers x1 , x2 , . . . , xn , which are
1 1
x1 , x2 , . . . , xn . Applying the AM-GM inequality (proven in Part 1) to these
positive numbers, we have:
1 1 1 1/n
+ + ··· +

x1 x2 xn 1 1 1
≥ · · ··· ·
n x1 x2 xn
Pn 1 1/n
i=1 xi 1
≥ Qn
n i=1 xi
Pn 1
i=1 xi 1
DR

≥ Qn 1/n
n ( i=1 xi )
Pn 1
i=1 xi 1
≥
n x̄GM
Taking the reciprocal of both sides of the inequality. Since both sides are
positive, the inequality sign reverses:
n
Pn 1 ≤ x̄GM
i=1 xi

By definition, the left side is the harmonic mean x̄HM :

x̄HM ≤ x̄GM

Which is equivalent to:

x̄GM ≥ x̄HM

56
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

The equality in the AM-GM inequality applied to x11 , x12 , . . . , x1n holds if and
only if x11 = x12 = · · · = x1n . This condition is equivalent to x1 = x2 = · · · = xn .
Therefore, equality in x̄GM ≥ x̄HM holds if and only if x1 = x2 = · · · = xn .

Combining Part 1 and Part 2, we have proven the AM-GM-HM inequality:

x̄AM ≥ x̄GM ≥ x̄HM

with equality holding throughout if and only if x1 = x2 = · · · = xn .

3.2.6 Median

T
The median is a measure of central tendency that divides a dataset into two
equal halves. Let x1 , x2 , . . . , xn be a set of n numerical observations. To find
the median, first arrange the data in ascending (or descending) order. Let the
ordered data set be denoted by x(1) , x(2) , . . . , x(n) , where x(i) is the i-th value
in the ordered set.
AF
The median is then defined as follows:
Let m = n+1
2 .

(
x(m) if m is an integer (i.e., n is odd)
Median = x(⌊m⌋) +x(⌈m⌉)
2 if m is not an integer (i.e., n is even)

Here, ⌊m⌋ is the floor function (the greatest integer less than or equal
to m), and ⌈m⌉ is the ceiling function (the smallest integer greater than
or equal to m).
DR

When there is an odd number of observations, the median is simply the

middle value. For example, consider a dataset of seven blood pressure readings:

110, 115, 120, 125, 130, 135, 140.

Since there are seven observations (an odd number), the median is the fourth
value. In this case, m = n+1
2 = 7+1
2 = 4 which is an integer and hence, the
median is

Median = x(4) = 125.

Conversely, when there is an even number of observations, there is no single

middle value, so the median is calculated as the average of the two middle values.
For instance, in a dataset of eight cholesterol levels:

200, 210, 220, 225, 230, 240, 250, 260,

57
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

the median is found by taking the average of the fourth and fifth values. In this
case, m = n+1 8+1
2 = 2 = 4.5 is not an integer. Hence, the median is

x(⌊4.5⌋) + x(⌈4.5⌉)
Median =
2
x(4) + x(5) 225 + 230
= = = 227.5
2 2
Median is particularly useful in data science for understanding the typical
value in a dataset that is not skewed by extreme values.
Problem 3.2. Consider the following data on the number of hours of sleep per
night for a group of 9 adults:

T
7, 8, 5, 6, 9, 7, 6, 10, 8
What is the median hours of sleep per night?

Solution
AF
First, arrange the data in ascending order:
5, 6, 6, 7, 7, 8, 8, 9, 10
n+1 9+1
In this case, n = 9, so m = 2 = 2 = 5, which is an integer. Hence,
Median = x(m) = x(5) = 7 hours

Problem 3.3. Consider the following data on the test scores of 6 students:
85, 92, 78, 95, 88, 80
DR

What is the median test score?

Solution
First, arrange the data in ascending order:
78, 80, 85, 88, 92, 95
In this case, n = 6, so m = n+1 2 = 6+1
2 = 3.5, which is not an integer.
Hence, we use the second case of the median formula,
x(⌊m⌋) + x(⌈m⌉) x(⌊3.5⌋) + x(⌈3.5⌉) x(3) + x(4)
Median = = =
2 2 2
From the ordered data, x(3) = 85 and x(4) = 88. Therefore,
85 + 88 173
Median = = = 86.5
2 2
The median test score is 86.5.

58
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.2.7 Advantages and Disadvantages of Median

Advantages of Median
(i) Robust to Outliers: The median is not affected by extreme values or
outliers, providing a better central measure for skewed distributions.
(ii) Simple to Compute: It is easy to calculate, especially for small datasets,
by finding the middle value when data is ordered.
(iii) Represents the 50th Percentile: It divides the dataset into two equal
halves, making it a useful measure for understanding the data’s center.

T
(iv) Applicable to Ordinal Data: Can be used with ordinal data, where
data values are ranked but not necessarily numeric.

Disadvantages of Median
(i) Ignores Data Values: Does not take into account the magnitude of all
AF data values, only their order.

(ii) Less Informative for Symmetric Distributions: Provides less infor-

mation about the distribution of data than the mean, especially if the
data is symmetric.
(iii) Not Suitable for Further Mathematical Operations: Unlike the
mean, the median cannot be easily used in further mathematical or sta-
tistical calculations.
(iv) Not Unique for Small Datasets: In small datasets with an even num-
ber of values, there may be two middle values, which can complicate
interpretation.
DR

3.2.8 Mode
The mode is the value that appears most frequently in a dataset. A dataset can
have more than one mode if multiple values have the same highest frequency.
The mode is useful for categorical data or when identifying the most common
value.
Problem 3.4. Consider a manufacturing process where engineers are measur-
ing the diameter of a set of machine components to ensure they meet quality
specifications. The following is a list of diameters (in millimeters) of 20 com-
ponents that were measured:

50, 52, 51, 50, 53, 52, 54, 50, 51, 52, 55, 50, 53, 50, 51, 52, 55, 50, 52, 51

Find the mode of this dataset.

59
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Solution
To find the mode of this dataset:

50, 52, 51, 50, 53, 52, 54, 50, 51, 52, 55, 50, 53, 50, 51, 52, 55, 50, 52, 51

Count the frequency of each diameter:

• Diameter 50 occurs 6 times.

• Diameter 52 occurs 5 times.

• Diameter 51 occurs 4 times.

T
• Diameters 53 and 55 occur 2 times each.

• Diameter 54 occurs 1 time.

We see the diameter 50 appears most frequently (6 times). Thus, the mode
of this dataset is 50 mm.
AF
Problem 3.5. Consider the following data on the blood types of 15 individuals:

A, O, B, AB, O, A, B, O, A, A, B, O, O, A, B

What is the mode of the blood types?

Solution
The frequency distribution of blood types is:
• A: 5

• B: 4
DR

• O: 5

• AB: 1
Since blood types A and O both have the highest frequency, the dataset is
bimodal:
Mode = A and O

3.2.9 Advantages and Disadvantages of Mode

Advantages of Mode
(i) Easy to Identify: The mode is straightforward to determine as it is
simply the most frequently occurring value in a dataset.
(ii) Applicable to All Data Types: Can be used with nominal, ordinal,
and some quantitative data, making it versatile.

60
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

(iii) Reflects Commonality: Represents the value or values that occur most
often, which can be useful for understanding common trends or prefer-
ences.
(iv) Handles Categorical Data: Ideal for categorical data where numerical
calculations are not applicable.

Disadvantages of Mode
(i) May Not Be Unique: A dataset can have more than one mode or no
mode at all, which can complicate interpretation.

T
(ii) Not Useful for Continuous Data: Less useful for continuous data
with many unique values, as identifying the most frequent value can be
challenging.
(iii) Does Not Reflect Data Distribution: Does not provide information
about the spread or shape of the data distribution.
AF
(iv) Insensitive to Changes: The mode does not account for changes in
data that do not affect frequency, potentially overlooking variations.

3.2.10 Choosing the Ideal Measure of Central Tendency

The choice of the ideal measure of central tendency depends on the nature of
the data and the analysis objectives. The primary measures are the arithmetic
mean, median, and mode. Here is a guide to selecting the most appropriate
measure:

• Arithmetic Mean: Best for symmetric distributions and numerical

data.
DR

• Median: Best for skewed distributions and when robustness to outliers

is needed.

• Mode: Best for categorical data and identifying the most frequent
values.

3.2.11 Weighted Mean

The weighted mean, or weighted average, is a measure of central tendency where
each data point contributes to the average based on its assigned weight. Unlike
the arithmetic mean, which treats all data points equally, the weighted mean
takes into account the relative importance of each data point. Mathematically,
the weighted mean is defined as:

61
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

n
P
wi xi
x̄W M = i=1
Pn
wi
i=1

where:
• xi represents the i-th data point,

• wi is the weight associated with the i-th data point,

• n is the number of data points.

T
Problem 3.6. A biostatistician is analyzing the average blood pressure readings
of three different age groups in a study. The average blood pressure for each age
group and the number of individuals in each group are as follows:
• Age Group 1: Average blood pressure = 120 mmHg, Number of indi-
AF•
viduals = 30

Age Group 2: Average blood pressure = 130 mmHg, Number of indi-

viduals = 25

• Age Group 3: Average blood pressure = 140 mmHg, Number of indi-

viduals = 20
Calculate the weighted mean of the average blood pressure across all age groups.

Solution:
The weighted mean is
DR

(120 × 30) + (130 × 25) + (140 × 20)

x̄W M =
30 + 25 + 20

3600 + 3250 + 2800 9650

x̄W M = = ≈ 128.67 mmHg
75 75
Therefore, the weighted mean blood pressure is approximately 128.67 mmHg.
Problem 3.7. A clinical trial evaluates the efficacy of a new drug. The efficacy
rates are measured as follows:
• Trial 1: Efficacy = 85%, Weight = 10

• Trial 2: Efficacy = 90%, Weight = 20

• Trial 3: Efficacy = 80%, Weight = 15

Find the weighted mean efficacy rate of the drug.

62
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Solution:
The weighted mean is
(85 × 10) + (90 × 20) + (80 × 15)
x̄W M =
10 + 20 + 15
850 + 1800 + 1200 3850
= = ≈ 85.56%
45 45
Hence, the weighted mean efficacy rate is approximately 85.56%.
Problem 3.8. A public health researcher records the incidence rates of a disease
in three different regions. The incidence rates and the population sizes for each

T
region are:
• Region A: Incidence rate = 0.02 cases per person, Population = 50, 000

• Region B: Incidence rate = 0.03 cases per person, Population = 75, 000

• Region C: Incidence rate = 0.04 cases per person, Population = 25, 000
AF
Solution:
The weighted mean is
(0.02 × 50,000) + (0.03 × 75,000) + (0.04 × 25,000)
x̄W M =
50,000 + 75,000 + 25,000
1000 + 2250 + 1000 4250
= = ≈ 0.02833 cases per person
150,000 150,000
Hence, the weighted mean incidence rate is approximately 0.02833 cases per person.
Problem 3.9. An academic advisor evaluates the performance of students
DR

based on their grades in three courses, with different credit hours for each course:
• Course 1: Grade = 85, Credits = 3

• Course 2: Grade = 90, Credits = 4

• Course 3: Grade = 80, Credits = 2

Compute the weighted mean grade.

Solution:
The weighted mean is
(85 × 3) + (90 × 4) + (80 × 2)
x̄W M =
3+4+2
255 + 360 + 160 775
= = ≈ 86.11
9 9
Hence, theweighted mean grade is approximately 86.11.

63
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.2.12 Measures of Central Tendency for Grouped Data

Refer to Example 2.4. Suppose we have the weekly expenditure of 30 students,
and the frequency distribution of this weekly expenditure is presented in Table
3.1. The mean, median, and mode can be calculated in the following ways:

Table 3.1: Distribution of Weekly Expenditure of 30 Students

Class Tally Frequency Cumulative Midpoint

Interval (fi ) Frequency (cfi ) (mi )
360 - 372 4 4 366

T
372 - 384 3 7 378
384 - 396 9 16 390
396 - 408 5 21 402
408 - 420 5 26 414
AF 420 - 432
Total
4
30
30 426

Mean Estimation
The mean for grouped data is given by the formula:
P
fi mi
x̄ = P
fi

where fi is the frequency and mi is the midpoint of each class interval.

360 + 372
m1 = = 366
2
372 + 384
m2 = = 378
2
384 + 396
m3 = = 390
2
396 + 408
m4 = = 402
2
408 + 420
m5 = = 414
2
420 + 432
m6 = = 426
2

(4 × 366) + (3 × 378) + (9 × 390) + (5 × 402) + (5 × 414) + (4 × 426)

x̄ =
30

64
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

1464 + 1134 + 3510 + 2010 + 2070 + 1704 11892

x̄ = = = 396.4
30 30

Median Estimation
The median for grouped data is given by the formula:
n
− cf

2
Median = l + ×h
f

T
where l is the lower boundary of the median class, n is the total frequency,
cf is the cumulative frequency before the median class, f is the frequency of
the median class, and h is the class width.

Median class: 384 − 396

l = 384, n = 30, cf = 7, f = 9, h = 12
AF Median = 384 +
30
−7
9
2

× 12 = 384 +

15 − 7
9

× 12

8
= 384 + × 12 = 384 + 10.67 = 394.67
9

Mode Estimation
The mode for grouped data is given by the formula:

f1 − f0
DR

Mode = l + ×h
(f1 − f0 ) + (f1 − f2 )

where l is the lower boundary of the modal class, f1 is the frequency of the
modal class, f0 is the frequency of the class before the modal class, f2 is the
frequency of the class after the modal class, and h is the class width.

Modal class: 384 − 396

L = 384, f1 = 9, f0 = 3, f2 = 5, h = 12

9−3 6
Mode = 384 + × 12 = 384 + × 12
(9 − 3) + (9 − 5) 6+4

6
= 384 + × 12 = 384 + 7.2 = 391.2
10

65
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.2.13 Python Code: Mean, Median and Mode

In this section, we will demonstrate how to compute the arithmetic mean, ge-
ometric mean, harmonic mean, median, and mode using Python. Consider the
following dataset:
{10, 15, 15, 20, 25, 30}
1 import numpy as np
2 from scipy . stats import gmean , hmean , mode
3

4 # Sample data
5 data = [10 , 15 , 15 , 20 , 25 , 30]

T
6

7 # Arithmetic Mean
8 arithmetic_mean = np . mean ( data )
9 print ( f " Arithmetic Mean : { arithmetic_mean } " )
10

11 # Geometric Mean
12 geometric_mean = gmean ( data )
13

16
AF print ( f " Geometric Mean : { geometric_mean } " )

# Harmonic Mean
harmonic_mean = hmean ( data )
17 print ( f " Harmonic Mean : { harmonic_mean } " )
18

19 # Median
20 median = np . median ( data )
21 print ( f " Median : { median } " )
22

23 # Mode
24 mode_value , count = mode ( data )
DR

25 print ( f " Mode : { mode_value } " )

The output of the code will be:
• Arithmetic Mean: 19.1667

• Geometric Mean: 17.9768

• Harmonic Mean: 16.8224

• Median: 17.5

• Mode: 15

3.3 Exercises
1. Suppose we have the following dataset representing the scores of students
in a test:
85, 90, 78, 92, 88, 76, 95, 89, 84, 91

66
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

(a) Calculate the arithmetic mean of the test scores.

(b) Determine the median of the test scores.
2. Consider the following dataset representing the monthly salaries (in thou-
sands of dollars) of 12 employees at a company:

45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100

(a) Calculate the arithmetic mean of the salaries.

(b) Determine the median salary.
3. A company measures the speed of data transmission across four networks

T
as follows (in Mbps):
10, 20, 30, 40
(a) Calculate the arithmetic mean of the data transmission speeds.
(b) Compute the harmonic mean of the data transmission speeds.
AF (c) Calculate the median of the data transmission speeds.

4. The following dataset represents the time (in hours) taken by a worker to
complete four tasks:
4, 6, 8, 12
(a) Calculate the arithmetic mean of the task completion times.
(b) Compute the harmonic mean of the task completion times.
5. A company tracks the monthly returns (in percentage) on three different
investments over a year:
4%, 8%, 12%
DR

(a) Calculate the geometric mean of the monthly returns.

(b) Compute the harmonic mean of the monthly returns.
6. A survey reports the annual growth rates of four investments as follows
(in percentage):
5%, 10%, 15%, 20%

(a) Calculate the geometric mean of the annual growth rates.

(b) Compute the harmonic mean of the annual growth rates.
7. Consider the following dataset of the number of books read by a group of
15 students:
2, 3, 2, 5, 7, 2, 4, 5, 5, 3, 6, 7, 8, 2, 5

(a) Identify the mode(s) of the number of books read.

(b) Calculate the mean of the number of books read.

67
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

8. A class of 20 students recorded their scores on a recent quiz as follows:

82, 85, 88, 82, 90, 85, 92, 82, 88, 85, 87, 90, 82, 85, 92, 90, 87, 88, 85, 82

(a) Compute the mean of the quiz scores.

(b) Identify the mode of the quiz scores.
9. The following dataset represents the hours spent on three different activ-
ities by a student each week, with respective weights:
• Activity Hours:
10, 15, 20

T
• Weights:
2, 3, 1
Compute the weighted mean of the hours spent on activities.
10. A company reports the sales (in units) of three different products with
AF the following weights:
• Sales:
200, 300, 400
• Weights:
1, 3, 2
Calculate the weighted mean of the sales data.
11. Define the weighted mean. A student’s overall grade in a course is de-
termined by three components: homework, exams, and projects. The
weights assigned to each component are as follows: homework (30%), ex-
DR

ams (50%), and projects (20%). The student scored 85 in homework, 90

in exams, and 80 in projects. Calculate the student’s overall weighted
mean grade.
12. A dataset is grouped into the following intervals with corresponding fre-
quencies:
• Class Intervals:

[10 − 20, 20 − 30, 30 − 40, 40 − 50]

• Frequencies:
[8, 15, 20, 12]

(a) Calculate the mean and median of the grouped data.

(b) Calculate the mode of the grouped data.

13. A student receives grade points in three subjects with the following weights:

68
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Grade points:
3.5, 3.7, 4
• Credit:
3, 4, 2
Compute an average of the grade points.
14. A healthcare provider tracks the number of patients visiting a clinic over
a week (in patients per day) as follows:

12, 15, 14, 17, 15, 18, 16

T
(a) Find the median number of patients per day.
(b) Determine the mode of the number of patients.

3.4 Measures of Dispersion or Variability

AF
Measures of dispersion or variability describe the spread or distribution of data
points in a dataset. They describe the extent to which data values differ from
the central tendency, such as the mean or median. These measures help to un-
derstand how much the data varies around the central tendency (mean, median,
or mode).
They provide insights into the distribution’s spread and consistency. They
are categorized into absolute and relative measures. Here’s an overview of
both:
1. Absolute Measures of Dispersion

• Range: Maximum value - Minimum value

• Interquartile range (IQR): Q3 − Q1

• Semi-interquartile range : (Q3 − Q1 )/2
• Variance (or standard deviation)
• Mean Absolute Deviation (MAD)

2. Relative Measures of Dispersion

• Relative Range: Relative range = Range/Mean

• Quartile Coefficient of Dispersion: (Q3 − Q1 )/(Q3 + Q1 )
• Coefficient of variation
• Skewness
• Kurtosis

69
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Absolute Measures give the dispersion in the same units as the

data and include range, standard deviation (and variance), and
mean absolute deviation.

• Relative Measures provide dispersion relative to the mean or

other central values and include the coefficient of variation, rela-
tive range, and quartile coefficient of dispersion.

We will explain some important measures of dispersion in detail in the following

sections.

T
3.4.1 Range
The range is the simplest measure of variability and is calculated as the differ-
ence between the maximum and minimum values in a dataset. Mathematically,
the range is defined as:
Range = xmax − xmin
AF
In a study measuring the blood glucose levels of 10 patients, the highest reading
is 120 mg/dL and the lowest reading is 85 mg/dL.So, the range is

Range = 120 − 85 = 35 mg/dL

3.4.2 Variance
Variance measures the average squared deviation of each data point from the
mean. It reflects how data points spread out around the mean. Mathematically,
the sample variance is denoted by s2 and is defined as:

n n
!
1 X 1
DR

X
2
s = (xi − x̄)2 = x2i − nx̄ 2
n − 1 i=1 n−1 i=1

where x̄ is the arithmetic mean, xi represents the data points, and n is the
number of data points.

Consider the weights of 5 patients: 65, 70, 75, 80, 85 kg. We have,
65 + 70 + 75 + 80 + 85
x̄ = = 75
5

1
s2 = (65 − 75)2 + (70 − 75)2 + (75 − 75)2 + (80 − 75)2 + (85 − 75)2

5−1
1 250
= [100 + 25 + 0 + 25 + 100] = = 62.5 kg2
4 4

70
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.4.3 Standard Deviation

The standard deviation is the positive square root of the variance. It provides
a measure of dispersion in the same units as the data. Mathematically, the
sample standard deviation is denoted by s and is defined as:
v v !
u n u n
u 1 X u 1 X
s=t 2
(xi − x̄) = t x2i − nx̄2
n−1 i=1
n−1 i=1

Using the variance from the previous example, the sample standard deviation
is

T
√
s = 62.5 ≈ 7.91 kg
Problem 3.10. A physician collected an initial measurement of hemoglobin
(g/L) after the admission of 10 inpatients to a hospital’s department of cardiol-
ogy. The hemoglobin measurements were 139, 158, 120, 112, 122, 132, 97, 104,
159, and 129 g/L. Calculate the variance and standard deviation of hemoglobin
AF
level.

Solution
The variance is calcualted by the following way:

xi x2i
139 19321
158 24964
120 14400
DR

112 12544
122 14884
132 17424
97 9409
104 10816
159 25281
129 16641
n n
x2i = 165684
P P
xi = 1272
i=1 i=1

71
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

The sample Mean and Variance:

1272
x̄ = = 127.2 g/L
10 !
n
2 1 X
2 2
s = x − nx̄
n − 1 i=1 i
1
165684 − 10 × 127.22

=
10 − 1
2
= 4317 (g/L)
√
The sample standard deviation is s = 431.7 = 20.8 g/L. Thus, the standard

T
deviation of hemoglobin for these 10 patients was 20.8 g/L.

3.4.4 Measures of Variability for Grouped Data

Variance for Grouped Data
AF
The variance for grouped data is given by the formula:

s2 =
P
fi (mi − x̄)2
P
fi − 1

To compute sample variance for the grouped data, we define di = (mi − x̄)
and hence
fi d2i
P
2
s =P .
fi − 1
Now, we can easily compute s2 . The detailed calculations are given in Table
3.2.
DR

Table 3.2: Variance calulation for grouped data

fi di = (mi − x̄) d2i fi d2i

4 -30.4 923.52 3694.08
3 -18.4 338.56 1015.68
9 -6.4 40.96 368.64
5 5.6 31.36 156.8
5 17.6 309.76 1548.8
4 29.6 876.16 3504.64
Total = 30 10288.64

72
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

where
x̄ = 396.4
Sample variance:
10288.64 10288.64
s2 = = ≈ 354.78
30 − 1 29

Standard deviation for Grouped Data

Standard deviation is simply the positive square root of variance. Hence, the
sndard deviation is defined as

T
sP
fi (mi − x̄)2
s= P
fi − 1

Standard deviation: √
s= 354.78 ≈ 18.83
AF
3.5 Exercises
1. Given the following dataset of temperatures (in degrees Celsius) recorded
over a week:
22, 25, 19, 23, 24, 20, 21
(a) Calculate the range of the temperatures.
2. Consider the following dataset representing the scores of 8 students in a
test:
75, 85, 95, 80, 90, 70, 88, 92
DR

(a) Compute the variance of the test scores.

(b) Determine the standard deviation of the test scores.

3. The following dataset represents the frequency distribution of exam scores:

• Class Intervals:

[50 − 60, 60 − 70, 70 − 80, 80 − 90, 90 − 100]

• Frequencies:
[6, 10, 15, 12, 7]

(a) Calculate the mean of the grouped data.

(b) Compute the variance and standard deviation of the grouped data.

73
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

4. A dataset of the number of hours spent by patients in a hospital (per day)

is as follows:
1.5, 2.0, 2.5, 3.0, 1.0, 2.2, 2.8
(a) Find the range of the hours spent in the hospital.
(b) Compute the variance of the dataset.

5. A company tracks the number of units sold each month for the last 6
months:
150, 170, 160, 180, 175, 165
(a) Calculate the standard deviation of the units sold.

T
6. A class of students took two different tests with the following scores:
• Test 1 Scores:
78, 82, 85, 88, 90
• Test 2 Scores:
AF 72, 80, 78, 85, 90

(a) Compute the mean and median for both tests.

(b) Compute the variance and standard deviation for both tests.

7. Consider a dataset of monthly income grouped into intervals with corre-

sponding frequencies:
• Income Intervals:

[2000 − 3000, 3000 − 4000, 4000 − 5000, 5000 − 6000]

• Frequencies:
[5, 8, 12, 6]

(a) Calculate the mean income.

(b) Compute the variance and standard deviation of the grouped income
data.

8. A company records the monthly sales (in thousands of dollars) over the
last year:
45, 52, 48, 55, 50, 47, 53, 60, 49, 51, 54, 57
(a) Find the range of the monthly sales data.
(b) Find the standard deviation of the monthly sales data.

74
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.6 Measures of Distribution Shape

Skewness and kurtosis are important descriptive statistics that measure the
shape characteristics of a data distribution. Skewness quantifies the asymmetry
of the distribution around its mean. Kurtosis, on the other hand, measures the
“tailedness” of the distribution. Both measures provide critical insights into
the distribution’s shape, helping to understand the underlying characteristics
of the data.

3.6.1 Skewness
Skewness is a statistical measure that characterizes the degree of asymme-

T
try of a distribution around its mean. It indicates whether the data are
concentrated more on one side of the mean compared to the other.

Coefficient of Skewness
The coefficient of skewness is a standardized measure of skewness that allows
AF
for comparison of the degree of asymmetry between different distributions. It
provides insight into the direction and extent of the skew of a data distribution.

There are several coefficients of skewness used to calculate skewness, each

providing a different perspective on the skewness of a distribution. Here are
two commonly used Coefficients:
(i). Fisher-Pearson Coefficient of Skewness

(ii). Pearson Median Coefficient of Skewness

Fisher-Pearson Coefficient of Skewness

The Fisher-Pearson coefficient measures the asymmetry of the distribution

around the mean using the formula:
n 3
n X xi − x̄
Skewness (Sk) = (3.1)
(n − 1)(n − 2) i=1 s
where:

• n is the number of observations,

• xi represents each data point,

• x̄ is the sample mean,

• s is the sample standard deviation.

The Fisher-Pearson coefficient standardizes the third central moment of the

distribution. A skewness of 0 indicates a perfectly symmetrical distribution.

75
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Pearson Median Coefficient of Skewness

The Pearson median coefficient calculates skewness based on the mean and the
median:

x̄ − Median
Skewness (Sk) = 3 (3.2)
s
where:
• Median is the median of the data,

This method uses the difference between the mean and the median to gauge

T
skewness.

Interpretation
• Sk = 0: The distribution is symmetric.

•
AF•
Sk > 0: Positive skewness (right-skewed distribution).

Sk < 0: Negative skewness (left-skewed distribution).

Types of Skewness
Skewness can be categorized into three main types based on the direction of
the asymmetry:

Positive Skewness (Right Skewed)

A distribution with positive skewness has a tail that extends more towards the
higher values on the right side. Most of the data points are concentrated on
DR

the left side of the mean.

• Characteristics: Mean > Median

• Visual Description: The tail on the right side of the distribution is

longer or fatter.

• Example: Income distribution where a few individuals earn much

more than the majority.

Zero Skewness (Symmetric Distribution)

A distribution with zero skewness is perfectly symmetrical around the mean.
The data points are evenly distributed on both sides of the mean.

• Characteristics: Mean = Median = Mode

76
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Visual Description: A bell-shaped curve typical of a normal distri-

bution.

• Example: Heights of a large, diverse group of people.

Negative Skewness (Left Skewed)

A distribution with negative skewness has a tail that extends more towards the
lower values on the left side. Most of the data points are concentrated on the
right side of the mean.

• Characteristics: Mean < Median

T
• Visual Description: The tail on the left side of the distribution is
longer or fatter.

• Example: Exam scores where most students score well, but a few
score significantly lower.
AF
Problem 3.11. Consider the following dataset of the number of hours spent
studying per day by a group of students 2, 4, 4, 4, 5, 6, 8, and 10 hours.
Calculate the skewness.

Solution
Fisher-Pearson Coefficient of Skewness: The mean is
Pn
xi 2 + 4 + 4 + 4 + 5 + 6 + 8 + 10 43
x̄ = i=1 = = = 5.375 hours
n 8 8
DR

Table 3.3: Calculations for Skewness

xi −x̄ xi −x̄ 3

xi x̄ xi − x̄ (xi − x̄)2 s s
2.0000 5.3750 -3.3750 11.3906 -1.3184 -2.2914
4.0000 5.3750 -1.3750 1.8906 -0.5371 -0.1549
4.0000 5.3750 -1.3750 1.8906 -0.5371 -0.1549
4.0000 5.3750 -1.3750 1.8906 -0.5371 -0.1549
5.0000 5.3750 -0.3750 0.1406 -0.1465 -0.0031
6.0000 5.3750 0.6250 0.3906 0.2441 0.0146
8.0000 5.3750 2.6250 6.8906 1.0254 1.0781
10.0000 5.3750 4.6250 21.3906 1.8066 5.8968
43 45.875 4.230

77
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Using above table, the variance (s2 ) is

Pn
(xi − x̄)2 45.875
s2 = i=1 = = 6.5536
n−1 8−1
and hence √
s= 6.5536 = 2.56 hours
and
n 3
X xi − x̄
= 4.230
i=1
s

T
Skewness formula:
n 3
n X xi − x̄
Skewness = .
(n − 1)(n − 2) i=1 s
For n = 8:
AF Skewness =
8
(8 − 1)(8 − 2)
× 4.635 =
8
42
× 4.230 = 0.8057

The skewness of the dataset {2, 4, 4, 4, 5, 6, 8, 10} is approximately 0.8057.

This positive value indicates that the distribution is right-skewed, with a longer
tail on the right side.

Pearson Median Coefficient of Skewness: Since the dataset has an even

number of observations:
4+5
Median = = 4.5
2
DR

Hence, the skewness is

x̄ − Median 5.375 − 4.5
Skewness (Sk) = 3 =3 = 1.0254
s 2.56
The skewness of the dataset is approximately 1.0254. This positive value
indicates that the distribution is right-skewed, meaning it has a longer tail on
the right side.

3.6.2 Kurtosis
Even with knowledge of central tendency, dispersion, and skewness, we still
don’t have a full understanding of a distribution. To gain a complete perspec-
tive on the shape of the distribution, we also need to consider kurtosis. Kurtosis
is a statistical measure that describes the shape, or peakedness, of the probabil-
ity distribution of a real-valued random variable. It indicates whether the data

78
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

are heavy-tailed or light-tailed relative to a normal distribution. A distribution

with positive kurtosis has a sharp peak and heavy tails, whereas a distribution
with negative kurtosis has a flatter peak and lighter tails compared to the nor-
mal distribution.

The sample version of Karl Pearson’s Measures of Kurtosis is

n
1
(xi − x̄)4
P
n
i=1
K=
n
2 − 3
1
P
n (xi − x̄)2
i=1

T
Alternatively, many software programs (such as Excel’s KURT function, which
uses a bias-corrected formula) use the following formula to measure kurtosis.

n 4
3(n − 1)2

n(n + 1) xi − x̄
AF K=
X
(n − 1)(n − 2)(n − 3) i=1 s
−
(n − 2)(n − 3)

Both formulas aim to measure the kurtosis of a dataset, but the second
formula includes additional terms to correct for bias, making it more accurate
for small sample sizes. The term −3 in the simpler formula adjusts for ex-
cess kurtosis, normalizing the kurtosis value to compare it against a normal
distribution.

Interpretation
• K = 0: The distribution has the same kurtosis as a normal distribution
DR

(mesokurtic).

• K > 0: Leptokurtic distribution (more outliers, heavier tails).

• K < 0: Platykurtic distribution (fewer outliers, lighter tails).

Problem 3.12. Consider Problem 3.11. Calculate the kurtosis of the following
dataset, which represents the number of hours spent studying per day by a group
of students: 2, 4, 4, 4, 5, 6, 8, and 10 hours.

Solution
In the solution to Problem 3.11, the mean is x̄ = 5.375 hours and the standard
deviation is s = 2.56 hours. The calculation for the kurtosis is provided in Table
3.4.

79
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Table 3.4: Calculations for Kurtosis

xi −x̄ xi −x̄ 4

xi x̄ xi − x̄ (xi − x̄)2 s s
2.0000 5.3750 -3.3750 11.3906 -1.3184 3.0209
4.0000 5.3750 -1.3750 1.8906 -0.5371 0.0832
4.0000 5.3750 -1.3750 1.8906 -0.5371 0.0832
4.0000 5.3750 -1.3750 1.8906 -0.5371 0.0832
5.0000 5.3750 -0.3750 0.1406 -0.1465 0.0005
6.0000 5.3750 0.6250 0.3906 0.2441 0.0036

T
8.0000 5.3750 2.6250 6.8906 1.0254 1.1055
10.0000 5.3750 4.6250 21.3906 1.8066 10.6535
43 45.8750 15.03358
AF Using above table, we have
n
X xi − x̄
s
4
= 15.03358
i=1

n(n + 1) 8×9
Bias-Correction Factor = = = 0.343
(n − 1)(n − 2)(n − 3) 7×6×5

3(n − 1)2 3 × 72 147

Correction Term = = = = 4.9
(n − 2)(n − 3) 6×5 30
DR

So,
K = 0.343 × 15.03358 − 4.9 = 0.2543
The excess kurtosis of the dataset is approximately 0.2543, the positive value
suggests that the distribution is leptokurtic. This means the distribution has
heavier tails and is more peaked compared to a normal distribution, indicating
a higher probability of extreme values or outliers.

3.6.3 Coefficient of Variation

The coefficient of variation (CV) is a normalized measure of dispersion that
expresses the standard deviation as a percentage of the mean. Mathematically,
the cv is defined as:
s
CV = × 100%
x̄

80
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

If the mean blood pressure is x̄ = 128.67 mmHg and the standard deviation
is s = 7.91 mmHg, the CV is:
7.91
CV = × 100% ≈ 6.15%
128.67

3.7 Exercises
1. Given the following dataset representing the monthly number of new cus-
tomer sign-ups for a company:

12, 15, 14, 16, 21, 25, 30, 35, 40, 50

T
Calculate the skewness of the dataset. Use a statistical software or formula
for skewness calculation.
2. The following dataset represents the heights (in cm) of 10 students:

150, 155, 160, 165, 170, 175, 180, 185, 190, 195
AF Determine the kurtosis of the height dataset. Use a statistical software or
formula for kurtosis calculation.
3. Consider the dataset representing the weekly earnings (in dollars) of 5
freelancers:
400, 420, 450, 480, 500
(a) Calculate the mean and standard deviation of the earnings.
(b) Compute the coefficient of variation (CV) for the earnings dataset.
4. A company tracks monthly sales figures (in thousands of dollars) for a
year:
DR

40, 42, 44, 45, 47, 50, 52, 55, 60, 65, 70, 75
(a) Calculate the skewness of the sales data.
(b) Determine the kurtosis of the sales data.
5. A dataset of exam scores for a class is given as follows:

88, 76, 92, 85, 79, 90, 82

(a) Find the mean and standard deviation of the exam scores.
(b) Compute the coefficient of variation (CV) for the exam scores.
6. The daily maximum temperatures (in degrees Celsius) for a week are:

18, 20, 22, 24, 26, 28, 30

(a) Compute the skewness of the temperature data.

(b) Determine the interquartile range (IQR) of the temperature data.

81
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.8 Quartiles, Percentiles, Deciles and Outlier

Detection
Quartiles and percentiles are used to summarize data distributions and identify
specific points within the data set.

3.8.1 Quartiles
Quartiles provide a concise summary of a data set, highlighting its central ten-
dency and variability without requiring a full description of the data. They help
identify the spread of the data, allowing you to see how values are distributed.

FT
Quartiles divide a data set into four equal parts, each representing 25% of the
data. The range between the first quartile (Q1 ) and the third quartile (Q3 ) is
known as the interquartile range (IQR), which indicates where the middle 50%
of the data lies.

Suppose we have a dataset x1 , x2 , . . . , xn of size n, and let x(1) , x(2) , . . . , x(n)

denote the ordered version of this dataset. A kth quantile can be defined
as follows:

Qk = x(⌊m⌋) + f · x(⌈m⌉) − x(⌊m⌋) , k = 1, 2, 3, . . . , 99
where
A
n+1
m= ×k
100
is the position in the ordered dataset, and:

• f = m − ⌊m⌋
R
• x(⌊m⌋) is the value at the integer part of the mth position

• x(⌈m⌉) is the value at the next position

If m is an integer, then f = 0, and in this case:

Qk = x(m)
D

is the value at the mth position of the ordered data set.

• Q1 (First Quartile): The value below which 25% of the data falls.

• Q2 (Second Quartile or Median): The value below which 50% of

the data falls.

• Q3 (Third Quartile): The value below which 75% of the data falls.

82
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Remark 3.8.1. Ceiling Function: The ceiling function, denoted as ⌈m⌉, is

defined as:

⌈m⌉ = the smallest integer greater than or equal to m

Floor Function: The floor function, denoted as ⌊m⌋, is defined as:

⌊m⌋ = the largest integer less than or equal to m

For example, for the floor function: ⌊2.4⌋ = 2, and for the ceiling function:
⌈2.4⌉ = 3.
Problem 3.13. Suppose that a college placement office sent a questionnaire

T
to a sample of business school graduates requesting information on monthly
starting salaries. Table 3.5 shows the collected data.

Table 3.5: Monthly Starting Salaries for a Sample of 12 Business School Grad-
uates
AF Graduate
1
Monthly Starting Salary ($)
5850
2 5950
3 6050
4 5880
5 5755
6 5710
7 5890
DR

8 6130
9 5940
10 6325
11 5920
12 5880

Compute the quartiles of the monthly starting salary for the sample of 12
business college graduates.

Solution
To compute the quartiles, we first sort the data in ascending order:

5710, 5755, 5850, 5880, 5880, 5890, 5920, 5940, 5950, 6050, 6130, 6325

83
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Quartile Computations
• Minimum: 5710

• Maximum: 6325

• Median (50th Percentile, Q2 ):

The median is the average of the 6th and 7th values:
5890 + 5920 11810
Q2 = = = 5905
2 2

• First Quartile (25th Percentile, Q1 ):

T
For n = 12, the value of m = 12+1
4 th position = 3.25th position:

Q1 = x(3) + 0.25 · x(3) − x(4)
= 5850 + 0.25 × (5880 − 5850)
= 5850 + 0.25 × 30
AF = 5850 + 7.5
= 5857.5

• Third Quartile (75th Percentile, Q3 ):

For n = 12, the value of m = 3 × 12+1
4 th position = 9.75th position:

Q3 = x(9) + 0.75 · x(9) − x(10)
= 5950 + 0.75 × (6050 − 5950)
= 5950 + 0.75 × 100 = 5950 + 75
= 6025
DR

The quartiles for the monthly starting salaries are:

• Minimum: 5710

• 1st Quartile (Q1 ): 5857.5

• Median (Q2 ): 5905

• 3rd Quartile (Q3 ): 6025

• Maximum: 6325

3.8.2 Percentiles
Percentiles divide a data set into 100 equal parts, providing a more detailed
breakdown of the data distribution. A kth percentile can be defined as

84
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Pk = x(⌊m⌋) + f · x(⌈m⌉) − x(⌊m⌋) k = 1, 2, . . . , 99
where
n+1
m= ×k
100
is the position of ordered data set.

• P10 (10th Percentile): The value below which 10% of the data falls.

• P25 (25th Percentile): The value below which 25% of the data falls.
(Q1)

T
• P50 (50th Percentile): The value below which 50% of the data falls.
(Median)

• PP75 (75th Percentile): The value below which 75% of the data
falls. (Q3)
AF•
3.8.3
P90 (90th Percentile): The value below which 90% of the data falls.

Deciles
Deciles divide a data set into ten equal parts, representing specific percentiles.
A kth decile can be defined as

Dk = x(⌊m⌋) + f · x(⌈m⌉) − x(⌊m⌋) k = 1, 2, . . . , 9
where
n+1
m= ×k
10
DR

is the position of ordered data set.

• D1 (1st Decile): The value below which 10% of the data falls.

• D2 (2nd Decile): The value below which 20% of the data falls.

• D3 (3rd Decile): The value below which 30% of the data falls.

• D4 (4th Decile): The value below which 40% of the data falls.

• D5 (5th Decile): The value below which 50% of the data falls. (Me-
dian)

• D6 (6th Decile): The value below which 60% of the data falls.

• D7 (7th Decile): The value below which 70% of the data falls.

• D8 (8th Decile): The value below which 80% of the data falls.

85
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• D9 (9th Decile): The value below which 90% of the data falls.

• D10 (10th Decile): The value below which 100% of the data falls.
(Maximum)

3.8.4 Interquartile Range (IQR)

The IQR measures the range within which the middle 50% of the data falls,
from the first quartile (Q1 ) to the third quartile (Q3 ). Hence, the interquartile
range is defined as
IQR = Q3 − Q1

T
In a dataset of test scores: 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, the first quar-
tile (Q1 ) is 62.5 and the third quartile (Q3 ) is 87.5. Hence, the interquartile
range is
IQR = 87.5 − 62.5 = 25

3.8.5 Outlier Detection

AF
In statistics, an outlier is a data point that is significantly different from other
observations. We can detect outliers using the interquartile range (IQR). Given
a dataset x1 , x2 , . . . , xn , the outlier detection can be described as follows:

Define Mild and Extreme Outliers:

• Mild Outlier: A data point xi is considered a mild outlier if

xi < Q1 − 1.5 × IQR or xi > Q3 + 1.5 × IQR

• Extreme Outlier: A data point xi is considered an extreme outlier

if
xi < Q1 − 3 × IQR or xi > Q3 + 3 × IQR

Problem 3.14. Consider the following systolic blood pressure (SBP) readings
(in mmHg):

165, 50, 110, 120, 125, 130, 135, 140, 145, 150, 155,

160, 175, 180, 185, 190, 195, 200, 115, 220, 170
Calculate the first and third quartiles and then determine mild and extreme
outliers.

Solution
We will calculate the quartiles, IQR, and determine mild and extreme outliers
by following the steps. The sorted data is:

86
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

50, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155,
160, 165, 170, 175, 180, 185, 190, 195, 200, 220
The number of data points is n = 21.

Finding Q1
n+1 21 + 1
m= ×1= × 1 = 5.5
4 4
Using the formula:

T
Q1 = x⌊5.5⌋ + f · (x⌈5.5⌉ − x⌊5.5⌋ )

where: ⌊5.5⌋ = 5, ⌈5.5⌉ = 6, f = 5.5 − 5 = 0.5. Substituting values:

Q1 = 125 + 0.5 · (130 − 125) = 125 + 0.5 · 5 = 125 + 2.5 = 127.5

AF
Finding Q3

m=
n+1
4
×3=
21 + 1
4
×3=
22
4
× 3 = 16.5

Using the formula:

Q3 = x⌊16.5⌋ + f · (x⌈16.5⌉ − x⌊16.5⌋ )

where ⌊16.5⌋ = 16, ⌈16.5⌉ = 17, f = 16.5 − 16 = 0.5. Substituting values:

Q3 = 180 + 0.5 · (185 − 180) = 180 + 0.5 · 5 = 180 + 2.5 = 182.5

Compute IQR:

IQR = Q3 − Q1 = 182.5 − 127.5 = 55

Determine Outlier Thresholds:
• Mild Outliers:

Lower Mild Threshold = Q1 − 1.5 × IQR = 127.5 − 1.5 × 55

= 127.5 − 82.5 = 45

Upper Mild Threshold = Q3 + 1.5 × IQR = 182.5 + 1.5 × 55

= 182.5 + 82.5 = 265

87
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Extreme Outliers:

Lower Extreme Threshold = Q1 − 3 × IQR = 127.5 − 3 × 55

= 127.5 − 165 = −37.5

Upper Extreme Threshold = Q3 + 3 × IQR = 182.5 + 3 × 55

= 182.5 + 165 = 347.5

Identify Outliers:
• Mild Outlier: Mild outliers are values below 45 or above 265. The
readings do not have any mild outliers.

• Extreme Outlier: Extreme outliers are values below -37.5 or above

T
347.5. Again, there are no extreme outliers.

3.8.6 Python Code: Dispersion Measures

AF
In this section, we will demonstrate how to compute the range, variance, stan-
dard deviation, coefficient of variation, quartiles, interquartile range (IQR),
skewness, and kurtosis using Python. Consider the following dataset:

{10, 15, 15, 20, 25, 30}

1 import numpy as np
2 from scipy . stats import iqr , skew , kurtosis
DR
3

4 # Sample data
5 data = [10 , 15 , 15 , 20 , 25 , 30]
6

7 # Range
8 data_range = np . ptp ( data )
9 print ( f " Range : { data_range } " )
10

11 # Variance
12 variance = np . var ( data , ddof =1)
13 print ( f " Variance : { variance } " )
14

15 # Standard Deviation
16 std_deviation = np . std ( data , ddof =1)
17 print ( f " Standard Deviation : { std_deviation } " )
18

19 # Coefficient of Variation
20 coe f_of_variation = std_deviation / np . mean ( data )
21 print ( f " Coefficient of Variation : { coef_of_variation } " )

88
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

23 # Quartiles
24 quartiles = np . percentile ( data , [25 , 50 , 75])
25 print ( f " Quartiles (25 th , 50 th , 75 th ) : { quartiles } " )
26

27 # Interquartile Range ( IQR )

28 i n te r q ua rtile_range = iqr ( data )
29 print ( f " Interquartile Range ( IQR ) : { interquartile_range } " )
30

31 # Skewness
32 data_skewness = skew ( data )
33 print ( f " Skewness : { data_skewness } " )

T
34

35 # Kurtosis
36 data_kurtosis = kurtosis ( data )
37 print ( f " Kurtosis : { data_kurtosis } " )
The output of the code will be:
• Range: 20
AF •
•
Variance: 62.5

Standard Deviation: 7.91

• Coefficient of Variation: 0.42

• Quartiles (25th, 50th, 75th): [15. 17.5 25.]

• Interquartile Range (IQR): 10.0

• Skewness: 0.246
DR

• Kurtosis: -1.09

3.9 Exercises
1. Given the following dataset representing the monthly expenses (in dollars)
of 12 households:
450, 600, 550, 620, 700, 480, 510, 540, 580, 660, 710, 690
Compute the first quartile, the second quartile ( or median), and the third
quartile of the dataset.
2. Given the following dataset representing the ages of 12 participants in a
study:
22, 25, 28, 30, 32, 35, 37, 40, 42, 45, 48, 50
Compute first quartile, third quartile and the interquartile range (IQR)
of the ages.

89
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3. The dataset represents the heights (in cm) of 20 plants measured over a
period:

40, 42, 45, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63

Compute the 1st decile (D1), 5th decile (D5, which is the median), and
the 9th decile (D9) of the plant heights.
4. Consider the following dataset representing the number of books read by
10 students in a year:

12, 15, 18, 20, 22, 25, 28, 30, 35, 100

T
(a) Compute Q1 and Q3 .
(b) Identify any outliers in the dataset using the Interquartile Range
(IQR) method.
5. Given the following dataset of monthly rainfall (in mm) for 8 cities:
AF 100, 110, 120, 150, 170, 190, 200, 220

(a) Compute the quartiles (Q1, Q2, Q3) for the dataset.
(b) Detect any potential outliers using the IQR method.
6. The following dataset represents the scores of 25 students in an exam:

45, 47, 48, 50, 51, 53, 55, 57, 58, 60, 62, 63, 65,

67, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88
(a) Calculate the 10th percentile (P10) and the 80th percentile (P80) of
the exam scores.
DR

(b) Detect any potential outliers using the IQR method.

7. A dataset of annual salaries (in thousands of dollars) is as follows:

30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150

(a) Find the 3rd decile (D3) and 7th decile (D7) of the salary data.
(b) Use the IQR method to detect any outliers in the salary data.
8. The following dataset represents the number of hours spent on the internet
per week by a sample of 20 people:

5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

(a) Compute the quartiles (Q1, Q2, Q3) for the dataset.
(b) Find the 25th percentile (P25) and the 75th percentile (P75) of the
dataset.

90
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.10 Five-Number Summary and Boxplot

3.10.1 Five-Number Summary
The five-number summary is a set of descriptive statistics that provides a com-
prehensive overview of a dataset. It consists of the
(i). smallest (or minimum) value

(ii). first quartile (Q1 )

(iii). median (also called second quartile (Q2 ))

T
(iv). third quartile (Q3 )

(v). largest (or maximum) value

These five statistics help describe the spread and center of the data.
Problem 3.15. Consider the following data on the cholesterol levels (in mg/dL)
AF
of 15 patients:
180, 195, 170, 200, 210, 175, 205, 190, 195, 220, 185, 215, 200, 190, 225
Find the five-number summary.

Solution
To find the five-number summary, we first arrange the data in ascending order:
170, 175, 180, 185, 190, 190, 195, 195, 200, 200, 205, 210, 215, 220, 225
• Minimum: The smallest value in the dataset.
DR

Minimum = 170 mg/dL

• First Quartile (Q1 ): The median of the lower half of the dataset (not
including the median if the number of observations is odd).
Q1 = 185 mg/dL

• Median: The middle value of the dataset.

Median = 195 mg/dL

• Third Quartile (Q3 ): The median of the upper half of the dataset
(not including the median if the number of observations is odd).
Q3 = 210 mg/dL

• Maximum: The largest value in the dataset.

Maximum = 225 mg/dL

91
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.10.2 Boxplot
A boxplot (or box-and-whisker plot) is a graphical representation of the five-
number summary. It displays the median, quartiles, and potential outliers.

Components of the Boxplot

• Minimum (Min): The lowest data point within the whiskers.

• First Quartile (Q1 ): The 25th percentile of the data.

• Median: The 50th percentile, dividing the dataset into two equal
halves.

T
• Third Quartile (Q3 ): The 75th percentile of the data.

• Maximum (Max): The highest data point within the whiskers.

• Outliers: Data points that fall outside the range of 1.5 times the IQR
AF from Q1 and Q3 .

The Figure 3.1 below illustrates these components:

Q1 Median Q3

LL UL
DRGroup

| {z } | {z }
Outliers Outliers
| {z }
Interquartile Range (IQR)

Values

Figure 3.1: Detailed boxplot illustrating the distribution of a sample dataset

with components: LL (Lower whisker), Q1, Median, Q3, UL (Upper whisker),
and Outliers, where LL = Q1 − 1.5 × IQR and UL = Q3 + 1.5 × IQR

When interpreting a boxplot, several key aspects should be focused on to

understand the distribution and variability of the data. The boxplot provides
a visual summary of the dataset’s central tendency, dispersion, and skewness.
First, examine the central box, which represents the interquartile range (IQR)
where the middle 50% of the data falls, with the line inside the box showing

92
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

the median. The position of the median within the box indicates the data’s
skewness: if it is centered, the data is symmetrical; if skewed towards one
end, it shows skewness. Second, look at the length of the whiskers extending
from the box, which represent the range of the data within 1.5 times the IQR
from the quartiles; data points beyond this range are considered outliers, which
are plotted as individual points. Third, check for outliers and extreme values,
which are represented as dots outside the whiskers. A higher number of outliers
might suggest variability or anomalies in the data. Fourth, observe the width
of the box and the lengths of the whiskers to assess data spread and identify
potential data dispersion. Finally, compare multiple boxplots side-by-side to
analyze differences between groups, noting shifts in the median, variations in

FT
the IQR, and the presence of outliers. By focusing on these elements, you can
gain insights into the data’s distribution characteristics, identify patterns, and
make informed conclusions about the underlying data trends.

Remark 3.10.1. The lines extending from either end of the box are called
whiskers. The term whisker plot is often used interchangeably with “box-
plot”, focusing on the whiskers. It highlights the range of data within 1.5 times
the IQR from the quartiles but might not always show the box or median. Both
terms generally refer to the same plot, but “boxplot” is the more comprehensive
term, including the full depiction of the quartiles and median along with the
whiskers.
A
3.10.3 Importance of Boxplots
Boxplots are essential tools for:
• Visualizing Data Distribution: They show the range, quartiles,
and outliers of the data.
R
• Comparing Distributions: They allow for comparisons between dif-
ferent groups or datasets.

• Detecting Outliers: They help identify unusual data points that may
need further investigation.
D

• Understanding Variability: They show the spread and central ten-

dency of the data.
Problem 3.16. Refer to Problem 3.13. The monthly starting salaries, sorted
in ascending order, are:

5710, 5755, 5850, 5880, 5880, 5890, 5920, 5940, 5950, 6050, 6130, 6325

The five-number summary for the monthly starting salaries is as follows:

93
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Minimum: 5710

• 1st Quartile (Q1 ): 5857.5

• Median (Q2 ): 5905

• 3rd Quartile (Q3 ): 6025

• Maximum: 6325

Draw the boxplot to represent this data and comment on your findings.

T
Solution
From the solution of 3.13, the five-number summary for the monthly starting
salaries is as follows:

• Minimum: 5710
AF•
•
1st Quartile (Q1 ): 5857.5

Median (Q2 ): 5905

• 3rd Quartile (Q3 ): 6025

• Maximum: 6325

To find the lower and upper whiskers (bounds), we first calculate the interquar-
tile range (IQR):

IQR = Q3 − Q1 = 6025 − 5857.5 = 167.5

Using 1.5 times the IQR, we can calculate the lower and upper whiskers as
follows:

Lower whisker = Q1 − 1.5 × IQR = 5857.5 − 1.5 × 167.5 = 5606.25

Upper whisker = Q3 + 1.5 × IQR = 6025 + 1.5 × 167.5 = 6276.25

Checking for outliers
• Lower outlier: Any value below 5606.25. There are no values below
5710, so there are no lower outliers.

• Upper outlier: Any value above 6276.25. The value 6325 is greater
than 6276.25, so 6325 is an upper outlier.

94
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Q1 Q2 Q3
1.5

0.5

T
5,600 5,700 5,800 5,900 6,000 6,100 6,200 6,300 6,400
Monthly Starting Salaries

Figure 3.2: Boxplot of monthly starting salaries

Comment on the Boxplot: The boxplot of the monthly starting salaries re-
AF
veals a positively skewed distribution. The median (Q2 = 5905) is positioned
slightly closer to the lower quartile, and the right whisker is longer than the left
whisker, which indicates that there are some higher salaries pulling the data to
the right.
The lower whisker extends to 5710, while the upper whisker reaches
6130. The interquartile range (IQR), between Q1 = 5857.5 and Q3 = 6025, is
fairly narrow, indicating that the middle 50% of the data are clustered together.
However, the longer upper whisker and the presence of an outlier at 6325 show
that there are some higher salaries that deviate from the general pattern.
The outlier (6325) is a clear indication of the positive skewness in the
data, suggesting that while most starting salaries are within a consistent range,
DR

a few are considerably higher.

Problem 3.17. We consider the following data on the cholesterol levels (in
mg/dL) of 15 patients:

180, 195, 170, 200, 210, 175, 205, 190, 195, 220, 185, 215, 200, 190, 275

Calculate the five-number summary and draw a boxplot to represent this data.

Solution
Given the ordered cholesterol levels:

170, 175, 180, 185, 190, 190, 195, 195, 200, 200, 205, 210, 215, 220, 275

• Minimum = 170

• Maximum = 275

95
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Median (Q2 ): Since there are 15 data points (odd), the median is the
middle value:
Q2 = x( n+1 ) = x( 15+1 ) = x(8) = 195
2 2

• First Quartile (Q1 ):

n+1 15 + 1
m= = =4
4 2
The 4th value is:
Q1 = x(4) = 185
• Third Quartile (Q3 ):

T
15 + 1
m= × 3 = 12
4
Therefore,
Q3 = x(12) = 210
AF • Calculate the IQR:
IQR = Q3 − Q1 = 210 − 185 = 25

• Calculate the Lower whisker:

Lower whisker = Q1 − 1.5 × IQR = 185 − 1.5 × 25 = 185 − 37.5 = 147.5

• Calculate the Upper whisker:

Upper whisker = Q3 + 1.5 × IQR = 210 + 1.5 × 25 = 210 + 37.5 = 247.5
The value 275 is considered an outlier since it is greater than the upper
whisker of 247.5.
DR

Q1 Q2 Q3
Cholesterol Levels

1.5

0.5

0
140 160 180 200 220 240 260 280
Cholesterol (mg/dL)

Figure 3.3: Boxplot of Cholesterol Levels

96
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

The boxplot of cholesterol levels given in Figure 3.3, provides a clear visual
representation of the data distribution. The spread of the data is captured
by the range from the minimum value (170 mg/dL) to the maximum value
(225 mg/dL), as well as by the interquartile range (IQR) of 25 mg/dL, which
measures the spread of the middle 50% of the data between the first quartile
(185 mg/dL) and the third quartile (210 mg/dL). There is an outlier in this
dataset which is 275. The median (195 mg/dL) lies closer to the first quartile,
suggesting a slight right skewness in the data, as the upper quartile range
is wider than the lower quartile range. Overall, the data appears relatively
symmetric but with a mild skew to the right, indicating that higher cholesterol
values are slightly more spread out than the lower values.

T
3.10.4 Python Code: Boxplot
To create a boxplot for the given data in Example 3.13 in Python, you can use
the matplotlib library. Here’s the Python code that will generate a boxplot
for the monthly starting salaries provided in the table:
1

4
AF import matplotlib . pyplot as plt

# Monthly starting salaries

salaries = [5850 , 5950 , 6050 , 5880 , 5755 , 5710 , 5890 , 6130 ,
5940 , 6325 , 5920 , 5880]
5

6 # Create a boxplot
7 plt . figure ( figsize =(10 , 6) )
8 plt . boxplot ( salaries , vert = False , patch_artist = True ,
9 boxprops = dict ( facecolor = ’ lightblue ’ , color = ’ blue
’) ,
10 whiskerprops = dict ( color = ’ blue ’) ,
DR

11 capprops = dict ( color = ’ blue ’) ,

12 medianprops = dict ( color = ’ red ’) )
13

14 # Add titles and labels

15 plt . title ( ’ Boxplot of Monthly Starting Salaries for Business
School Graduates ’)
16 plt . xlabel ( ’ Monthly Starting Salary ( $ ) ’)
17 plt . grid ( True )
18

19 # Show plot
20 plt . show ()

97
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.11 Exercises
1. Consider the following dataset representing the daily temperatures (in °C)
recorded over 15 days:

18, 21, 20, 22, 24, 19, 23, 25, 27, 26, 28, 30, 29, 31, 32

(a) Compute the five-number summary for this dataset, including the
minimum, first quartile (Q1), median, third quartile (Q3), and max-
imum.
(b) Draw a boxplot and comment on your findings.

T
2. The following dataset represents housing prices (in thousands of dollars)
in a neighborhood:

220, 230, 250, 270, 290, 310, 330, 350, 370, 400

(a) Compute the five-number summary.

AF(b) Calculate the interquartile range (IQR) of the housing prices.
(c) Draw a boxplot and comment on your findings.

3. The following dataset represents the test scores of 15 students:

55, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86

(a) Compute the five-number summary.

(b) Find the 25th percentile and the 90th percentile of the test scores.
(c) Draw a boxplot and comment on your findings.
DR

4. Given the following dataset of monthly sales (in thousands of dollars) for
a retail store over 12 months:

25, 30, 28, 35, 33, 32, 31, 29, 37, 40, 42, 38

(a) Create a boxplot for this dataset.

(b) Interpret the boxplot, identifying any potential outliers and describ-
ing the distribution of the data.
5. The dataset below represents the number of goals scored by a soccer team
over 10 games:
1, 2, 3, 3, 4, 5, 6, 6, 7, 8
(a) Calculate the five-number summary.
(b) Draw a boxplot based on the five-number summary.
(c) Label the components of the boxplot, including the minimum, Q1,
median, Q3, maximum, and any potential outliers.

98
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

6. Consider the dataset representing the heights (in cm) of 20 individuals:

160, 165, 170, 175, 180, 185, 190, 195, 200, 205,
210, 215, 220, 225, 230, 235, 240, 245, 250, 255
(a) Compute the five-number summary for the dataset.
(b) Draw a boxplot and discuss the importance of the boxplot in visu-
alizing the spread and identifying outliers.
7. Analyze the following dataset representing the number of hours spent on
homework per week by 15 students:

T
5, 6, 7, 8, 8, 9, 10, 10, 11, 12, 12, 13, 14, 15, 20
(a) Create a boxplot for this dataset.
(b) Identify and describe the distribution of the data, including any po-
tential outliers.
8. The following dataset represents the weights (in kg) of 18 animals:
AF 2, 3, 3, 4, 5, 5, 6, 6, 7, 8, 8, 9, 10, 10, 11, 12, 13, 15
(a) Use Python to compute the five-number summary and create a box-
plot for this dataset.
(b) Write a brief explanation of how Python can be used to visualize
data distributions using boxplots.

3.12 Concluding Remarks

In conclusion, this chapter has provided a comprehensive overview of key nu-
DR

merical measures used in data exploration. By understanding and applying

measures of central tendency, dispersion, and distribution shape, we equip our-
selves with the tools needed to analyze and interpret data effectively. Each
measure offers unique insights and, when used in combination, provides a ro-
bust understanding of the dataset’s characteristics.

The exploration of quartiles, percentiles, deciles, and outlier detection fur-

ther enhances our ability to segment and scrutinize data. The five-number
summary and boxplots serve as powerful visual tools for summarizing and un-
derstanding data distributions. Mastery of these concepts, along with the ac-
companying Python code examples, will significantly contribute to your data
science skills and enhance your ability to perform thorough data analysis.

We encourage you to apply these techniques to various datasets and practice

interpreting the results to gain a deeper understanding of their implications.
The exercises at the end of this chapter are designed to reinforce your learning
and provide hands-on experience with these important data exploration tools.

99
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.13 Chapter Exercises

1. A physician collected an initial measurement of hemoglobin (g/L) after the
admission of 10 inpatients to a hospital’s department of cardiology. The
hemoglobin measurements were 139, 158, 120, 112, 122, 132, 97, 104, 159, 129.

(a) Calculate the mean and median hemoglobin levels.

(b) Find the range of the hemoglobin measurements.
(c) Compute the variance and standard deviation of the hemoglobin
levels.

T
2. The average number of patients visiting a hospital each day over a week
is recorded as 150, 160, 170, 140, 155, 165, 160.

(a) What is the mean number of patients per day?

(b) Calculate the variance and standard deviation of the number of pa-
tients.
AF (c) Find the range of the daily patient counts.

3. The average score of a group of students on a test is 78. If 4 new students

with scores of 85, 90, 75, 80 are added, what will be the new average score
for the group?

(a) Compute the new average score for the group.

(b) Find the variance and standard deviation of the test scores after
adding the new students.

4. Compare the arithmetic means of the following two datasets:

• Dataset A: 55, 60, 65, 70, 75

• Dataset B: 50, 60, 70, 80, 90

(a) Compare the means of Dataset A and Dataset B.

(b) Compute the range, variance, and standard deviation for Dataset A.
(c) Compute the range, variance, and standard deviation for Dataset B.

5. If a class has two sections with 25 and 30 students, and the average scores
for the sections are 85 and 90, respectively, find the weighted average
score for the entire class.

(a) Find the weighted average score for the entire class.
(b) Calculate the variance and standard deviation of the scores, assum-
ing that the scores in each section are the same.

100
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

6. A researcher collects the following data on the number of new cases of a

disease per month: 20, 25, 30, 35, 40, 45, 50.

(a) What is the median number of new cases?

(b) Compute the mean number of new cases.
(c) Find the range, variance, and standard deviation of the number of
new cases.

7. For the dataset of daily temperatures: 22, 24, 21, 25, 23, 26, 24, 27,

(a) Calculate the median temperature.

T
(b) Find the mean, range, variance, and standard deviation of the daily
temperatures.

8. Calculate the medians of the following datasets:

• Dataset A: 5, 8, 7, 6, 10
AF • Dataset B: 3, 6, 5, 7, 8, 9

(a) Find the median of Dataset A and Dataset B.

(b) For Dataset A and Dataset B, calculate the range, variance, and
standard deviation.

9. In a study on the number of hours spent on exercise per week by a group

of individuals, the hours are recorded as 4, 6, 5, 7, 9, 5, 6, 7.

(a) Find the median number of hours spent exercising per week.
(b) Compute the mean, range, variance, and standard deviation of the
DR

hours spent on exercise.

10. A car travels at speeds of 50 km/h for the first part of the trip and
70 km/h for the second part. If the distances traveled are the same,
what is the harmonic mean of the two speeds?

(a) Calculate the harmonic mean of the two speeds.

(b) Discuss how the harmonic mean is used to find the average rate in
this context.

11. A biostatistics researcher records the growth rates of a certain plant

species over three months as 1.2, 1.5, 1.8.

(a) What is the geometric mean growth rate?

(b) Find the variance and standard deviation of the growth rates.

101
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

12. The annual returns of two investment portfolios over the past 3 years are
10%, 15%, 5% and 7%, 12%, 8%.

(a) Compute the geometric mean return for each portfolio.

(b) Compare the geometric means of the two portfolios and discuss their
implications.

13. A pharmaceutical company wants to calculate the average effectiveness

of a new drug across three different studies with effectiveness rates of
0.85, 0.90, 0.80.

(a) What is the geometric mean effectiveness?

T
(b) Compute the variance and standard deviation of the effectiveness
rates.

14. A nutritionist evaluates the average calorie intake of patients based on

three different dietary plans. The average calorie intake (in kcal) per day
AF for each plan and the number of patients on each plan are as follows:

• Plan A: 2000 kcal/day with 15 patients

• Plan B: 2500 kcal/day with 25 patients
• Plan C: 2200 kcal/day with 10 patients

Calculate the weighted mean calorie intake per day for all patients.
15. A professor is calculating the overall grade for a student based on the
grades from three different assessments with the following weights:

•
DR

Exam 1: 85 (weight 40%)

• Exam 2: 90 (weight 30%)
• Final Exam: 88 (weight 30%)

Find the weighted mean grade for the student.

16. This exercise focuses on comparing error rates for two software projects.
You will analyze the central tendency and variability of the error rates.

• Project Alpha: 5, 7, 6, 8, 5, 9, 6, 7, 8, 5, 7, 6
• Project Beta: 8, 10, 9, 7, 11, 9, 8, 10, 11, 9, 12, 10

Using the above data, answer the following questions.

(a) Draw a boxplot for the error rates for Project Alpha and Project
Beta.

102
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

(b) Which project has the lower median number of errors?

(c) Compare the spread of errors in each project. Which project shows
greater variability?
(d) Look for outliers and discuss what they might indicate about the
quality of the code in each project.
(e) Which project appears to have better performance based on the box-
plots?
17. This exercise involves analyzing BMI measurements from the start of a
study and after six months. Compare the central tendency and variability
of BMI over time.

T
• Start of Study: 22.5, 23.1, 22.8, 21.9, 23.5, 22.7, 23.0, 21.5,
22.3, 22.9, 21.8, 23.2
• After 6 Months: 21.0, 21.5, 21.8, 20.8, 22.0, 21.3, 21.6, 20.7,
21.1, 21.9, 20.9, 21.4
AF Using the above data, answer the following questions.

(a) Draw a boxplot for BMI at the start of the study and after six
months.
(b) How did the median BMI change from the start to after six months?
(c) Did the diet intervention lead to a reduction in the variability of
BMI?
(d) Are there any outliers in the BMI data at the start or after six
months? What can be inferred from them?
(e) Based on the boxplots, evaluate the effectiveness of the diet inter-
DR

vention.

18. This exercise involves analyzing cholesterol levels across three different
groups. You will interpret the boxplots to compare the cholesterol levels
between these groups.

• Group A: 190, 200, 195, 210, 205, 215, 202, 198, 220, 210, 195,
200
• Group B: 180, 185, 190, 175, 195, 190, 180, 185, 175, 190, 185,
180
• Group C: 210, 220, 215, 230, 225, 240, 220, 235, 225, 230, 240,
215

Using the above data, answer the following questions.

(a) Draw a boxplot for the cholesterol levels for Groups A, B, and C.

103
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

(b) Which group has the highest median cholesterol level?

(c) Which group has the smallest interquartile range?
(d) Are there any outliers in each group? If so, which group has the
most outliers?
(e) Compare the variability of cholesterol levels between the three groups.

T
AF
DR

104
Chapter 4

Introduction to Probability

T
4.1 Introduction
AF
In the realm of data science, understanding probability is essential for making
informed decisions based on uncertain and incomplete information. Probability
theory provides the mathematical foundation for analyzing data, modeling un-
certainty, and deriving insights from complex datasets. As data scientists, we
frequently encounter situations where outcomes are not deterministic but rather
subject to variability and chance. Probability offers tools and frameworks to
quantify this uncertainty and to make predictions that guide decision-making.

At its core, probability is concerned with measuring the likelihood of vari-

ous outcomes in uncertain situations. Whether it’s predicting customer behav-
ior, assessing risk, or evaluating the effectiveness of an algorithm, probability
helps us to model and interpret the inherent randomness in data. By applying
DR

probability theory, we can develop robust statistical models, conduct rigorous

hypothesis testing, and perform meaningful data analysis.

In this chapter, we will lay the groundwork for understanding probability

within the context of data science. We will start with the basic concepts that
form the foundation of probability theory, including experiments, sample spaces,
and events. We will then explore different methods of assigning probabilities,
such as classical, empirical, and subjective approaches, and examine how these
methods apply to real-world data problems.

As we delve deeper, we will cover key topics such as joint and marginal
probabilities, conditional probability, and posterior probabilities. Each of these
concepts is crucial for analyzing relationships between variables, updating be-
liefs based on new data, and making predictions about future events.

By the end of this chapter, you will gain a solid understanding of proba-

105
CHAPTER 4. INTRODUCTION TO PROBABILITY

bility and its applications in data science. This knowledge will equip you with
the tools needed to tackle complex data challenges and to make data-driven
decisions with confidence.

4.2 Basic Concepts

4.2.1 Experiment
In the context of probability, an experiment refers to any process or action
that generates a set of outcomes. The experiment is conducted under specified
conditions, and the outcomes of interest are observed and recorded. Each out-

T
come of an experiment is uncertain, but over repeated trials, patterns emerge
that allow us to assign probabilities to different outcomes.

Example of an Experiment
To test the fairness of a coin used in a cricket match to decide whether a
AF
team bats or bowls first, we can design an experiment to assess whether the
coin has an equal probability of landing on heads or tails. The goal is to
determine if the coin is unbiased, meaning both outcomes, heads and tails, are
equally likely. In this experiment, the possible outcomes are heads or tails.
The procedure involves flipping the coin a significant number of times, say 100
flips, and recording the result of each flip. The next step is to analyze the data
by calculating the relative frequencies of heads and tails. These frequencies
are then compared to the expected probability of 0.5 for each outcome. If the
proportion of heads deviates significantly from 0.5, it may suggest the coin is
biased. Conversely, if the proportions are close to 0.5, there is no evidence to
suggest that the coin is unfair.
DR

4.2.2 Random Experiment

A random experiment is a specific type of experiment where the outcome is
subject to chance and cannot be predicted with certainty. The outcomes are
uncertain and vary each time the experiment is performed.

Examples
• Rolling a Die: A random experiment with outcomes {1, 2, 3, 4, 5,
6}, where each outcome is unpredictable.

• Flipping a Coin: A random experiment with two possible outcomes:

heads or tails.

• Drawing a Card: A random experiment from a deck of 52 cards,

where each card is equally likely to be drawn.

106
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.2.3 Sample Space and Events

In data science, the concept of a sample space is essential for understanding
the possible outcomes of a random experiment or a data-generating process.
The sample space is the set of all possible outcomes or values that a random
variable can take.

Sample Space: The set of all possible outcomes of a random experiment

is denoted as the sample space, typically represented by S or Ω.

Consider the experiment of rolling a fair six-sided die. The possible outcomes

T
of this experiment are the numbers that appear on the top face of the die after
a roll. Then the sample space S for this experiment is the set of all possible
outcomes. That is,
S = {1, 2, 3, 4, 5, 6}
Here are some examples of sample spaces in different contexts within data
science:
AF• The sample space for tossing a fair coin is

S = {Heads, Tails}.

• Sample Space for tossing two coins:

S = {(Heads, Heads), (Heads, Tails), (Tails, Heads), (Tails, Tails)}.

• The sample space S for rolling two six-sided dice consists of all possible
ordered pairs (x1 , x2 ), where x1 represents the outcome of the first die
and x2 represents the outcome of the second die. Since each die has 6
DR

faces, the sample space contains 36 possible outcomes:

 

 (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), 

 

(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),

 


 


 
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6), 
S=


 (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6), 


 




 (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6), 



 
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)
 

This sample space shows all the possible outcomes when two dice are
rolled simultaneously.

Events are subsets of the sample space, representing specific outcomes or

combinations of outcomes.

107
CHAPTER 4. INTRODUCTION TO PROBABILITY

Event: A subset of the sample space.

For example, the event A of rolling an even number is:

A = {2, 4, 6}

4.3 Probability
Probability is a branch of mathematics that deals with the likelihood of differ-
ent outcomes in uncertain situations. It quantifies the chance of an event occur-
ring, providing a way to model and analyze randomness. In essence, probability

T
helps us understand and predict the behavior of systems in which outcomes are
not deterministic, but rather subject to chance.

Probability: A numerical measure of the likelihood that a particular

event will occur. It quantifies uncertainty by assigning a value between 0
and 1, where 0 means the event will not happen (impossible) and 1 means
AF
the event will definitely happen (certain).

For example, when rolling a fair six-sided die, the probability of getting a 3 is:
1
P (3) =
6
Because there is only one “3” out of six possible outcomes.

Properties of the Probability

A set of probability values for an experiment with a sample space
DR

S = {A1 , A2 , . . . , An }

consists of some probabilities P (A1 ), P (A2 ), . . . , P (An ) that must satisfy

0 ≤ P (A1 ) ≤ 1, 0 ≤ P (A2 ) ≤ 1, ..., 0 ≤ P (An ) ≤ 1

and

P (A1 ) + P (A2 ) + · · · + P (An ) = 1.

Properties of the Probability:

(i). The probability of an event is always a number between 0 and 1.

(ii). The sum of the probabilities of all mutually exclusive events is

always 1.

108
CHAPTER 4. INTRODUCTION TO PROBABILITY

For example, when tossing a fair coin, the probability of getting heads is 0.5,
and the probability of getting tails is also 0.5, which are both between 0 and 1.
The sum of these probabilities is

0.5 + 0.5 = 1

showing that the total probability of all possible outcomes (heads or tails) is
always 1.
Problem 4.1. An experiment has five outcomes, I, II, III, IV, and V. If P (I) =
0.08, P (II) = 0.20, and P (III) = 0.33, (a) what are the possible values for the
probability of outcome V? (b) If outcomes IV and V are equally likely, what are

T
their probability values?

Solution
(a).
An experiment has five outcomes: I, II, III, IV, and V. Given the probabilities
AF
for outcomes I, II, and III are:

P (I) = 0.08, P (II) = 0.20, P (III) = 0.33

We need to determine the possible values for the probability of outcome V.
First, we find the sum of the given probabilities:

P (I) + P (II) + P (III) = 0.08 + 0.20 + 0.33 = 0.61

Since the sum of the probabilities of all outcomes must equal 1, the sum of the
probabilities of outcomes IV and V is:

P (IV) + P (V) = 1 − 0.61 = 0.39

Thus, the possible values for the probability of outcome V, denoted as P (V),
depend on the probability of outcome IV, denoted as P (IV):

P (V) = 0.39 − P (IV)

Since, 0 ≤ P (V) ≤ 0.39, then the possible values for the probability of outcome
V are
0 ≤ P (V) ≤ 0.39

(b).

If outcomes IV and V are equally likely, then:

P (IV) = P (V)

109
CHAPTER 4. INTRODUCTION TO PROBABILITY

Let P (IV) = P (V) = x. Then:

0.39
2x = 0.39 ⇒ x= = 0.195
2
Therefore, the probabilities for outcomes IV and V are:

P (IV) = P (V) = 0.195

Problem 4.2. An experiment has three outcomes, I, II, and III. If outcome
I is twice as likely as outcome II, and outcome II is three times as likely as
outcome III, what are the probability values of the three outcomes?

T
Solution
An experiment has three outcomes: I, II, and III. Let the probabilities of these
outcomes be P (I), P (II), and P (III), respectively.

Let P (III) = x. Then:

AF •
•
P (II) = 3x (since outcome II is three times as likely as outcome III).

P (I) = 2 · P (II) = 2 · 3x = 6x (since outcome I is twice as likely as

outcome II).
Since the sum of the probabilities of all outcomes must equal 1, we have:

P (I) + P (II) + P (III) = 1

Substituting the values, we get:

6x + 3x + x = 1
DR

or,
10x = 1

1
∴x= = 0.1
10
Thus, the probabilities of the three outcomes are:

P (III) = x = 0.1
P (II) = 3x = 3 × 0.1 = 0.3
P (I) = 6x = 6 × 0.1 = 0.6

Therefore, the probability values of the three outcomes are:

P (I) = 0.6, P (II) = 0.3, P (III) = 0.1

110
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.3.1 Union of Events

The union of two events A and B, denoted A ∪ B, represents the event that
either A, B, or both occur. Formally:

A ∪ B = {ω | ω ∈ A or ω ∈ B}
Example: Consider rolling a standard six-sided die. Let:

• A be the event “rolling an even number” (i.e., A = {2, 4, 6)})

• B be the event “rolling a number greater than 3” (i.e., B = {4, 5, 6)})

T
The union A ∪ B represents rolling a number that is either even or greater
than 3 (or both). The possible outcomes for

A ∪ B = {2, 4, 5, 6}.

The probability of A ∪ B is given by:

AF P (A ∪ B) =
Number of favorable outcomes
Total number of outcomes
4
= =
6
2
3

4.3.2 Intersection of Events

The intersection of two events A and B, denoted A ∩ B, represents the event
that both A and B occur simultaneously. Formally:

A ∩ B = {ω | ω ∈ A and ω ∈ B}
Example: Using the same die roll, let:
DR

• A be the event “rolling an even number” (i.e., A = {2, 4, 6)})

• B be the event “rolling a number greater than 3” (i.e., B = {4, 5, 6)})

The intersection A ∩ B represents rolling a number that is both even and

greater than 3.
The possible outcomes for

A ∩ B = {4, 6}.

The probability of A ∩ B is given by:

Number of favorable outcomes 2 1
P (A ∩ B) = = =
Total number of outcomes 6 3

111
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.3.3 Complementary Event

A complementary event of an event A, denoted by Ac or A or A′ , consists
of all outcomes in the sample space S that are not in A.
Example: Consider a six-sided die.

• Event A: Rolling an even number:

A = {2, 4, 6}

• Complementary Event Ac : Rolling an odd number:

Ac = {1, 3, 5}

T
The probability of the complementary event Ac is given by:

P (Ac ) = 1 − P (A)

Example: If A is the event of getting a head when flipping a coin, then the
AF
complementary event Ac is the event of getting a tail. If P (A) = 0.5, then
P (Ac ) = 1 − 0.5 = 0.5.

Odds: The odds in favor of an event A are defined as the ratio of the
probability that the event A occurs to the probability that the event A
does not occur (i.e., the complement of A). Mathematically, the odds in
favor of A are given by:

P (A)
Odds in favor of A =
P (Ac )

where P (Ac ) is the probability of the complement of A.

Problem 4.3. Suppose p is the probability of the success.

(a) If the odds is 1, what is p?
(b) If the odds is 2, what is p?
(c) If p = 0.25, what is the odds?

Solution
(a). The odds in favor of success are given by:
p
Odds =
1−p
If the odds are 1, then:
p
1=
1−p

112
CHAPTER 4. INTRODUCTION TO PROBABILITY

Solving for p:
1
1−p=p ⇒ 1 = 2p ⇒ p=
2
So, p = 0.5.

(b). If the odds are 2, then:

p
2=
1−p
Solving for p:
2

T
2(1 − p) = p ⇒ 2 − 2p = p ⇒ 2 = 3p ⇒ p=
3
So, p = 23 .

(c). If p = 0.25, then the odds are:

p 0.25 0.25 1
AF Odds =

So, the odds are 13 .

1−p
=
1 − 0.25
=
0.75
=
3

4.3.4 Equally Likely Events

In probability theory, equally likely events are events that have the same prob-
ability of occurring. If all outcomes in the sample space are equally likely, then
the probability of any specific event can be calculated by dividing the number
of favorable outcomes by the total number of possible outcomes.
DR

Equally Likely Events: Events that have the same probability of occur-
ring.

Example
Consider the experiment of rolling a fair six-sided die. The sample space is:

S = {1, 2, 3, 4, 5, 6}
Since the die is fair, each of the six outcomes is equally likely. The proba-
bility of each outcome is:

1
P ({1}) = P ({2}) = P ({3}) = P ({4}) = P ({5}) = P ({6}) = .
6

113
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.3.5 Mutually Exclusive Events

In probability theory, mutually exclusive events are events that cannot happen
at the same time. In other words, if one event occurs, the other cannot occur
at the same time.

Mutually Exclusive Events: Two events A and B are said to be mutu-

ally exclusive (or disjoint) if they cannot occur at the same time. Formally,
A and B are mutually exclusive if:

A∩B =∅
where ∩ denotes the intersection of events, and ∅ represents the empty

T
set, indicating that there are no outcomes common to both A and B.

Example:
Consider rolling a standard six-sided die. Let:

• A be the event “rolling a 2”

AF • B be the event “rolling a 5”

The events A and B are mutually exclusive because you cannot roll a 2 and
a 5 at the same time.

Additivity for Mutually Exclusive Events

If A and B are mutually exclusive events, then the probability of their union is
the sum of their individual probabilities:

P (A ∪ B) = P (A) + P (B)
DR

Generalization
For any finite or countable collection of mutually exclusive events A1 , A2 , . . . , An :
n
! n
[ X
P Ai = P (Ai )
i=1 i=1

4.3.6 Probability Axioms

The probability of an event Ai ; i = 1, 2, . . . , n, denoted as P (Ai ), satisfies the
following axioms:

(i). 0 ≤ P (Ai ) ≤ 1,
n
P
(ii). P (Ai ) = 1,
i=1

114
CHAPTER 4. INTRODUCTION TO PROBABILITY

(iii). For any sequence of mutually exclusive events {Ai }, we have

∞ ∞
!
[ X
P Ai = P (Ai ).
i=1 i=1

For mutually exclusive events A, B ∈ S, we have

Pr(A ∪ B) = Pr(A) + Pr(B)

This is an addition rule for two mutually exclusive events

T
If A and B are not mutually exclusive, then the addition rule is

Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B).

If two events are mutually exclusive, then the probability of both occurring
is denoted as P (A ∩ B) and
AF P (A and B) = P (A ∩ B) = 0.

Problem 4.4. A single 6-sided die is rolled. What is the probability of rolling
a 2 or a 5?

Solution
• Pr(2) = 1
6 and Pr(5) = 1
6

• Therefore,
DR

Pr(2 or 5) = Pr(2 ∪ 5) = Pr(2) + Pr(5)

1 1
= +
6 6
2
=
6
1
=
3

Problem 4.5. In a Math class of 30 students, 17 are boys and 13 are girls.
On a unit test, 4 boys and 5 girls made an A grade. If a student is chosen at
random from the class, what is the probability of choosing a girl or an A-grade
student?

Solution
• Pr(girl) = 13
30 , Pr(A-grade student) = 9
30 and Pr(girl ∩ A-grade student) =
5
30

115
CHAPTER 4. INTRODUCTION TO PROBABILITY

• Therefore,

Pr(girl or A-grade student) = Pr(girl) + Pr(A-grade student)

− Pr(girl ∩ A-grade student)
13 9 5
= + −
30 30 30
17
=
30

4.4 Types of Probability

T
Probability can be categorized into different types based on how it is determined
or calculated. Below are the key types:

1. Classical (Theoretical) Probability

2. Experimental (Empirical) Probability
AF
3. Subjective Probability

4.4.1 Classical (Theoretical) Probability

Classical or theoretical probability is based on the assumption that all outcomes
of a random experiment are equally likely. This is often used when the out-
comes of an experiment are known and finite. If an experiment has n equally
likely outcomes and an event A consists of m of these outcomes, the probability
of event A is given by:
Number of favorable outcomes m
P (A) = = .
Total number of possible outcomes n
DR

Example 1: For a fair six-sided die, the total number of possible outcomes is
6 (since the die has six faces). If we are interested in the probability of rolling a
3, there is only one favorable outcome (rolling a 3). Therefore, the probability
is:
1
P (rolling a 3) =
6
Example 2: For a standard deck of 52 playing cards, the total number of
possible outcomes is 52. If we want to know the probability of drawing an Ace,
there are 4 Aces in the deck (one in each suit). Therefore, the probability is:
4 1
P (drawing an Ace) = =
52 13
Classical probability is particularly useful for situations where the outcomes
are well-defined, and each outcome is equally likely, such as rolling dice, drawing
cards, or selecting outcomes from a set of equally likely possibilities.

116
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.4.2 Experimental (Empirical) Probability

The empirical approach, also known as the frequentist approach, assigns prob-
abilities based on observed frequencies from experimental data. This approach
is often used in practical situations where we have observed data. If an experi-
ment is repeated N times and event A occurs nA times, the probability of event
A is estimated by the relative frequency:

Number of times the event A occurred nA

P (A) = = .
Total number of observations N

This method is particularly useful when it is difficult or impractical to calcu-

T
late theoretical probabilities, or when we want to verify theoretical predictions
by comparing them with actual outcomes. The empirical approach to probabil-
ity relies on the principle known as the law of large numbers. This principle
suggests that as the number of observations increases, the estimate of the prob-
ability becomes more accurate. Therefore, by gathering more data, one can
obtain a more precise estimation of the probability.
AF
Law of large numbers: As the number of trials or observations increases,
the empirical probability of an event will get closer to its actual probability.

Example 1: For example, if we flip a coin 100 times and get 52 heads, the
estimated probability of getting a head would be:
Number of heads 52
P (Heads) = = = 0.52
Total number of flips 100
Based on the empirical data, the estimated probability of getting heads on a
coin flip is 0.52, or 52%.
DR

Example 2: In Bangladesh, the route between Dhaka and Chattogram

is one of the busiest and most important for both business and leisure travel.
Given the high volume of flights, airlines strive to maintain punctuality to en-
hance customer satisfaction and operational efficiency. Monitoring these flights
provides valuable data on performance and reliability.

For this example, 100 flights from Dhaka to Chattogram were monitored.

• Successful flights (on time): 95

• Unsuccessful flights (delayed or canceled): 5

The empirical probability of a successful flight is:

95
P (Success) = = 0.95
100
The empirical probability of an unsuccessful flight is:

117
CHAPTER 4. INTRODUCTION TO PROBABILITY

5
P (Failure) = = 0.05
100
In this example, the empirical probability of a successful flight from Dhaka
to Chattogram is 0.95, while the probability of an unsuccessful flight is 0.05,
based on the actual outcomes of the monitored flights.

4.4.3 Subjective Approach

The subjective approach assigns probabilities based on personal judgment
or belief about the likelihood of an event. This method does not rely
on mathematical calculations or empirical data but rather on an individual’s

T
intuition or experience. For an event A, the subjective probability is denoted
as:
P (A) = Subjective belief about A
In this approach, probabilities are not necessarily based on frequency or
equal likelihood but on personal estimation.
AF
Example 1: Consider an entrepreneur deciding whether to launch a new
product. Since there is no historical data or empirical studies available for this
specific product, the entrepreneur uses their expertise and market knowledge
to estimate the likelihood of success.

The entrepreneur assesses various factors:

• Knowledge of current market trends.

• Feedback from potential customers.

• Expertise of the team involved.

• Analysis of the competitive landscape.

Based on these factors, the entrepreneur might estimate the probability of
the product’s success to be 70%. This subjective estimate is derived from their
personal judgment and experience rather than from data analysis.

P (Success) = 0.70
This subjective probability is based on the entrepreneur uses their expertise
and market knowledge, rather than on empirical data or mathematical models.

118
CHAPTER 4. INTRODUCTION TO PROBABILITY

Example 2: Consider a football game between Team A and Team B. Based

on your personal judgment and knowledge of the teams, you estimate the prob-
ability of Team A winning the game.
Let’s denote:
• P (A) as the probability of Team A winning.

• P (B) as the probability of Team B winning.

Based on your assessment, you estimate that:

P (A) = 0.70

T
This means you believe there is a 70% chance that Team A will win the
game. This probability is derived from your subjective evaluation of the teams’
recent performances, player conditions, and other relevant factors.

4.5 Joint and Marginal Probabilities

AF
Joint probability is the probability of two (or more) events happening simul-
taneously. For two events A and B, the joint probability is denoted by P (A∩B).

Marginal probability is the probability of the occurrence of the single event.

It is obtained by summing (or integrating) the joint probabilities over all pos-
sible values of the other variable(s).

Example: Consider a study on student performance with the following events:

• A: The event that a student good studied for the exam.

• B: The event that a student passed the exam.

The probability table for these events is as follows:

Good Study Habit (A) Poor Study Habit (Ac ) Total

Passed (B) 0.80 0.05 0.85
c
Not Passed (B ) 0.02 0.13 0.15
Total 0.82 0.18 1.0

The joint probability is the probability that a student has a good study
habit and passed the exam:

P (A ∩ B) = 0.80

119
CHAPTER 4. INTRODUCTION TO PROBABILITY

Marginal Probability of Studying (A) that a student studied for the exam,
regardless of whether they passed or not, is obtained by summing the joint
probabilities involving A:

P (A) = P (A ∩ B) + P (A ∩ B c ) = 0.8 + 0.02 = 0.82

Marginal Probability of Passing (B): The probability that a student passed
the exam, regardless of whether they studied or not, is obtained by summing
the joint probabilities involving B:

P (B) = P (A ∩ B) + P (Ac ∩ B) = 0.80 + 0.05 = 0.85

T
Problem 4.6. Suppose two dice are thrown together. What is the probability
that at least one 6 is obtained on the two dice?

Solution
Since each die has 6 faces, the sample space contains 6 × 6 = 36 possible
outcomes:
AF 
(1, 1), (1, 2), (1, 3),



(2, 1), (2, 2), (2, 3),

(1, 4), (1, 5), (1, 6),



(2, 4), (2, 5), (2, 6),
 


 

 

(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),
S=
(4, 1), (4, 2), (4, 3),
 (4, 4), (4, 5), (4, 6),


 

 



 (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),



 
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)
 

In the sample space S, we see the possible outcomes with at least one 6 are
DR

(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6), (1, 6), (2, 6), (3, 6), (4, 6), (5, 6)

. Therefore, the number of outcomes with at least one 6 is 11.

11
P (at least one 6) = .
36

4.6 Conditional Probability

Conditional probability is essential in data science because it helps model and
understand how the probability of an event changes based on the occurrence of
another event. It underpins Bayesian inference, supports feature engineering,
enhances risk assessment, informs decision-making, aids in anomaly detection,
and is pivotal in natural language processing tasks. This ability to adjust
probabilities with new information is crucial for accurate predictions and data-
driven insights.

120
CHAPTER 4. INTRODUCTION TO PROBABILITY

Conditional Probability: The conditional probability of an event A

given that event B has occurred is denoted by P (A|B) and is defined as:

P (A ∩ B)
P (A|B) = , provided P (B) > 0
P (B)

One simple example of conditional probability concerns the situation in

which two events A and B are mutually exclusive. Since mutually exclusive
events have no common outcomes, the occurrence of event B makes the occur-
rence of event A impossible. Thus, intuitively, the probability of event A given
that B has occurred should be zero. This is confirmed by the formula:

T
P (A ∩ B) 0
P (A | B) = = = 0.
P (B) P (B)
Another example involves a scenario where event B is a subset of event A,
denoted B ⊆ A. In this case, if event B occurs, event A must also occur. Thus,
the probability of event A given that B has occurred should be one. This is
AF
supported by the formula:

P (A | B) =
P (A ∩ B)
P (B)
=
P (B)
P (B)
= 1.

Problem 4.7. If somebody rolls a fair die without showing you but announces
that the result is even, then what is the probability of scoring a 6?

Solution
The sample space for a fair die roll is S = {1, 2, 3, 4, 5, 6}. The event (B) that
the result is even is B = {2, 4, 6}.
DR

3 1
P (B) = =
6 2
The event (A) of scoring a 6 given that the result is even is A = {6}.
1
P (A ∩ B) 6 1
P (A|B) = = 1 =
P (B) 2
3

Problem 4.8. Suppose somebody rolls a red die and a blue die together without
showing you, but announces that at least one 6 has been showed. What is the
probability that the red die showed a 6?

Solution
In the sample space S mentioned in Problem 4.6, we see the possible outcomes
with at least one 6 are

B = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6), (1, 6), (2, 6), (3, 6), (4, 6), (5, 6)}

121
CHAPTER 4. INTRODUCTION TO PROBABILITY

. Therefore, the number of outcomes with at least one 6 is 11. The number of
outcomes where the red die scores a 6 is 6. That is,

A = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.

Hence, the conditional probability is

6/36 6
P (A|B) = = .
11/36 11

Problem 4.9. Suppose somebody rolls a red die and a blue die together. What
is the probability that the red die scores a 6 given that exactly one 6 of the two

T
outcomes has been scored?

Solution
In the sample space S mentioned in Problem 4.6, we see the possible outcomes
with at least one 6 are
AF B = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (1, 6), (2, 6), (3, 6), (4, 6), (5, 6)}.

Therefore, the number of outcomes with at least one 6 is 10 (i.e., excluding

(6, 6)). The number of outcomes where the red die scores a 6 and the blue die
does not is 5 (i.e., A = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5)}).

5/36 1
P (A|B) = =
10/36 2

4.6.1 Probabilities Computation form Contingency Table

A contingency table displays the frequency distribution of two categorical
DR

variables. Each cell in the table represents the count of observations where
the two variables take specific values. We use contingency tables to compute
marginal and joint probabilities.
Consider two categorical variables: Variable A with categories A1 and
A2 and Variable B with categories B1 and B1 The contingency table 4.1 is
structured as follows:

Table 4.1: Contingency Table

A1 A2 Total
B1 a b a+b
B2 c d c+d
Total a+c b+d n=a+b+c+d

122
CHAPTER 4. INTRODUCTION TO PROBABILITY

The joint probability table shows the probability of each combination of

categories occurring. It is obtained by dividing each cell count in the contin-
gency table by the overall total number of observations n.
To compute the joint probabilities:
Count in cell
Joint Probability =
n
The joint probability table is:

A1 A2 Total
a b a+b
B1

T
n n n
c d c+d
B2 n n n
a+c b+d
Total n n 1

Table 4.2: Joint Probability Table

AF
Problem 4.10. Consider the situation of the promotion status of male and
female officers of a major metropolitan police force in the eastern United States.
The force consists of 1200 officers, 960 men and 240 women. Over the past two
years 324 officers on the public force received promotions. After reviewing the
promotion record, a committee of female officers raised a discrimination case
on the basis that 288 male officers had received promotions, but only 36 female
officers had received promotions.

Men Women Total

Promoted 288 36 324
DR

Not Promoted 672 204 876

Total 960 240 1200

(i). Develop a joint probability table for these data. What are the marginal
probabilities? Suppose a male officer is selected randomly, what is the
chance that the officer will be promoted?

(ii). Suppose a female officer is selected randomly, what is the chance that
the officer will not be promoted? Suppose an officer is selected randomly
who got promotion, what is the chance that the officer will be male?

(iii). Suppose an officer is selected randomly who did not get promotion,
what is the chance that the officer will be female?

123
CHAPTER 4. INTRODUCTION TO PROBABILITY

Solution
(i) Joint Probability Table and Marginal Probabilities
To develop the joint probability table, we divide each cell count by the total
number of officers, which is 1200.

Joint Probability Table :

Men Women Total

288 36 324
Promoted 1200 1200 1200

T
672 204 876
Not Promoted 1200 1200 1200
960 240
Total 1200 1200 1
Simplifying the fractions, we get:

Men Women Total

AF Promoted
Not Promoted
0.24
0.56
0.03
0.17
0.27
0.73
Total 0.80 0.20 1
Marginal Probabilities:
• Probability of promotion: 324
1200 = 0.27

• Probability of not promotion: 876

1200 = 0.73

• Probability of being male: 960

1200 = 0.80
DR

• Probability of being female: 240

1200 = 0.20
Probability that a randomly selected male officer is promoted:
288
P (Promoted | Male) = = 0.30
960

(ii) Probabilities for Female Officers and Promotion

Probability that a randomly selected female officer is not promoted:

204/1200
P (Not Promoted | Female) = = 0.85
240/1200
Probability that a randomly selected officer who got promoted is
male:

288/1200
P (Male | Promoted) = ≈ 0.889
324/1200

124
CHAPTER 4. INTRODUCTION TO PROBABILITY

(iii) Probability for Officers Not Promoted

Probability that a randomly selected officer who did not get promoted
is female:

204/1200
P (Female | Not Promoted) = ≈ 0.233
876/1200

4.6.2 Independent Events

Two events A and B are said to be independent if the occurrence of one event
does not affect the probability of the other event occurring. In mathematical

T
terms, this is expressed as:

P (A ∩ B) = P (A) · P (B).

For independent events, the following also holds true:

P (A ∩ B) P (A) · P (B)
P (A | B) = = = P (A)
AF
and
P (B)

P (A ∩ B)
P (B)

P (A) · P (B)
P (B | A) = = = P (B).
P (A) P (A)
These equations indicate that knowing the occurrence of one event does not
change the probability of the other event.

Independent Events: Two events A and B are said to be independent

if:

P (A ∩ B) = P (A) · P (B).
DR

This equation implies

P (A | B) = P (A), P (B | A) = P (B).

Any one of these three conditions implies the other two.

Example 1: Rolling a Die and Flipping a Coin

Consider rolling a fair six-sided die and flipping a fair coin. Let:
• A be the event that the die shows a 3.

• B be the event that the coin lands on heads.

The die has six possible outcomes: 1, 2, 3, 4, 5, or 6. Therefore, the sample
space for the die roll is:
SA = {1, 2, 3, 4, 5, 6}

125
CHAPTER 4. INTRODUCTION TO PROBABILITY

The coin has two possible outcomes: heads (H) or tails (T). Therefore, the
sample space for the coin flip is:

SA = {H, T }

The outcome of rolling the die does not affect the outcome of flipping the
coin, and vice versa. Therefore, events A and B are independent. We can verify
this as follows:
1 1
P (A) = , P (B) =
6 2
The combined sample space consists of 12 outcomes.

FT
 
(1, H), (2, H), (3, H), (4, H), (5, H), (6, H),
S=
 (1, T ), (2, T ), (3, T ), (4, T ), (5, T ), (6, T ) 

1
P (A ∩ B) = P (die shows 3 and coin lands on heads) =
12
1 1 1
P (A ∩ B) = P (A) · P (B) = · =
6 2 12
Thus, the events are independent.

Example 2: Drawing Cards from a Deck Without Replace-

A
ment
Consider drawing two cards from a standard deck of 52 cards without replace-
ment. Let:
• A be the event that the first card drawn is a heart.

• B be the event that the second card drawn is a heart.

R
In this case, the events are not independent because the outcome of the first
draw affects the probability of the second draw. If the first card is a heart,
there are now only 12 hearts left in a deck of 51 cards, so:
13 12
P (A) = , P (B | A) =
52 51
D

The probability of B if A occurs is different from P (B) without conditioning

on A, which is:
13
P (B) =
52
Thus, A and B are not independent.
Problem 4.11. A system has four computers. Computer 1 works with a prob-
ability of 0.88; computer 2 works with a probability of 0.78; computer 3 works
with a probability of 0.92; computer 4 works with a probability of 0.85. Suppose
that the operations of the computers are independent of each other.

126
CHAPTER 4. INTRODUCTION TO PROBABILITY

(a). Suppose that the system works only when all four computers are work-
ing. What is the probability that the system works?

(b). Suppose that the system works only if at least one computer is working.
What is the probability that the system works?

(c). Suppose that the system works only if at least three computers are work-
ing. What is the probability that the system works?

Solution
(a). To find the probability in this scenario, we multiply the probabilities

T
of all four computers working, as they are independent.

P (system works) = 0.88 × 0.78 × 0.92 × 0.85 = 0.537

(b). To find the probability in this scenario, we find the complement of

the probability that none of the computers are working. Then, the
AF probability that at least one computer is working is the complement of
the probability that none of the computers are working.

P (system works) = 1 − P (no computers working)

= 1 − ((1 − 0.88) × (1 − 0.78) × (1 − 0.92) × (1 − 0.85))
= 0.9997

(c).

P (system works) = P (all computers working)

+ P (computers 1,2,3 working, computer 4 not working)
DR

+ P (computers 1,2,4 working, computer 3 not working)

+ P (computers 1,3,4 working, computer 2 not working)
+ P (computers 2,3,4 working, computer 1 not working)
= 0.537 + (0.88 × 0.78 × 0.92 × (1 − 0.85))
+ (0.88 × 0.78 × (1 − 0.92) × 0.85)
+ (0.88 × (1 − 0.78) × 0.92 × 0.85)
+ ((1 − 0.88) × 0.78 × 0.92 × 0.85)
= 0.903

Problem 4.12. Suppose that somebody secretly rolls two fair six-sided dice,
and what is the probability that the face-up value of the first one is 2, given the
information that their sum is no greater than 5?

127
CHAPTER 4. INTRODUCTION TO PROBABILITY

Solution
To find the probability that the face-up value of the first die is 2 given that
the sum of the two dice is no greater than 5, we use the concept of conditional
probability.
Let A be the event that the face-up value of the first die is 2, and B be
the event that the sum of the two dice is no greater than 5. We want to find
P (A | B).
The conditional probability P (A | B) is given by:

P (A ∩ B)
P (A | B) =
P (B)

T
First, we determine P (B). The possible outcomes for the sum of the two
dice being no greater than 5 are:

(1, 1), (1, 2), (1, 3), (1, 4),

(2, 1), (2, 2), (2, 3),
AF (3, 1), (3, 2),
(4, 1)
There are 10 such outcomes, and since there are 36 possible outcomes when
rolling two dice, the probability P (B) is:
10 5
P (B) = =
36 18
Next, we determine P (A ∩ B), which is the probability that the first die is
2 and the sum of the dice is no greater than 5. The possible outcomes for this
are:
DR

(2, 1), (2, 2), (2, 3)

There are 3 such outcomes, so the probability P (A ∩ B) is:
3 1
P (A ∩ B) = =
36 12
Now we can calculate P (A | B):
1
P (A ∩ B) 12 1 18 18 3
P (A | B) = = 5 = × = =
P (B) 18
12 5 60 10

Thus, the probability that the face-up value of the first die is 2 given that
3
their sum is no greater than 5 is 10 .

128
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.7 Posterior Probabilities

4.7.1 Law of Total Probability
Consider a sample space S partitioned into mutually exclusive events A1 , A2 , . . . , An .
This means:

S = A1 ∪ A2 ∪ · · · ∪ An
Let B be another event in the sample space given in Figure 4.1. The initial
question of interest is how to use the probabilities P (Ai ) and P (B | Ai ) to
calculate P (B), the probability of the event B. This can be achieved by noting

T
that

B = (A1 ∩ B) ∪ (A2 ∩ B) ∪ · · · ∪ (An ∩ B)

where the events Ai ∩ B are mutually exclusive, so that

P (B) = P (A1 ∩ B) + P (A2 ∩ B) + · · · + P (An ∩ B)

AF A1
A2
An−1 An
Ai

S
DR

Figure 4.1: A partition A1 , . . . , An and an event B.

Using the definition of conditional probability, this becomes

P (B) = P (A1 )P (B | A1 ) + P (A2 )P (B | A2 ) + · · · + P (An )P (B | An )

This result, known as the Law of Total Probability, has the interpretation
that if it is known that one and only one of a series of events Ai can occur, then
the probability of another event B can be obtained as the weighted average of
the conditional probabilities P (B | Ai ), with weights equal to the probabilities
P (Ai ).

Law of Total Probability: If A1 , . . . , An is a partition of a sample space,

then the probability of an event B can be obtained from the probabilities

129
CHAPTER 4. INTRODUCTION TO PROBABILITY

P (Ai ) and P (B | Ai ) using the formula

P (B) = P (A1 )P (B | A1 ) + P (A2 )P (B | A2 ) + · · · + P (An )P (B | An )

The law of total probability states that if you have a partition of the sample
space into mutually exclusive events, the probability of an event can be found by
summing the probabilities of the event occurring within each partition, weighted
by the probability of each partition.

Example

FT
Suppose we have a sample space divided into three mutually exclusive events
A1 , A2 , and A3 with the following probabilities and conditional probabilities:

P (A1 ) = 0.2, P (A2 ) = 0.5, P (A3 ) = 0.3

P (B | A1 ) = 0.4, P (B | A2 ) = 0.6, P (B | A3 ) = 0.3

To find P (B), use the Law of Total Probability:

P (B) = P (A1 ) · P (B | A1 ) + P (A2 ) · P (B | A2 ) + P (A3 ) · P (B | A3 )

A
P (B) = 0.2 · 0.4 + 0.5 · 0.6 + 0.3 · 0.3

P (B) = 0.08 + 0.30 + 0.09 = 0.47

Problem 4.13. A company sells a certain type of car that it assembles in one
of four possible locations. The probabilities of a car being assembled at each
R
plant are as follows:

• Plant I: 20% (P (Plant I) = 0.20)

• Plant II: 24% (P (Plant II) = 0.24)

• Plant III: 25% (P (Plant III) = 0.25)

• Plant IV: 31% (P (Plant IV) = 0.31)

Each new car sold carries a one-year bumper-to-bumper warranty. The com-
pany has collected data showing the following conditional probabilities of making
a warranty claim:

• P (claim | Plant I) = 0.05

• P (claim | Plant II) = 0.11

130
CHAPTER 4. INTRODUCTION TO PROBABILITY

• P (claim | Plant III) = 0.03

• P (claim | Plant IV) = 0.08

The probability of interest is the probability that a claim on the warranty of the
car will be required. If B is the event that a claim is made, we want to find
P (B).

Solution
We can use the Law of Total Probability to find P (B). According to the Law
of Total Probability:

T
P (B) =P (B | Plant I) · P (Plant I) + P (B | Plant II) · P (Plant II)
+ P (B | Plant III) · P (Plant III) + P (B | Plant IV) · P (Plant IV)

No Claim (0.95)
AF Plant I

Claim (0.05)

No Claim (0.89)

Plant II

Claim (0.11)
DR

Plant

No Claim (0.97)

Plant III

Claim (0.03)

No Claim (0.92)

Plant IV

Claim (0.08)
Substitute the given values:

131
CHAPTER 4. INTRODUCTION TO PROBABILITY

P (B) = (0.05 · 0.20) + (0.11 · 0.24) + (0.03 · 0.25) + (0.08 · 0.31)

= 0.01 + 0.0264 + 0.0075 + 0.0248
= 0.0687

Thus, the probability that a claim on the warranty will be required is 0.0687.

4.7.2 Total Probability with Multiple Conditions

In situations where an event depends on multiple factors, the Law of Total

FT
Probability allows us to compute the overall probability by summing the con-
tributions from all mutually exclusive combinations of those factors.

When an event B is influenced by two or more factors, such as age group

and smoking status, the total probability of B is calculated by conditioning on
all possible combinations of these factors. If we partition the sample space by
factors Ai (e.g., age groups) and Sj (e.g., smoking status), the total probability
of event B is given by:
XX
P (B) = P (B | Ai , Sj ) · P (Sj | Ai ) · P (Ai )
i j
A
Where:
• P (B | Ai , Sj ) is the conditional probability of B given that the indi-
vidual belongs to age group Ai and has smoking status Sj .

• P (Sj | Ai ) is the conditional probability of having smoking status Sj ,

given that the individual is in age group Ai .
R
• P (Ai ) is the marginal probability of being in age group Ai .
Thus, the total probability P (B) accounts for all the different ways that the
event B can occur, considering the two factors involved.
Problem 4.14. In a clinical study, participants are classified by age and smok-
D

ing status. The probability of being in the young age group (less than 40 years) is
P (Young) = 0.60, and the probability of being in the old age group (40 years or
older) is P (Old) = 0.40. For the young age group, the conditional probabilities
of having high blood pressure (BP) are P (High BP | Young, Smoker) = 0.10 for
smokers and P (High BP | Young, Non-smoker) = 0.05 for non-smokers. For
the old age group, the conditional probabilities of having high BP are P (High BP |
Old, Smoker) = 0.40 for smokers and P (High BP | Old, Non-smoker) = 0.25
for non-smokers. The probability of being a smoker in the young age group
is P (Smoker | Young) = 0.30, and the probability of being a non-smoker is
P (Non-smoker | Young) = 0.70. In the old age group, the probability of being

132
CHAPTER 4. INTRODUCTION TO PROBABILITY

a smoker is P (Smoker | Old) = 0.40, and the probability of being a non-smoker

is P (Non-smoker | Old) = 0.60. Calculate the overall probability of having high
blood pressure, P (High BP).

Solution
To compute the overall probability of having high blood pressure, P (High BP),
we use the Law of Total Probability. The total probability is given by:
XX
P (High BP) = P (High BP | Ai , Sj ) · P (Sj | Ai ) · P (Ai )
i j

T
Where
• P (High BP | Ai , Sj ) is the conditional probability of having high blood
pressure given that the individual belongs to age group Ai and has
smoking status Sj ,

• P (Ai ) is the marginal probability of being in age group Ai ,

AF• P (Sj | Ai ) is the conditional probability of being a smoker (or non-
smoker) given the age group Ai .
DR

133
CHAPTER 4. INTRODUCTION TO PROBABILITY

Not High BP

Non Smoker

High BP

Old

Not High BP

Smoker

T
High BP

Age

Not High BP

Non Smoker
AF High BP

Young

Not High BP

Smoker

High BP
Therefore, we have
DR

P (High BP) = P (High BP | Young, Smoker) · P (Smoker | Young) · P (Young)

+ P (High BP | Young, Non-smoker) · P (Non-smoker | Young) · P (Young)
+ P (High BP | Old, Smoker) · P (Smoker | Old) · P (Old)
+ P (High BP | Old, Non-smoker) · P (Non-smoker | Old) · P (Old)
= (0.10 · 0.30 · 0.60) + (0.05 · 0.70 · 0.60) + (0.40 · 0.40 · 0.40)
+ (0.25 · 0.60 · 0.40)
= 0.018 + 0.021 + 0.064 + 0.060
= 0.163
Thus, the probability of developing high blood pressure is P (High BP) =
0.163 or 16.3%.

Interpretation: The overall probability of having high blood pressure, con-

sidering all age groups and smoking statuses, is 16.3%. This result helps un-
derstand the prevalence of high blood pressure in the study population, taking

134
CHAPTER 4. INTRODUCTION TO PROBABILITY

into account various risk factors.

4.7.3 Bayes’ Theorem

Bayes’ Theorem relates the conditional and marginal probabilities of random
events. It is used to update the probability of a hypothesis based on observed
evidence. The theorem can be stated mathematically as follows:

P (B | A) · P (A)
P (A | B) =
P (B)

T
where:
• P (A | B) is the posterior probability of event A given that event B has
occurred.

• P (B | A) is the likelihood of event B given that event A has occurred.

AF • P (A) is the prior probability of event A before observing event B.
If A1 , A2 , . . . , An is a partition of a sample space, then the marginal probability
P (B) is
X
P (B) = P (B | Ai ) · P (Ai ).
i

In this case, the conditional probabilities P (B | Ai ) is

P (B | Ai ) · P (Ai )
P (Ai | B) = Pn
j=1 P (B | Aj ) · P (Aj )
DR

which is known as Bayes’ theorem.

Bayes’ Theorem for Posterior Probabilities

If A1 , A2 , . . . , An is a partition of a sample space, then the posterior
probabilities of the events Ai conditional on an event B can be obtained
from the prior probabilities P (Ai ) and the conditional probabilities P (B |
Ai ) using the formula:

P (Ai ) · P (B | Ai )
P (Ai | B) = Pn
j=1 P (B | Aj ) · P (Aj )

where:

• P (Ai ) is the prior probability of event Ai ,

• P (B | Ai ) is the conditional probability of event B given Ai ,

135
CHAPTER 4. INTRODUCTION TO PROBABILITY

• The denominator is the total probability of B, computed by sum-

ming over all possible partitions Aj .

Bayes’ Theorem is particularly useful in scenarios where the probability of

an event is updated as more evidence becomes available. It plays a crucial role
in fields such as machine learning, data analysis, and decision making under
uncertainty.
Problem 4.15. When a customer buys a car, the prior probabilities of it having
been assembled in a particular plant are P (Plant I) = 0.20, P (Plant II) = 0.24,
P (Plant III) = 0.25, and P (Plant IV) = 0.31. Each new car sold carries a
one-year bumper-to-bumper warranty. The company has collected data showing

T
the following conditional probabilities of making a warranty claim:

• P (claim | Plant I) = 0.05

• P (claim | Plant II) = 0.11

•
AF•
P (claim | Plant III) = 0.03

P (claim | Plant IV) = 0.08

If a claim is made on the warranty of the car, how does this change these
probabilities?

Solution
From Bayes’ theorem, the posterior probabilities are calculated as follows:

P (Plant I) · P (Claim | Plant I)

P (Plant I | Claim) =
P (Claim)
DR

Substitute the given values:

136
CHAPTER 4. INTRODUCTION TO PROBABILITY

0.25 × 0.03
P (Plant III | Claim) = = 0.109
0.0687
P (Plant IV) · P (Claim | Plant IV)
P (Plant IV | Claim) =
P (Claim)
Substitute the given values:
0.31 × 0.08
P (Plant IV | Claim) = = 0.361
0.0687

Comments on the results: The posterior probabilities are as follows:

T
• P (Plant I | Claim) = 0.146

• P (Plant II | Claim) = 0.384

• P (Plant III | Claim) = 0.109

AF• P (Plant IV | Claim) = 0.361

Notice that Plant II has the largest claim rate (0.11), and its posterior
probability (0.384) is much larger than its prior probability (0.24). This is
expected since the fact that a claim is made increases the likelihood that the
car has been assembled in a plant with a high claim rate. Similarly, Plant III
has the smallest claim rate (0.03), and its posterior probability (0.109) is much
smaller than its prior probability (0.25), as expected.
Problem 4.16. Suppose it is known that 1% of the population suffers from
a particular disease. A blood test has a 97% chance of identifying the disease
for diseased individuals, but also has a 6% chance of falsely indicating that a
DR

healthy person has the disease.

(a) What is the probability that a person will have a positive blood test?

(b) If your blood test is positive, what is the chance that you have the disease?
(c) If your blood test is negative, what is the chance that you do not have the
disease?

Solution
(a) Probability of a Positive Blood Test
Let D be the event that a person has the disease, and Dc be the event that a
person does not have the disease. Let T + be the event of a positive test result,
and T − be the event of a negative test result.

137
CHAPTER 4. INTRODUCTION TO PROBABILITY

P (D) = 0.01
P (Dc ) = 0.99
P (T + |D) = 0.97
P (T + |Dc ) = 0.06

The total probability of a positive test result is given by:

P (T + ) = P (T + |D)P (D) + P (T + |Dc )P (Dc )

T
= (0.97 × 0.01) + (0.06 × 0.99)
= 0.0097 + 0.0594
= 0.0691

So, the probability that a person will have a positive blood test is 0.0691.
AF
(b) Probability of Having the Disease Given a Positive Test
We use Bayes’ theorem:

P (T + |D)P (D)
P (D|T + ) =
P (T + )
0.97 × 0.01
=
0.0691
0.0097
=
0.0691
≈ 0.1403
DR

So, if your blood test is positive, the chance that you have the disease is
approximately 0.1403 or 14.03%.

(c) Probability of Not Having the Disease Given a Negative Test

We first find the probability of a negative test:

P (T − ) = P (T − |D)P (D) + P (T − |Dc )P (Dc )

= (1 − P (T + |D))P (D) + (1 − P (T + |Dc ))P (Dc )
= (1 − 0.97) × 0.01 + (1 − 0.06) × 0.99
= 0.03 × 0.01 + 0.94 × 0.99
= 0.0003 + 0.9406
= 0.9409

138
CHAPTER 4. INTRODUCTION TO PROBABILITY

Now, using Bayes’ theorem for P (Dc |T − ):

P (T − |Dc )P (Dc )
P (Dc |T − ) =
P (T − )
0.94 × 0.99
=
0.9409
0.9406
=
0.9409
≈ 0.9997

So, if your blood test is negative, the chance that you do not have the disease

T
is approximately 0.9997 or 99.97%.

4.8 Concluding Remarks

In this chapter, we covered the essential principles of probability that form the
backbone of data science. We discussed experiments, sample spaces, joint and
AF
marginal probabilities, and conditional probabilities, providing a solid foun-
dation for analyzing uncertainty and making data-driven decisions. We also
explored various methods for assigning probabilities, including classical, empir-
ical, and subjective approaches. The insights gained from understanding joint
probabilities, marginal probabilities, and Bayes’ Theorem will be invaluable for
refining models and interpreting data.

As we move forward, the next chapter will delve into random variables and
their properties. Random variables are crucial for quantifying and modeling
uncertainty in a more structured way. We will explore different types of ran-
dom variables, their distributions, and key properties, further building on the
DR

probability concepts introduced here. Mastering these topics will enhance your
ability to handle complex data challenges and apply statistical techniques effec-
tively. Understanding random variables is essential for advanced data analysis
and developing predictive models.

4.9 Chapter Exercises

1. Consider an experiment where a fair six-sided die is rolled. Define the
following events:

• A: The event that the outcome is an even number.

• B: The event that the outcome is greater than 4.
Calculate the following probabilities:
(a) P (A)

139
CHAPTER 4. INTRODUCTION TO PROBABILITY

(b) P (B)
(c) P (A ∩ B)
(d) P (A ∪ B)
(e) P (Ac )

2. In a bag of 10 balls, 4 are red and 6 are blue. Two balls are drawn at
random without replacement. Define the following events:
• A: Drawing a red ball on the first draw.
• B: Drawing a red ball on the second draw.

T
Calculate the following probabilities:

(a) P (A)
(b) P (B | A)
(c) P (A ∩ B)
AF(d) P (A ∪ B)

3. You are given a deck of 52 playing cards. Define the following events:
• A: Drawing a card that is a heart.
• B: Drawing a card that is a queen.
Calculate the following probabilities:
(a) P (A)
(b) P (B)
(c) P (A ∩ B)
DR

(d) P (A ∪ B)
(e) P (Ac )

4. Suppose that the probability of a young person liking Facebook is 0.7,

the probability of liking YouTube is 0.6, and the probability of liking
both platforms is 0.5. Using the relevant probability theorems, determine
the following:

(a) What is the probability that a young person likes exactly one of
the two social media platforms?
(b) What is the probability that a young person likes at least one of
the two platforms?
(c) What is the probability that a young person likes only Facebook
and not YouTube?

140
CHAPTER 4. INTRODUCTION TO PROBABILITY

5. A quality control team in a small factory inspects a batch of 60 parts.

It was observed that 10 parts were defective in appearance, 12 parts had
functional defects, and 4 parts were both defective in appearance and
function. If a part is selected randomly, what is the probability that it is
defective in appearance or has a functional defect?
6. A survey finds that 60% of people prefer coffee over tea, and 30% prefer
both coffee and tea. What is the probability that a randomly chosen
person prefers at least one of the two drinks? Define the following events:

• A: Preferring coffee.
•

T
B: Preferring tea.

Given:
• P (A) = 0.6
• P (A ∩ B) = 0.3
AFCalculate: P (A ∪ B)
7. In a class of 30 students, 18 like mathematics, 12 like science, and 8 like
both. If a student is chosen at random, calculate:

(a) The probability that the student likes science.

(b) The probability that the student likes mathematics given they like
science.
(c) The probability that the student likes either mathematics or science.

8. Consider the employment status of male and female employees in a tech-

nology company. The company employs 1500 individuals, of whom 1050

are men and 450 are women. Over the past year, 375 employees were
promoted. After analyzing the promotion records, a committee of female
employees raised concerns about potential gender bias, noting that 315
male employees had received promotions, while only 60 female employees
were promoted.

(i) Construct a joint probability table based on these data. Calcu-

late the marginal probabilities. If a male employee is selected
randomly, what is the probability that he was promoted?
(ii) If a female employee is selected randomly, what is the probability
that she was not promoted? Also, if a randomly selected em-
ployee was promoted, what is the probability that the employee
is male?
(iii) If a randomly selected employee was not promoted, what is the
probability that the employee is female?

141
CHAPTER 4. INTRODUCTION TO PROBABILITY

9. In a clinical study, researchers are interested in the probability of a patient

developing a particular health condition based on the type of treatment
received. There are three types of treatments: A, B, and C. The proba-
bilities of receiving each treatment are as follows:

• Treatment A: 30% (P (A) = 0.30)

• Treatment B: 50% (P (B) = 0.50)
• Treatment C: 20% (P (C) = 0.20)

The probability of developing the health condition given the type of treat-
ment is known to be:

T
• P (Condition | A) = 0.10
• P (Condition | B) = 0.25
• P (Condition | C) = 0.15
AF Find the overall probability of a patient developing the health condition,
denoted as P (Condition).
Using Bayes’ theorem, calculate the following probabilities:

(a) P (A | Condition)
(b) P (B | Condition)
(c) P (C | Condition).

10. Suppose that somebody secretly rolls two fair six-sided dice, and what
is the probability that the face-up value of the first one is 3, given the
information that their sum is no greater than 5?
DR

11. An electrical system consists of four components as illustrated in the

following figure.

142
CHAPTER 4. INTRODUCTION TO PROBABILITY

The system works if components A and B work and either of the com-
ponents C or D works. The reliability (probability of working) of each
component is also shown in the above figure. Find the probability that
(a) the entire system works.
(b) the component C does not work, given that the entire system works.
Assume that the four components work independently.
(c) the component D does not work, given that the entire system works.
12. An agricultural research establishment grows vegetables and grades each
one as either good or bad for its taste, good or bad for its size, and good

T
or bad for its appearance. Overall 78% of the vegetables have a good
taste. However, only 69% of the vegetables have both a good taste and a
good size. Also, 5% of the vegetable have both a good taste and a good
appearance, but a bad size. Finally, 84% of the vegetables have either a
good size or a good appearance.
(a). If a vegetable has a good taste, what is the probability that it
AF (b).
also has a good size?
If a vegetable has a bad size and a bad appearance, what is the
probability that it has a good taste?
.
13. A company produces electronic components, and it has two types of ma-
chines, A and B, that manufacture these components. Machine A pro-
duces 60% of the components, while Machine B produces 40%. Historical
data shows that 2% of the components produced by Machine A are defec-
tive, while 5% of the components produced by Machine B are defective.
DR

A component is selected at random and found to be defective. What is

the probability that this defective component was produced by Machine
A?
14. A certain rare disease affects 2% of a population. A diagnostic test for
this disease has the following characteristics:
• It correctly identifies the disease (true positive) 90% of the time
for those who have it.
• It incorrectly indicates the disease (false positive) 5% of the time
for those who do not have it.
Answer the following questions.
(a) What is the probability that a person will receive a positive test
result?

143
CHAPTER 4. INTRODUCTION TO PROBABILITY

(b) If a person tests positive, what is the probability that they actually
have the disease?
(c) If a person tests negative, what is the probability that they do not
have the disease?
15. As a mining company evaluates the likelihood of discovering a gold de-
posit in a specific region, they have gathered data on the probabilities
associated with geological features. Given that the probability of finding
a gold deposit is P (G) = 0.3, the likelihood of observing specific geolog-
ical features if a deposit is present is P (E|G) = 0.8, and the chance of
observing those features if no deposit exists is P (E|Gc ) = 0.1, answer the

T
following:

(i) Calculate the probability of observing the geological features in this

area.
(ii) What is the probability that there is indeed a gold deposit given the
observed geological features?
AF (iii) How would you interpret these results in terms of their implications
for the mining company’s decision-making process regarding further
exploration in this region?

16. A factory produces 80% of products with Machine A and 20% with Ma-
chine B. If 2% of A’s products and 5% of B’s products are defective, what
is the probability that a defective product came from Machine A?
17. Suppose you are on a game show with three doors: one has a car, the
other two have goats. You choose Door 1. The host, who knows what’s
behind the doors, opens Door 3 to show a goat and asks if you want to
switch to Door 2.
DR

(a) What’s the chance of winning the car if you switch to Door 2?
(b) What’s the chance of winning the car if you stay with Door 1?

144
Chapter 5

Random Variable and Its

T
Properties
AF
5.1 Introduction
In the realm of data science, understanding and manipulating uncertainty is
a fundamental skill. At the core of this capability lies the concept of a ran-
dom variable. A random variable is a quantitative variable whose values are
determined by the outcome of a random phenomenon. It serves as a bridge
connecting the abstract world of probability theory to the concrete domain of
data analysis.

Random variables can be classified into two main types: discrete and contin-
uous. Discrete random variables take on a countable number of distinct values,
DR

often representing things like the number of occurrences of an event. Contin-

uous random variables, on the other hand, can take on an infinite number of
possible values within a given range, making them essential for representing
measurements and other quantities that vary smoothly.

This chapter delves into the foundational aspects of random variables, ex-
ploring their properties and the critical role they play in statistical modeling
and data analysis. We will discuss probability distributions, expected values,
variances, and other By the end of this chapter, readers will gain a robust un-
derstanding of how random variables function and how they can be applied to
solve real-world problems in data science.

5.2 Random Variable

A random variable is a mathematical concept used in probability theory and
statistics, representing a variable whose possible values depend on the outcomes

145
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

of a random experiment. It serves as a fundamental tool for defining proba-

bility distributions and calculating probabilities associated with events arising
from uncertain or stochastic processes. In data science, random variables are
fundamental because they allow us to model and reason about uncertainty and
variability in data.

A random variable is a numerical outcome of a random phenomenon. It is

a function that assigns a real number to each outcome in a sample space of a
random experiment. Formally, a random variable X is defined as a function:
X:S→R
where S is the sample space of the experiment, and R is the set of real numbers.

T
Random Variable: A variable whose possible values are determined by
outcomes of a random experiment or process, with each value associated
with a probability.

Example: Testing Electronic Components

AF
Consider a random experiment where three electronic components are tested
for defects. The sample space, giving a detailed description of each possible
outcome, can be written as follows:

S = {NNN, NDN, NND, DNN, NDD, DND, DDN, DDD}

where,
• N stands for a non-defective component.

• D stands for a defective component.

Let X be the random variable representing the number of defective com-
DR

ponents in the sample. The possible values of X are denoted by x, and their
corresponding outcomes are listed in Table 5.1. The random variable X can
take on the following values:
• x = 0: No defective components (Outcome: NNN)

• x = 1: One defective component (Outcomes: NDN, NND, DNN)

• x = 2: Two defective components (Outcomes: NDD, DND, DDN)

• x = 3: Three defective components (Outcome: DDD).

Table 5.1: Possible Outcomes When Testing Three Electronic Components

Outcome NNN NDN NND DNN NDD DND DDN DDD

x 0 1 1 1 2 2 2 3

146
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

In this example, X is a discrete random variable because it can take on

a countable number of distinct values. Each value of X corresponds to the
number of defective components in the tested sample. There are two main
types of random variables: discrete and continuous.

5.3 Discrete Random Variables

A discrete random variable can take on a countable number of possible
values. Here are some examples of discrete random variables:

1. Number of Heads in a Series of Coin Tosses: When flipping a

T
fair coin multiple times, the number of heads observed in the series is a
discrete random variable. For example, if you flip a coin 10 times, the
number of heads (0 to 10) is a discrete outcome.

2. Number of Defective Items in a Batch: In quality control, the num-

ber of defective items in a batch of products is a discrete random variable.
AF For instance, if a factory produces 100 items in a day, the number of de-
fective items could be any integer from 0 to 100.

3. Number of Customers in a Queue: The number of customers waiting

in line at a service center or a bank is a discrete random variable. At any
given time, this number could be 0, 1, 2, and so on.
4. Roll of a Die: When rolling a standard six-sided die, the outcome is a
discrete random variable with possible values of 1, 2, 3, 4, 5, or 6.

5. Number of Emails Received in a Day: The number of emails a

person receives in a day is a discrete random variable. It can take on any
non-negative integer value (0, 1, 2, . . . ).
DR

6. Number of Accidents at an Intersection: The number of traffic

accidents occurring at a particular intersection in a month is a discrete
random variable. This count could be 0, 1, 2, and so on.
7. Number of Children in a Family: The number of children in a family
is a discrete random variable, with possible values of 0, 1, 2, and so forth.
8. Number of Sales Transactions in a Day: The number of sales trans-
actions processed by a retail store in a single day is a discrete random
variable, representing the count of individual sales.

These examples illustrate various contexts in which discrete random vari-

ables are used to model and analyze real-world phenomena.

147
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

5.3.1 Probability Mass Function (pmf )

The probability distribution of a random variable describes how probabilities
are distributed over the values of the random variable. For a discrete random
variable, the probability distribution is described by the probability mass
function (pmf ), which gives the probability that the random variable takes
on a specific value.

Probability Mass Function: A pmf of a discrete random variable X

is a function p(x) that gives the probability that X takes the value x. It
satisfies:

T
(i). Non-negativity: For every possible value x that X can take, the
probability p(x) is non-negative:

p(x) ≥ 0

(ii). Normalization: The sum of the probabilities for all possible values
AF of X equals 1: X
p(x) = 1
x∈Range of X

(iii). Probability Assignment: For any specific value x, p(x) gives the
probability that the random variable X takes the value x:

p(x) = P (X = x)

Example: Testing Electronic Components

Consider the example of Testing Electronic Components described in the
DR

previous section, where X is the random variable representing the number of

defective components in the three tested electronic components. The prob-
ability mass function for X is shown below, and the graphical representation is
presented in Figure 5.1.

Table 5.2: Probability mass function

Outcome x P (X = x)
1
{NNN} 0 8
3
{NDN, NND, DNN} 1 8
3
{NDD, DND, DDN} 2 8
1
{DDD} 3 8

148
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

0.5

0.4 0.375 0.375

Probability P (X = x)
0.3

0.2
0.125 0.125
0.1

T
0
0 1 2 3
X (Number of Defective Components)
AF Figure 5.1: Probability Mass Function of X

5.3.2 Cumulative Distribution Function (cdf )

The cumulative distribution function (cdf) of a random variable X is a
function that gives the probability that X will take a value less than or equal
to x. For both discrete and continuous random variables,

F (x) = P (X ≤ x).

The cdf is a non-decreasing function that ranges from 0 to 1.

For the discrete random variable X, the cumulative distribution function

can then be calculated from the expression:
X
F (x) = P (X = y).
y:y≤x

Example: Testing Electronic Components

Consider the example of Testing Electronic Components described in the
previous section, where X is the random variable representing the number
of defective components in the three tested electronic components. The
cumulative distribution function for X is shown below, and the graphical rep-
resentation is presented in Figure 5.2.

149
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Table 5.3: The cdf for the Number of Defective Components

x P (X = x) F (x)
1
0 8
0.125
3
1 8
0.500
3
2 8
0.875
1
3 8
1

T
1
0.875
F (x)

AF 0.5

0.125
0
0 1 2 3
X (Number of Defective Components)

Figure 5.2: Cumulative Distribution Function of X

5.3.3 Properties of the Cumulative Distribution Function

The cdf of a random variable has several important properties:

1. Non-decreasing: The cdf F (x) is a non-decreasing function. This means

that if x1 ≤ x2 , then F (x1 ) ≤ F (x2 ). The probability that the random
variable takes a value less than or equal to x does not decrease as x
increases.
2. Limits:
• limx→−∞ F (x) = 0; that is the minimum value of the cdf is 0.
• limx→+∞ F (x) = 1; that is the maximum value of the cdf is 1.
3. Right-Continuous: The cdf F (x) is right-continuous. This means that
for any value x, the limit of F (x) as t approaches x from the right (t → x+ )

150
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

is equal to F (x). Mathematically, this can be written as limt→x+ F (t) =

F (x).
4. Range: The cdf F (x) takes values in the interval [0, 1]. For any real
number x, 0 ≤ F (x) ≤ 1. This reflects the fact that probabilities range
from 0 to 1.

5. Step Function: For discrete random variables, the cdf F (x) is a step
function, with the value increasing at each point where the random vari-
able takes a value.

Problem 5.1. An office has four copying machines, and the random variable

T
X measures how many of them are in use at a particular moment in time.
Suppose that P (X = 0) = 0.08, P (X = 1) = 0.11, P (X = 2) = 0.27, and
P (X = 3) = 0.33.

(a) What is P (X = 4)?

(b) Draw a line graph of the probability mass function.
AF
(c) Construct and plot the cumulative distribution function.

Solution
(a) Since the sum of all probabilities must be 1, we have:

P (X = 4) = 1 − (P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3))
= 1 − (0.08 + 0.11 + 0.27 + 0.33) = 1 − 0.79
= 0.21
DR

(b) The graphical presentation of the probability mass function is the follow-
ing:

151
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

0.4

0.33

Probability P (X = x)
0.3
0.27

0.21
0.2

0.11
0.1 0.08

T
0
0 1 2 3 4
Number of copying machines in use
AF Figure 5.3: Probability Mass Function

(c) We konw, the cumulative distribution function F (x) is defined as:

F (x) = P (X ≤ x)

The cumulative distribution function F (x) with probability mass function

is provided in Table 5.4.

x 0 1 2 3 4
DR

p(x) 0.08 0.11 0.27 0.33 0.21

F (x) 0.08 0.19 0.46 0.79 1.00

Table 5.4: Cumulative Distribution Function of X

152
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

where,
F (0) = P (X = 0) = 0.08
F (1) = P (X ≤ 1) = P (X = 0) + P (X = 1) = 0.08 + 0.11 = 0.19
F (2) = P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2)
= 0.08 + 0.11 + 0.27 = 0.46
F (3) = P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
= 0.08 + 0.11 + 0.27 + 0.33 = 0.79
F (4) = P (X ≤ 4)
= P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4)

T
= 0.08 + 0.11 + 0.27 + 0.33 + 0.21 = 1.00

The graphical presentation of F (x) is the following:

1
AF 0.8
F (X)

0.6

0.4

0.2

0
0 1 2 3 4
X
DR

Problem 5.2. Let the number of phone calls received by a switchboard during
a 5-minute interval be a random variable X with probability function
e−2 2x
p(x) = , for x = 0, 1, 2, . . .
x!
(a) Determine the probability that x equals 0, 1, 2, 3, 4, 5, and 6.
(b) Graph the probability mass function for these values of x.
(c) Determine the cumulative distribution function for these values of X.

Solution
(a) Probabilities
The probability function is given by
e−2 2x
p(x) =
x!

153
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

The probabilities for X = 0, 1, 2, 3, 4, 5, 6 are:

e−2 20
P (X = 0) = = e−2 ≈ 0.1353
0!
e−2 21
P (X = 1) = = 2e−2 ≈ 0.2707
1!
e−2 22
P (X = 2) = = 22 e−2 /2 ≈ 0.2707
2!
e−2 23
P (X = 3) = = 23 e−2 /6 ≈ 0.1804
3!
e−2 24
P (X = 4) = = 24 e−2 /24 ≈ 0.0902

T
4!
e−2 25
P (X = 5) = = 25 e−2 /120 ≈ 0.0361
5!
e−2 26
P (X = 6) = = 26 e−2 /720 ≈ 0.0120
6!
AF
(b) Graph of the Probability Mass Function

0.3
0.27 0.27
Probability P (X = x)

0.2 0.18

0.14
DR

0.1 0.09

0.04
0.01
0
0 1 2 3 4 5 6
x

Figure 5.4: Probability Mass Function of X

(c) Cumulative Distribution Function

The cumulative distribution function F (x) = P (X ≤ x) for x = 0, 1, 2, 3, 4, 5, 6, . . .
is:

154
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

F (0) = P (X ≤ 0) = P (X = 0) = 0.1353
F (1) = P (X ≤ 1) = P (X = 0) + P (X = 1) = 0.1353 + 0.2707 = 0.4060
F (2) = P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2)
= 0.4060 + 0.2707 = 0.6767
F (3) = P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
= 0.6767 + 0.1804 = 0.8571
F (4) = P (X ≤ 4) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4)
= 0.8571 + 0.0902 = 0.9473

T
F (5) = P (X ≤ 5) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
+ P (X = 4) + P (X = 5)
= 0.9473 + 0.0361 = 0.9834
F (6) = P (X ≤ 6) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
+ P (X = 4) + P (X = 5) + P (X = 6)
AF
5.3.4
= 0.9834 + 0.0120 = 0.9954

Exercise
1. An office has four copying machines, and the random variable X denotes
how many of them are in use in a particular time. Suppose the probability
mass function X is given below:

x 0 1 2 3 4
Pr(X = x) k 0.02 0.05 0.4 (k + 0.3)
DR

(a) What is the value of k and draw the line graph of the probability
mass function Pr(X = x).
(b) Find the Value of Pr(X ≤ 2).
(c) Find the probability that at least two copying machines are in
use.
(d) Find the cumulative Function F (x) and draw the F (x).

2. An office has five printers and the random variable Y measures how many
of them are currently being used. Suppose that P (Y = 0) = 0.05, P (Y =
1) = 0.10, P (Y = 2) = 0.20, P (Y = 3) = 0.30, and P (Y = 4) = 0.25.