0% found this document useful (0 votes)
45 views528 pages

Text Book

Uploaded by

ifrat.official25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views528 pages

Text Book

Uploaded by

ifrat.official25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 528

Foundations of

Data Science

T
Applied Statistics and Probability with Python
AF Lecture Note

Dr. Md Rezaul Karim


DR

Professor
Department of Statistics and Data Science
Jahangirnagar University, Savar, Bangladesh

‘Are those who know equal to those who do not know?’ Only they will
remember [who are] people of understanding (Surah Al-Zumar (39:9),
Al-Quran).
T
AF
DR

Copyright © 2025 Dr. Md Rezaul Karim


Preface

T
In today’s data-driven world, the ability to analyze and interpret data has
become a crucial skill across various disciplines. As we navigate through an
abundance of information, the principles of statistics and probability serve as
the bedrock upon which data science stands. This book, Foundations of Data
Science: Applied Statistics and Probability with Python, aims to bridge the gap
between theory and practice, providing readers with the tools they need to har-
AF
ness the power of data effectively.

The journey of data science can be both exhilarating and overwhelming.


This book is designed for aspiring data scientists, students, and professionals
who wish to develop a robust understanding of statistical methods and their
application in Python. We will explore key concepts such as descriptive statis-
tics, inferential statistics, hypothesis testing, and probability theory, all while
emphasizing practical applications.

Each chapter is structured to introduce fundamental concepts, followed by


Python code examples and exercises that reinforce learning. My goal is to
make complex ideas accessible and engaging, allowing readers to build their
DR

confidence as they apply these techniques to real-world datasets.

I would like to extend my gratitude to the many educators, practitioners,


and learners who inspired this work. Your passion for data science fuels my
own. I hope this book serves as a valuable resource on your journey, encourag-
ing curiosity and fostering a deeper understanding of the fascinating world of
data.

Whether you are a beginner or someone looking to refine your skills, I invite
you to dive in and explore the foundations that will empower you in your data
science endeavors.

Happy learning!

Prof. Dr. Md Rezaul Karim


May 21, 2025

iv
Table of Contents

T
1 Introduction to Data Science 1
1.1 Welcome to Data Science . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Key Components of Data Science . . . . . . . . . . . . . . . . . . 2
1.3 Concepts of Statistics . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Population and Sample . . . . . . . . . . . . . . . . . . . 6
AF 1.3.2 Census and Sample Survey .
1.3.3 Parameters and Statistic . . .
1.3.4 Types of Statistics . . . . . .
1.4 What is Data? . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
8
9
10
1.4.1 Levels of Measurement . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Scope of Applied Statistics . . . . . . . . . . . . . . . . . . . . . 17
1.6 Statistical Methods in Data Science . . . . . . . . . . . . . . . . 17
1.7 Overview of Data Science Workflow . . . . . . . . . . . . . . . . 19
1.8 Popular Statistical Analysis Tools . . . . . . . . . . . . . . . . . . 20
1.9 Why Choose Python for This Book? . . . . . . . . . . . . . . . . 20
1.10 Getting Started with Python . . . . . . . . . . . . . . . . . . . . 22
DR

1.11 Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . 22


1.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.13 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Data Exploration: Tabular and Graphical Displays 25


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Tabular Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Graphical Displays . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Summarizing Qualitative Data . . . . . . . . . . . . . . . . . . . 27
2.4.1 Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.2 Python Code: Bar Chart . . . . . . . . . . . . . . . . . . 28
2.4.3 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.4 Python Code: Pie Chart . . . . . . . . . . . . . . . . . . . 30
2.5 Summarizing Quantitative Data . . . . . . . . . . . . . . . . . . . 32
2.5.1 Constructing a Frequency Distribution Table . . . . . . . 33
2.5.2 Python Code: Frequency Distribution . . . . . . . . . . . 35

v
TABLE OF CONTENTS

2.5.3 Frequency Polygon . . . . . . . . . . . . . . . . . . . . . . 36


2.5.4 Python Code: Frequency Polygon . . . . . . . . . . . . . 37
2.5.5 Ogive Curve . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.6 Python Code: Ogive . . . . . . . . . . . . . . . . . . . . . 39
2.5.7 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.8 Python Code: Histogram . . . . . . . . . . . . . . . . . . 42
2.5.9 Stem-and-Leaf Plot . . . . . . . . . . . . . . . . . . . . . 42
2.5.10 Python Code: Stem-and-leaf . . . . . . . . . . . . . . . . 44
2.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 Data Exploration: Numerical Measures 49

T
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . 49
3.2.1 Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.2 Advantages and Disadvantages of Arithmetic Mean . . . . 50
3.2.3 Harmonic Mean . . . . . . . . . . . . . . . . . . . . . . . 52
AF Advantages and Disadvantages of Harmonic Mean . . . .
3.2.4 Geometric Mean . . . . . . . . . . . . . . . . . . . . . . .
Advantages and Disadvantages of Geometric Mean . . . .
3.2.5 Relationships Between Arithmetic Mean, Geometric Mean,
52
53
54

and Harmonic Mean . . . . . . . . . . . . . . . . . . . . . 54


3.2.6 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.7 Advantages and Disadvantages of Median . . . . . . . . . 59
3.2.8 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.9 Advantages and Disadvantages of Mode . . . . . . . . . . 60
3.2.10 Choosing the Ideal Measure of Central Tendency . . . . . 61
3.2.11 Weighted Mean . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.12 Measures of Central Tendency for Grouped Data . . . . . 64
DR

3.2.13 Python Code: Mean, Median and Mode . . . . . . . . . . 66


3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 Measures of Dispersion or Variability . . . . . . . . . . . . . . . . 69
3.4.1 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.3 Standard Deviation . . . . . . . . . . . . . . . . . . . . . 71
3.4.4 Measures of Variability for Grouped Data . . . . . . . . . 72
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6 Measures of Distribution Shape . . . . . . . . . . . . . . . . . . . 75
3.6.1 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6.2 Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.6.3 Coefficient of Variation . . . . . . . . . . . . . . . . . . . 80
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.8 Quartiles, Percentiles, Deciles and Outlier Detection . . . . . . . 82
3.8.1 Quartiles . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.8.2 Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . 84

vi
TABLE OF CONTENTS

3.8.3 Deciles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8.4 Interquartile Range (IQR) . . . . . . . . . . . . . . . . . . 86
3.8.5 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . 86
3.8.6 Python Code: Dispersion Measures . . . . . . . . . . . . . 88
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.10 Five-Number Summary and Boxplot . . . . . . . . . . . . . . . . 91
3.10.1 Five-Number Summary . . . . . . . . . . . . . . . . . . . 91
3.10.2 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.10.3 Importance of Boxplots . . . . . . . . . . . . . . . . . . . 93
3.10.4 Python Code: Boxplot . . . . . . . . . . . . . . . . . . . . 97
3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

T
3.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.13 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4 Introduction to Probability 105


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
AF 4.2.1 Experiment . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Random Experiment . . . . . . . . . . . . . . . . . .
4.2.3 Sample Space and Events . . . . . . . . . . . . . . .
4.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
106
106
107
108
4.3.1 Union of Events . . . . . . . . . . . . . . . . . . . . . . . 111
4.3.2 Intersection of Events . . . . . . . . . . . . . . . . . . . . 111
4.3.3 Complementary Event . . . . . . . . . . . . . . . . . . . . 112
4.3.4 Equally Likely Events . . . . . . . . . . . . . . . . . . . . 113
4.3.5 Mutually Exclusive Events . . . . . . . . . . . . . . . . . 114
4.3.6 Probability Axioms . . . . . . . . . . . . . . . . . . . . . . 114
4.4 Types of Probability . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.4.1 Classical (Theoretical) Probability . . . . . . . . . . . . . 116
DR

4.4.2 Experimental (Empirical) Probability . . . . . . . . . . . 117


4.4.3 Subjective Approach . . . . . . . . . . . . . . . . . . . . . 118
4.5 Joint and Marginal Probabilities . . . . . . . . . . . . . . . . . . 119
4.6 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 120
4.6.1 Probabilities Computation form Contingency Table . . . . 122
4.6.2 Independent Events . . . . . . . . . . . . . . . . . . . . . 125
4.7 Posterior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 129
4.7.1 Law of Total Probability . . . . . . . . . . . . . . . . . . . 129
4.7.2 Total Probability with Multiple Conditions . . . . . . . . 132
4.7.3 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . 135
4.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.9 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 139

vii
TABLE OF CONTENTS

5 Random Variable and Its Properties 145


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.3 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . 147
5.3.1 Probability Mass Function (pmf) . . . . . . . . . . . . . . 148
5.3.2 Cumulative Distribution Function (cdf) . . . . . . . . . . 149
5.3.3 Properties of the Cumulative Distribution Function . . . 150
5.3.4 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.4 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . 157
5.4.1 Probability Density Function (pdf) . . . . . . . . . . . . . 158
5.4.2 Cumulative Distribution Function (cdf) . . . . . . . . . . 162

T
5.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.5 The Expectation of a Random Variable . . . . . . . . . . . . . . 168
5.5.1 Example: Testing Electronic Components . . . . . . . . . 169
5.5.2 Example: Metal Cylinder Production . . . . . . . . . . . 171
5.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.6 The Variance of a Random Variable . . . . . . . . . . . . . . . . 172
AF 5.6.1 Example: Metal Cylinder Production . . . . . . . .
5.6.2 Chebyshev’s Inequality . . . . . . . . . . . . . . . . .
5.6.3 Example: Blood Pressure Measurement . . . . . . .
5.6.4 Example: Employee Salaries . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
175
176
176
177
5.6.5 Quantiles of Random Variables . . . . . . . . . . . . . . . 177
5.6.6 Example: Metal Cylinder Production . . . . . . . . . . . 178
5.6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.7 Essential Generating Functions . . . . . . . . . . . . . . . . . . . 181
5.7.1 Moment Generating Function . . . . . . . . . . . . . . . . 182
5.7.2 Key Properties of MGF . . . . . . . . . . . . . . . . . . . 182
5.7.3 Probability Generating Function (PGF) . . . . . . . . . . 185
5.7.4 Characteristic Function (CF) . . . . . . . . . . . . . . . . 189
DR

5.7.5 Key Properties of Characteristic Functions . . . . . . . . 189


5.7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.8 Jointly Distributed Random Variables . . . . . . . . . . . . . . . 196
5.8.1 Joint Probability Mass Function (pmf) . . . . . . . . . . . 196
5.8.2 Example: Computer Maintenance . . . . . . . . . . . . . 197
5.8.3 Joint Probability Density Function (pdf) . . . . . . . . . 198
5.8.4 Example: Mineral Deposits . . . . . . . . . . . . . . . . . 198
5.8.5 Marginal Distributions . . . . . . . . . . . . . . . . . . . . 199
5.8.6 Example: Computer Maintenance . . . . . . . . . . . . . 200
5.8.7 Example: Mineral Deposits . . . . . . . . . . . . . . . . . 201
5.8.8 Conditional Distributions . . . . . . . . . . . . . . . . . . 203
5.8.9 Example: Computer Maintenance . . . . . . . . . . . . . 203
5.8.10 Example: Mineral Deposits . . . . . . . . . . . . . . . . . 204
5.8.11 Independence and Covariance . . . . . . . . . . . . . . . . 204
5.8.12 Covariance and Correlation . . . . . . . . . . . . . . . . . 207
5.8.13 Linear Functions of a Random Variable . . . . . . . . . . 212

viii
TABLE OF CONTENTS

5.8.14 Linear Combinations of Random Variables . . . . . . . . . 216


5.8.15 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
5.9 Python Functions for Statistical Distributions . . . . . . . . . . . 221
5.10 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 222
5.11 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 223

6 Some Discrete Probability Distributions 225


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.2 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . 225
6.2.1 Expected Value (Mean) . . . . . . . . . . . . . . . . . . . 226
6.2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.2.3 Moment Generating Function (MGF) . . . . . . . . . . . 230

T
6.2.4 Characteristic Function . . . . . . . . . . . . . . . . . . . 230
6.2.5 Probability Generating Function . . . . . . . . . . . . . . 230
6.2.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.2.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.2.8 Python Code for Bernoulli Distribution . . . . . . . . . . 233
AF 6.2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Expected Value . . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Variance and Standard Deviation . . . . . . . . . . . . .
.
.
.
.
234
236
238
239
6.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.3.4 Python Code for Binomial Distribution . . . . . . . . . . 243
6.3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
6.4 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.4.1 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . 251
6.4.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.4.3 Moment Generating Function . . . . . . . . . . . . . . . . 253
6.4.4 Characteristic Function . . . . . . . . . . . . . . . . . . . 253
DR

6.4.5 Approximation of Binomial Distribution Using Poisson


Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.4.6 Python Code for Poisson Distribution . . . . . . . . . . . 256
6.4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.4.8 Discrete Uniform Distribution . . . . . . . . . . . . . . . . 260
6.4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.6 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 265

7 Some Continuous Probability Distributions 267


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.2 Continuous Uniform Distribution . . . . . . . . . . . . . . . . . . 268
7.2.1 Distributional Properties . . . . . . . . . . . . . . . . . . 269
7.2.2 Python Code for Uniform Distribution Characteristics . . 272
7.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.3 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . 274

ix
TABLE OF CONTENTS

7.3.1 Properties of the Exponential Distribution . . . . . . . . . 280


7.3.2 Memoryless Property . . . . . . . . . . . . . . . . . . . . 282
Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.3.3 Python Code for Exponential Distribution Characteristics 285
7.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
7.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.4.1 Definition of the Normal Distribution . . . . . . . . . . . 290
7.4.2 Properties of the Normal Distribution . . . . . . . . . . . 293
7.4.3 Standard Normal Distribution . . . . . . . . . . . . . . . 296
7.4.4 Finding the Probability P (a ≤ X ≤ b . . . . . . . . . . . 298
7.4.5 Central Limit Theorem . . . . . . . . . . . . . . . . . . . 304

T
7.4.6 Python Code for Normal Distribution Characteristics . . 307
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
7.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 311
7.7 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 312

8 Confidence Interval Estimation 315


AF
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . .
8.3 Confidence Intervals for the Population Mean . . . . . . . .
8.4 Confidence Intervals for Variances and Standard Deviations
.
.
.
.
.
.
.
.
.
.
.
.
315
315
316
329
8.4.1 Confidence Interval for Variance . . . . . . . . . . . . . . 330
8.4.2 Confidence Interval for Standard Deviation . . . . . . . . 330
8.5 Confidence Intervals for Population Proportions . . . . . . . . . . 334
8.6 Sample Size Estimation . . . . . . . . . . . . . . . . . . . . . . . 338
8.6.1 Sample Size for Estimating a Population Mean . . . . . . 338
8.6.2 Sample Size for Estimating a Population Proportion . . . 342
8.6.3 Sample Size Estimation for Finite Populations . . . . . . 343
8.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 344
DR

8.8 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 345

9 Hypothesis Testing for Decision Making 348


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9.2 Concepts of Hypothesis Testing . . . . . . . . . . . . . . . . . . . 349
9.3 Steps for Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 351
9.3.1 Formulating Hypotheses . . . . . . . . . . . . . . . . . . . 351
9.3.2 Level of Significance . . . . . . . . . . . . . . . . . . . . . 353
9.3.3 Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 354
9.3.4 Acceptance and Rejection Regions . . . . . . . . . . . . . 354
9.3.5 Decision Rules . . . . . . . . . . . . . . . . . . . . . . . . 357
9.4 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
9.5 Why is hypothesis testing so important? . . . . . . . . . . . . . . 358
9.6 Hypothesis Testing for Means . . . . . . . . . . . . . . . . . . . . 359
9.6.1 One-Sample Test of Means . . . . . . . . . . . . . . . . . 359
9.6.2 Testing Equality of Two Means . . . . . . . . . . . . . . . 369

x
TABLE OF CONTENTS

9.6.3 Independent Samples T -Test . . . . . . . . . . . . . . . . 369


1. Equal Variances (Pooled T -Test) . . . . . . . . . . . . 370
2. Unequal Variances (Welch’s T -Test) . . . . . . . . . . . 371
9.6.4 Paired T -Test . . . . . . . . . . . . . . . . . . . . . . . . . 373
9.7 Testing Equality of Several Means . . . . . . . . . . . . . . . . . 376
9.7.1 Analysis of Variance (ANOVA) . . . . . . . . . . . . . . . 376
9.8 Power of the Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
9.9 Sample Size Estimation for the Mean Test . . . . . . . . . . . . . 386
9.9.1 When Testing for the Mean of a Normal Distribution
(One-Sided Alternative) . . . . . . . . . . . . . . . . . . . 386
9.9.2 Sample Size Estimation When Testing for the Mean of a

T
Normal Distribution (Two-Sided Alternative) . . . . . . . 387
9.10 Single Proportion Test . . . . . . . . . . . . . . . . . . . . . . . . 388
9.10.1 Sample Size Estimation for Proportion Test . . . . . . . . 390
9.11 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 391
9.12 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 392
AF
10 Correlation and Regression Analysis
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Scatter Diagram . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Python Code: Scatter diagram . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
396
396
397
399
10.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
10.5 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 404
10.6 Pearson’s Correlation Coefficient . . . . . . . . . . . . . . . . . . 404
10.6.1 Interpretation of the value of Correlation Coefficient . . . 405
10.6.2 Properties of the Correlation Coefficient . . . . . . . . . . 407
10.6.3 Testing the Significance of the Correlation Coefficient . . 410
10.6.4 Python Code: Correlation Matrix . . . . . . . . . . . . . 411
10.7 Rank Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
DR

10.7.1 Key Types of Rank Correlation . . . . . . . . . . . . . . . 412


10.7.2 Applications of Rank Correlation . . . . . . . . . . . . . . 415
10.7.3 Python Code: Rank Correlation . . . . . . . . . . . . . . 415
10.7.4 Kendall Tau Correlation Coefficient . . . . . . . . . . . . 416
10.7.5 Advantages and Disadvantages . . . . . . . . . . . . . . . 418
10.7.6 Python Code: Kendall Tau . . . . . . . . . . . . . . . . . 419
10.7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
10.8 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 423
10.8.1 Types of regression analysis . . . . . . . . . . . . . . . . . 425
10.8.2 Simple Regression Model . . . . . . . . . . . . . . . . . . 426
10.8.3 Assumptions of the CSLR Model (10.6) . . . . . . . . . . 428
10.8.4 Ordinary Least Squares (OLS) Estimation . . . . . . . . . 429
10.8.5 Interpretation of Regression Coefficients . . . . . . . . . . 431
10.8.6 The Estimated Error Variance or Standard Error . . . . . 433
10.8.7 Coefficient of Determination . . . . . . . . . . . . . . . . . 438
10.8.8 Relationship between R2 and rxy . . . . . . . . . . . . . 439

xi
TABLE OF CONTENTS

10.8.9 Advantages and Disadvantages of R2 . . . . . . . . . . . . 440


10.8.10 Adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . 440
10.8.11 Python Code: Simple Regression Analysis . . . . . . . . . 441
10.8.12 Interval Estimation and Hypothesis Testing . . . . . . . . 443
Confidence Interval for β0 . . . . . . . . . . . . . . . . . . 443
Confidence Interval for β1 . . . . . . . . . . . . . . . . . . 443
10.8.13 The F -tests in Simple Linear Regression Model . . . . . . 444
Decision Rule for ANOVA F -test . . . . . . . . . . . . . . 445
Decision Rule for ANOVA F -test . . . . . . . . . . . . . 445
10.8.14 The t-tests in Simple Linear Regression Model . . . . . . 445
Decision Rule for t-test . . . . . . . . . . . . . . . . . . . 446

T
Confidence Interval for E(Y |X = x) . . . . . . . . . . . . 447
10.8.15 Python Code: Linear Regression Model . . . . . . . . . . 449
10.8.16 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
10.9 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . 454
10.9.1 Model Assumptions . . . . . . . . . . . . . . . . . . . . . 455
10.9.2 Estimation Procedure . . . . . . . . . . . . . . . . . . . . 455
AF 10.9.3 Estimation Procedure of Error Variance . . . . . . . . .
10.9.4 Mean of the OLS Estimator . . . . . . . . . . . . . . . .
10.9.5 Variance of the OLS Estimator . . . . . . . . . . . . . .
10.9.6 Coefficient of Determination . . . . . . . . . . . . . . . .
.
.
.
.
456
457
457
458
10.9.7 Adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . 458
10.9.8 Example Dataset and Regression Calculations . . . . . . . 459
Goodness-of-fit R2 and Adjusted R2 . . . . . . . . . . . . 462
10.9.9 F -test in Multiple Regression . . . . . . . . . . . . . . . . 463
10.9.10 ANOVA Table in Regression Analysis . . . . . . . . . . . 463
10.9.11 The t-tests in Multiple Regression . . . . . . . . . . . . . 464
10.9.12 Python Code: Linear Regression Model . . . . . . . . . . 471
10.9.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
DR

10.10Regression Model Diagnostics . . . . . . . . . . . . . . . . . . . . 474


10.10.1 Assumptions of Linear Regression . . . . . . . . . . . . . 475
10.10.2 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . 475
10.10.3 Formal Tests . . . . . . . . . . . . . . . . . . . . . . . . . 480
10.10.4 Test for Autocorrelation . . . . . . . . . . . . . . . . . . . 483
10.10.5 Tests for Non-Constancy of Variance . . . . . . . . . . . . 484
10.10.6 Influential Observations, Outliers, and Cook’s Distance . 490
10.10.7 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . 492
10.10.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
10.11Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 498
10.12Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 498

xii
Chapter 1

Introduction to Data

T
Science
AF
1.1 Welcome to Data Science
Welcome to the exciting world of data science, where numbers tell hidden stories
and reveal valuable insights! In today’s digital era, data is incredibly powerful,
and those who can understand and use it have the key to endless opportu-
nities. Data science is all about extracting meaningful information from vast
amounts of data to make informed decisions, solve complex problems, and drive
innovation.

Data Science: An interdisciplinary field that employs scientific methods,


processes, algorithms, and systems to extract knowledge and insights from
both structured and unstructured data. It combines various disciplines,
DR

including statistics, probability, machine learning, data engineering, and


domain-specific expertise, to understand and analyze data.

In the era of big data and advanced analytics, data science has become a
crucial field in both academia and industry. At the heart of data science is the
ability to extract meaningful insights from data, a process that heavily relies on
statistical methods. To succeed in this field, a strong foundation in statistics is
essential. Statistics provides the tools and techniques needed to analyze data,
identify patterns, and draw reliable conclusions. Without this knowledge, it’s
challenging to make sense of the data and leverage its full potential.

In this chapter, we will introduce the foundational concepts of applied statis-


tics that are essential for data scientists. We will cover basic statistical termi-
nology, explore different types of data, and discuss various statistical methods
commonly used in data science. But data science is more than just numbers
and statistics. It involves programming, data engineering, machine learning,

1
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

and data visualization. These skills allow data scientists to collect, clean, and
process data, build predictive models, and present their findings in a clear and
compelling way.

As we embark on this journey, we will dive deep into the statistical founda-
tions that every aspiring data scientist needs to master. We will also explore
how these principles are applied in real-world scenarios, from predicting cus-
tomer behavior to identifying trends in healthcare. So, buckle up and get ready
to unlock the secrets of data science!

1.2 Key Components of Data Science

T
Data Science is made up of several interrelated components, each playing a
crucial role in transforming raw data into actionable insights. Below are some
of the most important components:

Statistics and Probability


AF
Statistics and probability form the mathematical foundation of Data Science,
providing essential tools for analyzing data, making inferences, and building pre-
dictive models. Statistics helps summarize and describe data through measures
such as mean, median, variance, and standard deviation, while also enabling hy-
pothesis testing and confidence interval estimation. Probability theory is used
to model uncertainty and assess the likelihood of different outcomes, which is es-
pecially crucial in decision-making and risk analysis. Together, these disciplines
support key Data Science tasks such as data interpretation, model evaluation,
and the development of algorithms in machine learning. Tools like R, Python
(with libraries such as NumPy, SciPy, and Statsmodels), SPSS and SAS are
DR

commonly used for statistical analysis.

Machine Learning
Machine learning, a subset of artificial intelligence, plays a central role in Data
Science by enabling computers to learn from data and make predictions or
decisions without explicit programming. It involves training models on histor-
ical data to detect patterns and generate accurate forecasts or classifications.
Machine learning is widely applied in areas such as recommendation systems,
fraud detection, image recognition, and natural language processing. Common
approaches include supervised learning, unsupervised learning, and reinforce-
ment learning. Popular tools and libraries used to build and deploy models
include Scikit-learn, TensorFlow, Keras, and PyTorch.

2
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Deep learning
Deep learning is an advanced subset of machine learning that uses artificial
neural networks with multiple layers to model and understand complex pat-
terns in large volumes of data. It excels at tasks where traditional algorithms
struggle, such as image and speech recognition, natural language processing,
and autonomous systems. Deep learning models, such as convolutional neural
networks (CNNs) and recurrent neural networks (RNNs), learn hierarchical fea-
tures directly from raw data without the need for manual feature extraction.
These models require significant computational power and large datasets to per-
form effectively. Popular frameworks for developing deep learning applications
include TensorFlow, Keras, and PyTorch.

T
Data Engineering
Data Engineering involves the design, development, and maintenance of systems
and architectures that enable the collection, storage, and processing of large
datasets. Data engineers build data pipelines that move data from various
AF
sources to storage and analytics platforms, ensuring it is clean, reliable, and
accessible for analysis. They work with tools and technologies like SQL, Apache
Spark, Hadoop, Airflow, and cloud services (e.g., AWS, Azure, GCP) to handle
big data efficiently. Data Engineering lays the foundation for data analysis and
machine learning by making high-quality data available to data scientists and
analysts.

Data Visualization
Data visualization is a critical aspect of Data Science that involves represent-
ing data and analytical results through visual formats such as charts, graphs,
DR

maps, and dashboards. It helps transform complex datasets into easily under-
standable insights, enabling quicker interpretation and more informed decision-
making. Effective data visualization allows analysts and stakeholders to iden-
tify patterns, trends, and outliers that might not be immediately evident in raw
data. It is widely used in business intelligence, reporting, and exploratory data
analysis. Common tools and libraries include Tableau, Power BI, Matplotlib,
Seaborn, and Plotly, each offering powerful capabilities for creating both static
and interactive visualizations.

Big Data Analytics


Big Data Analytics is a key component of Data Science that focuses on process-
ing and analyzing vast volumes of complex and fast-moving data. It enables
data scientists to uncover meaningful patterns, trends, and insights from large
datasets that traditional tools cannot handle efficiently. By leveraging technolo-
gies like Hadoop, Spark, and NoSQL databases, Big Data Analytics supports
real-time decision-making and predictive modeling. This component is essential

3
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

in industries where data is generated at high speed and scale, helping organi-
zations drive innovation, improve operations, and gain a competitive edge.

In this book, we focus on statistics and probability, emphasizing applied


statistical tools and applied probability with examples. The following chapters
will provide detailed explanations and practical applications.

1.3 Concepts of Statistics


Understanding the concepts of statistics is crucial for any data scientist. Statis-
tics provides the foundation for data analysis, enabling us to make sense of

T
data and draw meaningful conclusions. As a data scientist, you will frequently
encounter questions such as:

• How do we summarize large datasets?

• What can we infer from sample data about a larger population?


AF•

How can we validate the accuracy of our models?

What techniques can we use to identify patterns or anomalies in data?

By mastering statistical concepts, you will be better equipped to answer


these questions and make informed decisions based on data. Statistics plays a
crucial role in data science by providing tools and methods for:

• Summarizing and exploring data

• Designing experiments and surveys


DR

• Making inferences and predictions

• Evaluating and validating models

In this section, we will cover some fundamental statistical concepts essential


for data science.

Statistics: The science of collecting data, organizing, summarizing, clas-


sifying, comparing, and drawing inferences about a population.

Statistics encompasses both theoretical and applied aspects. The theoret-


ical side focuses on developing new statistical methods and theories. Applied
statistics involves the practical application of these methods to real-world prob-
lems and data, utilizing established statistical techniques to analyze data and
draw conclusions in various fields such as healthcare, business, engineering, and
social sciences.

4
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Applied statistics forms the foundation of data science by collecting, or-


ganizing, and analyzing data to understand larger populations. By applying
these methods, experts can predict trends and outcomes using models based
on real-world observations. This process enhances data quality through ex-
periments and sampling techniques and supports evidence-based conclusions.
Applied statistics is vital for fields like natural sciences, social sciences, public
policy, healthcare, engineering, and business, enabling informed decisions and
systematic approaches to complex challenges.

Example: Customer Satisfaction Survey


Imagine a company wants to understand the average customer satisfaction level

T
with its new product. Surveying every customer is impractical, so the company
surveys a sample of 500 customers out of its 50,000 customers.

Collecting Data
The company distributes a satisfaction survey to 500 randomly selected cus-
AF
tomers, asking them to rate their satisfaction on a scale from 1 to 10.

Analyzing Data
Once the responses are collected, the company calculates the average satisfac-
tion score from the sample. Suppose the average satisfaction score from the 500
customers is 7.2.

Interpreting Data
Using statistical methods, the company estimates the average satisfaction score
for the entire population of 50,000 customers based on the sample. This involves
DR

calculating confidence intervals to understand the range within which the true
average satisfaction score likely falls.

Presenting Data
The company creates a report with graphs and charts to visualize the distri-
bution of satisfaction scores, the average satisfaction score, and the confidence
interval.

Organizing Data
The company stores the survey data in a database, ensuring it is organized for
future analysis or reference.

5
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Making Inferences
By analyzing the sample data, the company infers that the average satisfaction
score of all its customers is approximately 7.2, with a certain level of confidence
(e.g., 95% confidence interval).

Quantifying Uncertainty
To quantify the uncertainty of their estimate, the company calculates a confi-
dence interval. For instance, they might determine that they are 95% confident
that the true average satisfaction score of all customers is between 6.9 and 7.5.

T
Through collecting, analyzing, interpreting, presenting, and organizing data
from a sample, the company uses statistics to make informed decisions about
customer satisfaction for the entire customer base. This approach helps the
company understand and quantify the uncertainty associated with their esti-
mate, enabling them to make more accurate and reliable business decisions.
AF
1.3.1 Population and Sample
In statistics, it is often impractical or impossible to study an entire population.
Instead, we rely on samples. Understanding the difference between a population
and a sample is fundamental to conducting statistical analyses.

Population
The population refers to the entire group of individuals or instances about whom
we want to draw conclusions. It includes all possible observations or outcomes
that are of interest in a particular study or analysis.
DR

Population: The entire set of individuals, items, or data points of interest


in a study or analysis.

Examples:
• If we are studying the prevalence of diabetes in adults aged 40-60 in
a country, the population would include all adults aged 40-60 in that
country. This would encompass every individual in that age range,
regardless of their health status, socioeconomic background, or other
characteristics.

• If we are studying the average height of adult men in a country, the


population would include all adult men in that country.

6
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Sample
A sample is a subset of the population that is selected for the actual study. The
goal is to choose a sample that is representative of the population so that the
findings can be used to make inferences or generalizations about the popula-
tion.

Sample: A subset of the population selected for analysis to make infer-


ences about the entire population.

Examples:

• Continuing with the diabetes study, it might be impractical to test

T
every adult aged 40-60 in the country. Instead, researchers might select
a sample of 1,000 adults from this age group and measure their blood
sugar levels. The results from this sample can then be used to estimate
the prevalence of diabetes in the entire population of adults aged 40-60.

• If we cannot measure the height of all adult men in a country, we


AF might select a sample of 1000 adult men and measure their heights.
The results from this sample can then be used to estimate the average
height of the entire population.

By studying a sample, we can make inferences about the population. How-


ever, it is crucial to use proper sampling techniques to ensure that the sample
accurately represents the population.

1.3.2 Census and Sample Survey


Census
DR

The census is a comprehensive and periodic data collection process aimed


at gathering detailed demographic information about the entire population
of a country. In the context of Bangladesh, the census is conducted by the
Bangladesh Bureau of Statistics (BBS) and serves as a crucial tool for policy
planning, resource allocation, and development programs. The first population
census in Bangladesh was conducted in 1974, following the country’s indepen-
dence. Since then, the census has been carried out every ten years, with the
most recent one being the 2022 Population and Housing Census. As per
as 2022 census, Bangladesh have a population of 165,158,616 people, of which
81,712,824 are male, while 83,347,206 are female. As many as 113,063,587 of
them live in rural areas and 52,009,072 live in Urban.

Importance of Census
The census data is essential for:
• Informing government policies and development strategies.

7
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

• Allocating resources and planning public services at both national and


local levels.

• Providing a basis for the estimation of various demographic trends and


indicators.

Sample Surveys
Sample surveys are a method of collecting data from a subset of the population
to infer information about the entire population. In Bangladesh, sample surveys
are conducted for various purposes, including economic research, social policy
evaluation, and program assessment.

T
Types of Sample Surveys
In Bangladesh, different types of sample surveys are carried out by the Bangladesh
Bureau of Statistics (BBS) and other organizations. Some key surveys include:

AF•
Labor Force Survey (LFS): Assesses employment, unemployment,
and labor market conditions.

Multiple Indicator Cluster Survey (MICS): Provides data on


child health, education, and protection.

• Household Income and Expenditure Survey (HIES): Measures


household income, consumption patterns, and poverty levels.

• Demographic and Health Survey (DHS): Collects data on popu-


lation health, fertility, and mortality.

1.3.3 Parameters and Statistic


DR

Parameter
A parameter is a measurable characteristic of the population which is a nu-
merical value that summarizes a characteristic of a population. It is a fixed
value, although its exact value is often unknown. Parameters describe the en-
tire population.

Parameter: A numerical characteristic or measure that describes a spe-


cific aspect of a population, such as its mean, variance, or proportion.

Examples:
• Population Mean (µ): The average height of all adult men in a
country.

• Population Proportion (P ): The proportion of voters in a country


who support a particular candidate.

8
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Statistic
A statistic is a measurable characteristic of the sample which is a numerical
value that summarizes a characteristic of a sample. It is used to estimate
the corresponding population parameter. Statistics can vary from sample to
sample.

Statistic: A measurable characteristic or quantity calculated from a sam-


ple, which is used to estimate or infer the corresponding population pa-
rameter.

Examples:

T
• Sample Mean (x̄): The average height of 100 randomly selected adult
men from the population.

• Sample Proportion (p): The proportion of voters in a sample of


1,000 who support a particular candidate.
AFParameters describe the entire population and are fixed values, whereas
statistics describe samples and can vary from sample to sample. Understanding
the distinction between these concepts is crucial in the field of statistics, as we
often use sample statistics to make inferences about population parameters.

1.3.4 Types of Statistics


When using statistics to derive information from data for decision-making, we
employ either descriptive statistics or inferential statistics. The choice between
these methods depends on the questions we aim to answer and the nature of
the data at hand.
DR

(i). Descriptive Statistics


Descriptive statistics is all about summarizing data in a way that is easy
to understand. It helps us describe the data we have and make it more un-
derstandable. Instead of looking at a large amount of raw data, descriptive
statistics helps us look at it in a more organized and simple way. It focuses on
things like:

• Averages: To find a typical value, like the average score in a class.

• Spreads: To understand how different the data points are from each
other, such as how spread out the heights of people in a group are.

• Visual Tools: Charts, graphs, and tables that make it easier to see
patterns and trends in data.

9
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Descriptive statistics: Methods of organizing, summarizing, and pre-


senting data in an informative way.

(ii). Inferential Statistics


Inferential statistics allows us to make predictions or generalizations about
a larger group based on a small sample of data. In many situations, it’s not
practical or possible to collect data from everyone (for example, asking every
person in a country about their opinion). Instead, we collect data from a small
group (a sample) and use that to make guesses about the entire group (called
a population).

T
For example, suppose you wanted to know how much time students spend
studying for exams in your school. Instead of asking every student, you could
ask a small group of students (a sample) and then use their answers to make
an estimate about the entire student population.
AF
Inferential statistics: Techniques for making predictions or inferences
about a population based on sample data.

Inferential statistics helps us make decisions or draw conclusions about


a larger group, even though we only have data from a smaller group. This is
helpful when we need to make choices or predictions without needing to survey
everyone.

1.4 What is Data?


Data refers to raw facts, observations, or units of information, including num-
DR

bers, words, measurements, observations, images, videos, audio, or descriptions,


that can be collected, stored, analyzed, and used to inform decisions. It can be
structured (like databases) or unstructured (like text or images). It is the ba-
sis for generating useful insights and making informed decisions in various fields.

Data: Raw facts and figures or units of information that can be collected
for analysis, which can be quantitative or qualitative.

Types of Data
Data can be categorized into several types, each with its own characteristics
and uses. These categories help us understand different ways to collect and
analyze information. The primary data types are as follows:

(i). Quantitative Data or Numeric Data: This type of data involves


numerical values that can be measured or counted, answering questions

10
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

like “how much?” or “how many?” Quantitative data provides us with


specific amounts or counts, and it can be broken down further into two
subcategories:
• Discrete Data: This refers to numerical data that can only take
specific, separate values, often representing counts. For example,
the number of students in a class or the number of products sold.
• Continuous Data: In contrast, continuous data can take any
value within a given range, including fractions. Examples include
measurements like height, weight, or temperature.
Some examples of quantitative data are as follows:

T
• How old you are: 15, 16, 17, etc.
• How many students are in a class: 25, 30, 35, etc.
• The temperature in a city: 20°C, 25°C, etc.
(ii). Qualitative Data or Non-numeric Data: Unlike quantitative data,
AF qualitative data or non-numeric data does not deal with numbers. In-
stead, it describes categories, qualities, or characteristics. This type of
data is often used to answer questions like “what kind?” or “which one?”
Sometimes, it is called categorical data. Qualitative data includes var-
ious forms of descriptive information, and examples include:
• Your favorite color: Red, Blue, Green, etc.
• The type of pets people have: Dog, Cat, Fish, etc.
• The kind of food people like: Pizza, Pasta, Salad, etc.
DR

Other Types of Data


• Binary Data: Contains only two possible values (e.g., true/false,
yes/no).

• Text Data: Unstructured information in textual form (e.g., articles,


tweets).

• Time Series Data: Recorded observations over time intervals (e.g.,


stock prices, weather data).

• Spatial Data: Information about the physical location and shape of


objects (e.g., GPS coordinates, maps).

• Image and Video Data: Visual information captured as images or


videos.

• Audio Data: Sound or speech information, including recordings or


live streams.

11
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

1.4.1 Levels of Measurement


The levels of measurement or scale of measurement refer to the nature of
the data and determine the types of statistical analyses that can be performed.
There are four main levels of measurement:
(i). Nominal

(ii). Ordinal

(iii). Interval

(iv). Ratio

T
Nominal Level
The nominal level of measurement is the most basic type of data categoriza-
tion. It classifies data into distinct categories that do not have a natural order
or ranking. These categories are mutually exclusive and collectively exhaustive,
AF
meaning each observation fits into one and only one category, and all categories
together include all possible observations.

Characteristics of Nominal Data:


• Categorical: Data are grouped into categories based on some quali-
tative property.

• No Order: There is no inherent order or ranking among the categories.

• Labels Only: Categories are typically labeled with names, numbers,


or symbols for identification.
DR

Examples:
• Gender (male, female)

• Marital status (single, married, divorced)

• Types of cuisine (Italian, Chinese, Mexican)

• Types of Engineering Degrees: Civil, Mechanical, Electrical, Chemical,


Computer.

• Different types of engines used in vehicles and machinery: Diesel, Elec-


tric, Gasoline, Hybrid.

• Various materials used in building structures: Wood, Steel, Concrete,


Brick.

• Software Versions: v1.0, v2.0, v3.0.

12
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

• Blood Types: A, B, AB, O.

• Disease Types: Flu, Cold, Allergies, Asthma.

• Medical Specialties: Cardiology, Neurology, Oncology, Pediatrics.

• Vaccination Status: Vaccinated, Not Vaccinated.

• Hospital Departments: Emergency, Radiology, Surgery, Pediatrics.

Ordinal Level
The ordinal level of measurement classifies data into categories that have a

T
meaningful order or ranking among them. Unlike nominal data, ordinal data
allow for comparisons of the relative position of items, but the intervals between
the categories are not necessarily equal or known. Ordinal data are widely used
in surveys, questionnaires, and educational assessments to gauge attitudes, per-
ceptions, and performance levels.
AF
Characteristics of Ordinal Data:
• Categorical with Order: Data are grouped into categories that have
a logical sequence or ranking.

• Relative Position: Categories indicate relative positions but not the


magnitude of difference between them.

• Rankings or Ratings: Often represented by rankings or ratings.

Examples:

DR

Education level (high school, bachelor’s, master’s, PhD)

• Customer satisfaction (very unsatisfied, unsatisfied, neutral, satisfied,


very satisfied)

• Engineering Design Maturity: Concept, Prototype, Final Design.

• Performance Ratings: Poor, Fair, Good, Excellent.

• Pain Levels: No Pain, Mild Pain, Moderate Pain, Severe Pain.

• Stage of Disease: Stage I, Stage II, Stage III, Stage IV.

• Quality of Life Scores: Poor, Fair, Good, Excellent.

• Severity of Symptoms: Mild, Moderate, Severe.

• Patient Satisfaction Levels: Very Unsatisfied, Unsatisfied, Neutral, Sat-


isfied, Very Satisfied.

13
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Interval Level
The interval level of measurement involves data that are not only ordered but
also have equal intervals between values. Unlike ordinal data, the differences
between values are meaningful. However, interval data do not have a true zero
point, which means that ratios are not meaningful. Despite lacking a true zero
point, interval data provide a higher level of detail compared to nominal and
ordinal data, allowing for more sophisticated analytical techniques.

Characteristics of Interval Data:


• Equal Intervals: The difference between values is consistent and

T
meaningful. For example, consider temperature measured in degrees
Celsius. The difference between 10°C and 20°C is 10 degrees, and the
difference between 20°C and 30°C is also 10 degrees. These equal in-
tervals allow us to perform meaningful addition and subtraction.

• No True Zero Point: Zero does not indicate the absence of the
AF•
quantity being measured, so ratios are not meaningful.

Ordered: Data have a meaningful order.

Examples:
• Temperature Measurements in Celsius or Fahrenheit: 10°C, 20°C, 30°C.

• IQ Scores: Differences between scores are meaningful, but there is no


true zero point.

• Calendar Years: The difference between years is consistent, but there


is no true zero year.
DR

• Body Temperature in Celsius: 36.5°C, 37.0°C, 37.5°C.

• SAT Scores: There is no true zero score that indicates an absence of


knowledge or ability; the score is relative to the testing scale.

• GPA (Grade Point Average): GPA values represent equal intervals


(e.g., the difference between 3.0 and 4.0 is the same as between 2.0 and
3.0). However, there is no true zero point on the GPA scale; it is just
a measure of academic performance.

Ratio Level
The ratio level of measurement is the highest level of measurement and includes
all the properties of the interval level, with the addition of a true zero point.
This allows for meaningful comparisons and calculation of ratios. The pres-
ence of a true zero point allows for a full range of mathematical and statistical

14
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

operations, making ratio data the most informative and versatile level of mea-
surement.

Characteristics of Ratio Data:


• Equal Intervals: The difference between values is consistent and
meaningful. For example, the difference between 10 kilograms and
20 kilograms is the same as the difference between 30 kilograms and
40 kilograms. In both cases, the difference is 10 kilograms. This con-
sistency allows for accurate and meaningful addition and subtraction
of values.

T
• True Zero Point: Zero indicates the absence of the quantity being
measured, making ratios meaningful.

• Ordered: Data have a meaningful order.

• Meaningful Ratio: This characteristic enables comparisons using


AF ratios (e.g., twice as much, half as much).

Examples:
• Length of Engineering Components: 5 m, 10 m, 15 m.

• Weight of Materials: 2 kg, 5 kg, 10 kg.

• Cost of Engineering Projects: $1000, $2000, $3000.

• Mechanical Stress Measurements: 10 MPa, 20 MPa, 30 MPa.

• Production Quantity: 100 units, 200 units, 300 units.


DR

• Height of Patients: 150 cm, 160 cm, 170 cm.

• Weight of Patients: 50 kg, 60 kg, 70 kg.

• Number of Hospital Visits: 1 visit, 2 visits, 3 visits.

• Dosage of Medication: 10 mg, 20 mg, 30 mg.

• Length of Hospital Stay: 1 day, 2 days, 3 days.

By grasping these basic statistical concepts, data scientists can better analyze
and interpret data, leading to more accurate and meaningful insights. As we
delve deeper into data science, these foundational principles will serve as the
building blocks for more advanced topics and applications.

15
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

1.4.2 Variables
In statistics and data science, a variable is any characteristic or properties that
can take on different values. Variables are essential for research, as they rep-
resent the different factors or elements that can change or vary across different
individuals, conditions, or time periods. They can take on different values de-
pending on the nature of the data being collected.

Variable: A characteristic or quantity that can vary or take on different


values in a dataset or experiment.

Understanding the types of variables is crucial because it determines the kind

T
of statistical analysis that can be performed. Variables are broadly categorized
into qualitative and quantitative variables, each with further subtypes.

Qualitative Variables
Qualitative variables, also known as categorical variables, describe categories
AF
or groups. These variables represent characteristics that cannot be measured
numerically but can be classified into distinct groups.

• Nominal Variables: These are variables with categories that have


no natural order or ranking. Examples include gender (male, female),
marital status (single, married, divorced), and hair color (blonde, brunette,
redhead).

• Ordinal Variables: These are variables with categories that have


a meaningful order or ranking but the intervals between the cate-
gories are not necessarily equal. Examples include education level (high
school, bachelor’s, master’s, doctorate) and customer satisfaction rat-
DR

ings (satisfied, neutral, dissatisfied).

Quantitative Variables
Quantitative variables, also known as numerical variables, represent measurable
quantities and can be expressed numerically. They can be further divided into
discrete and continuous variables.

• Discrete Variables: These variables represent countable values. They


take on a finite number of distinct values and are often integers. Ex-
amples include the number of students in a class, the number of cars
in a parking lot, and the number of books on a shelf.

• Continuous Variables: These variables represent measurable quan-


tities that can take on any value within a given range. They are often
associated with measurements. Examples include height, weight, tem-
perature, and time.

16
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

Understanding the type of variable is essential for choosing the appropriate


statistical methods and analyses. Each type of variable provides different in-
sights and requires specific techniques for accurate interpretation and decision-
making.

1.5 Scope of Applied Statistics


Applied statistics encompasses a wide range of activities and applications across
various fields. Here are some key areas of its scope:

• Data Collection and Management: Designing surveys, experi-

T
ments, and data collection methods.

• Data Analysis: Employing statistical techniques to analyze data,


including hypothesis testing, regression analysis, and ANOVA.

• Decision Making: Providing statistical insights for decision-making


AF•
in business, healthcare, policy-making, and more.

Predictive Modeling: Developing models to forecast future trends


and outcomes based on historical data.

• Quality Control: Implementing statistical methods for quality as-


surance and improvement in manufacturing and services.

• Market Research: Analyzing consumer behavior, market trends, and


product performance.

• Epidemiology: Studying the distribution and determinants of health-


related events in populations.
DR

• Financial Analysis: Applying statistics to risk assessment, invest-


ment strategies, and economic forecasting.

• Educational Assessment: Evaluating educational programs, stu-


dent performance, and teaching effectiveness.

1.6 Statistical Methods in Data Science


Statistical methods play a central role in data science by providing the tools and
techniques needed to analyze, interpret, and draw conclusions from data. These
methods help data scientists understand underlying patterns, make predictions,
and inform decision-making. Below are some key statistical methods commonly
used in data science:

17
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

• Descriptive Statistics: Descriptive statistics involve summarizing


and describing the main features of a dataset. This can include mea-
sures such as:
■ Mean: The sum of all values divided by the number of values.
■ Median: The middle value when the data is sorted.
■ Mode: The most frequently occurring value.
■ Standard Deviation: A measure of how spread out the values are
from the mean.
■ Variance: The square of the standard deviation, representing

T
the spread of the data.

• Inferential Statistics: Inferential statistics allow us to make predic-


tions or inferences about a population based on a sample of data. Key
techniques include:
AF ■ Hypothesis Testing: Used to test assumptions or claims about
a population based on sample data, using tests like t-tests or
chi-squared tests.
■ Confidence Intervals: Estimating the range within which a pop-
ulation parameter is likely to fall, based on sample data.
■ ANOVA (Analysis of Variance): A method for comparing the
means of multiple groups to determine if there are any statisti-
cally significant differences between them.

• Regression Analysis: Regression techniques are used to model rela-


tionships between variables and make predictions. Common methods
DR

include:
■ Linear Regression: Modeling the relationship between a depen-
dent variable and one or more independent variables.
■ Logistic Regression: Used for binary classification problems where
the outcome is categorical (e.g., yes/no, success/failure).

• Probability Theory: Probability theory helps quantify uncertainty


and is fundamental to many data science algorithms. Common tools
include:
■ Bayesian Statistics: A method of statistical inference that up-
dates probabilities based on new evidence.
■ Random Variables: Variables whose values are outcomes of ran-
dom phenomena, often described by probability distributions
(e.g., normal distribution, binomial distribution).

18
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

• Machine Learning Algorithms: Many machine learning algorithms,


especially supervised learning algorithms, are built upon statistical
principles. These include:
■ Classification: Methods like decision trees, support vector ma-
chines (SVM), and neural networks that categorize data into
distinct classes.
■ Clustering: Techniques like k-means or hierarchical clustering
that group data points based on similarity.
■ Dimensionality Reduction: Methods such as Principal Compo-
nent Analysis (PCA) that reduce the number of variables in a

T
dataset while preserving as much information as possible.

• Time Series Analysis: Time series methods are used for analyzing
data collected over time. Techniques such as:
■ ARIMA (AutoRegressive Integrated Moving Average): A model
AF used for forecasting time series data.
■ Seasonal Decomposition: Identifying and removing seasonal pat-
terns from time series data to better understand underlying
trends.

• Statistical Sampling: Sampling techniques are used to select subsets


of data from a larger population for analysis. Methods include:

■ Random Sampling: Selecting data randomly to ensure unbiased


representation of the population.
■ Stratified Sampling: Dividing the population into subgroups and
DR

sampling from each group to ensure representation from all seg-


ments.

1.7 Overview of Data Science Workflow


The data science workflow typically involves several key steps:

1. Problem Definition: Understanding the problem and defining objec-


tives.
2. Data Collection: Gathering relevant data from various sources.

3. Data Cleaning: Handling missing values, outliers, and ensuring data


quality.
4. Data Exploration: Summarizing and visualizing data to understand its
structure and patterns.

19
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

5. Modeling: Applying statistical and machine learning techniques to build


predictive models.
6. Evaluation: Assessing the performance of the models using appropriate
metrics.

7. Deployment: Implementing the model in a production environment.


8. Monitoring: Continuously monitoring and refining the model to ensure
its accuracy and relevance.

1.8 Popular Statistical Analysis Tools

T
There are several tools and software commonly used for statistical analysis in
data science, including:

• Python: A versatile programming language with powerful libraries


such as NumPy, pandas, SciPy, and scikit-learn for statistical analysis
AF•
and machine learning.

R: A programming language and software environment designed for


statistical computing and graphics.

• SAS: A software suite used for advanced analytics, multivariate anal-


ysis, business intelligence, and data management.

• STATA: A powerful software provides comprehensive tools for data


manipulation, statistical analysis, and graphical representation.

• SQL: A domain-specific language used for managing and manipulating


DR

relational databases.

• Excel: A spreadsheet software that provides basic statistical functions


and data visualization capabilities.

1.9 Why Choose Python for This Book?


Python has become the language of choice for many data scientists and statisti-
cians. Here are several reasons why Python is particularly suited for statistical
analysis and data science:

1. Ease of Learning and Use


Python’s syntax is straightforward and readable, making it accessible for be-
ginners. Its simplicity allows users to focus on solving problems rather than
getting bogged down by the complexities of the language itself.

20
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

2. Comprehensive Libraries
Python boasts a rich ecosystem of libraries that are essential for statistical
analysis and data science. Some of the most popular libraries include:
• NumPy: Provides support for large multidimensional arrays and ma-
trices, along with a collection of mathematical functions to operate on
these arrays.

• Pandas: Offers data structures and functions needed to manipulate


structured data seamlessly, making it easy to handle and analyze data.

• SciPy: Contains modules for optimization, linear algebra, integration,

T
and other advanced mathematical functions.

• Matplotlib and Seaborn: Provide powerful tools for data visual-


ization, allowing users to create a wide range of static, animated, and
interactive plots.
AF•


Statsmodels: Enables statistical modeling, hypothesis testing, and
data exploration.

Scikit-learn: A robust library for machine learning that includes sim-


ple and efficient tools for data mining and data analysis.

3. Community and Support


Python has a large, active community of users and developers. This means
extensive documentation, a wealth of tutorials, and a plethora of forums where
users can ask questions and share knowledge. This community support accel-
erates learning and problem-solving.
DR

4. Integration Capabilities
Python integrates well with other languages and technologies. It can be used
alongside other data tools like SQL for database queries, or even integrated
with languages such as R or C++ for specialized tasks. This flexibility makes
Python a versatile tool in a data scientist’s toolkit.

5. Open Source and Free


Python is open source and freely available, which lowers the barrier to entry for
individuals and organizations. This democratization of technology allows more
people to engage with data science and statistical analysis.

21
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

6. Versatility
Python is not only used for data analysis but also for web development, au-
tomation, scripting, and even artificial intelligence and machine learning. This
versatility means that once you learn Python, you can apply your skills to a
wide range of problems and projects.

7. Real-World Applications
Many industry leaders and tech giants like Google, Facebook, and NASA use
Python for data analysis and machine learning. This real-world application un-
derscores Python’s reliability and effectiveness in handling complex data tasks.

T
8. Continuous Development
Python and its libraries are continuously being developed and improved by the
community, ensuring that users have access to the latest tools and techniques
in data science and statistics.
AFIn summary, Python’s combination of simplicity, powerful libraries, commu-
nity support, integration capabilities, and versatility makes it an ideal choice
for data scientists and statisticians. Throughout this book, we will leverage
Python to demonstrate various statistical tools and methodologies, ensuring
that you can apply what you learn to real-world data challenges effectively.

1.10 Getting Started with Python


Setting Up Python
DR

Instructions for installing Python can be found at https://www.python.org/


downloads/. We recommend using Jupyter Notebook or an Integrated Devel-
opment Environment (IDE) like PyCharm.

Basic Python Concepts


If you are new to Python, start with the basics: variables, data types, con-
trol structures, and functions. Jupyter Notebooks are particularly useful for
interactive data analysis.

1.11 Structure of the Book


Overview of Chapters
This book is structured to gradually build your knowledge in data science,
starting from basic concepts to advanced applications. Each chapter includes

22
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

theoretical explanations, practical examples, and Python code snippets.

Learning Path
To get the most out of this book, follow the chapters sequentially. Practice the
examples and exercises provided to reinforce your understanding.

1.12 Concluding Remarks


Applied statistics is a fundamental component of data science that provides the
tools necessary for data analysis and interpretation. By understanding basic

T
statistical concepts and methods, data scientists can perform effective analyses
and make data-driven decisions. In the following chapters, we will delve deeper
into specific statistical methods and explore how they can be applied to real-
world data science problems.

By the end of this book, you will have a solid foundation in applied statistics
AF
and probability using Python. You will be equipped with the skills to tackle
real-world data science problems.

1.13 Chapter Exercises


1. Define data science. How does it differ from traditional data analysis?
2. List and describe the various stages of a data science project lifecycle.
3. Explain the role of a data scientist. What skills are essential for a data
scientist to be effective in their role?
DR

4. Discuss the importance of interdisciplinary knowledge in data science.


Provide examples of how data science can be applied in different fields.
5. Define statistics. Provide examples of how statistics are applied in the
fields of healthcare and engineering.
6. Explain the difference between descriptive and inferential statistics. Give
an example of each.
7. Discuss the importance of data visualization in statistical analysis. Men-
tion two Python libraries that are useful for creating data visualizations.
8. Define population and sample. Why is sampling important in statistical
analysis?
9. Define parameter and statistic with an example.
10. Why is Python considered a suitable programming language for data sci-
ence and statistical analysis?

23
CHAPTER 1. INTRODUCTION TO DATA SCIENCE

11. List and describe three Python libraries commonly used in data science.
What functionalities do they provide?
12. Explain the importance of community and support in choosing Python
for data science.

13. How does Python’s integration capability benefit a data scientist working
on complex projects?
14. Identify the type of data (quantitative or qualitative) for each of the
following:
(i). The colors of cars in a parking lot.

T
(ii). The heights of students in a class.
(iii). The brands of smartphones owned by a group of people.
(iv). The number of books read by a group of students in a year.
AF (v). The types of cuisine served at different restaurants.
15. For each of the following variables, identify the level of measurement
(nominal, ordinal, interval, or ratio):
(i). The ranking of movies from a film festival.
(ii). The temperatures in degrees Celsius recorded over a week.
(iii). The number of steps taken by an individual in a day.
(iv). The blood types of patients in a hospital.
(v). The ages of participants in a survey.
DR

16. Categorize the following data sets as either nominal, ordinal, interval, or
ratio:
(i). Survey responses (Strongly Agree, Agree, Neutral, Disagree, Strongly
Disagree).
(ii). Birth years of employees in a company.
(iii). Types of pets owned (dog, cat, bird, etc.).
(iv). Test scores out of 100.
(v). Customer satisfaction ratings on a scale of 1 to 10.

24
Chapter 2

Data Exploration: Tabular

T
and Graphical Displays
AF
2.1 Introduction
In data science, data exploration is a crucial phase in the data analysis work-
flow, preceding more complex statistical modeling and hypothesis testing. It
provides a comprehensive overview of the dataset, allowing analysts to under-
stand the underlying structure and characteristics of the data. Through this
process, one can identify data quality issues, such as missing values or outliers,
and gain insights that inform the choice of appropriate analytical methods.

This chapter covers basic methods for exploring data using tables and charts.
First, we will look at tables, which are a simple and effective way to organize
DR

and summarize data. Then, we will explore charts and graphs, which help
us see patterns and trends more easily. By looking at both qualitative and
quantitative data through these methods, we aim to build a solid foundation
for more advanced analysis and ensure a strong approach to exploring data

2.2 Tabular Displays


Tabular displays present data in a structured format, typically organized into
rows and columns. Tables are valuable for displaying raw data, summarizing
information, and comparing different datasets. Basic tabular methods include
frequency tables, which show the distribution of data across different categories
or intervals, and cross-tabulations, which explore relationships between two
categorical variables. Tables are particularly useful for presenting precise data
and performing detailed comparisons, but they can be limited in their ability
to convey trends and patterns visually.

25
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

Key Components
• Frequency Tables: Display the count (or frequency) of each distinct
value or category or interval in a dataset, helping to summarize and
understand the distribution of the data.

• Contingency Tables: Show the frequency distribution of categorical


variables that are dependent on each other.

• Summary Tables: Provide descriptive statistics such as mean, me-


dian, mode, standard deviation, and quartiles.

T
Example
A frequency table showing the distribution of test scores for a class of students.

Score Range Frequency


0-49 2
AF 50-59
60-69
3
7
70-79 8
80-89 10
90-100 5

2.3 Graphical Displays


Graphical displays, on the other hand, offer a visual representation of data that
DR

can reveal trends, relationships, and distributions more intuitively than tables.
Common graphical methods include histograms, bar charts, pie charts, stem-
and-leaf plots, and box plots. Each type of graphical display serves a specific
purpose:

• Histograms show the frequency distribution of numerical data across


intervals, helping to identify patterns and outliers.

• Bar Charts compare different categories by showing the frequency or


count of each category.

• Pie Charts illustrate the proportion of each category relative to the


whole dataset.

• Stem-and-Leaf Plots provide a way to display data that retains the


original data values and reveals distribution shapes.

26
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

• Box Plots summarize the distribution of data through quartiles and


highlight potential outliers.

Graphical methods are indispensable for exploring the data visually and
communicating findings to a broader audience. They help to simplify complex
datasets and highlight patterns that might not be immediately apparent from
tabular data alone.

2.4 Summarizing Qualitative Data


Summarizing Qualitative Data involves organizing and presenting categor-

T
ical data in a way that highlights the frequency and distribution of the different
categories. Here are common methods:

2.4.1 Bar Chart


A bar chart (or bar graph) is a type of data visualization used to represent
AF
and compare different categories of data through rectangular bars. Each bar’s
length or height is proportional to the value or frequency of the category it
represents.

Bar Chart: A chart that uses rectangular bars to represent the frequency
or proportion of categories in a dataset, with the length of each bar corre-
sponding to its value.

The bar chart is a powerful tool for visualizing categorical data. It allows
for easy comparison between different categories. This makes it straightforward
to identify patterns, trends, and outliers within the data.
DR

Problem 2.1. Suppose we conducted a survey asking students about their fa-
vorite movie genre from a list of options: Comedy, Action, Romance, Drama,
and Science Fiction. We gathered responses from a total of 20 students, and
their preferences are given in Table 2.1 and as follows:

Table 2.1: List of favorite movies.

Comedy Science Fiction Comedy Comedy


Action Science Fiction Action Romance
Action Romance Science Fiction Romance
Romance Romance Action Drama
Comedy Romance Action Science Fiction

Make a frequency distribution table and draw bar chart.

27
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

To draw the bar chart, the frequency distribution of the student’s prefer-
ences is given in Table 2.2. The bar chart is presented in Figure 2.1.

Table 2.2: Frequency distribution of movie preferences

Category frequency (fi ) Percentage

Comedy 4 20%
Action 5 25%
Romance 6 30%
Drama 1 5%
Science Fiction 4 20%

T
P
Total
AF i fi = 20

6
6
Number of Students

4 4
4

2
DR
1

0
Comedy Action Romance Drama Science Fiction

Figure 2.1: Bar Chart of Movie Preferences

2.4.2 Python Code: Bar Chart


To create a bar chart for the data in Example 2.1 using Python, you can use
the following code to generate the chart.
1 # Install the necessary libraries
2 import matplotlib . pyplot as plt
3 from collections import Counter
4

5 # Data from the table


6 movies = [

28
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

7 " Comedy " , " Science Fiction " , " Comedy " , " Comedy " ,
8 " Action " , " Science Fiction " , " Action " , " Romance " ,
9 " Action " , " Romance " , " Science Fiction " , " Romance " ,
10 " Romance " , " Romance " , " Action " , " Drama " ,
11 " Comedy " , " Romance " , " Action " , " Science Fiction "
12 ]
13

14 # Count the frequency of each movie genre


15 movie_counts = Counter ( movies )
16

17 # Extract the genres and their corresponding counts


18 genres = list ( movie_counts . keys () )

T
19 counts = list ( movie_counts . values () )
20

21 # Plotting the bar chart


22 plt . figure ( figsize =(10 , 6) )
23 plt . bar ( genres , counts , color = ’ skyblue ’)
24 plt . xlabel ( ’ Movie Genre ’)
25 plt . ylabel ( ’ Frequency ’)
26

27

28

29
AF plt . title ( ’ Favorite Movie Genres ’)
plt . show ()

30

2.4.3 Pie Chart


A pie chart is a circular statistical graphic divided into slices to illustrate nu-
merical proportions. Each slice of the pie represents a category’s contribution
to the total, with the entire pie representing 100% of the data.
DR

Pie Chart: A circular chart that represents the proportion of each cate-
gory in a dataset as slices, allowing easy comparison of relative frequencies.

Pie charts are a straightforward way to show how different parts contribute
to a whole, making them a popular choice for visualizing proportions in data
analysis.

Problem 2.2. Refer to Table 2.1 for the dataset needed to create a pie chart.
Analyze the pie chart and interpret the key findings from the data visualized.

Solution
To draw the pie chart, the frequency distribution of the student’s preferences
is given in Table 2.2. The pie chart is presented in Figure 2.2.

29
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Table 2.3: Frequency Distribution

Category frequency (fi ) Percentage Angle = Pfi × 360◦


i fi

Comedy 4 20% 72
Action 5 25% 90
Romance 6 30% 108
Drama 1 5% 18
Science Fiction 4 20% 72
P
Total i fi = 20

T
Action
Comedy
25%
20%
AF
30% 20%

Romance 5% Science Fiction

Drama

Figure 2.2: Pie Chart


DR
The frequency distribution table and pie chart reveals the preferences for
movie genres among 20 respondents. Romance emerges as the most preferred
genre, accounting for 30% of the choices, as indicated by the largest pie chart
segment. Action is also quite popular, chosen by 25% of respondents. Both
Comedy and Science Fiction have equal popularity, each being preferred by
20% of the respondents. Drama is the least favored genre, with only 5% of the
respondents selecting it. These results suggest that respondents have a diverse
range of genre preferences, with Romance standing out as the most popular
choice.

2.4.4 Python Code: Pie Chart


To create a bar chart for the data in Example 2.1 using Python, you can use
the following code to generate the chart.
1 pip install pandas matplotlib # Install Required Libraries (
if not already installed )

30
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2 import matplotlib . pyplot as plt


3 from collections import Counter
4

5 # Data from the table


6 data = [
7 ’ Comedy ’ , ’ Science Fiction ’ , ’ Comedy ’ , ’ Comedy ’ ,
8 ’ Action ’ , ’ Science Fiction ’ , ’ Action ’ , ’ Romance ’ ,
9 ’ Action ’ , ’ Romance ’ , ’ Science Fiction ’ , ’ Romance ’ ,
10 ’ Romance ’ , ’ Romance ’ , ’ Action ’ , ’ Drama ’ ,
11 ’ Comedy ’ , ’ Romance ’ , ’ Action ’ , ’ Science Fiction ’
12 ]

FT
13

14 # Count occurrences of each genre


15 genre_counts = Counter ( data )
16

17 # Prepare data for the pie chart


18 genres = list ( genre_counts . keys () )
19 counts = list ( genre_counts . values () )
20

21 # Create the pie chart


22 plt . figure ( figsize =(10 , 7) )
23 plt . pie ( counts , labels = genres , autopct = ’ %1.1 f %% ’ , startangle
=140)
24 plt . title ( ’ Favorite Movies Pie Chart ’)
A
25 plt . show ()

Problem 2.3. Imagine you conducted a survey to find out the types of physical
activities performed by patients in a rehabilitation program. You collected data
from 100 patients, and the results are as follows:

• Walking: 25 patients
R
• Cycling: 20 patients

• Swimming: 15 patients

• Yoga: 15 patients


D

Strength Training: 10 patients

• Pilates: 8 patients

• Dancing: 7 patients

Answer the following questions:

(a). What is the name of the variable under study? Is this a qualitative vari-
able? If the answer is no, why?

31
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

(b). Make a pie chart to represent the distribution of physical activities among
the patients. Write a summary based on this. Be sure to label the chart
accurately and show the percentage of patients for each activity.

Solution
(a). The variable under study is the type of physical activity performed by
patients in the rehabilitation program. Yes, this is a qualitative variable
because it describes categories or types of activities rather than numerical
values.
(b). Pie Chart:

T
Walking
20%
25% Cycling
AF 15%
Swimming
Yoga
Strength Training
7%
Pilates
Dancing
8%
15%
10%
DR

Summary:
The pie chart shows the distribution of physical activities among the pa-
tients in the rehabilitation program. Walking is the most common activity,
with 25% of the patients participating in it. This is followed by cycling
(20%), swimming (15%), and yoga (15%). Strength training is performed
by 10% of the patients, while pilates and dancing are the least common
activities, with 8% and 7% participation, respectively.

2.5 Summarizing Quantitative Data


Summarizing quantitative data involves organizing and presenting numerical
data in a way that reveals patterns, trends, and important characteristics of
the dataset. Some common methods are explained in the following.

32
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.1 Constructing a Frequency Distribution Table


Let’s take an example to understand how to construct a frequency distribution.
Suppose we have a weekly expenditure of 30 students. To construct a frequency
distribution table, we follow these steps: To construct a frequency distribution
table, we follow these steps:
• Steps for constructing a frequency table:

• Step 1: Sort the data in ascending order.

• Step 2: Find minimum and maximum observation of data.

T
• Step 3: Decide on the number of classes (k) in the frequency distri-
bution.
k = 1 + 3.322 log10 (n).
Alternatively, we can also choose k such that

2k ≥ n.
AF• Step 4: Determine the class interval (h) size.

Maximum observation - Minimum observation


h≥
Number of class (k)

• Step 5: Decide the starting point: the lower class limit or class bound-
ary should cover the smallest value in the raw data.

• Step 6: Tally and count the observations under each interval.


DR

Let’s take the following example to understand how to construct a frequency


distribution.

Problem 2.4. Suppose we have a weekly expenditure of 30 students. Given the


following numbers of observations:

423 369 387 411 393 394


371 377 389 409 392 407
431 401 363 391 405 382
400 381 399 415 428 422
395 371 410 419 386 390

construct a frequency distribution table using an appropriate number of classes


and the class interval.

33
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

Solution
1. Step 1: sort data in ascending order

363 369 371 371 377 381 382 386 387 389 390

391 392 393 394 395 399 400 401 405 407 409

410 411 415 419 422 423 428 431

2. Step 2:

T
Minimum Observation is 363 and maximum observation is 431
3. Step 3: Number of classes
k = 1 + 3.322 log10 (30) = 5.907 ≈ 6

4. Step 4:
AF h≥
Maximum observation - Minimum observation

431 − 363
Number of class
≥ = 11.33 ≈ 12
6
5. Step 5: Decide the staring point 360
Using all steps, the frequency distribution table is presented in Table 2.4.

Table 2.4: Distribution of weekly expenditure of 30 students.


DR

Class Interval Tally Frequency Relative Frequency


360 - 372 4 0.1333
372 - 384 3 0.1000
384 - 396 9 0.3000
396 - 408 5 0.1667
408 - 420 5 0.1667
420 - 432 4 0.1333
Total 30 1.0000

Constructing a frequency distribution table helps in organizing data into


class intervals, providing a clear overview of the distribution of data points.
This is a crucial step in data exploration, as it allows for the identification of
patterns and trends within the dataset.

34
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.2 Python Code: Frequency Distribution


To create a frequency distribution table for the data in Problem 2.4 using
Python, you can use the following code to generate the Frequency Distribution
table.
1 import pandas as pd
2 import matplotlib . pyplot as plt
3

4 # Data : weekly expenditure of 30 students


5 data = [
6 423 , 369 , 387 , 411 , 393 , 394 ,
7 371 , 377 , 389 , 409 , 392 , 407 ,

T
8 431 , 401 , 363 , 391 , 405 , 382 ,
9 400 , 381 , 399 , 415 , 428 , 422 ,
10 395 , 371 , 410 , 419 , 386 , 390
11 ]
12

13 # Convert the data into a pandas DataFrame


14

15

16

17
AF df = pd . DataFrame ( data , columns =[ ’ Expenditure ’ ])

# Define the class intervals ( bins )


bins = range (360 , 440 , 12) # Create bins from 360 to 440
with an interval of 12
18 labels = [ f ’{ bins [ i ]} -{ bins [ i +1] -1} ’ for i in range ( len ( bins
) -1) ] # Labels for each bin
19

20 # Bin the data and calculate frequency distribution


21 df [ ’ Bins ’] = pd . cut ( df [ ’ Expenditure ’] , bins = bins , labels =
labels , right = False )
22 f r e q u e n c y _d is tri bu ti on = df [ ’ Bins ’ ]. value_counts () .
sort_index ()
DR

23

24 # Print frequency distribution


25 print ( " Frequency Distribution : " )
26 print ( f r eq uen cy _d ist ri bu ti on )
27

28 # Plot frequency distribution as a bar chart


29 plt . figure ( figsize =(12 , 6) )
30 f r e q u e n c y _d is tri bu ti on . plot ( kind = ’ bar ’ , color = ’ skyblue ’)
31 plt . xlabel ( ’ Expenditure Range ’)
32 plt . ylabel ( ’ Frequency ’)
33 plt . title ( ’ Frequency Distribution of Weekly Expenditure with
Fixed Class Intervals ’)
34 plt . xticks ( rotation =45)
35 plt . grid ( axis = ’y ’ , linestyle = ’ -- ’ , alpha =0.7)
36 plt . tight_layout ()
37 plt . show ()
38

39

35
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.3 Frequency Polygon


A frequency polygon is a line graph that displays the frequencies of different
class intervals. It is created by plotting the frequencies of each class interval at
the midpoints of those intervals and connecting these points with straight lines.

To draw we have to calculate the midpoint for each class interval. The
midpoint is the average of the lower and upper bounds of the class interval
(See Table 2.5). Then plot the midpoints on the x-axis and the corresponding
frequencies on the y-axis. The frequency polygon is depicted in Figure 2.3

Table 2.5: Distribution of Weekly Expenditure of 30 Students

T
Class Tally Frequency Relative Midpoint
Interval Frequency
360 - 372 4 0.1333 366
AF 372 - 384
384 - 396
396 - 408
3
9
5
0.1000
0.3000
0.1667
378
390
402
408 - 420 5 0.1667 414
420 - 432 4 0.1333 426
Total 30 1.0000

Figure 2.3: Frequency Polygon of Weekly Expenditure of 30 Students


DR

Frequency Polygon

10

8
Frequency

2
Frequency Polygon
0
360 372 384 396 408 420 432
Expenditure

36
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.4 Python Code: Frequency Polygon


To create a frequency polygon for the data in Example 2.4 using Python, you
can use the following code to generate the Frequency Polygon.
1 import pandas as pd
2 import matplotlib . pyplot as plt
3

4 # Data : weekly expenditure of 30 students


5 data = [
6 423 , 369 , 387 , 411 , 393 , 394 ,
7 371 , 377 , 389 , 409 , 392 , 407 ,
8 431 , 401 , 363 , 391 , 405 , 382 ,

T
9 400 , 381 , 399 , 415 , 428 , 422 ,
10 395 , 371 , 410 , 419 , 386 , 390
11 ]
12

13 # Convert the data into a pandas DataFrame


14 df = pd . DataFrame ( data , columns =[ ’ Expenditure ’ ])
15

16

17
AF # Define the class intervals ( bins )
bins = range (360 , 440 , 12) # Create bins from 360 to 440
with an interval of 12
18 labels = [ f ’{ bins [ i ]} -{ bins [ i +1] -1} ’ for i in range ( len ( bins
) -1) ] # Labels for each bin
19

20 # Bin the data and calculate frequency distribution


21 df [ ’ Bins ’] = pd . cut ( df [ ’ Expenditure ’] , bins = bins , labels =
labels , right = False )
22 f r e q u e n c y _d is tri bu ti on = df [ ’ Bins ’ ]. value_counts () .
sort_index ()
23
DR

24 # Calculate bin midpoints


25 bin_midpoints = [( bins [ i ] + bins [ i +1]) / 2 for i in range (
len ( bins ) -1) ]
26

27 # Plot frequency polygon


28 plt . figure ( figsize =(12 , 6) )
29 plt . plot ( bin_midpoints , fre qu en cy_ di st ri but io n . sort_index () ,
marker = ’o ’ , linestyle = ’ - ’ , color = ’b ’)
30 plt . xlabel ( ’ Expenditure Range ’)
31 plt . ylabel ( ’ Frequency ’)
32 plt . title ( ’ Frequency Polygon of Weekly Expenditure ’)
33 plt . xticks ( bin_midpoints , labels = labels , rotation =45)
34 plt . grid ( True , linestyle = ’ -- ’ , alpha =0.7)
35 plt . tight_layout ()
36 plt . show ()
37

38

37
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.5 Ogive Curve


An ogive curve (also called Cumulative Frequency Polygons) is a graphical rep-
resentation used in statistics to show the cumulative frequency distribution of
a dataset. It is a type of cumulative frequency graph that visualizes how the
cumulative frequency accumulates over a range of values.

To draw an ogive curve, we need to calculate cumulative frequency which is


calculated by adding the frequency of the current class interval to the cumulative
frequency of the previous class interval. The frequency distribution table with
cumulative frequency is presented in Table 2.6.

T
Table 2.6: Distribution of Weekly Expenditure of 30 Students

Class Tally Frequency Relative Midpoint Cumulative


Interval Frequency Frequency
360 - 372 4 0.1333 366 4
AF
372 - 384
384 - 396
3
9
0.1000
0.3000
378
390
7
16
396 - 408 5 0.1667 402 21
408 - 420 5 0.1667 414 26
420 - 432 4 0.1333 426 30
Total 30 1.0000

The ogive curve is presented in Figure 2.4 for weekly expenditures. In


DR

the frequency polygon, the peak at the midpoint of 390 indicates that most
students’ expenditures fall around this value. The shape of the polygon, with
a rise to the peak and a gradual decline, shows that while expenditures are
somewhat concentrated in the 384-396 range, there is a moderate spread across
other ranges. This visualization helps quickly grasp the central tendency and
variability of expenditures among students.

38
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Figure 2.4: Ogive Curve of Weekly Expenditure of 30 Students

Ogive Curve
Cumulative Frequency

30
25
20
15
10
5
Ogive Curve
0

T
360 372 384 396 408 420 432
Expenditure

In the example of weekly expenditures for 30 students, the ogive curve il-
AF
lustrates that as expenditure increases, the cumulative number of students also
rises. The curve starts at the cumulative frequency of 4 for the interval 360-
372 and gradually increases to 30 for the interval 420-432, reflecting that 30
students’ expenditures are up to 432. The steepness of the curve indicates in-
tervals with higher frequencies, while flatter sections show lower frequencies.
Key features like the median can be identified where the curve reaches 50%
of the total cumulative frequency (15 students), and the quartiles reveal the
spread of expenditures across different percentiles. Overall, the ogive provides
insights into data distribution, helping to visualize the proportion of students
DR
spending up to various amounts.

2.5.6 Python Code: Ogive


To create a Ogive for the data in Example 2.4 using Python, you can use the
following code to generate the Ogive.
1 import pandas as pd
2 import matplotlib . pyplot as plt
3

4 # Data : weekly expenditure of 30 students


5 data = [
6 423 , 369 , 387 , 411 , 393 , 394 ,
7 371 , 377 , 389 , 409 , 392 , 407 ,
8 431 , 401 , 363 , 391 , 405 , 382 ,
9 400 , 381 , 399 , 415 , 428 , 422 ,
10 395 , 371 , 410 , 419 , 386 , 390
11 ]
12

39
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

13 # Convert the data into a pandas DataFrame


14 df = pd . DataFrame ( data , columns =[ ’ Expenditure ’ ])
15

16 # Define the class intervals ( bins ) with an interval of 12


17 bin_start = 360
18 bin_end = 440
19 bin_interval = 12
20 bins = list ( range ( bin_start , bin_end + bin_interval ,
bin_interval ) )
21

22 # Generate labels for the bins


23 labels = [ f ’{ bins [ i ]} -{ bins [ i +1] -1} ’ for i in range ( len ( bins
) -1) ]
24

25 # Bin the data and calculate frequency distribution


26 df [ ’ Bins ’] = pd . cut ( df [ ’ Expenditure ’] , bins = bins , labels =

T
labels , right = False )
27 f r e q u e n c y _d is tri bu ti on = df [ ’ Bins ’ ]. value_counts () .
sort_index ()
28

# Calculate cumulative frequency


29

30
AF
c u m u l a t i ve_frequency = f re qu enc y_ di str ib ut ion . cumsum ()
31

32 # Calculate bin edges for plotting


33 bin_edges = bins # Use only the bin edges
34

35 # Calculate cumulative frequencies including the starting


point
36 c u m u l a t i v e _ f r e q u e n c y _ w i t h _ s t a r t = [0] + list (
c u m u lative_frequency )
DR
37

38 # Ensure the length of bin_edges matches


cumulative_frequency_with_start
39 if len ( bin_edges ) != len ( c u m u l a t i v e _ f r e q u e n c y _ w i t h _ s t a r t ) :
40 # Extend bin_edges to match the length of
cumulative_frequency_with_start
41 bin_edges = bins + [ bins [ -1] + bin_interval ]
42

43 # Plot the ogive


44 plt . figure ( figsize =(12 , 6) )
45 plt . plot ( bin_edges , cumulative_frequency_with_start , marker =
’o ’ , linestyle = ’ - ’ , color = ’b ’)
46 plt . xlabel ( ’ Expenditure Range ’)
47 plt . ylabel ( ’ Cumulative Frequency ’)
48 plt . title ( ’ Ogive of Weekly Expenditure with Interval 12 ’)
49 plt . xticks ( bin_edges , labels =[ f ’{ edges } -{ edges +12 -1} ’ for
edges in bin_edges [: -1]] + [ ’ ’ ])
50 plt . grid ( True , linestyle = ’ -- ’ , alpha =0.7)
51 plt . tight_layout ()

40
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

52 plt . show ()

2.5.7 Histogram
A histogram is a graphical representation of the distribution of numerical data.
It consists of a series of adjacent rectangles, or bars, where each bar’s height
corresponds to the frequency or count of data points falling within a specific
range or bin. It provides a visual summary of data distribution, helping to
identify patterns such as trends, peaks, and the spread of data.

The following histogram (See in Figure 2.5) displays the distribution of

T
weekly expenditures for 30 students. The x-axis represents the expenditure
ranges (bins), and the y-axis represents the number of students in each range.

8
AF
Frequency

0
360-372 372-384 384-396 396-408 408-420
DR

Value

Figure 2.5: Histogram

The histogram of weekly expenditures for 30 students illustrates that most


students’ spending falls within the $384 to $396 range, which is the highest
frequency interval in the distribution. The data is predominantly centered
around this mid-range, indicating that it is the most common expenditure range
among the students. The frequency of expenditures decreases as you move
towards the lower ($360 to $372) and higher ($420 to $432) intervals, showing
that fewer students have expenditures at the extremes. Overall, the histogram
reveals a central tendency in the middle expenditure ranges, suggesting that
most students have similar spending patterns with moderate expenditures being
more prevalent compared to the extremes.

41
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

2.5.8 Python Code: Histogram


To create a histogram for the data in Example 2.4 using Python, you can use
the following code to generate the histogram.
1 import pandas as pd
2 import matplotlib . pyplot as plt
3

4 # Data : weekly expenditure of 30 students


5 data = [
6 423 , 369 , 387 , 411 , 393 , 394 ,
7 371 , 377 , 389 , 409 , 392 , 407 ,
8 431 , 401 , 363 , 391 , 405 , 382 ,

T
9 400 , 381 , 399 , 415 , 428 , 422 ,
10 395 , 371 , 410 , 419 , 386 , 390
11 ]
12

13 # Convert the data into a pandas DataFrame


14 df = pd . DataFrame ( data , columns =[ ’ Expenditure ’ ])
15

16

17

18
AF # Define the bin intervals with an interval of 12
bin_start = 360
bin_end = 440
19 bin_interval = 12
20 bins = list ( range ( bin_start , bin_end + bin_interval ,
bin_interval ) )
21

22 # Plot the histogram


23 plt . figure ( figsize =(12 , 6) )
24 plt . hist ( df [ ’ Expenditure ’] , bins = bins , edgecolor = ’ black ’ ,
alpha =0.7)
25 plt . xlabel ( ’ Expenditure Range ’)
DR

26 plt . ylabel ( ’ Frequency ’)


27 plt . title ( ’ Histogram of Weekly Expenditure with Interval 12 ’
)
28 plt . xticks ( bins , labels =[ f ’{ edges } -{ edges +12 -1} ’ for edges
in bins [: -1]] + [ f ’{ bins [ -1]}+ ’ ])
29 plt . grid ( True , linestyle = ’ -- ’ , alpha =0.7)
30 plt . tight_layout ()
31 plt . show ()

2.5.9 Stem-and-Leaf Plot


A stem-and-leaf plot is a data visualization tool used to display the distri-
bution of a dataset while preserving the original data values. It separates each
data point into two parts: the “stem,” which represents the leading digits of
the data, and the “leaf,” which represents the trailing digits. This plot helps
to organize data, reveal patterns, and identify the shape of the distribution. It

42
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

is particularly useful for small to moderate-sized datasets.

Components of a Stem-and-Leaf Plot:


• Stem: Represents the leading digits of the data values.

• Leaf: Represents the last digit of the data values.

• Plot: Lists stems in ascending order with leaves corresponding to each


stem, showing the distribution of data.
Problem 2.5. Consider the following dataset representing the blood pressures

FT
(in mmHg) of 15 patients.

120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148

Construct a stem-and-leaf plot and comment on it.

Solution
The stem-and-leaf plot is presented in Table 2.7. The “stem” represents the
tens and the “leaf” represents the ones digit. For example, for 120, the stem is
12 and the leaf is 0.

Table 2.7: Stem-and-Leaf Plot of Blood Pressure Measurements


A
Stem Leaf
12 0, 2, 4, 6, 8
13 0, 2, 4, 6, 8
14 0, 2, 4, 6, 8
R
The stem-and-leaf plot organizes the data clearly:
• The stems (12, 13, 14) represent the tens digits of the blood pressures.

• The leaves show the units digits for each stem, indicating the exact
D

values of the blood pressures.

Observations from the plot:


The stem-and-leaf plot shows that the blood pressures are fairly evenly dis-
tributed between 120 and 148. There are no extreme outliers, as the values
are consistently spaced out. The plot illustrates a consistent increase in blood
pressure from 120 mmHg to 148 mmHg, indicating a relatively uniform range
of values. Each stem (12, 13, and 14) represents a set of five patients’ blood
pressures, making the distribution easy to interpret.

43
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

Problem 2.6. Construct a stem-and-leaf plot to display the distribution of the


following temperatures recorded in Celsius degrees:

25.3, -3.8, 12, 0.5, -10,


18.9, -7.2, 6, 21.6, -15.4

Create a stem-and-leaf plot appropriately.

Solution
The temperatures recorded in Celsius are:

25.3, −3.8, 12, 0.5, −10, 18.9, −7.2, 6, 21.6, −15.4

We can construct a stem-and-leaf plot as follows:

T
Stem Leaf
−15 4
−10 0
AF −7 2
−3 8
0 5
6 0
12 0
18 9
DR
21 6
25 3
In this plot:

• The ‘stem’ represents the integer part (before the decimal point) of
each temperature. For instance, the temperature 25.3 can be broken
down into a stem of 25 (representing the integer part) and a leaf of 3
(representing the decimal part).

• The ‘leaf’ represents the decimal part (after the decimal point) of each
temperature.

2.5.10 Python Code: Stem-and-leaf


To create a stem-and-leaf for the given data in Example 2.6 in Python, we need
the following Python code that will generate a stem-and-leaf.

44
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

1 from collections import defaultdict


2

3 # Data
4 data = [25 , -4 , 12 , 1 , -10 , 19 , -7 , 6 , 22 , -15]
5

6 def st em _and_leaf_plot ( data ) :


7 # Separate positive and negative values
8 pos_data = [ x for x in data if x >= 0]
9 neg_data = [ - x for x in data if x < 0]
10

11 # Determine the stems and leaves


12 def create_stem_leaf ( data ) :

T
13 stem_leaf = defaultdict ( list )
14 for value in data :
15 stem = value // 10
16 leaf = value % 10
17 stem_leaf [ stem ]. append ( leaf )
18 return stem_leaf
19

20

21

22

23
AF pos_stem_leaf = create_stem_leaf ( pos_data )
neg_stem_leaf = create_stem_leaf ( neg_data )

# Print positive values


24 print ( " Stem - and - Leaf Plot : " )
25 print ( " Positive values : " )
26 for stem in sorted ( pos_stem_leaf . keys () ) :
27 leaves = sorted ( pos_stem_leaf [ stem ])
28 print ( f " { stem } | { ’ ’. join ( map ( str , leaves ) ) } " )
29

30 # Print negative values


31 print ( " Negative values : " )
DR

32 for stem in sorted ( neg_stem_leaf . keys () ) :


33 leaves = sorted ( neg_stem_leaf [ stem ])
34 print ( f " -{ stem } | { ’ ’. join ( map ( str , leaves ) ) } " )
35

36 # Generate the stem - and - leaf plot


37 st em _a nd _leaf_plot ( data )

2.6 Concluding Remarks


As we conclude our exploration of tabular and graphical displays, it is evident
that these foundational techniques are indispensable for effective data analysis.
The ability to present and interpret data through well-structured tables and
informative graphics not only enhances our understanding but also aids in un-
covering patterns and trends that might otherwise remain obscured. The skills
and Python code examples provided in this chapter equip you with practical
tools for summarizing and visualizing data, setting the stage for deeper statis-

45
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

tical analysis and more complex data science endeavors. As you move forward,
the principles outlined here will serve as a cornerstone for more advanced topics,
reinforcing the importance of clear, accurate, and insightful data presentation
in the broader field of data science.

2.7 Chapter Exercises


1. You are given the following data from a survey on the incidence of differ-
ent types of diseases among a sample of patients:

Flu, Cold, Flu, Allergies, Cold, Flu, Cold, Flu, Allergies, Flu, Flu, Aller-

T
gies, Cold, Flu, Cold, Flu, Cold, Allergies, Cold, Flu

Construct a frequency distribution table for the diseases. Draw a bar chart
and a pie chart based on the frequency distribution. Interpret the results
and comment on the most and least common diseases in the sample.
AF
2. A local community center conducted a survey to find out the preferred
recreational activities of its members. The results of the survey are sum-
marized below:

Activity Number of Members


Basketball 45
Swimming 30
Yoga 20
Tennis 15
DR

Running 10

(a). Calculate the percentage of members who prefer each activity.


(b). Draw a pie chart to represent the distribution of preferred recre-
ational activities among the members. Write a summary based on
this.

3. The following table given in Table 2.8, shows the distribution of sales (in
thousands of units) of five different products in a company during the first
quarter of the year.

46
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Table 2.8: Sales Distribution of Products

Product Sales (thousands of units)


Laptops 30
Smartphones 20
Tablets 15
Desktops 25
Accessories 10

T
(a). Calculate the percentage share of each product in the total sales.
(b). Make a pie chart. Write a summary based on this.

4. Imagine you conducted a survey to find out how people spend their leisure
time on a typical weekend. You collected data from 100 respondents, and
the results are as follows:
AF • Watching TV: 30 respondents
• Reading: 20 respondents
• Playing Sports: 15 respondents
• Socializing with Friends: 10 respondents
• Playing Video Games: 10 respondents
• Hiking and Outdoor Activities: 8 respondents
DR

• Cooking and Baking: 7 respondents

Answer the following questions:


(a). What is the name of the variable under study? Is this a qualitative
variable? If the answer is no, why?
(b). Make a ie chart to represent the distribution of leisure activities
among the respondents. Write a summary based on this. Be sure to
label the chart accurately and show the percentage of respondents
for each activity.
5. You are given the following set of test scores from a group of students in
an engineering course:

45, 67, 53, 52, 61, 59, 68, 72, 56, 54,
63, 75, 49, 62, 60, 58, 66, 64, 55, 70

47
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS

Construct a frequency distribution table using an appropriate number of


classes and appropriate class intervals for the given test scores.
6. The following data represents the ages of employees in a company working
on a new engineering project:

102, 98, 105, 110, 95, 107, 101, 99, 103, 106,
104, 100, 108, 97, 96, 109, 111, 94, 93, 92
Create a frequency distribution table using an appropriate number of
classes and appropriate class intervals for the given ages.

T
7. Here are the weekly sales figures (in units) for a pharmaceutical company
over a period of 20 weeks:

23, 27, 31, 35, 29, 33, 22, 28, 26, 34,
32, 25, 30, 24, 36, 21, 37, 38, 39, 20
AFConstruct a frequency distribution table using an appropriate number of
classes and appropriate class intervals for the given sales figures.
8. The following data represents the weights (in kilograms) of a sample of
fruits used in a nutritional study:

5.2, 6.3, 7.1, 8.4, 5.5, 6.8, 7.6, 8.0, 5.9, 6.1,
7.3, 8.7, 5.0, 6.5, 7.8, 8.1, 5.7, 6.9, 7.4, 8.5
Make a frequency distribution table using an appropriate number of classes
and appropriate class intervals for the given weights.
DR

9. You are given the following set of scores from a recent medical examina-
tion:

82, 91, 85, 87, 89, 95, 88, 92, 84, 90,
93, 83, 86, 96, 94, 81, 97, 98, 99, 80
Create a frequency distribution table using an appropriate number of
classes and appropriate class intervals for the given exam scores.
10. Consider the following data set representing the heights (in cm) of 10
plants:

Data: 150, 155, 160, 162, 165, 168, 170, 175, 180, 185
(a). Construct a stem-and-leaf plot for the data set.
(b). What is the maximum number of heights of the data?

48
Chapter 3

Data Exploration:

T
Numerical Measures
AF
3.1 Introduction
In this chapter, we delve into the fundamental concepts of data exploration
with a focus on numerical measures. Understanding these measures is crucial
for analyzing and interpreting data effectively. We begin by examining vari-
ous measures of central tendency, which provide insights into the typical value
within a dataset. These include the arithmetic mean, harmonic mean, geometric
mean, and median, each with its unique properties, advantages, and limitations.

Following the exploration of central tendency, we will address measures of


dispersion or variability. These metrics, such as range, variance, and standard
DR

deviation, help us understand the spread and variability of data points around
the central value. The chapter will also cover measures of distribution shape,
including skewness and kurtosis, which describe the asymmetry and peakedness
of the data distribution.

Moreover, we will discuss quartiles, percentiles, and deciles, which are instru-
mental in dividing the data into meaningful segments, and outline methods for
detecting outliers. The chapter concludes with an overview of the five-number
summary and boxplots, essential tools for summarizing and visualizing data
distribution. Python code will be provided throughout to illustrate practical
applications of these concepts.

3.2 Measures of Central Tendency


Measures of central tendency are statistical measures that describe the center
or representative value of a dataset. The most common measures of central

49
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

tendency are the


(i). Mean
■ Arithmetic mean
■ Harmonic mean
■ Geometric mean

(ii). Median

(iii). Mode

T
These measures provide a single value that represents the middle or center
of the data distribution and are essential for summarizing large datasets.

3.2.1 Arithmetic Mean


The arithmetic mean, often referred to as the average, is the most common
AF
measure of central tendency. It is a useful measure when the data is symmet-
rically distributed without extreme outliers. It is calculated by summing all
values and dividing by the number of values. The arithmetic mean of a set of
numbers x1 , x2 , . . . , xn is typically denoted by x̄ of n and is defined by

n
!
x1 + x2 + · · · + xn 1 X
x̄ = = xi
n n i=1

Consider a study measuring the time in days for patients to recover from a
specific illness, with recovery times being 4, 5, 6, 7, 8 days. We can use the
arithmetic mean to find the average recovery time.
DR

The arithmetic mean is calculated as follows:


4+5+6+7+8 30
x̄ = = =6
5 5
So, the arithmetic mean of the recovery times is 6 days, which represents
the average recovery time.

3.2.2 Advantages and Disadvantages of Arithmetic Mean


The arithmetic mean (or simply the mean) is one of the most commonly used
measures of central tendency. It has both advantages and disadvantages de-
pending on the context in which it is used. Understanding its advantages and
disadvantages is crucial for selecting the appropriate measure of central ten-
dency for different types of data and analyses.

50
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Advantages
(i) Rigidity and Simplicity: It is rigidly defined, simple, easy to under-
stand, and easy to calculate.
(ii) Uses all data points: It is based upon all the observations in the data
set.
(iii) Uniqueness: Its value being unique allows for comparisons between dif-
ferent sets of data.
(iv) Mathematical Properties: The arithmetic mean has useful mathemat-
ical properties. For instance, it can be used in further statistical analysis,

T
like calculating variance and standard deviation.
(v) Best for Symmetric Distributions: The arithmetic mean is a reliable
measure of central tendency when the data follows a symmetric distribu-
tion (like a normal distribution), where the mean is the most representa-
tive value.
AF
(vi) Stability: It is least affected by sampling fluctuations compared to other
measures of central tendency.

Disadvantages
(i) Sensitivity to Outliers: The mean is highly affected by extreme values
or outliers. A few very high or low numbers can skew the mean, making
it not represent the ”typical” value of the data set.
(ii) Not Suitable for Skewed Distributions: In data sets that are heavily
skewed, the mean may not reflect the central location accurately, as it can
DR

be pulled toward the tail of the distribution.


(iii) Not Ideal for Non-Numeric Data: The arithmetic mean cannot be
applied to nominal or ordinal data, as it requires numerical values to make
sense.

(iv) Requirement of Complete Data: It cannot be obtained if a single


observation is missing.
(v) Inapplicability with Open Classes: It cannot be calculated if the
extreme class is open (e.g., below 10 or above 90).

Problem 3.1. Suppose we have the following data on the systolic blood pressure
(in mmHg) of 10 patients:

120, 130, 125, 140, 135, 128, 132, 138, 124, 126

What is the average systolic blood pressure (in mmHg) of 10 patients?

51
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Solution
To calculate the mean systolic blood pressure:
120 + 130 + 125 + 140 + 135 + 128 + 132 + 138 + 124 + 126
x̄ =
10
1298
= = 129.8 mmHg
10

3.2.3 Harmonic Mean


The harmonic mean is a type of average used for rates and ratios. It is defined
as the reciprocal of the arithmetic mean of the reciprocals of a set of values.

T
The formula for the harmonic mean x̄HM of n values x1 , x2 , . . . , xn is:

n n
x̄HM = P
n = P
n
1 1
xi xi
i=1 i=1
AF
Using the recovery times 4, 5, 6, 7, 8 days, we calculate the harmonic mean to
find the average rate of recovery.

The harmonic mean is calculated as follows:

5
x̄HM = 1 1 1 1 1
4 + 5 + 6 + 7 + 8
5
= ≈ 5.59
0.25 + 0.20 + 0.1667 + 0.1429 + 0.125
DR

So, the harmonic mean of the recovery times is approximately 5.59 days,
which represents an average recovery time weighted by the rates of recovery.

Advantages and Disadvantages of Harmonic Mean


Advantages
(i) Appropriate for Rates and Ratios: The harmonic mean is particu-
larly useful for averaging rates or ratios, such as speeds, densities, or other
quantities where the reciprocal of the average is meaningful.
(ii) Minimizes the Impact of Large Values: It tends to minimize the
impact of large values compared to the arithmetic mean. This is beneficial
when large values could distort the overall average.
(iii) Emphasizes Small Values: It gives more weight to smaller values in a
data set, which can be useful in situations where lower values are more
significant or indicative.

52
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

(iv) Mathematically Robust: It is less affected by extreme values than the


arithmetic mean in datasets where the values are rates or ratios.

Disadvantages
(i) Sensitivity to Zero Values: The harmonic mean cannot be computed
if any value in the dataset is zero, as it involves division by the values.

(ii) Less Intuitive: It is less intuitive than the arithmetic mean and is not as
commonly used, which can make interpretation and communication more
challenging.

T
(iii) Not Suitable for All Data Types: It is not suitable for data that does
not represent rates or ratios. It is not typically used for general numerical
data where other means are more appropriate.
(iv) Potential for Misleading Results: In cases where there is significant
variability in the data, particularly if large values are present, the har-
monic mean can provide misleading results.
AF
(v) Complex Calculation: The calculation of the harmonic mean is more
complex compared to the arithmetic mean, which can be a drawback in
some practical applications.

3.2.4 Geometric Mean


The geometric mean is a measure of central tendency that is useful for
datasets with exponential growth or multiplicative effects. It is defined as the
n-th root of the product of n values. The formula for the geometric mean x̄GM
of n values x1 , x2 , . . . , xn is:
DR

v
u n
√ uY
x̄GM = n
x1 × x2 × · · · × xn = t
n
xi
i=1

Using the recovery times 4, 5, 6, 7, 8 days, we calculate the geometric mean to


find the average growth rate of recovery.

The geometric mean is calculated as follows:



5

5
x̄GM = 4×5×6×7×8= 6720 ≈ 5.59
So, the geometric mean of the recovery times is approximately 5.59 days,
reflecting the average multiplicative rate of recovery over time.

53
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Advantages and Disadvantages of Geometric Mean


Advantages
(i) Appropriate for Multiplicative Processes: The geometric mean is
ideal for data that involves multiplicative processes, such as rates of
growth, financial returns, or other situations where values are multiplied.
(ii) Mitigates the Effect of Extreme Values: It reduces the impact of
extremely high or low values in the dataset, which can provide a more
balanced measure when dealing with skewed data.

(iii) Handles Proportional Relationships: It is useful for datasets where

T
values are in proportional or percentage terms, such as in economic or
financial analysis.
(iv) Stability in Long-Term Growth Rates: In contexts like investment
returns, the geometric mean offers a more accurate measure of average
growth rates over time compared to the arithmetic mean.
AF
Disadvantages
(i) Cannot Handle Zero or Negative Values: The geometric mean is
undefined for datasets containing zero or negative values, as it involves
taking the nth root of the product of values.
(ii) Less Intuitive: It is less intuitive and harder to understand compared
to the arithmetic mean, making it less accessible for some audiences.
(iii) Requires Logarithmic Transformation: Calculating the geometric
mean involves the logarithm of values, which adds complexity compared
to simpler averages.
DR

(iv) Sensitive to Variability: While it reduces the impact of extreme values,


it may still be affected by large variability in the dataset, especially if
values differ greatly.
(v) Not Suitable for All Data Types: It is not appropriate for all types
of data, particularly where the data do not naturally fit a multiplicative
model or where additive relationships are more relevant.

3.2.5 Relationships Between Arithmetic Mean, Geomet-


ric Mean, and Harmonic Mean
Let x1 , x2 , . . . , xn be positive real numbers. The Theorem 3.1 state that the
means of a set of positive numbers satisfy the following inequality

54
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

x̄ ≥ x̄GM ≥ x̄HM ,

where x̄, x̄GM , and x̄HM represent the arithmetic mean, geometric mean, and
harmonic mean, respectively. This inequality reflects a fundamental property
of these measures of central tendency.

Specifically, the arithmetic mean is always greater than or equal to the


geometric mean, which in turn is greater than or equal to the harmonic mean.
Importantly, equality holds in these inequalities if and only if all the elements
in the dataset are equal. In other words, x̄ = x̄GM = x̄HM if and only if
x1 = x2 = · · · = xn .

T
Theorem 3.1. Let x1 , x2 , . . . , xn be positive real numbers. Let x̄AM denote
the arithmetic mean, x̄GM denote the geometric mean, and x̄HM denote the
harmonic mean of these numbers. Then,

x̄AM ≥ x̄GM ≥ x̄HM


AF
Equality holds in both inequalities if and only if x1 = x2 = · · · = xn .
Proof. We will prove the inequality in two parts.

Part 1: Proof of x̄AM ≥ x̄GM (AM-GM Inequality)


We will use the property that for a convex function f , Jensen’s inequality
holds: !
n n
1X 1X
f xi ≤ f (xi )
n i=1 n i=1

Consider the function f (x) = − ln(x) for x > 0. The first derivative is
f ′ (x) = − x1 , and the second derivative is f ′′ (x) = x12 . Since f ′′ (x) > 0 for all
DR

x > 0, the function f (x) = − ln(x) is strictly convex on the interval (0, ∞).

Applying Jensen’s inequality to f (x) = − ln(x) and the positive numbers


x1 , x2 , . . . , xn : !
n n
1X 1X
− ln xi ≤ (− ln(xi ))
n i=1 n i=1
n
1X
− ln (x̄AM ) ≤ − ln(xi )
n i=1
Pn Qn
Using the property of logarithms i=1 ln(xi ) = ln ( i=1 xi ):
n
!
1 Y
− ln (x̄AM ) ≤ − ln xi
n i=1

55
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

 !1/n 
n
Y
− ln (x̄AM ) ≤ − ln  xi 
i=1

− ln (x̄AM ) ≤ − ln (x̄GM )
Since the natural logarithm function ln(x) is strictly increasing, the function
− ln(x) is strictly decreasing. Therefore, multiplying by −1 and reversing the
inequality sign gives:
ln (x̄AM ) ≥ ln (x̄GM ) .
Again, due to the strictly increasing nature of ln(x), we conclude:

T
x̄AM ≥ x̄GM

Equality in Jensen’s inequality for a strictly convex function holds if and


only if all the arguments are equal, i.e., x1 = x2 = · · · = xn . Therefore, equal-
ity in x̄AM ≥ x̄GM holds if and only if x1 = x2 = · · · = xn .

1
AF
Part 2: Proof of x̄GM ≥ x̄HM (GM-HM Inequality)
Consider the reciprocals of the positive numbers x1 , x2 , . . . , xn , which are
1 1
x1 , x2 , . . . , xn . Applying the AM-GM inequality (proven in Part 1) to these
positive numbers, we have:
1 1 1 1/n
+ + ··· +

x1 x2 xn 1 1 1
≥ · · ··· ·
n x1 x2 xn
Pn 1  1/n
i=1 xi 1
≥ Qn
n i=1 xi
Pn 1
i=1 xi 1
DR

≥ Qn 1/n
n ( i=1 xi )
Pn 1
i=1 xi 1

n x̄GM
Taking the reciprocal of both sides of the inequality. Since both sides are
positive, the inequality sign reverses:
n
Pn 1 ≤ x̄GM
i=1 xi

By definition, the left side is the harmonic mean x̄HM :

x̄HM ≤ x̄GM

Which is equivalent to:


x̄GM ≥ x̄HM

56
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

The equality in the AM-GM inequality applied to x11 , x12 , . . . , x1n holds if and
only if x11 = x12 = · · · = x1n . This condition is equivalent to x1 = x2 = · · · = xn .
Therefore, equality in x̄GM ≥ x̄HM holds if and only if x1 = x2 = · · · = xn .

Combining Part 1 and Part 2, we have proven the AM-GM-HM inequality:

x̄AM ≥ x̄GM ≥ x̄HM

with equality holding throughout if and only if x1 = x2 = · · · = xn .

3.2.6 Median

T
The median is a measure of central tendency that divides a dataset into two
equal halves. Let x1 , x2 , . . . , xn be a set of n numerical observations. To find
the median, first arrange the data in ascending (or descending) order. Let the
ordered data set be denoted by x(1) , x(2) , . . . , x(n) , where x(i) is the i-th value
in the ordered set.
AF
The median is then defined as follows:
Let m = n+1
2 .

(
x(m) if m is an integer (i.e., n is odd)
Median = x(⌊m⌋) +x(⌈m⌉)
2 if m is not an integer (i.e., n is even)

Here, ⌊m⌋ is the floor function (the greatest integer less than or equal
to m), and ⌈m⌉ is the ceiling function (the smallest integer greater than
or equal to m).
DR

When there is an odd number of observations, the median is simply the


middle value. For example, consider a dataset of seven blood pressure readings:

110, 115, 120, 125, 130, 135, 140.

Since there are seven observations (an odd number), the median is the fourth
value. In this case, m = n+1
2 = 7+1
2 = 4 which is an integer and hence, the
median is

Median = x(4) = 125.

Conversely, when there is an even number of observations, there is no single


middle value, so the median is calculated as the average of the two middle values.
For instance, in a dataset of eight cholesterol levels:

200, 210, 220, 225, 230, 240, 250, 260,

57
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

the median is found by taking the average of the fourth and fifth values. In this
case, m = n+1 8+1
2 = 2 = 4.5 is not an integer. Hence, the median is

x(⌊4.5⌋) + x(⌈4.5⌉)
Median =
2
x(4) + x(5) 225 + 230
= = = 227.5
2 2
Median is particularly useful in data science for understanding the typical
value in a dataset that is not skewed by extreme values.
Problem 3.2. Consider the following data on the number of hours of sleep per
night for a group of 9 adults:

T
7, 8, 5, 6, 9, 7, 6, 10, 8
What is the median hours of sleep per night?

Solution
AF
First, arrange the data in ascending order:
5, 6, 6, 7, 7, 8, 8, 9, 10
n+1 9+1
In this case, n = 9, so m = 2 = 2 = 5, which is an integer. Hence,
Median = x(m) = x(5) = 7 hours

Problem 3.3. Consider the following data on the test scores of 6 students:
85, 92, 78, 95, 88, 80
DR

What is the median test score?

Solution
First, arrange the data in ascending order:
78, 80, 85, 88, 92, 95
In this case, n = 6, so m = n+1 2 = 6+1
2 = 3.5, which is not an integer.
Hence, we use the second case of the median formula,
x(⌊m⌋) + x(⌈m⌉) x(⌊3.5⌋) + x(⌈3.5⌉) x(3) + x(4)
Median = = =
2 2 2
From the ordered data, x(3) = 85 and x(4) = 88. Therefore,
85 + 88 173
Median = = = 86.5
2 2
The median test score is 86.5.

58
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.2.7 Advantages and Disadvantages of Median


Advantages of Median
(i) Robust to Outliers: The median is not affected by extreme values or
outliers, providing a better central measure for skewed distributions.
(ii) Simple to Compute: It is easy to calculate, especially for small datasets,
by finding the middle value when data is ordered.
(iii) Represents the 50th Percentile: It divides the dataset into two equal
halves, making it a useful measure for understanding the data’s center.

T
(iv) Applicable to Ordinal Data: Can be used with ordinal data, where
data values are ranked but not necessarily numeric.

Disadvantages of Median
(i) Ignores Data Values: Does not take into account the magnitude of all
AF data values, only their order.

(ii) Less Informative for Symmetric Distributions: Provides less infor-


mation about the distribution of data than the mean, especially if the
data is symmetric.
(iii) Not Suitable for Further Mathematical Operations: Unlike the
mean, the median cannot be easily used in further mathematical or sta-
tistical calculations.
(iv) Not Unique for Small Datasets: In small datasets with an even num-
ber of values, there may be two middle values, which can complicate
interpretation.
DR

3.2.8 Mode
The mode is the value that appears most frequently in a dataset. A dataset can
have more than one mode if multiple values have the same highest frequency.
The mode is useful for categorical data or when identifying the most common
value.
Problem 3.4. Consider a manufacturing process where engineers are measur-
ing the diameter of a set of machine components to ensure they meet quality
specifications. The following is a list of diameters (in millimeters) of 20 com-
ponents that were measured:

50, 52, 51, 50, 53, 52, 54, 50, 51, 52, 55, 50, 53, 50, 51, 52, 55, 50, 52, 51

Find the mode of this dataset.

59
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Solution
To find the mode of this dataset:

50, 52, 51, 50, 53, 52, 54, 50, 51, 52, 55, 50, 53, 50, 51, 52, 55, 50, 52, 51

Count the frequency of each diameter:


• Diameter 50 occurs 6 times.

• Diameter 52 occurs 5 times.

• Diameter 51 occurs 4 times.

T
• Diameters 53 and 55 occur 2 times each.

• Diameter 54 occurs 1 time.


We see the diameter 50 appears most frequently (6 times). Thus, the mode
of this dataset is 50 mm.
AF
Problem 3.5. Consider the following data on the blood types of 15 individuals:

A, O, B, AB, O, A, B, O, A, A, B, O, O, A, B

What is the mode of the blood types?

Solution
The frequency distribution of blood types is:
• A: 5

• B: 4
DR

• O: 5

• AB: 1
Since blood types A and O both have the highest frequency, the dataset is
bimodal:
Mode = A and O

3.2.9 Advantages and Disadvantages of Mode


Advantages of Mode
(i) Easy to Identify: The mode is straightforward to determine as it is
simply the most frequently occurring value in a dataset.
(ii) Applicable to All Data Types: Can be used with nominal, ordinal,
and some quantitative data, making it versatile.

60
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

(iii) Reflects Commonality: Represents the value or values that occur most
often, which can be useful for understanding common trends or prefer-
ences.
(iv) Handles Categorical Data: Ideal for categorical data where numerical
calculations are not applicable.

Disadvantages of Mode
(i) May Not Be Unique: A dataset can have more than one mode or no
mode at all, which can complicate interpretation.

T
(ii) Not Useful for Continuous Data: Less useful for continuous data
with many unique values, as identifying the most frequent value can be
challenging.
(iii) Does Not Reflect Data Distribution: Does not provide information
about the spread or shape of the data distribution.
AF
(iv) Insensitive to Changes: The mode does not account for changes in
data that do not affect frequency, potentially overlooking variations.

3.2.10 Choosing the Ideal Measure of Central Tendency


The choice of the ideal measure of central tendency depends on the nature of
the data and the analysis objectives. The primary measures are the arithmetic
mean, median, and mode. Here is a guide to selecting the most appropriate
measure:

• Arithmetic Mean: Best for symmetric distributions and numerical


data.
DR

• Median: Best for skewed distributions and when robustness to outliers


is needed.

• Mode: Best for categorical data and identifying the most frequent
values.

3.2.11 Weighted Mean


The weighted mean, or weighted average, is a measure of central tendency where
each data point contributes to the average based on its assigned weight. Unlike
the arithmetic mean, which treats all data points equally, the weighted mean
takes into account the relative importance of each data point. Mathematically,
the weighted mean is defined as:

61
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

n
P
wi xi
x̄W M = i=1
Pn
wi
i=1

where:
• xi represents the i-th data point,

• wi is the weight associated with the i-th data point,

• n is the number of data points.

T
Problem 3.6. A biostatistician is analyzing the average blood pressure readings
of three different age groups in a study. The average blood pressure for each age
group and the number of individuals in each group are as follows:
• Age Group 1: Average blood pressure = 120 mmHg, Number of indi-
AF•
viduals = 30

Age Group 2: Average blood pressure = 130 mmHg, Number of indi-


viduals = 25

• Age Group 3: Average blood pressure = 140 mmHg, Number of indi-


viduals = 20
Calculate the weighted mean of the average blood pressure across all age groups.

Solution:
The weighted mean is
DR

(120 × 30) + (130 × 25) + (140 × 20)


x̄W M =
30 + 25 + 20

3600 + 3250 + 2800 9650


x̄W M = = ≈ 128.67 mmHg
75 75
Therefore, the weighted mean blood pressure is approximately 128.67 mmHg.
Problem 3.7. A clinical trial evaluates the efficacy of a new drug. The efficacy
rates are measured as follows:
• Trial 1: Efficacy = 85%, Weight = 10

• Trial 2: Efficacy = 90%, Weight = 20

• Trial 3: Efficacy = 80%, Weight = 15


Find the weighted mean efficacy rate of the drug.

62
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Solution:
The weighted mean is
(85 × 10) + (90 × 20) + (80 × 15)
x̄W M =
10 + 20 + 15
850 + 1800 + 1200 3850
= = ≈ 85.56%
45 45
Hence, the weighted mean efficacy rate is approximately 85.56%.
Problem 3.8. A public health researcher records the incidence rates of a disease
in three different regions. The incidence rates and the population sizes for each

T
region are:
• Region A: Incidence rate = 0.02 cases per person, Population = 50, 000

• Region B: Incidence rate = 0.03 cases per person, Population = 75, 000

• Region C: Incidence rate = 0.04 cases per person, Population = 25, 000
AF
Solution:
The weighted mean is
(0.02 × 50,000) + (0.03 × 75,000) + (0.04 × 25,000)
x̄W M =
50,000 + 75,000 + 25,000
1000 + 2250 + 1000 4250
= = ≈ 0.02833 cases per person
150,000 150,000
Hence, the weighted mean incidence rate is approximately 0.02833 cases per person.
Problem 3.9. An academic advisor evaluates the performance of students
DR

based on their grades in three courses, with different credit hours for each course:
• Course 1: Grade = 85, Credits = 3

• Course 2: Grade = 90, Credits = 4

• Course 3: Grade = 80, Credits = 2


Compute the weighted mean grade.

Solution:
The weighted mean is
(85 × 3) + (90 × 4) + (80 × 2)
x̄W M =
3+4+2
255 + 360 + 160 775
= = ≈ 86.11
9 9
Hence, theweighted mean grade is approximately 86.11.

63
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.2.12 Measures of Central Tendency for Grouped Data


Refer to Example 2.4. Suppose we have the weekly expenditure of 30 students,
and the frequency distribution of this weekly expenditure is presented in Table
3.1. The mean, median, and mode can be calculated in the following ways:

Table 3.1: Distribution of Weekly Expenditure of 30 Students

Class Tally Frequency Cumulative Midpoint


Interval (fi ) Frequency (cfi ) (mi )
360 - 372 4 4 366

T
372 - 384 3 7 378
384 - 396 9 16 390
396 - 408 5 21 402
408 - 420 5 26 414
AF 420 - 432
Total
4
30
30 426

Mean Estimation
The mean for grouped data is given by the formula:
P
fi mi
x̄ = P
fi

where fi is the frequency and mi is the midpoint of each class interval.


DR

360 + 372
m1 = = 366
2
372 + 384
m2 = = 378
2
384 + 396
m3 = = 390
2
396 + 408
m4 = = 402
2
408 + 420
m5 = = 414
2
420 + 432
m6 = = 426
2

(4 × 366) + (3 × 378) + (9 × 390) + (5 × 402) + (5 × 414) + (4 × 426)


x̄ =
30

64
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

1464 + 1134 + 3510 + 2010 + 2070 + 1704 11892


x̄ = = = 396.4
30 30

Median Estimation
The median for grouped data is given by the formula:
n
− cf

2
Median = l + ×h
f

T
where l is the lower boundary of the median class, n is the total frequency,
cf is the cumulative frequency before the median class, f is the frequency of
the median class, and h is the class width.

Median class: 384 − 396


l = 384, n = 30, cf = 7, f = 9, h = 12
AF Median = 384 +
 30
−7
9
2

× 12 = 384 +

15 − 7
9

× 12
 
8
= 384 + × 12 = 384 + 10.67 = 394.67
9

Mode Estimation
The mode for grouped data is given by the formula:
 
f1 − f0
DR

Mode = l + ×h
(f1 − f0 ) + (f1 − f2 )

where l is the lower boundary of the modal class, f1 is the frequency of the
modal class, f0 is the frequency of the class before the modal class, f2 is the
frequency of the class after the modal class, and h is the class width.

Modal class: 384 − 396


L = 384, f1 = 9, f0 = 3, f2 = 5, h = 12
   
9−3 6
Mode = 384 + × 12 = 384 + × 12
(9 − 3) + (9 − 5) 6+4
 
6
= 384 + × 12 = 384 + 7.2 = 391.2
10

65
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.2.13 Python Code: Mean, Median and Mode


In this section, we will demonstrate how to compute the arithmetic mean, ge-
ometric mean, harmonic mean, median, and mode using Python. Consider the
following dataset:
{10, 15, 15, 20, 25, 30}
1 import numpy as np
2 from scipy . stats import gmean , hmean , mode
3

4 # Sample data
5 data = [10 , 15 , 15 , 20 , 25 , 30]

T
6

7 # Arithmetic Mean
8 arithmetic_mean = np . mean ( data )
9 print ( f " Arithmetic Mean : { arithmetic_mean } " )
10

11 # Geometric Mean
12 geometric_mean = gmean ( data )
13

14

15

16
AF print ( f " Geometric Mean : { geometric_mean } " )

# Harmonic Mean
harmonic_mean = hmean ( data )
17 print ( f " Harmonic Mean : { harmonic_mean } " )
18

19 # Median
20 median = np . median ( data )
21 print ( f " Median : { median } " )
22

23 # Mode
24 mode_value , count = mode ( data )
DR

25 print ( f " Mode : { mode_value } " )


The output of the code will be:
• Arithmetic Mean: 19.1667

• Geometric Mean: 17.9768

• Harmonic Mean: 16.8224

• Median: 17.5

• Mode: 15

3.3 Exercises
1. Suppose we have the following dataset representing the scores of students
in a test:
85, 90, 78, 92, 88, 76, 95, 89, 84, 91

66
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

(a) Calculate the arithmetic mean of the test scores.


(b) Determine the median of the test scores.
2. Consider the following dataset representing the monthly salaries (in thou-
sands of dollars) of 12 employees at a company:

45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100

(a) Calculate the arithmetic mean of the salaries.


(b) Determine the median salary.
3. A company measures the speed of data transmission across four networks

T
as follows (in Mbps):
10, 20, 30, 40
(a) Calculate the arithmetic mean of the data transmission speeds.
(b) Compute the harmonic mean of the data transmission speeds.
AF (c) Calculate the median of the data transmission speeds.

4. The following dataset represents the time (in hours) taken by a worker to
complete four tasks:
4, 6, 8, 12
(a) Calculate the arithmetic mean of the task completion times.
(b) Compute the harmonic mean of the task completion times.
5. A company tracks the monthly returns (in percentage) on three different
investments over a year:
4%, 8%, 12%
DR

(a) Calculate the geometric mean of the monthly returns.


(b) Compute the harmonic mean of the monthly returns.
6. A survey reports the annual growth rates of four investments as follows
(in percentage):
5%, 10%, 15%, 20%

(a) Calculate the geometric mean of the annual growth rates.


(b) Compute the harmonic mean of the annual growth rates.
7. Consider the following dataset of the number of books read by a group of
15 students:
2, 3, 2, 5, 7, 2, 4, 5, 5, 3, 6, 7, 8, 2, 5

(a) Identify the mode(s) of the number of books read.


(b) Calculate the mean of the number of books read.

67
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

8. A class of 20 students recorded their scores on a recent quiz as follows:

82, 85, 88, 82, 90, 85, 92, 82, 88, 85, 87, 90, 82, 85, 92, 90, 87, 88, 85, 82

(a) Compute the mean of the quiz scores.


(b) Identify the mode of the quiz scores.
9. The following dataset represents the hours spent on three different activ-
ities by a student each week, with respective weights:
• Activity Hours:
10, 15, 20

T
• Weights:
2, 3, 1
Compute the weighted mean of the hours spent on activities.
10. A company reports the sales (in units) of three different products with
AF the following weights:
• Sales:
200, 300, 400
• Weights:
1, 3, 2
Calculate the weighted mean of the sales data.
11. Define the weighted mean. A student’s overall grade in a course is de-
termined by three components: homework, exams, and projects. The
weights assigned to each component are as follows: homework (30%), ex-
DR

ams (50%), and projects (20%). The student scored 85 in homework, 90


in exams, and 80 in projects. Calculate the student’s overall weighted
mean grade.
12. A dataset is grouped into the following intervals with corresponding fre-
quencies:
• Class Intervals:

[10 − 20, 20 − 30, 30 − 40, 40 − 50]

• Frequencies:
[8, 15, 20, 12]

(a) Calculate the mean and median of the grouped data.


(b) Calculate the mode of the grouped data.

13. A student receives grade points in three subjects with the following weights:

68
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Grade points:
3.5, 3.7, 4
• Credit:
3, 4, 2
Compute an average of the grade points.
14. A healthcare provider tracks the number of patients visiting a clinic over
a week (in patients per day) as follows:

12, 15, 14, 17, 15, 18, 16

T
(a) Find the median number of patients per day.
(b) Determine the mode of the number of patients.

3.4 Measures of Dispersion or Variability


AF
Measures of dispersion or variability describe the spread or distribution of data
points in a dataset. They describe the extent to which data values differ from
the central tendency, such as the mean or median. These measures help to un-
derstand how much the data varies around the central tendency (mean, median,
or mode).
They provide insights into the distribution’s spread and consistency. They
are categorized into absolute and relative measures. Here’s an overview of
both:
1. Absolute Measures of Dispersion

• Range: Maximum value - Minimum value


DR

• Interquartile range (IQR): Q3 − Q1


• Semi-interquartile range : (Q3 − Q1 )/2
• Variance (or standard deviation)
• Mean Absolute Deviation (MAD)

2. Relative Measures of Dispersion

• Relative Range: Relative range = Range/Mean


• Quartile Coefficient of Dispersion: (Q3 − Q1 )/(Q3 + Q1 )
• Coefficient of variation
• Skewness
• Kurtosis

69
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Absolute Measures give the dispersion in the same units as the


data and include range, standard deviation (and variance), and
mean absolute deviation.

• Relative Measures provide dispersion relative to the mean or


other central values and include the coefficient of variation, rela-
tive range, and quartile coefficient of dispersion.

We will explain some important measures of dispersion in detail in the following


sections.

T
3.4.1 Range
The range is the simplest measure of variability and is calculated as the differ-
ence between the maximum and minimum values in a dataset. Mathematically,
the range is defined as:
Range = xmax − xmin
AF
In a study measuring the blood glucose levels of 10 patients, the highest reading
is 120 mg/dL and the lowest reading is 85 mg/dL.So, the range is

Range = 120 − 85 = 35 mg/dL

3.4.2 Variance
Variance measures the average squared deviation of each data point from the
mean. It reflects how data points spread out around the mean. Mathematically,
the sample variance is denoted by s2 and is defined as:

n n
!
1 X 1
DR

X
2
s = (xi − x̄)2 = x2i − nx̄ 2
n − 1 i=1 n−1 i=1

where x̄ is the arithmetic mean, xi represents the data points, and n is the
number of data points.

Consider the weights of 5 patients: 65, 70, 75, 80, 85 kg. We have,
65 + 70 + 75 + 80 + 85
x̄ = = 75
5

1 
s2 = (65 − 75)2 + (70 − 75)2 + (75 − 75)2 + (80 − 75)2 + (85 − 75)2

5−1
1 250
= [100 + 25 + 0 + 25 + 100] = = 62.5 kg2
4 4

70
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.4.3 Standard Deviation


The standard deviation is the positive square root of the variance. It provides
a measure of dispersion in the same units as the data. Mathematically, the
sample standard deviation is denoted by s and is defined as:
v v !
u n u n
u 1 X u 1 X
s=t 2
(xi − x̄) = t x2i − nx̄2
n−1 i=1
n−1 i=1

Using the variance from the previous example, the sample standard deviation
is

T

s = 62.5 ≈ 7.91 kg
Problem 3.10. A physician collected an initial measurement of hemoglobin
(g/L) after the admission of 10 inpatients to a hospital’s department of cardiol-
ogy. The hemoglobin measurements were 139, 158, 120, 112, 122, 132, 97, 104,
159, and 129 g/L. Calculate the variance and standard deviation of hemoglobin
AF
level.

Solution
The variance is calcualted by the following way:

xi x2i
139 19321
158 24964
120 14400
DR

112 12544
122 14884
132 17424
97 9409
104 10816
159 25281
129 16641
n n
x2i = 165684
P P
xi = 1272
i=1 i=1

71
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

The sample Mean and Variance:


1272
x̄ = = 127.2 g/L
10 !
n
2 1 X
2 2
s = x − nx̄
n − 1 i=1 i
1
165684 − 10 × 127.22

=
10 − 1
2
= 4317 (g/L)

The sample standard deviation is s = 431.7 = 20.8 g/L. Thus, the standard

T
deviation of hemoglobin for these 10 patients was 20.8 g/L.

3.4.4 Measures of Variability for Grouped Data


Variance for Grouped Data
AF
The variance for grouped data is given by the formula:

s2 =
P
fi (mi − x̄)2
P
fi − 1

To compute sample variance for the grouped data, we define di = (mi − x̄)
and hence
fi d2i
P
2
s =P .
fi − 1
Now, we can easily compute s2 . The detailed calculations are given in Table
3.2.
DR

Table 3.2: Variance calulation for grouped data

fi di = (mi − x̄) d2i fi d2i


4 -30.4 923.52 3694.08
3 -18.4 338.56 1015.68
9 -6.4 40.96 368.64
5 5.6 31.36 156.8
5 17.6 309.76 1548.8
4 29.6 876.16 3504.64
Total = 30 10288.64

72
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

where
x̄ = 396.4
Sample variance:
10288.64 10288.64
s2 = = ≈ 354.78
30 − 1 29

Standard deviation for Grouped Data


Standard deviation is simply the positive square root of variance. Hence, the
sndard deviation is defined as

T
sP
fi (mi − x̄)2
s= P
fi − 1

Standard deviation: √
s= 354.78 ≈ 18.83
AF
3.5 Exercises
1. Given the following dataset of temperatures (in degrees Celsius) recorded
over a week:
22, 25, 19, 23, 24, 20, 21
(a) Calculate the range of the temperatures.
2. Consider the following dataset representing the scores of 8 students in a
test:
75, 85, 95, 80, 90, 70, 88, 92
DR

(a) Compute the variance of the test scores.


(b) Determine the standard deviation of the test scores.

3. The following dataset represents the frequency distribution of exam scores:


• Class Intervals:

[50 − 60, 60 − 70, 70 − 80, 80 − 90, 90 − 100]

• Frequencies:
[6, 10, 15, 12, 7]

(a) Calculate the mean of the grouped data.


(b) Compute the variance and standard deviation of the grouped data.

73
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

4. A dataset of the number of hours spent by patients in a hospital (per day)


is as follows:
1.5, 2.0, 2.5, 3.0, 1.0, 2.2, 2.8
(a) Find the range of the hours spent in the hospital.
(b) Compute the variance of the dataset.

5. A company tracks the number of units sold each month for the last 6
months:
150, 170, 160, 180, 175, 165
(a) Calculate the standard deviation of the units sold.

T
6. A class of students took two different tests with the following scores:
• Test 1 Scores:
78, 82, 85, 88, 90
• Test 2 Scores:
AF 72, 80, 78, 85, 90

(a) Compute the mean and median for both tests.


(b) Compute the variance and standard deviation for both tests.

7. Consider a dataset of monthly income grouped into intervals with corre-


sponding frequencies:
• Income Intervals:

[2000 − 3000, 3000 − 4000, 4000 − 5000, 5000 − 6000]


DR

• Frequencies:
[5, 8, 12, 6]

(a) Calculate the mean income.


(b) Compute the variance and standard deviation of the grouped income
data.

8. A company records the monthly sales (in thousands of dollars) over the
last year:
45, 52, 48, 55, 50, 47, 53, 60, 49, 51, 54, 57
(a) Find the range of the monthly sales data.
(b) Find the standard deviation of the monthly sales data.

74
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.6 Measures of Distribution Shape


Skewness and kurtosis are important descriptive statistics that measure the
shape characteristics of a data distribution. Skewness quantifies the asymmetry
of the distribution around its mean. Kurtosis, on the other hand, measures the
“tailedness” of the distribution. Both measures provide critical insights into
the distribution’s shape, helping to understand the underlying characteristics
of the data.

3.6.1 Skewness
Skewness is a statistical measure that characterizes the degree of asymme-

T
try of a distribution around its mean. It indicates whether the data are
concentrated more on one side of the mean compared to the other.

Coefficient of Skewness
The coefficient of skewness is a standardized measure of skewness that allows
AF
for comparison of the degree of asymmetry between different distributions. It
provides insight into the direction and extent of the skew of a data distribution.

There are several coefficients of skewness used to calculate skewness, each


providing a different perspective on the skewness of a distribution. Here are
two commonly used Coefficients:
(i). Fisher-Pearson Coefficient of Skewness

(ii). Pearson Median Coefficient of Skewness

Fisher-Pearson Coefficient of Skewness


DR

The Fisher-Pearson coefficient measures the asymmetry of the distribution


around the mean using the formula:
n  3
n X xi − x̄
Skewness (Sk) = (3.1)
(n − 1)(n − 2) i=1 s
where:

• n is the number of observations,

• xi represents each data point,

• x̄ is the sample mean,

• s is the sample standard deviation.

The Fisher-Pearson coefficient standardizes the third central moment of the


distribution. A skewness of 0 indicates a perfectly symmetrical distribution.

75
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Pearson Median Coefficient of Skewness


The Pearson median coefficient calculates skewness based on the mean and the
median:
 
x̄ − Median
Skewness (Sk) = 3 (3.2)
s
where:
• Median is the median of the data,

This method uses the difference between the mean and the median to gauge

T
skewness.

Interpretation
• Sk = 0: The distribution is symmetric.


AF•
Sk > 0: Positive skewness (right-skewed distribution).

Sk < 0: Negative skewness (left-skewed distribution).

Types of Skewness
Skewness can be categorized into three main types based on the direction of
the asymmetry:

Positive Skewness (Right Skewed)


A distribution with positive skewness has a tail that extends more towards the
higher values on the right side. Most of the data points are concentrated on
DR

the left side of the mean.

• Characteristics: Mean > Median

• Visual Description: The tail on the right side of the distribution is


longer or fatter.

• Example: Income distribution where a few individuals earn much


more than the majority.

Zero Skewness (Symmetric Distribution)


A distribution with zero skewness is perfectly symmetrical around the mean.
The data points are evenly distributed on both sides of the mean.

• Characteristics: Mean = Median = Mode

76
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Visual Description: A bell-shaped curve typical of a normal distri-


bution.

• Example: Heights of a large, diverse group of people.

Negative Skewness (Left Skewed)


A distribution with negative skewness has a tail that extends more towards the
lower values on the left side. Most of the data points are concentrated on the
right side of the mean.

• Characteristics: Mean < Median

T
• Visual Description: The tail on the left side of the distribution is
longer or fatter.

• Example: Exam scores where most students score well, but a few
score significantly lower.
AF
Problem 3.11. Consider the following dataset of the number of hours spent
studying per day by a group of students 2, 4, 4, 4, 5, 6, 8, and 10 hours.
Calculate the skewness.

Solution
Fisher-Pearson Coefficient of Skewness: The mean is
Pn
xi 2 + 4 + 4 + 4 + 5 + 6 + 8 + 10 43
x̄ = i=1 = = = 5.375 hours
n 8 8
DR

Table 3.3: Calculations for Skewness

xi −x̄ xi −x̄ 3

xi x̄ xi − x̄ (xi − x̄)2 s s
2.0000 5.3750 -3.3750 11.3906 -1.3184 -2.2914
4.0000 5.3750 -1.3750 1.8906 -0.5371 -0.1549
4.0000 5.3750 -1.3750 1.8906 -0.5371 -0.1549
4.0000 5.3750 -1.3750 1.8906 -0.5371 -0.1549
5.0000 5.3750 -0.3750 0.1406 -0.1465 -0.0031
6.0000 5.3750 0.6250 0.3906 0.2441 0.0146
8.0000 5.3750 2.6250 6.8906 1.0254 1.0781
10.0000 5.3750 4.6250 21.3906 1.8066 5.8968
43 45.875 4.230

77
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Using above table, the variance (s2 ) is


Pn
(xi − x̄)2 45.875
s2 = i=1 = = 6.5536
n−1 8−1
and hence √
s= 6.5536 = 2.56 hours
and
n  3
X xi − x̄
= 4.230
i=1
s

T
Skewness formula:
n  3
n X xi − x̄
Skewness = .
(n − 1)(n − 2) i=1 s
For n = 8:
AF Skewness =
8
(8 − 1)(8 − 2)
× 4.635 =
8
42
× 4.230 = 0.8057

The skewness of the dataset {2, 4, 4, 4, 5, 6, 8, 10} is approximately 0.8057.


This positive value indicates that the distribution is right-skewed, with a longer
tail on the right side.

Pearson Median Coefficient of Skewness: Since the dataset has an even


number of observations:
4+5
Median = = 4.5
2
DR

Hence, the skewness is

   
x̄ − Median 5.375 − 4.5
Skewness (Sk) = 3 =3 = 1.0254
s 2.56
The skewness of the dataset is approximately 1.0254. This positive value
indicates that the distribution is right-skewed, meaning it has a longer tail on
the right side.

3.6.2 Kurtosis
Even with knowledge of central tendency, dispersion, and skewness, we still
don’t have a full understanding of a distribution. To gain a complete perspec-
tive on the shape of the distribution, we also need to consider kurtosis. Kurtosis
is a statistical measure that describes the shape, or peakedness, of the probabil-
ity distribution of a real-valued random variable. It indicates whether the data

78
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

are heavy-tailed or light-tailed relative to a normal distribution. A distribution


with positive kurtosis has a sharp peak and heavy tails, whereas a distribution
with negative kurtosis has a flatter peak and lighter tails compared to the nor-
mal distribution.

The sample version of Karl Pearson’s Measures of Kurtosis is

n
1
(xi − x̄)4
P
n
i=1
K=
n
2 − 3
1
P
n (xi − x̄)2
i=1

T
Alternatively, many software programs (such as Excel’s KURT function, which
uses a bias-corrected formula) use the following formula to measure kurtosis.

n 4
3(n − 1)2

n(n + 1) xi − x̄
AF K=
X
(n − 1)(n − 2)(n − 3) i=1 s

(n − 2)(n − 3)

Both formulas aim to measure the kurtosis of a dataset, but the second
formula includes additional terms to correct for bias, making it more accurate
for small sample sizes. The term −3 in the simpler formula adjusts for ex-
cess kurtosis, normalizing the kurtosis value to compare it against a normal
distribution.

Interpretation
• K = 0: The distribution has the same kurtosis as a normal distribution
DR

(mesokurtic).

• K > 0: Leptokurtic distribution (more outliers, heavier tails).

• K < 0: Platykurtic distribution (fewer outliers, lighter tails).

Problem 3.12. Consider Problem 3.11. Calculate the kurtosis of the following
dataset, which represents the number of hours spent studying per day by a group
of students: 2, 4, 4, 4, 5, 6, 8, and 10 hours.

Solution
In the solution to Problem 3.11, the mean is x̄ = 5.375 hours and the standard
deviation is s = 2.56 hours. The calculation for the kurtosis is provided in Table
3.4.

79
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Table 3.4: Calculations for Kurtosis

xi −x̄ xi −x̄ 4

xi x̄ xi − x̄ (xi − x̄)2 s s
2.0000 5.3750 -3.3750 11.3906 -1.3184 3.0209
4.0000 5.3750 -1.3750 1.8906 -0.5371 0.0832
4.0000 5.3750 -1.3750 1.8906 -0.5371 0.0832
4.0000 5.3750 -1.3750 1.8906 -0.5371 0.0832
5.0000 5.3750 -0.3750 0.1406 -0.1465 0.0005
6.0000 5.3750 0.6250 0.3906 0.2441 0.0036

T
8.0000 5.3750 2.6250 6.8906 1.0254 1.1055
10.0000 5.3750 4.6250 21.3906 1.8066 10.6535
43 45.8750 15.03358
AF Using above table, we have
n 
X xi − x̄
s
4
= 15.03358
i=1

n(n + 1) 8×9
Bias-Correction Factor = = = 0.343
(n − 1)(n − 2)(n − 3) 7×6×5

3(n − 1)2 3 × 72 147


Correction Term = = = = 4.9
(n − 2)(n − 3) 6×5 30
DR

So,
K = 0.343 × 15.03358 − 4.9 = 0.2543
The excess kurtosis of the dataset is approximately 0.2543, the positive value
suggests that the distribution is leptokurtic. This means the distribution has
heavier tails and is more peaked compared to a normal distribution, indicating
a higher probability of extreme values or outliers.

3.6.3 Coefficient of Variation


The coefficient of variation (CV) is a normalized measure of dispersion that
expresses the standard deviation as a percentage of the mean. Mathematically,
the cv is defined as:
s
CV = × 100%

80
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

If the mean blood pressure is x̄ = 128.67 mmHg and the standard deviation
is s = 7.91 mmHg, the CV is:
7.91
CV = × 100% ≈ 6.15%
128.67

3.7 Exercises
1. Given the following dataset representing the monthly number of new cus-
tomer sign-ups for a company:

12, 15, 14, 16, 21, 25, 30, 35, 40, 50

T
Calculate the skewness of the dataset. Use a statistical software or formula
for skewness calculation.
2. The following dataset represents the heights (in cm) of 10 students:

150, 155, 160, 165, 170, 175, 180, 185, 190, 195
AF Determine the kurtosis of the height dataset. Use a statistical software or
formula for kurtosis calculation.
3. Consider the dataset representing the weekly earnings (in dollars) of 5
freelancers:
400, 420, 450, 480, 500
(a) Calculate the mean and standard deviation of the earnings.
(b) Compute the coefficient of variation (CV) for the earnings dataset.
4. A company tracks monthly sales figures (in thousands of dollars) for a
year:
DR

40, 42, 44, 45, 47, 50, 52, 55, 60, 65, 70, 75
(a) Calculate the skewness of the sales data.
(b) Determine the kurtosis of the sales data.
5. A dataset of exam scores for a class is given as follows:

88, 76, 92, 85, 79, 90, 82

(a) Find the mean and standard deviation of the exam scores.
(b) Compute the coefficient of variation (CV) for the exam scores.
6. The daily maximum temperatures (in degrees Celsius) for a week are:

18, 20, 22, 24, 26, 28, 30

(a) Compute the skewness of the temperature data.


(b) Determine the interquartile range (IQR) of the temperature data.

81
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.8 Quartiles, Percentiles, Deciles and Outlier


Detection
Quartiles and percentiles are used to summarize data distributions and identify
specific points within the data set.

3.8.1 Quartiles
Quartiles provide a concise summary of a data set, highlighting its central ten-
dency and variability without requiring a full description of the data. They help
identify the spread of the data, allowing you to see how values are distributed.

FT
Quartiles divide a data set into four equal parts, each representing 25% of the
data. The range between the first quartile (Q1 ) and the third quartile (Q3 ) is
known as the interquartile range (IQR), which indicates where the middle 50%
of the data lies.

Suppose we have a dataset x1 , x2 , . . . , xn of size n, and let x(1) , x(2) , . . . , x(n)


denote the ordered version of this dataset. A kth quantile can be defined
as follows:

Qk = x(⌊m⌋) + f · x(⌈m⌉) − x(⌊m⌋) , k = 1, 2, 3, . . . , 99
where
A
n+1
m= ×k
100
is the position in the ordered dataset, and:

• f = m − ⌊m⌋
R
• x(⌊m⌋) is the value at the integer part of the mth position

• x(⌈m⌉) is the value at the next position

If m is an integer, then f = 0, and in this case:

Qk = x(m)
D

is the value at the mth position of the ordered data set.

• Q1 (First Quartile): The value below which 25% of the data falls.

• Q2 (Second Quartile or Median): The value below which 50% of


the data falls.

• Q3 (Third Quartile): The value below which 75% of the data falls.

82
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Remark 3.8.1. Ceiling Function: The ceiling function, denoted as ⌈m⌉, is


defined as:

⌈m⌉ = the smallest integer greater than or equal to m

Floor Function: The floor function, denoted as ⌊m⌋, is defined as:

⌊m⌋ = the largest integer less than or equal to m

For example, for the floor function: ⌊2.4⌋ = 2, and for the ceiling function:
⌈2.4⌉ = 3.
Problem 3.13. Suppose that a college placement office sent a questionnaire

T
to a sample of business school graduates requesting information on monthly
starting salaries. Table 3.5 shows the collected data.

Table 3.5: Monthly Starting Salaries for a Sample of 12 Business School Grad-
uates
AF Graduate
1
Monthly Starting Salary ($)
5850
2 5950
3 6050
4 5880
5 5755
6 5710
7 5890
DR

8 6130
9 5940
10 6325
11 5920
12 5880

Compute the quartiles of the monthly starting salary for the sample of 12
business college graduates.

Solution
To compute the quartiles, we first sort the data in ascending order:

5710, 5755, 5850, 5880, 5880, 5890, 5920, 5940, 5950, 6050, 6130, 6325

83
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Quartile Computations
• Minimum: 5710

• Maximum: 6325

• Median (50th Percentile, Q2 ):


The median is the average of the 6th and 7th values:
5890 + 5920 11810
Q2 = = = 5905
2 2

• First Quartile (25th Percentile, Q1 ):

T
For n = 12, the value of m = 12+1
4 th position = 3.25th position:

Q1 = x(3) + 0.25 · x(3) − x(4)
= 5850 + 0.25 × (5880 − 5850)
= 5850 + 0.25 × 30
AF = 5850 + 7.5
= 5857.5

• Third Quartile (75th Percentile, Q3 ):


For n = 12, the value of m = 3 × 12+1
4 th position = 9.75th position:

Q3 = x(9) + 0.75 · x(9) − x(10)
= 5950 + 0.75 × (6050 − 5950)
= 5950 + 0.75 × 100 = 5950 + 75
= 6025
DR

The quartiles for the monthly starting salaries are:

• Minimum: 5710

• 1st Quartile (Q1 ): 5857.5

• Median (Q2 ): 5905

• 3rd Quartile (Q3 ): 6025

• Maximum: 6325

3.8.2 Percentiles
Percentiles divide a data set into 100 equal parts, providing a more detailed
breakdown of the data distribution. A kth percentile can be defined as

84
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES


Pk = x(⌊m⌋) + f · x(⌈m⌉) − x(⌊m⌋) k = 1, 2, . . . , 99
where
n+1
m= ×k
100
is the position of ordered data set.

• P10 (10th Percentile): The value below which 10% of the data falls.

• P25 (25th Percentile): The value below which 25% of the data falls.
(Q1)

T
• P50 (50th Percentile): The value below which 50% of the data falls.
(Median)

• PP75 (75th Percentile): The value below which 75% of the data
falls. (Q3)
AF•
3.8.3
P90 (90th Percentile): The value below which 90% of the data falls.

Deciles
Deciles divide a data set into ten equal parts, representing specific percentiles.
A kth decile can be defined as

Dk = x(⌊m⌋) + f · x(⌈m⌉) − x(⌊m⌋) k = 1, 2, . . . , 9
where
n+1
m= ×k
10
DR

is the position of ordered data set.

• D1 (1st Decile): The value below which 10% of the data falls.

• D2 (2nd Decile): The value below which 20% of the data falls.

• D3 (3rd Decile): The value below which 30% of the data falls.

• D4 (4th Decile): The value below which 40% of the data falls.

• D5 (5th Decile): The value below which 50% of the data falls. (Me-
dian)

• D6 (6th Decile): The value below which 60% of the data falls.

• D7 (7th Decile): The value below which 70% of the data falls.

• D8 (8th Decile): The value below which 80% of the data falls.

85
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• D9 (9th Decile): The value below which 90% of the data falls.

• D10 (10th Decile): The value below which 100% of the data falls.
(Maximum)

3.8.4 Interquartile Range (IQR)


The IQR measures the range within which the middle 50% of the data falls,
from the first quartile (Q1 ) to the third quartile (Q3 ). Hence, the interquartile
range is defined as
IQR = Q3 − Q1

T
In a dataset of test scores: 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, the first quar-
tile (Q1 ) is 62.5 and the third quartile (Q3 ) is 87.5. Hence, the interquartile
range is
IQR = 87.5 − 62.5 = 25

3.8.5 Outlier Detection


AF
In statistics, an outlier is a data point that is significantly different from other
observations. We can detect outliers using the interquartile range (IQR). Given
a dataset x1 , x2 , . . . , xn , the outlier detection can be described as follows:

Define Mild and Extreme Outliers:

• Mild Outlier: A data point xi is considered a mild outlier if

xi < Q1 − 1.5 × IQR or xi > Q3 + 1.5 × IQR

• Extreme Outlier: A data point xi is considered an extreme outlier


DR

if
xi < Q1 − 3 × IQR or xi > Q3 + 3 × IQR

Problem 3.14. Consider the following systolic blood pressure (SBP) readings
(in mmHg):

165, 50, 110, 120, 125, 130, 135, 140, 145, 150, 155,

160, 175, 180, 185, 190, 195, 200, 115, 220, 170
Calculate the first and third quartiles and then determine mild and extreme
outliers.

Solution
We will calculate the quartiles, IQR, and determine mild and extreme outliers
by following the steps. The sorted data is:

86
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

50, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155,
160, 165, 170, 175, 180, 185, 190, 195, 200, 220
The number of data points is n = 21.

Finding Q1
n+1 21 + 1
m= ×1= × 1 = 5.5
4 4
Using the formula:

T
Q1 = x⌊5.5⌋ + f · (x⌈5.5⌉ − x⌊5.5⌋ )

where: ⌊5.5⌋ = 5, ⌈5.5⌉ = 6, f = 5.5 − 5 = 0.5. Substituting values:

Q1 = 125 + 0.5 · (130 − 125) = 125 + 0.5 · 5 = 125 + 2.5 = 127.5


AF
Finding Q3

m=
n+1
4
×3=
21 + 1
4
×3=
22
4
× 3 = 16.5

Using the formula:

Q3 = x⌊16.5⌋ + f · (x⌈16.5⌉ − x⌊16.5⌋ )

where ⌊16.5⌋ = 16, ⌈16.5⌉ = 17, f = 16.5 − 16 = 0.5. Substituting values:

Q3 = 180 + 0.5 · (185 − 180) = 180 + 0.5 · 5 = 180 + 2.5 = 182.5


DR

Compute IQR:

IQR = Q3 − Q1 = 182.5 − 127.5 = 55


Determine Outlier Thresholds:
• Mild Outliers:

Lower Mild Threshold = Q1 − 1.5 × IQR = 127.5 − 1.5 × 55


= 127.5 − 82.5 = 45

Upper Mild Threshold = Q3 + 1.5 × IQR = 182.5 + 1.5 × 55


= 182.5 + 82.5 = 265

87
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Extreme Outliers:

Lower Extreme Threshold = Q1 − 3 × IQR = 127.5 − 3 × 55


= 127.5 − 165 = −37.5

Upper Extreme Threshold = Q3 + 3 × IQR = 182.5 + 3 × 55


= 182.5 + 165 = 347.5

Identify Outliers:
• Mild Outlier: Mild outliers are values below 45 or above 265. The
readings do not have any mild outliers.

• Extreme Outlier: Extreme outliers are values below -37.5 or above

T
347.5. Again, there are no extreme outliers.

3.8.6 Python Code: Dispersion Measures


AF
In this section, we will demonstrate how to compute the range, variance, stan-
dard deviation, coefficient of variation, quartiles, interquartile range (IQR),
skewness, and kurtosis using Python. Consider the following dataset:

{10, 15, 15, 20, 25, 30}

1 import numpy as np
2 from scipy . stats import iqr , skew , kurtosis
DR
3

4 # Sample data
5 data = [10 , 15 , 15 , 20 , 25 , 30]
6

7 # Range
8 data_range = np . ptp ( data )
9 print ( f " Range : { data_range } " )
10

11 # Variance
12 variance = np . var ( data , ddof =1)
13 print ( f " Variance : { variance } " )
14

15 # Standard Deviation
16 std_deviation = np . std ( data , ddof =1)
17 print ( f " Standard Deviation : { std_deviation } " )
18

19 # Coefficient of Variation
20 coe f_of_variation = std_deviation / np . mean ( data )
21 print ( f " Coefficient of Variation : { coef_of_variation } " )

88
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

22

23 # Quartiles
24 quartiles = np . percentile ( data , [25 , 50 , 75])
25 print ( f " Quartiles (25 th , 50 th , 75 th ) : { quartiles } " )
26

27 # Interquartile Range ( IQR )


28 i n te r q ua rtile_range = iqr ( data )
29 print ( f " Interquartile Range ( IQR ) : { interquartile_range } " )
30

31 # Skewness
32 data_skewness = skew ( data )
33 print ( f " Skewness : { data_skewness } " )

T
34

35 # Kurtosis
36 data_kurtosis = kurtosis ( data )
37 print ( f " Kurtosis : { data_kurtosis } " )
The output of the code will be:
• Range: 20
AF •

Variance: 62.5

Standard Deviation: 7.91

• Coefficient of Variation: 0.42

• Quartiles (25th, 50th, 75th): [15. 17.5 25.]

• Interquartile Range (IQR): 10.0

• Skewness: 0.246
DR

• Kurtosis: -1.09

3.9 Exercises
1. Given the following dataset representing the monthly expenses (in dollars)
of 12 households:
450, 600, 550, 620, 700, 480, 510, 540, 580, 660, 710, 690
Compute the first quartile, the second quartile ( or median), and the third
quartile of the dataset.
2. Given the following dataset representing the ages of 12 participants in a
study:
22, 25, 28, 30, 32, 35, 37, 40, 42, 45, 48, 50
Compute first quartile, third quartile and the interquartile range (IQR)
of the ages.

89
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3. The dataset represents the heights (in cm) of 20 plants measured over a
period:

40, 42, 45, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63

Compute the 1st decile (D1), 5th decile (D5, which is the median), and
the 9th decile (D9) of the plant heights.
4. Consider the following dataset representing the number of books read by
10 students in a year:

12, 15, 18, 20, 22, 25, 28, 30, 35, 100

T
(a) Compute Q1 and Q3 .
(b) Identify any outliers in the dataset using the Interquartile Range
(IQR) method.
5. Given the following dataset of monthly rainfall (in mm) for 8 cities:
AF 100, 110, 120, 150, 170, 190, 200, 220

(a) Compute the quartiles (Q1, Q2, Q3) for the dataset.
(b) Detect any potential outliers using the IQR method.
6. The following dataset represents the scores of 25 students in an exam:

45, 47, 48, 50, 51, 53, 55, 57, 58, 60, 62, 63, 65,

67, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88
(a) Calculate the 10th percentile (P10) and the 80th percentile (P80) of
the exam scores.
DR

(b) Detect any potential outliers using the IQR method.


7. A dataset of annual salaries (in thousands of dollars) is as follows:

30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150

(a) Find the 3rd decile (D3) and 7th decile (D7) of the salary data.
(b) Use the IQR method to detect any outliers in the salary data.
8. The following dataset represents the number of hours spent on the internet
per week by a sample of 20 people:

5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

(a) Compute the quartiles (Q1, Q2, Q3) for the dataset.
(b) Find the 25th percentile (P25) and the 75th percentile (P75) of the
dataset.

90
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.10 Five-Number Summary and Boxplot


3.10.1 Five-Number Summary
The five-number summary is a set of descriptive statistics that provides a com-
prehensive overview of a dataset. It consists of the
(i). smallest (or minimum) value

(ii). first quartile (Q1 )

(iii). median (also called second quartile (Q2 ))

T
(iv). third quartile (Q3 )

(v). largest (or maximum) value


These five statistics help describe the spread and center of the data.
Problem 3.15. Consider the following data on the cholesterol levels (in mg/dL)
AF
of 15 patients:
180, 195, 170, 200, 210, 175, 205, 190, 195, 220, 185, 215, 200, 190, 225
Find the five-number summary.

Solution
To find the five-number summary, we first arrange the data in ascending order:
170, 175, 180, 185, 190, 190, 195, 195, 200, 200, 205, 210, 215, 220, 225
• Minimum: The smallest value in the dataset.
DR

Minimum = 170 mg/dL

• First Quartile (Q1 ): The median of the lower half of the dataset (not
including the median if the number of observations is odd).
Q1 = 185 mg/dL

• Median: The middle value of the dataset.


Median = 195 mg/dL

• Third Quartile (Q3 ): The median of the upper half of the dataset
(not including the median if the number of observations is odd).
Q3 = 210 mg/dL

• Maximum: The largest value in the dataset.


Maximum = 225 mg/dL

91
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.10.2 Boxplot
A boxplot (or box-and-whisker plot) is a graphical representation of the five-
number summary. It displays the median, quartiles, and potential outliers.

Components of the Boxplot


• Minimum (Min): The lowest data point within the whiskers.

• First Quartile (Q1 ): The 25th percentile of the data.

• Median: The 50th percentile, dividing the dataset into two equal
halves.

T
• Third Quartile (Q3 ): The 75th percentile of the data.

• Maximum (Max): The highest data point within the whiskers.

• Outliers: Data points that fall outside the range of 1.5 times the IQR
AF from Q1 and Q3 .

The Figure 3.1 below illustrates these components:

Q1 Median Q3

LL UL
DRGroup

| {z } | {z }
Outliers Outliers
| {z }
Interquartile Range (IQR)

Values

Figure 3.1: Detailed boxplot illustrating the distribution of a sample dataset


with components: LL (Lower whisker), Q1, Median, Q3, UL (Upper whisker),
and Outliers, where LL = Q1 − 1.5 × IQR and UL = Q3 + 1.5 × IQR

When interpreting a boxplot, several key aspects should be focused on to


understand the distribution and variability of the data. The boxplot provides
a visual summary of the dataset’s central tendency, dispersion, and skewness.
First, examine the central box, which represents the interquartile range (IQR)
where the middle 50% of the data falls, with the line inside the box showing

92
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

the median. The position of the median within the box indicates the data’s
skewness: if it is centered, the data is symmetrical; if skewed towards one
end, it shows skewness. Second, look at the length of the whiskers extending
from the box, which represent the range of the data within 1.5 times the IQR
from the quartiles; data points beyond this range are considered outliers, which
are plotted as individual points. Third, check for outliers and extreme values,
which are represented as dots outside the whiskers. A higher number of outliers
might suggest variability or anomalies in the data. Fourth, observe the width
of the box and the lengths of the whiskers to assess data spread and identify
potential data dispersion. Finally, compare multiple boxplots side-by-side to
analyze differences between groups, noting shifts in the median, variations in

FT
the IQR, and the presence of outliers. By focusing on these elements, you can
gain insights into the data’s distribution characteristics, identify patterns, and
make informed conclusions about the underlying data trends.

Remark 3.10.1. The lines extending from either end of the box are called
whiskers. The term whisker plot is often used interchangeably with “box-
plot”, focusing on the whiskers. It highlights the range of data within 1.5 times
the IQR from the quartiles but might not always show the box or median. Both
terms generally refer to the same plot, but “boxplot” is the more comprehensive
term, including the full depiction of the quartiles and median along with the
whiskers.
A
3.10.3 Importance of Boxplots
Boxplots are essential tools for:
• Visualizing Data Distribution: They show the range, quartiles,
and outliers of the data.
R
• Comparing Distributions: They allow for comparisons between dif-
ferent groups or datasets.

• Detecting Outliers: They help identify unusual data points that may
need further investigation.
D

• Understanding Variability: They show the spread and central ten-


dency of the data.
Problem 3.16. Refer to Problem 3.13. The monthly starting salaries, sorted
in ascending order, are:

5710, 5755, 5850, 5880, 5880, 5890, 5920, 5940, 5950, 6050, 6130, 6325

The five-number summary for the monthly starting salaries is as follows:

93
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Minimum: 5710

• 1st Quartile (Q1 ): 5857.5

• Median (Q2 ): 5905

• 3rd Quartile (Q3 ): 6025

• Maximum: 6325

Draw the boxplot to represent this data and comment on your findings.

T
Solution
From the solution of 3.13, the five-number summary for the monthly starting
salaries is as follows:

• Minimum: 5710
AF•

1st Quartile (Q1 ): 5857.5

Median (Q2 ): 5905

• 3rd Quartile (Q3 ): 6025

• Maximum: 6325

To find the lower and upper whiskers (bounds), we first calculate the interquar-
tile range (IQR):

IQR = Q3 − Q1 = 6025 − 5857.5 = 167.5


DR

Using 1.5 times the IQR, we can calculate the lower and upper whiskers as
follows:

Lower whisker = Q1 − 1.5 × IQR = 5857.5 − 1.5 × 167.5 = 5606.25

Upper whisker = Q3 + 1.5 × IQR = 6025 + 1.5 × 167.5 = 6276.25


Checking for outliers
• Lower outlier: Any value below 5606.25. There are no values below
5710, so there are no lower outliers.

• Upper outlier: Any value above 6276.25. The value 6325 is greater
than 6276.25, so 6325 is an upper outlier.

94
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

Q1 Q2 Q3
1.5

0.5

T
5,600 5,700 5,800 5,900 6,000 6,100 6,200 6,300 6,400
Monthly Starting Salaries

Figure 3.2: Boxplot of monthly starting salaries

Comment on the Boxplot: The boxplot of the monthly starting salaries re-
AF
veals a positively skewed distribution. The median (Q2 = 5905) is positioned
slightly closer to the lower quartile, and the right whisker is longer than the left
whisker, which indicates that there are some higher salaries pulling the data to
the right.
The lower whisker extends to 5710, while the upper whisker reaches
6130. The interquartile range (IQR), between Q1 = 5857.5 and Q3 = 6025, is
fairly narrow, indicating that the middle 50% of the data are clustered together.
However, the longer upper whisker and the presence of an outlier at 6325 show
that there are some higher salaries that deviate from the general pattern.
The outlier (6325) is a clear indication of the positive skewness in the
data, suggesting that while most starting salaries are within a consistent range,
DR

a few are considerably higher.


Problem 3.17. We consider the following data on the cholesterol levels (in
mg/dL) of 15 patients:

180, 195, 170, 200, 210, 175, 205, 190, 195, 220, 185, 215, 200, 190, 275

Calculate the five-number summary and draw a boxplot to represent this data.

Solution
Given the ordered cholesterol levels:

170, 175, 180, 185, 190, 190, 195, 195, 200, 200, 205, 210, 215, 220, 275

• Minimum = 170

• Maximum = 275

95
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

• Median (Q2 ): Since there are 15 data points (odd), the median is the
middle value:
Q2 = x( n+1 ) = x( 15+1 ) = x(8) = 195
2 2

• First Quartile (Q1 ):

n+1 15 + 1
m= = =4
4 2
The 4th value is:
Q1 = x(4) = 185
• Third Quartile (Q3 ):

T
15 + 1
m= × 3 = 12
4
Therefore,
Q3 = x(12) = 210
AF • Calculate the IQR:
IQR = Q3 − Q1 = 210 − 185 = 25

• Calculate the Lower whisker:


Lower whisker = Q1 − 1.5 × IQR = 185 − 1.5 × 25 = 185 − 37.5 = 147.5

• Calculate the Upper whisker:


Upper whisker = Q3 + 1.5 × IQR = 210 + 1.5 × 25 = 210 + 37.5 = 247.5
The value 275 is considered an outlier since it is greater than the upper
whisker of 247.5.
DR

Q1 Q2 Q3
Cholesterol Levels

1.5

0.5

0
140 160 180 200 220 240 260 280
Cholesterol (mg/dL)

Figure 3.3: Boxplot of Cholesterol Levels

96
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

The boxplot of cholesterol levels given in Figure 3.3, provides a clear visual
representation of the data distribution. The spread of the data is captured
by the range from the minimum value (170 mg/dL) to the maximum value
(225 mg/dL), as well as by the interquartile range (IQR) of 25 mg/dL, which
measures the spread of the middle 50% of the data between the first quartile
(185 mg/dL) and the third quartile (210 mg/dL). There is an outlier in this
dataset which is 275. The median (195 mg/dL) lies closer to the first quartile,
suggesting a slight right skewness in the data, as the upper quartile range
is wider than the lower quartile range. Overall, the data appears relatively
symmetric but with a mild skew to the right, indicating that higher cholesterol
values are slightly more spread out than the lower values.

T
3.10.4 Python Code: Boxplot
To create a boxplot for the given data in Example 3.13 in Python, you can use
the matplotlib library. Here’s the Python code that will generate a boxplot
for the monthly starting salaries provided in the table:
1

4
AF import matplotlib . pyplot as plt

# Monthly starting salaries


salaries = [5850 , 5950 , 6050 , 5880 , 5755 , 5710 , 5890 , 6130 ,
5940 , 6325 , 5920 , 5880]
5

6 # Create a boxplot
7 plt . figure ( figsize =(10 , 6) )
8 plt . boxplot ( salaries , vert = False , patch_artist = True ,
9 boxprops = dict ( facecolor = ’ lightblue ’ , color = ’ blue
’) ,
10 whiskerprops = dict ( color = ’ blue ’) ,
DR

11 capprops = dict ( color = ’ blue ’) ,


12 medianprops = dict ( color = ’ red ’) )
13

14 # Add titles and labels


15 plt . title ( ’ Boxplot of Monthly Starting Salaries for Business
School Graduates ’)
16 plt . xlabel ( ’ Monthly Starting Salary ( $ ) ’)
17 plt . grid ( True )
18

19 # Show plot
20 plt . show ()

97
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.11 Exercises
1. Consider the following dataset representing the daily temperatures (in °C)
recorded over 15 days:

18, 21, 20, 22, 24, 19, 23, 25, 27, 26, 28, 30, 29, 31, 32

(a) Compute the five-number summary for this dataset, including the
minimum, first quartile (Q1), median, third quartile (Q3), and max-
imum.
(b) Draw a boxplot and comment on your findings.

T
2. The following dataset represents housing prices (in thousands of dollars)
in a neighborhood:

220, 230, 250, 270, 290, 310, 330, 350, 370, 400

(a) Compute the five-number summary.


AF(b) Calculate the interquartile range (IQR) of the housing prices.
(c) Draw a boxplot and comment on your findings.

3. The following dataset represents the test scores of 15 students:

55, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86

(a) Compute the five-number summary.


(b) Find the 25th percentile and the 90th percentile of the test scores.
(c) Draw a boxplot and comment on your findings.
DR

4. Given the following dataset of monthly sales (in thousands of dollars) for
a retail store over 12 months:

25, 30, 28, 35, 33, 32, 31, 29, 37, 40, 42, 38

(a) Create a boxplot for this dataset.


(b) Interpret the boxplot, identifying any potential outliers and describ-
ing the distribution of the data.
5. The dataset below represents the number of goals scored by a soccer team
over 10 games:
1, 2, 3, 3, 4, 5, 6, 6, 7, 8
(a) Calculate the five-number summary.
(b) Draw a boxplot based on the five-number summary.
(c) Label the components of the boxplot, including the minimum, Q1,
median, Q3, maximum, and any potential outliers.

98
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

6. Consider the dataset representing the heights (in cm) of 20 individuals:


160, 165, 170, 175, 180, 185, 190, 195, 200, 205,
210, 215, 220, 225, 230, 235, 240, 245, 250, 255
(a) Compute the five-number summary for the dataset.
(b) Draw a boxplot and discuss the importance of the boxplot in visu-
alizing the spread and identifying outliers.
7. Analyze the following dataset representing the number of hours spent on
homework per week by 15 students:

T
5, 6, 7, 8, 8, 9, 10, 10, 11, 12, 12, 13, 14, 15, 20
(a) Create a boxplot for this dataset.
(b) Identify and describe the distribution of the data, including any po-
tential outliers.
8. The following dataset represents the weights (in kg) of 18 animals:
AF 2, 3, 3, 4, 5, 5, 6, 6, 7, 8, 8, 9, 10, 10, 11, 12, 13, 15
(a) Use Python to compute the five-number summary and create a box-
plot for this dataset.
(b) Write a brief explanation of how Python can be used to visualize
data distributions using boxplots.

3.12 Concluding Remarks


In conclusion, this chapter has provided a comprehensive overview of key nu-
DR

merical measures used in data exploration. By understanding and applying


measures of central tendency, dispersion, and distribution shape, we equip our-
selves with the tools needed to analyze and interpret data effectively. Each
measure offers unique insights and, when used in combination, provides a ro-
bust understanding of the dataset’s characteristics.

The exploration of quartiles, percentiles, deciles, and outlier detection fur-


ther enhances our ability to segment and scrutinize data. The five-number
summary and boxplots serve as powerful visual tools for summarizing and un-
derstanding data distributions. Mastery of these concepts, along with the ac-
companying Python code examples, will significantly contribute to your data
science skills and enhance your ability to perform thorough data analysis.

We encourage you to apply these techniques to various datasets and practice


interpreting the results to gain a deeper understanding of their implications.
The exercises at the end of this chapter are designed to reinforce your learning
and provide hands-on experience with these important data exploration tools.

99
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

3.13 Chapter Exercises


1. A physician collected an initial measurement of hemoglobin (g/L) after the
admission of 10 inpatients to a hospital’s department of cardiology. The
hemoglobin measurements were 139, 158, 120, 112, 122, 132, 97, 104, 159, 129.

(a) Calculate the mean and median hemoglobin levels.


(b) Find the range of the hemoglobin measurements.
(c) Compute the variance and standard deviation of the hemoglobin
levels.

T
2. The average number of patients visiting a hospital each day over a week
is recorded as 150, 160, 170, 140, 155, 165, 160.

(a) What is the mean number of patients per day?


(b) Calculate the variance and standard deviation of the number of pa-
tients.
AF (c) Find the range of the daily patient counts.

3. The average score of a group of students on a test is 78. If 4 new students


with scores of 85, 90, 75, 80 are added, what will be the new average score
for the group?

(a) Compute the new average score for the group.


(b) Find the variance and standard deviation of the test scores after
adding the new students.

4. Compare the arithmetic means of the following two datasets:


DR

• Dataset A: 55, 60, 65, 70, 75


• Dataset B: 50, 60, 70, 80, 90

(a) Compare the means of Dataset A and Dataset B.


(b) Compute the range, variance, and standard deviation for Dataset A.
(c) Compute the range, variance, and standard deviation for Dataset B.

5. If a class has two sections with 25 and 30 students, and the average scores
for the sections are 85 and 90, respectively, find the weighted average
score for the entire class.

(a) Find the weighted average score for the entire class.
(b) Calculate the variance and standard deviation of the scores, assum-
ing that the scores in each section are the same.

100
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

6. A researcher collects the following data on the number of new cases of a


disease per month: 20, 25, 30, 35, 40, 45, 50.

(a) What is the median number of new cases?


(b) Compute the mean number of new cases.
(c) Find the range, variance, and standard deviation of the number of
new cases.

7. For the dataset of daily temperatures: 22, 24, 21, 25, 23, 26, 24, 27,

(a) Calculate the median temperature.

T
(b) Find the mean, range, variance, and standard deviation of the daily
temperatures.

8. Calculate the medians of the following datasets:

• Dataset A: 5, 8, 7, 6, 10
AF • Dataset B: 3, 6, 5, 7, 8, 9

(a) Find the median of Dataset A and Dataset B.


(b) For Dataset A and Dataset B, calculate the range, variance, and
standard deviation.

9. In a study on the number of hours spent on exercise per week by a group


of individuals, the hours are recorded as 4, 6, 5, 7, 9, 5, 6, 7.

(a) Find the median number of hours spent exercising per week.
(b) Compute the mean, range, variance, and standard deviation of the
DR

hours spent on exercise.

10. A car travels at speeds of 50 km/h for the first part of the trip and
70 km/h for the second part. If the distances traveled are the same,
what is the harmonic mean of the two speeds?

(a) Calculate the harmonic mean of the two speeds.


(b) Discuss how the harmonic mean is used to find the average rate in
this context.

11. A biostatistics researcher records the growth rates of a certain plant


species over three months as 1.2, 1.5, 1.8.

(a) What is the geometric mean growth rate?


(b) Find the variance and standard deviation of the growth rates.

101
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

12. The annual returns of two investment portfolios over the past 3 years are
10%, 15%, 5% and 7%, 12%, 8%.

(a) Compute the geometric mean return for each portfolio.


(b) Compare the geometric means of the two portfolios and discuss their
implications.

13. A pharmaceutical company wants to calculate the average effectiveness


of a new drug across three different studies with effectiveness rates of
0.85, 0.90, 0.80.

(a) What is the geometric mean effectiveness?

T
(b) Compute the variance and standard deviation of the effectiveness
rates.

14. A nutritionist evaluates the average calorie intake of patients based on


three different dietary plans. The average calorie intake (in kcal) per day
AF for each plan and the number of patients on each plan are as follows:

• Plan A: 2000 kcal/day with 15 patients


• Plan B: 2500 kcal/day with 25 patients
• Plan C: 2200 kcal/day with 10 patients

Calculate the weighted mean calorie intake per day for all patients.
15. A professor is calculating the overall grade for a student based on the
grades from three different assessments with the following weights:


DR

Exam 1: 85 (weight 40%)


• Exam 2: 90 (weight 30%)
• Final Exam: 88 (weight 30%)

Find the weighted mean grade for the student.

16. This exercise focuses on comparing error rates for two software projects.
You will analyze the central tendency and variability of the error rates.

• Project Alpha: 5, 7, 6, 8, 5, 9, 6, 7, 8, 5, 7, 6
• Project Beta: 8, 10, 9, 7, 11, 9, 8, 10, 11, 9, 12, 10

Using the above data, answer the following questions.

(a) Draw a boxplot for the error rates for Project Alpha and Project
Beta.

102
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

(b) Which project has the lower median number of errors?


(c) Compare the spread of errors in each project. Which project shows
greater variability?
(d) Look for outliers and discuss what they might indicate about the
quality of the code in each project.
(e) Which project appears to have better performance based on the box-
plots?
17. This exercise involves analyzing BMI measurements from the start of a
study and after six months. Compare the central tendency and variability
of BMI over time.

T
• Start of Study: 22.5, 23.1, 22.8, 21.9, 23.5, 22.7, 23.0, 21.5,
22.3, 22.9, 21.8, 23.2
• After 6 Months: 21.0, 21.5, 21.8, 20.8, 22.0, 21.3, 21.6, 20.7,
21.1, 21.9, 20.9, 21.4
AF Using the above data, answer the following questions.

(a) Draw a boxplot for BMI at the start of the study and after six
months.
(b) How did the median BMI change from the start to after six months?
(c) Did the diet intervention lead to a reduction in the variability of
BMI?
(d) Are there any outliers in the BMI data at the start or after six
months? What can be inferred from them?
(e) Based on the boxplots, evaluate the effectiveness of the diet inter-
DR

vention.

18. This exercise involves analyzing cholesterol levels across three different
groups. You will interpret the boxplots to compare the cholesterol levels
between these groups.

• Group A: 190, 200, 195, 210, 205, 215, 202, 198, 220, 210, 195,
200
• Group B: 180, 185, 190, 175, 195, 190, 180, 185, 175, 190, 185,
180
• Group C: 210, 220, 215, 230, 225, 240, 220, 235, 225, 230, 240,
215

Using the above data, answer the following questions.


(a) Draw a boxplot for the cholesterol levels for Groups A, B, and C.

103
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES

(b) Which group has the highest median cholesterol level?


(c) Which group has the smallest interquartile range?
(d) Are there any outliers in each group? If so, which group has the
most outliers?
(e) Compare the variability of cholesterol levels between the three groups.

T
AF
DR

104
Chapter 4

Introduction to Probability

T
4.1 Introduction
AF
In the realm of data science, understanding probability is essential for making
informed decisions based on uncertain and incomplete information. Probability
theory provides the mathematical foundation for analyzing data, modeling un-
certainty, and deriving insights from complex datasets. As data scientists, we
frequently encounter situations where outcomes are not deterministic but rather
subject to variability and chance. Probability offers tools and frameworks to
quantify this uncertainty and to make predictions that guide decision-making.

At its core, probability is concerned with measuring the likelihood of vari-


ous outcomes in uncertain situations. Whether it’s predicting customer behav-
ior, assessing risk, or evaluating the effectiveness of an algorithm, probability
helps us to model and interpret the inherent randomness in data. By applying
DR

probability theory, we can develop robust statistical models, conduct rigorous


hypothesis testing, and perform meaningful data analysis.

In this chapter, we will lay the groundwork for understanding probability


within the context of data science. We will start with the basic concepts that
form the foundation of probability theory, including experiments, sample spaces,
and events. We will then explore different methods of assigning probabilities,
such as classical, empirical, and subjective approaches, and examine how these
methods apply to real-world data problems.

As we delve deeper, we will cover key topics such as joint and marginal
probabilities, conditional probability, and posterior probabilities. Each of these
concepts is crucial for analyzing relationships between variables, updating be-
liefs based on new data, and making predictions about future events.

By the end of this chapter, you will gain a solid understanding of proba-

105
CHAPTER 4. INTRODUCTION TO PROBABILITY

bility and its applications in data science. This knowledge will equip you with
the tools needed to tackle complex data challenges and to make data-driven
decisions with confidence.

4.2 Basic Concepts


4.2.1 Experiment
In the context of probability, an experiment refers to any process or action
that generates a set of outcomes. The experiment is conducted under specified
conditions, and the outcomes of interest are observed and recorded. Each out-

T
come of an experiment is uncertain, but over repeated trials, patterns emerge
that allow us to assign probabilities to different outcomes.

Example of an Experiment
To test the fairness of a coin used in a cricket match to decide whether a
AF
team bats or bowls first, we can design an experiment to assess whether the
coin has an equal probability of landing on heads or tails. The goal is to
determine if the coin is unbiased, meaning both outcomes, heads and tails, are
equally likely. In this experiment, the possible outcomes are heads or tails.
The procedure involves flipping the coin a significant number of times, say 100
flips, and recording the result of each flip. The next step is to analyze the data
by calculating the relative frequencies of heads and tails. These frequencies
are then compared to the expected probability of 0.5 for each outcome. If the
proportion of heads deviates significantly from 0.5, it may suggest the coin is
biased. Conversely, if the proportions are close to 0.5, there is no evidence to
suggest that the coin is unfair.
DR

4.2.2 Random Experiment


A random experiment is a specific type of experiment where the outcome is
subject to chance and cannot be predicted with certainty. The outcomes are
uncertain and vary each time the experiment is performed.

Examples
• Rolling a Die: A random experiment with outcomes {1, 2, 3, 4, 5,
6}, where each outcome is unpredictable.

• Flipping a Coin: A random experiment with two possible outcomes:


heads or tails.

• Drawing a Card: A random experiment from a deck of 52 cards,


where each card is equally likely to be drawn.

106
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.2.3 Sample Space and Events


In data science, the concept of a sample space is essential for understanding
the possible outcomes of a random experiment or a data-generating process.
The sample space is the set of all possible outcomes or values that a random
variable can take.

Sample Space: The set of all possible outcomes of a random experiment


is denoted as the sample space, typically represented by S or Ω.

Consider the experiment of rolling a fair six-sided die. The possible outcomes

T
of this experiment are the numbers that appear on the top face of the die after
a roll. Then the sample space S for this experiment is the set of all possible
outcomes. That is,
S = {1, 2, 3, 4, 5, 6}
Here are some examples of sample spaces in different contexts within data
science:
AF• The sample space for tossing a fair coin is

S = {Heads, Tails}.

• Sample Space for tossing two coins:

S = {(Heads, Heads), (Heads, Tails), (Tails, Heads), (Tails, Tails)}.

• The sample space S for rolling two six-sided dice consists of all possible
ordered pairs (x1 , x2 ), where x1 represents the outcome of the first die
and x2 represents the outcome of the second die. Since each die has 6
DR

faces, the sample space contains 36 possible outcomes:


 

 (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), 

 

(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),

 


 


 
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6), 
S=


 (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6), 


 




 (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6), 



 
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)
 

This sample space shows all the possible outcomes when two dice are
rolled simultaneously.

Events are subsets of the sample space, representing specific outcomes or


combinations of outcomes.

107
CHAPTER 4. INTRODUCTION TO PROBABILITY

Event: A subset of the sample space.

For example, the event A of rolling an even number is:

A = {2, 4, 6}

4.3 Probability
Probability is a branch of mathematics that deals with the likelihood of differ-
ent outcomes in uncertain situations. It quantifies the chance of an event occur-
ring, providing a way to model and analyze randomness. In essence, probability

T
helps us understand and predict the behavior of systems in which outcomes are
not deterministic, but rather subject to chance.

Probability: A numerical measure of the likelihood that a particular


event will occur. It quantifies uncertainty by assigning a value between 0
and 1, where 0 means the event will not happen (impossible) and 1 means
AF
the event will definitely happen (certain).

For example, when rolling a fair six-sided die, the probability of getting a 3 is:
1
P (3) =
6
Because there is only one “3” out of six possible outcomes.

Properties of the Probability


A set of probability values for an experiment with a sample space
DR

S = {A1 , A2 , . . . , An }

consists of some probabilities P (A1 ), P (A2 ), . . . , P (An ) that must satisfy

0 ≤ P (A1 ) ≤ 1, 0 ≤ P (A2 ) ≤ 1, ..., 0 ≤ P (An ) ≤ 1


and

P (A1 ) + P (A2 ) + · · · + P (An ) = 1.

Properties of the Probability:


(i). The probability of an event is always a number between 0 and 1.

(ii). The sum of the probabilities of all mutually exclusive events is


always 1.

108
CHAPTER 4. INTRODUCTION TO PROBABILITY

For example, when tossing a fair coin, the probability of getting heads is 0.5,
and the probability of getting tails is also 0.5, which are both between 0 and 1.
The sum of these probabilities is

0.5 + 0.5 = 1

showing that the total probability of all possible outcomes (heads or tails) is
always 1.
Problem 4.1. An experiment has five outcomes, I, II, III, IV, and V. If P (I) =
0.08, P (II) = 0.20, and P (III) = 0.33, (a) what are the possible values for the
probability of outcome V? (b) If outcomes IV and V are equally likely, what are

T
their probability values?

Solution
(a).
An experiment has five outcomes: I, II, III, IV, and V. Given the probabilities
AF
for outcomes I, II, and III are:

P (I) = 0.08, P (II) = 0.20, P (III) = 0.33


We need to determine the possible values for the probability of outcome V.
First, we find the sum of the given probabilities:

P (I) + P (II) + P (III) = 0.08 + 0.20 + 0.33 = 0.61

Since the sum of the probabilities of all outcomes must equal 1, the sum of the
probabilities of outcomes IV and V is:

P (IV) + P (V) = 1 − 0.61 = 0.39


DR

Thus, the possible values for the probability of outcome V, denoted as P (V),
depend on the probability of outcome IV, denoted as P (IV):

P (V) = 0.39 − P (IV)


Since, 0 ≤ P (V) ≤ 0.39, then the possible values for the probability of outcome
V are
0 ≤ P (V) ≤ 0.39

(b).

If outcomes IV and V are equally likely, then:

P (IV) = P (V)

109
CHAPTER 4. INTRODUCTION TO PROBABILITY

Let P (IV) = P (V) = x. Then:


0.39
2x = 0.39 ⇒ x= = 0.195
2
Therefore, the probabilities for outcomes IV and V are:

P (IV) = P (V) = 0.195


Problem 4.2. An experiment has three outcomes, I, II, and III. If outcome
I is twice as likely as outcome II, and outcome II is three times as likely as
outcome III, what are the probability values of the three outcomes?

T
Solution
An experiment has three outcomes: I, II, and III. Let the probabilities of these
outcomes be P (I), P (II), and P (III), respectively.

Let P (III) = x. Then:


AF •

P (II) = 3x (since outcome II is three times as likely as outcome III).

P (I) = 2 · P (II) = 2 · 3x = 6x (since outcome I is twice as likely as


outcome II).
Since the sum of the probabilities of all outcomes must equal 1, we have:

P (I) + P (II) + P (III) = 1

Substituting the values, we get:

6x + 3x + x = 1
DR

or,
10x = 1

1
∴x= = 0.1
10
Thus, the probabilities of the three outcomes are:

P (III) = x = 0.1
P (II) = 3x = 3 × 0.1 = 0.3
P (I) = 6x = 6 × 0.1 = 0.6

Therefore, the probability values of the three outcomes are:

P (I) = 0.6, P (II) = 0.3, P (III) = 0.1

110
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.3.1 Union of Events


The union of two events A and B, denoted A ∪ B, represents the event that
either A, B, or both occur. Formally:

A ∪ B = {ω | ω ∈ A or ω ∈ B}
Example: Consider rolling a standard six-sided die. Let:

• A be the event “rolling an even number” (i.e., A = {2, 4, 6)})

• B be the event “rolling a number greater than 3” (i.e., B = {4, 5, 6)})

T
The union A ∪ B represents rolling a number that is either even or greater
than 3 (or both). The possible outcomes for

A ∪ B = {2, 4, 5, 6}.

The probability of A ∪ B is given by:


AF P (A ∪ B) =
Number of favorable outcomes
Total number of outcomes
4
= =
6
2
3

4.3.2 Intersection of Events


The intersection of two events A and B, denoted A ∩ B, represents the event
that both A and B occur simultaneously. Formally:

A ∩ B = {ω | ω ∈ A and ω ∈ B}
Example: Using the same die roll, let:
DR

• A be the event “rolling an even number” (i.e., A = {2, 4, 6)})

• B be the event “rolling a number greater than 3” (i.e., B = {4, 5, 6)})

The intersection A ∩ B represents rolling a number that is both even and


greater than 3.
The possible outcomes for

A ∩ B = {4, 6}.

The probability of A ∩ B is given by:


Number of favorable outcomes 2 1
P (A ∩ B) = = =
Total number of outcomes 6 3

111
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.3.3 Complementary Event


A complementary event of an event A, denoted by Ac or A or A′ , consists
of all outcomes in the sample space S that are not in A.
Example: Consider a six-sided die.

• Event A: Rolling an even number:

A = {2, 4, 6}

• Complementary Event Ac : Rolling an odd number:

Ac = {1, 3, 5}

T
The probability of the complementary event Ac is given by:

P (Ac ) = 1 − P (A)

Example: If A is the event of getting a head when flipping a coin, then the
AF
complementary event Ac is the event of getting a tail. If P (A) = 0.5, then
P (Ac ) = 1 − 0.5 = 0.5.

Odds: The odds in favor of an event A are defined as the ratio of the
probability that the event A occurs to the probability that the event A
does not occur (i.e., the complement of A). Mathematically, the odds in
favor of A are given by:

P (A)
Odds in favor of A =
P (Ac )

where P (Ac ) is the probability of the complement of A.


DR

Problem 4.3. Suppose p is the probability of the success.


(a) If the odds is 1, what is p?
(b) If the odds is 2, what is p?
(c) If p = 0.25, what is the odds?

Solution
(a). The odds in favor of success are given by:
p
Odds =
1−p
If the odds are 1, then:
p
1=
1−p

112
CHAPTER 4. INTRODUCTION TO PROBABILITY

Solving for p:
1
1−p=p ⇒ 1 = 2p ⇒ p=
2
So, p = 0.5.

(b). If the odds are 2, then:


p
2=
1−p
Solving for p:
2

T
2(1 − p) = p ⇒ 2 − 2p = p ⇒ 2 = 3p ⇒ p=
3
So, p = 23 .

(c). If p = 0.25, then the odds are:


p 0.25 0.25 1
AF Odds =

So, the odds are 13 .


1−p
=
1 − 0.25
=
0.75
=
3

4.3.4 Equally Likely Events


In probability theory, equally likely events are events that have the same prob-
ability of occurring. If all outcomes in the sample space are equally likely, then
the probability of any specific event can be calculated by dividing the number
of favorable outcomes by the total number of possible outcomes.
DR

Equally Likely Events: Events that have the same probability of occur-
ring.

Example
Consider the experiment of rolling a fair six-sided die. The sample space is:

S = {1, 2, 3, 4, 5, 6}
Since the die is fair, each of the six outcomes is equally likely. The proba-
bility of each outcome is:

1
P ({1}) = P ({2}) = P ({3}) = P ({4}) = P ({5}) = P ({6}) = .
6

113
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.3.5 Mutually Exclusive Events


In probability theory, mutually exclusive events are events that cannot happen
at the same time. In other words, if one event occurs, the other cannot occur
at the same time.

Mutually Exclusive Events: Two events A and B are said to be mutu-


ally exclusive (or disjoint) if they cannot occur at the same time. Formally,
A and B are mutually exclusive if:

A∩B =∅
where ∩ denotes the intersection of events, and ∅ represents the empty

T
set, indicating that there are no outcomes common to both A and B.

Example:
Consider rolling a standard six-sided die. Let:

• A be the event “rolling a 2”


AF • B be the event “rolling a 5”

The events A and B are mutually exclusive because you cannot roll a 2 and
a 5 at the same time.

Additivity for Mutually Exclusive Events


If A and B are mutually exclusive events, then the probability of their union is
the sum of their individual probabilities:

P (A ∪ B) = P (A) + P (B)
DR

Generalization
For any finite or countable collection of mutually exclusive events A1 , A2 , . . . , An :
n
! n
[ X
P Ai = P (Ai )
i=1 i=1

4.3.6 Probability Axioms


The probability of an event Ai ; i = 1, 2, . . . , n, denoted as P (Ai ), satisfies the
following axioms:

(i). 0 ≤ P (Ai ) ≤ 1,
n
P
(ii). P (Ai ) = 1,
i=1

114
CHAPTER 4. INTRODUCTION TO PROBABILITY

(iii). For any sequence of mutually exclusive events {Ai }, we have


∞ ∞
!
[ X
P Ai = P (Ai ).
i=1 i=1

For mutually exclusive events A, B ∈ S, we have

Pr(A ∪ B) = Pr(A) + Pr(B)

This is an addition rule for two mutually exclusive events

T
If A and B are not mutually exclusive, then the addition rule is

Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B).

If two events are mutually exclusive, then the probability of both occurring
is denoted as P (A ∩ B) and
AF P (A and B) = P (A ∩ B) = 0.

Problem 4.4. A single 6-sided die is rolled. What is the probability of rolling
a 2 or a 5?

Solution
• Pr(2) = 1
6 and Pr(5) = 1
6

• Therefore,
DR

Pr(2 or 5) = Pr(2 ∪ 5) = Pr(2) + Pr(5)


1 1
= +
6 6
2
=
6
1
=
3

Problem 4.5. In a Math class of 30 students, 17 are boys and 13 are girls.
On a unit test, 4 boys and 5 girls made an A grade. If a student is chosen at
random from the class, what is the probability of choosing a girl or an A-grade
student?

Solution
• Pr(girl) = 13
30 , Pr(A-grade student) = 9
30 and Pr(girl ∩ A-grade student) =
5
30

115
CHAPTER 4. INTRODUCTION TO PROBABILITY

• Therefore,

Pr(girl or A-grade student) = Pr(girl) + Pr(A-grade student)


− Pr(girl ∩ A-grade student)
13 9 5
= + −
30 30 30
17
=
30

4.4 Types of Probability

T
Probability can be categorized into different types based on how it is determined
or calculated. Below are the key types:

1. Classical (Theoretical) Probability


2. Experimental (Empirical) Probability
AF
3. Subjective Probability

4.4.1 Classical (Theoretical) Probability


Classical or theoretical probability is based on the assumption that all outcomes
of a random experiment are equally likely. This is often used when the out-
comes of an experiment are known and finite. If an experiment has n equally
likely outcomes and an event A consists of m of these outcomes, the probability
of event A is given by:
Number of favorable outcomes m
P (A) = = .
Total number of possible outcomes n
DR

Example 1: For a fair six-sided die, the total number of possible outcomes is
6 (since the die has six faces). If we are interested in the probability of rolling a
3, there is only one favorable outcome (rolling a 3). Therefore, the probability
is:
1
P (rolling a 3) =
6
Example 2: For a standard deck of 52 playing cards, the total number of
possible outcomes is 52. If we want to know the probability of drawing an Ace,
there are 4 Aces in the deck (one in each suit). Therefore, the probability is:
4 1
P (drawing an Ace) = =
52 13
Classical probability is particularly useful for situations where the outcomes
are well-defined, and each outcome is equally likely, such as rolling dice, drawing
cards, or selecting outcomes from a set of equally likely possibilities.

116
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.4.2 Experimental (Empirical) Probability


The empirical approach, also known as the frequentist approach, assigns prob-
abilities based on observed frequencies from experimental data. This approach
is often used in practical situations where we have observed data. If an experi-
ment is repeated N times and event A occurs nA times, the probability of event
A is estimated by the relative frequency:

Number of times the event A occurred nA


P (A) = = .
Total number of observations N

This method is particularly useful when it is difficult or impractical to calcu-

T
late theoretical probabilities, or when we want to verify theoretical predictions
by comparing them with actual outcomes. The empirical approach to probabil-
ity relies on the principle known as the law of large numbers. This principle
suggests that as the number of observations increases, the estimate of the prob-
ability becomes more accurate. Therefore, by gathering more data, one can
obtain a more precise estimation of the probability.
AF
Law of large numbers: As the number of trials or observations increases,
the empirical probability of an event will get closer to its actual probability.

Example 1: For example, if we flip a coin 100 times and get 52 heads, the
estimated probability of getting a head would be:
Number of heads 52
P (Heads) = = = 0.52
Total number of flips 100
Based on the empirical data, the estimated probability of getting heads on a
coin flip is 0.52, or 52%.
DR

Example 2: In Bangladesh, the route between Dhaka and Chattogram


is one of the busiest and most important for both business and leisure travel.
Given the high volume of flights, airlines strive to maintain punctuality to en-
hance customer satisfaction and operational efficiency. Monitoring these flights
provides valuable data on performance and reliability.

For this example, 100 flights from Dhaka to Chattogram were monitored.

• Successful flights (on time): 95

• Unsuccessful flights (delayed or canceled): 5

The empirical probability of a successful flight is:


95
P (Success) = = 0.95
100
The empirical probability of an unsuccessful flight is:

117
CHAPTER 4. INTRODUCTION TO PROBABILITY

5
P (Failure) = = 0.05
100
In this example, the empirical probability of a successful flight from Dhaka
to Chattogram is 0.95, while the probability of an unsuccessful flight is 0.05,
based on the actual outcomes of the monitored flights.

4.4.3 Subjective Approach


The subjective approach assigns probabilities based on personal judgment
or belief about the likelihood of an event. This method does not rely
on mathematical calculations or empirical data but rather on an individual’s

T
intuition or experience. For an event A, the subjective probability is denoted
as:
P (A) = Subjective belief about A
In this approach, probabilities are not necessarily based on frequency or
equal likelihood but on personal estimation.
AF
Example 1: Consider an entrepreneur deciding whether to launch a new
product. Since there is no historical data or empirical studies available for this
specific product, the entrepreneur uses their expertise and market knowledge
to estimate the likelihood of success.

The entrepreneur assesses various factors:


• Knowledge of current market trends.

• Feedback from potential customers.


DR

• Expertise of the team involved.

• Analysis of the competitive landscape.


Based on these factors, the entrepreneur might estimate the probability of
the product’s success to be 70%. This subjective estimate is derived from their
personal judgment and experience rather than from data analysis.

P (Success) = 0.70
This subjective probability is based on the entrepreneur uses their expertise
and market knowledge, rather than on empirical data or mathematical models.

118
CHAPTER 4. INTRODUCTION TO PROBABILITY

Example 2: Consider a football game between Team A and Team B. Based


on your personal judgment and knowledge of the teams, you estimate the prob-
ability of Team A winning the game.
Let’s denote:
• P (A) as the probability of Team A winning.

• P (B) as the probability of Team B winning.


Based on your assessment, you estimate that:

P (A) = 0.70

T
This means you believe there is a 70% chance that Team A will win the
game. This probability is derived from your subjective evaluation of the teams’
recent performances, player conditions, and other relevant factors.

4.5 Joint and Marginal Probabilities


AF
Joint probability is the probability of two (or more) events happening simul-
taneously. For two events A and B, the joint probability is denoted by P (A∩B).

Marginal probability is the probability of the occurrence of the single event.


It is obtained by summing (or integrating) the joint probabilities over all pos-
sible values of the other variable(s).

Example: Consider a study on student performance with the following events:

• A: The event that a student good studied for the exam.


DR

• B: The event that a student passed the exam.


The probability table for these events is as follows:

Good Study Habit (A) Poor Study Habit (Ac ) Total


Passed (B) 0.80 0.05 0.85
c
Not Passed (B ) 0.02 0.13 0.15
Total 0.82 0.18 1.0

The joint probability is the probability that a student has a good study
habit and passed the exam:

P (A ∩ B) = 0.80

119
CHAPTER 4. INTRODUCTION TO PROBABILITY

Marginal Probability of Studying (A) that a student studied for the exam,
regardless of whether they passed or not, is obtained by summing the joint
probabilities involving A:

P (A) = P (A ∩ B) + P (A ∩ B c ) = 0.8 + 0.02 = 0.82


Marginal Probability of Passing (B): The probability that a student passed
the exam, regardless of whether they studied or not, is obtained by summing
the joint probabilities involving B:

P (B) = P (A ∩ B) + P (Ac ∩ B) = 0.80 + 0.05 = 0.85

T
Problem 4.6. Suppose two dice are thrown together. What is the probability
that at least one 6 is obtained on the two dice?

Solution
Since each die has 6 faces, the sample space contains 6 × 6 = 36 possible
outcomes:
AF 
(1, 1), (1, 2), (1, 3),



(2, 1), (2, 2), (2, 3),

(1, 4), (1, 5), (1, 6),



(2, 4), (2, 5), (2, 6),
 


 

 

(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),
S=
(4, 1), (4, 2), (4, 3),
 (4, 4), (4, 5), (4, 6),


 

 



 (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),



 
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)
 

In the sample space S, we see the possible outcomes with at least one 6 are
DR

(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6), (1, 6), (2, 6), (3, 6), (4, 6), (5, 6)

. Therefore, the number of outcomes with at least one 6 is 11.


11
P (at least one 6) = .
36

4.6 Conditional Probability


Conditional probability is essential in data science because it helps model and
understand how the probability of an event changes based on the occurrence of
another event. It underpins Bayesian inference, supports feature engineering,
enhances risk assessment, informs decision-making, aids in anomaly detection,
and is pivotal in natural language processing tasks. This ability to adjust
probabilities with new information is crucial for accurate predictions and data-
driven insights.

120
CHAPTER 4. INTRODUCTION TO PROBABILITY

Conditional Probability: The conditional probability of an event A


given that event B has occurred is denoted by P (A|B) and is defined as:

P (A ∩ B)
P (A|B) = , provided P (B) > 0
P (B)

One simple example of conditional probability concerns the situation in


which two events A and B are mutually exclusive. Since mutually exclusive
events have no common outcomes, the occurrence of event B makes the occur-
rence of event A impossible. Thus, intuitively, the probability of event A given
that B has occurred should be zero. This is confirmed by the formula:

T
P (A ∩ B) 0
P (A | B) = = = 0.
P (B) P (B)
Another example involves a scenario where event B is a subset of event A,
denoted B ⊆ A. In this case, if event B occurs, event A must also occur. Thus,
the probability of event A given that B has occurred should be one. This is
AF
supported by the formula:

P (A | B) =
P (A ∩ B)
P (B)
=
P (B)
P (B)
= 1.

Problem 4.7. If somebody rolls a fair die without showing you but announces
that the result is even, then what is the probability of scoring a 6?

Solution
The sample space for a fair die roll is S = {1, 2, 3, 4, 5, 6}. The event (B) that
the result is even is B = {2, 4, 6}.
DR

3 1
P (B) = =
6 2
The event (A) of scoring a 6 given that the result is even is A = {6}.
1
P (A ∩ B) 6 1
P (A|B) = = 1 =
P (B) 2
3

Problem 4.8. Suppose somebody rolls a red die and a blue die together without
showing you, but announces that at least one 6 has been showed. What is the
probability that the red die showed a 6?

Solution
In the sample space S mentioned in Problem 4.6, we see the possible outcomes
with at least one 6 are

B = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6), (1, 6), (2, 6), (3, 6), (4, 6), (5, 6)}

121
CHAPTER 4. INTRODUCTION TO PROBABILITY

. Therefore, the number of outcomes with at least one 6 is 11. The number of
outcomes where the red die scores a 6 is 6. That is,

A = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.

Hence, the conditional probability is

6/36 6
P (A|B) = = .
11/36 11

Problem 4.9. Suppose somebody rolls a red die and a blue die together. What
is the probability that the red die scores a 6 given that exactly one 6 of the two

T
outcomes has been scored?

Solution
In the sample space S mentioned in Problem 4.6, we see the possible outcomes
with at least one 6 are
AF B = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (1, 6), (2, 6), (3, 6), (4, 6), (5, 6)}.

Therefore, the number of outcomes with at least one 6 is 10 (i.e., excluding


(6, 6)). The number of outcomes where the red die scores a 6 and the blue die
does not is 5 (i.e., A = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5)}).

5/36 1
P (A|B) = =
10/36 2

4.6.1 Probabilities Computation form Contingency Table


A contingency table displays the frequency distribution of two categorical
DR

variables. Each cell in the table represents the count of observations where
the two variables take specific values. We use contingency tables to compute
marginal and joint probabilities.
Consider two categorical variables: Variable A with categories A1 and
A2 and Variable B with categories B1 and B1 The contingency table 4.1 is
structured as follows:

Table 4.1: Contingency Table

A1 A2 Total
B1 a b a+b
B2 c d c+d
Total a+c b+d n=a+b+c+d

122
CHAPTER 4. INTRODUCTION TO PROBABILITY

The joint probability table shows the probability of each combination of


categories occurring. It is obtained by dividing each cell count in the contin-
gency table by the overall total number of observations n.
To compute the joint probabilities:
Count in cell
Joint Probability =
n
The joint probability table is:

A1 A2 Total
a b a+b
B1

T
n n n
c d c+d
B2 n n n
a+c b+d
Total n n 1

Table 4.2: Joint Probability Table


AF
Problem 4.10. Consider the situation of the promotion status of male and
female officers of a major metropolitan police force in the eastern United States.
The force consists of 1200 officers, 960 men and 240 women. Over the past two
years 324 officers on the public force received promotions. After reviewing the
promotion record, a committee of female officers raised a discrimination case
on the basis that 288 male officers had received promotions, but only 36 female
officers had received promotions.

Men Women Total


Promoted 288 36 324
DR

Not Promoted 672 204 876


Total 960 240 1200

(i). Develop a joint probability table for these data. What are the marginal
probabilities? Suppose a male officer is selected randomly, what is the
chance that the officer will be promoted?

(ii). Suppose a female officer is selected randomly, what is the chance that
the officer will not be promoted? Suppose an officer is selected randomly
who got promotion, what is the chance that the officer will be male?

(iii). Suppose an officer is selected randomly who did not get promotion,
what is the chance that the officer will be female?

123
CHAPTER 4. INTRODUCTION TO PROBABILITY

Solution
(i) Joint Probability Table and Marginal Probabilities
To develop the joint probability table, we divide each cell count by the total
number of officers, which is 1200.

Joint Probability Table :

Men Women Total


288 36 324
Promoted 1200 1200 1200

T
672 204 876
Not Promoted 1200 1200 1200
960 240
Total 1200 1200 1
Simplifying the fractions, we get:

Men Women Total


AF Promoted
Not Promoted
0.24
0.56
0.03
0.17
0.27
0.73
Total 0.80 0.20 1
Marginal Probabilities:
• Probability of promotion: 324
1200 = 0.27

• Probability of not promotion: 876


1200 = 0.73

• Probability of being male: 960


1200 = 0.80
DR

• Probability of being female: 240


1200 = 0.20
Probability that a randomly selected male officer is promoted:
288
P (Promoted | Male) = = 0.30
960

(ii) Probabilities for Female Officers and Promotion


Probability that a randomly selected female officer is not promoted:

204/1200
P (Not Promoted | Female) = = 0.85
240/1200
Probability that a randomly selected officer who got promoted is
male:

288/1200
P (Male | Promoted) = ≈ 0.889
324/1200

124
CHAPTER 4. INTRODUCTION TO PROBABILITY

(iii) Probability for Officers Not Promoted


Probability that a randomly selected officer who did not get promoted
is female:

204/1200
P (Female | Not Promoted) = ≈ 0.233
876/1200

4.6.2 Independent Events


Two events A and B are said to be independent if the occurrence of one event
does not affect the probability of the other event occurring. In mathematical

T
terms, this is expressed as:

P (A ∩ B) = P (A) · P (B).

For independent events, the following also holds true:


P (A ∩ B) P (A) · P (B)
P (A | B) = = = P (A)
AF
and
P (B)

P (A ∩ B)
P (B)

P (A) · P (B)
P (B | A) = = = P (B).
P (A) P (A)
These equations indicate that knowing the occurrence of one event does not
change the probability of the other event.

Independent Events: Two events A and B are said to be independent


if:

P (A ∩ B) = P (A) · P (B).
DR

This equation implies

P (A | B) = P (A), P (B | A) = P (B).

Any one of these three conditions implies the other two.

Example 1: Rolling a Die and Flipping a Coin


Consider rolling a fair six-sided die and flipping a fair coin. Let:
• A be the event that the die shows a 3.

• B be the event that the coin lands on heads.


The die has six possible outcomes: 1, 2, 3, 4, 5, or 6. Therefore, the sample
space for the die roll is:
SA = {1, 2, 3, 4, 5, 6}

125
CHAPTER 4. INTRODUCTION TO PROBABILITY

The coin has two possible outcomes: heads (H) or tails (T). Therefore, the
sample space for the coin flip is:

SA = {H, T }

The outcome of rolling the die does not affect the outcome of flipping the
coin, and vice versa. Therefore, events A and B are independent. We can verify
this as follows:
1 1
P (A) = , P (B) =
6 2
The combined sample space consists of 12 outcomes.

FT
 
(1, H), (2, H), (3, H), (4, H), (5, H), (6, H),
S=
 (1, T ), (2, T ), (3, T ), (4, T ), (5, T ), (6, T ) 

1
P (A ∩ B) = P (die shows 3 and coin lands on heads) =
12
1 1 1
P (A ∩ B) = P (A) · P (B) = · =
6 2 12
Thus, the events are independent.

Example 2: Drawing Cards from a Deck Without Replace-


A
ment
Consider drawing two cards from a standard deck of 52 cards without replace-
ment. Let:
• A be the event that the first card drawn is a heart.

• B be the event that the second card drawn is a heart.


R
In this case, the events are not independent because the outcome of the first
draw affects the probability of the second draw. If the first card is a heart,
there are now only 12 hearts left in a deck of 51 cards, so:
13 12
P (A) = , P (B | A) =
52 51
D

The probability of B if A occurs is different from P (B) without conditioning


on A, which is:
13
P (B) =
52
Thus, A and B are not independent.
Problem 4.11. A system has four computers. Computer 1 works with a prob-
ability of 0.88; computer 2 works with a probability of 0.78; computer 3 works
with a probability of 0.92; computer 4 works with a probability of 0.85. Suppose
that the operations of the computers are independent of each other.

126
CHAPTER 4. INTRODUCTION TO PROBABILITY

(a). Suppose that the system works only when all four computers are work-
ing. What is the probability that the system works?

(b). Suppose that the system works only if at least one computer is working.
What is the probability that the system works?

(c). Suppose that the system works only if at least three computers are work-
ing. What is the probability that the system works?

Solution
(a). To find the probability in this scenario, we multiply the probabilities

T
of all four computers working, as they are independent.

P (system works) = 0.88 × 0.78 × 0.92 × 0.85 = 0.537

(b). To find the probability in this scenario, we find the complement of


the probability that none of the computers are working. Then, the
AF probability that at least one computer is working is the complement of
the probability that none of the computers are working.

P (system works) = 1 − P (no computers working)


= 1 − ((1 − 0.88) × (1 − 0.78) × (1 − 0.92) × (1 − 0.85))
= 0.9997

(c).

P (system works) = P (all computers working)


+ P (computers 1,2,3 working, computer 4 not working)
DR

+ P (computers 1,2,4 working, computer 3 not working)


+ P (computers 1,3,4 working, computer 2 not working)
+ P (computers 2,3,4 working, computer 1 not working)
= 0.537 + (0.88 × 0.78 × 0.92 × (1 − 0.85))
+ (0.88 × 0.78 × (1 − 0.92) × 0.85)
+ (0.88 × (1 − 0.78) × 0.92 × 0.85)
+ ((1 − 0.88) × 0.78 × 0.92 × 0.85)
= 0.903

Problem 4.12. Suppose that somebody secretly rolls two fair six-sided dice,
and what is the probability that the face-up value of the first one is 2, given the
information that their sum is no greater than 5?

127
CHAPTER 4. INTRODUCTION TO PROBABILITY

Solution
To find the probability that the face-up value of the first die is 2 given that
the sum of the two dice is no greater than 5, we use the concept of conditional
probability.
Let A be the event that the face-up value of the first die is 2, and B be
the event that the sum of the two dice is no greater than 5. We want to find
P (A | B).
The conditional probability P (A | B) is given by:

P (A ∩ B)
P (A | B) =
P (B)

T
First, we determine P (B). The possible outcomes for the sum of the two
dice being no greater than 5 are:

(1, 1), (1, 2), (1, 3), (1, 4),


(2, 1), (2, 2), (2, 3),
AF (3, 1), (3, 2),
(4, 1)
There are 10 such outcomes, and since there are 36 possible outcomes when
rolling two dice, the probability P (B) is:
10 5
P (B) = =
36 18
Next, we determine P (A ∩ B), which is the probability that the first die is
2 and the sum of the dice is no greater than 5. The possible outcomes for this
are:
DR

(2, 1), (2, 2), (2, 3)


There are 3 such outcomes, so the probability P (A ∩ B) is:
3 1
P (A ∩ B) = =
36 12
Now we can calculate P (A | B):
1
P (A ∩ B) 12 1 18 18 3
P (A | B) = = 5 = × = =
P (B) 18
12 5 60 10

Thus, the probability that the face-up value of the first die is 2 given that
3
their sum is no greater than 5 is 10 .

128
CHAPTER 4. INTRODUCTION TO PROBABILITY

4.7 Posterior Probabilities


4.7.1 Law of Total Probability
Consider a sample space S partitioned into mutually exclusive events A1 , A2 , . . . , An .
This means:

S = A1 ∪ A2 ∪ · · · ∪ An
Let B be another event in the sample space given in Figure 4.1. The initial
question of interest is how to use the probabilities P (Ai ) and P (B | Ai ) to
calculate P (B), the probability of the event B. This can be achieved by noting

T
that

B = (A1 ∩ B) ∪ (A2 ∩ B) ∪ · · · ∪ (An ∩ B)


where the events Ai ∩ B are mutually exclusive, so that

P (B) = P (A1 ∩ B) + P (A2 ∩ B) + · · · + P (An ∩ B)


AF A1
A2
An−1 An
Ai

S
DR

Figure 4.1: A partition A1 , . . . , An and an event B.

Using the definition of conditional probability, this becomes

P (B) = P (A1 )P (B | A1 ) + P (A2 )P (B | A2 ) + · · · + P (An )P (B | An )

This result, known as the Law of Total Probability, has the interpretation
that if it is known that one and only one of a series of events Ai can occur, then
the probability of another event B can be obtained as the weighted average of
the conditional probabilities P (B | Ai ), with weights equal to the probabilities
P (Ai ).

Law of Total Probability: If A1 , . . . , An is a partition of a sample space,


then the probability of an event B can be obtained from the probabilities

129
CHAPTER 4. INTRODUCTION TO PROBABILITY

P (Ai ) and P (B | Ai ) using the formula

P (B) = P (A1 )P (B | A1 ) + P (A2 )P (B | A2 ) + · · · + P (An )P (B | An )

The law of total probability states that if you have a partition of the sample
space into mutually exclusive events, the probability of an event can be found by
summing the probabilities of the event occurring within each partition, weighted
by the probability of each partition.

Example

FT
Suppose we have a sample space divided into three mutually exclusive events
A1 , A2 , and A3 with the following probabilities and conditional probabilities:

P (A1 ) = 0.2, P (A2 ) = 0.5, P (A3 ) = 0.3

P (B | A1 ) = 0.4, P (B | A2 ) = 0.6, P (B | A3 ) = 0.3


To find P (B), use the Law of Total Probability:

P (B) = P (A1 ) · P (B | A1 ) + P (A2 ) · P (B | A2 ) + P (A3 ) · P (B | A3 )


A
P (B) = 0.2 · 0.4 + 0.5 · 0.6 + 0.3 · 0.3

P (B) = 0.08 + 0.30 + 0.09 = 0.47


Problem 4.13. A company sells a certain type of car that it assembles in one
of four possible locations. The probabilities of a car being assembled at each
R
plant are as follows:

• Plant I: 20% (P (Plant I) = 0.20)

• Plant II: 24% (P (Plant II) = 0.24)

• Plant III: 25% (P (Plant III) = 0.25)


D

• Plant IV: 31% (P (Plant IV) = 0.31)

Each new car sold carries a one-year bumper-to-bumper warranty. The com-
pany has collected data showing the following conditional probabilities of making
a warranty claim:

• P (claim | Plant I) = 0.05

• P (claim | Plant II) = 0.11

130
CHAPTER 4. INTRODUCTION TO PROBABILITY

• P (claim | Plant III) = 0.03

• P (claim | Plant IV) = 0.08

The probability of interest is the probability that a claim on the warranty of the
car will be required. If B is the event that a claim is made, we want to find
P (B).

Solution
We can use the Law of Total Probability to find P (B). According to the Law
of Total Probability:

T
P (B) =P (B | Plant I) · P (Plant I) + P (B | Plant II) · P (Plant II)
+ P (B | Plant III) · P (Plant III) + P (B | Plant IV) · P (Plant IV)

No Claim (0.95)
AF Plant I

Claim (0.05)

No Claim (0.89)

Plant II

Claim (0.11)
DR

Plant

No Claim (0.97)

Plant III

Claim (0.03)

No Claim (0.92)

Plant IV

Claim (0.08)
Substitute the given values:

131
CHAPTER 4. INTRODUCTION TO PROBABILITY

P (B) = (0.05 · 0.20) + (0.11 · 0.24) + (0.03 · 0.25) + (0.08 · 0.31)


= 0.01 + 0.0264 + 0.0075 + 0.0248
= 0.0687

Thus, the probability that a claim on the warranty will be required is 0.0687.

4.7.2 Total Probability with Multiple Conditions


In situations where an event depends on multiple factors, the Law of Total

FT
Probability allows us to compute the overall probability by summing the con-
tributions from all mutually exclusive combinations of those factors.

When an event B is influenced by two or more factors, such as age group


and smoking status, the total probability of B is calculated by conditioning on
all possible combinations of these factors. If we partition the sample space by
factors Ai (e.g., age groups) and Sj (e.g., smoking status), the total probability
of event B is given by:
XX
P (B) = P (B | Ai , Sj ) · P (Sj | Ai ) · P (Ai )
i j
A
Where:
• P (B | Ai , Sj ) is the conditional probability of B given that the indi-
vidual belongs to age group Ai and has smoking status Sj .

• P (Sj | Ai ) is the conditional probability of having smoking status Sj ,


given that the individual is in age group Ai .
R
• P (Ai ) is the marginal probability of being in age group Ai .
Thus, the total probability P (B) accounts for all the different ways that the
event B can occur, considering the two factors involved.
Problem 4.14. In a clinical study, participants are classified by age and smok-
D

ing status. The probability of being in the young age group (less than 40 years) is
P (Young) = 0.60, and the probability of being in the old age group (40 years or
older) is P (Old) = 0.40. For the young age group, the conditional probabilities
of having high blood pressure (BP) are P (High BP | Young, Smoker) = 0.10 for
smokers and P (High BP | Young, Non-smoker) = 0.05 for non-smokers. For
the old age group, the conditional probabilities of having high BP are P (High BP |
Old, Smoker) = 0.40 for smokers and P (High BP | Old, Non-smoker) = 0.25
for non-smokers. The probability of being a smoker in the young age group
is P (Smoker | Young) = 0.30, and the probability of being a non-smoker is
P (Non-smoker | Young) = 0.70. In the old age group, the probability of being

132
CHAPTER 4. INTRODUCTION TO PROBABILITY

a smoker is P (Smoker | Old) = 0.40, and the probability of being a non-smoker


is P (Non-smoker | Old) = 0.60. Calculate the overall probability of having high
blood pressure, P (High BP).

Solution
To compute the overall probability of having high blood pressure, P (High BP),
we use the Law of Total Probability. The total probability is given by:
XX
P (High BP) = P (High BP | Ai , Sj ) · P (Sj | Ai ) · P (Ai )
i j

T
Where
• P (High BP | Ai , Sj ) is the conditional probability of having high blood
pressure given that the individual belongs to age group Ai and has
smoking status Sj ,

• P (Ai ) is the marginal probability of being in age group Ai ,


AF• P (Sj | Ai ) is the conditional probability of being a smoker (or non-
smoker) given the age group Ai .
DR

133
CHAPTER 4. INTRODUCTION TO PROBABILITY

Not High BP

Non Smoker

High BP

Old

Not High BP

Smoker

T
High BP

Age

Not High BP

Non Smoker
AF High BP

Young

Not High BP

Smoker

High BP
Therefore, we have
DR

P (High BP) = P (High BP | Young, Smoker) · P (Smoker | Young) · P (Young)


+ P (High BP | Young, Non-smoker) · P (Non-smoker | Young) · P (Young)
+ P (High BP | Old, Smoker) · P (Smoker | Old) · P (Old)
+ P (High BP | Old, Non-smoker) · P (Non-smoker | Old) · P (Old)
= (0.10 · 0.30 · 0.60) + (0.05 · 0.70 · 0.60) + (0.40 · 0.40 · 0.40)
+ (0.25 · 0.60 · 0.40)
= 0.018 + 0.021 + 0.064 + 0.060
= 0.163
Thus, the probability of developing high blood pressure is P (High BP) =
0.163 or 16.3%.

Interpretation: The overall probability of having high blood pressure, con-


sidering all age groups and smoking statuses, is 16.3%. This result helps un-
derstand the prevalence of high blood pressure in the study population, taking

134
CHAPTER 4. INTRODUCTION TO PROBABILITY

into account various risk factors.

4.7.3 Bayes’ Theorem


Bayes’ Theorem relates the conditional and marginal probabilities of random
events. It is used to update the probability of a hypothesis based on observed
evidence. The theorem can be stated mathematically as follows:

P (B | A) · P (A)
P (A | B) =
P (B)

T
where:
• P (A | B) is the posterior probability of event A given that event B has
occurred.

• P (B | A) is the likelihood of event B given that event A has occurred.


AF • P (A) is the prior probability of event A before observing event B.
If A1 , A2 , . . . , An is a partition of a sample space, then the marginal probability
P (B) is
X
P (B) = P (B | Ai ) · P (Ai ).
i

In this case, the conditional probabilities P (B | Ai ) is

P (B | Ai ) · P (Ai )
P (Ai | B) = Pn
j=1 P (B | Aj ) · P (Aj )
DR

which is known as Bayes’ theorem.

Bayes’ Theorem for Posterior Probabilities


If A1 , A2 , . . . , An is a partition of a sample space, then the posterior
probabilities of the events Ai conditional on an event B can be obtained
from the prior probabilities P (Ai ) and the conditional probabilities P (B |
Ai ) using the formula:

P (Ai ) · P (B | Ai )
P (Ai | B) = Pn
j=1 P (B | Aj ) · P (Aj )

where:

• P (Ai ) is the prior probability of event Ai ,

• P (B | Ai ) is the conditional probability of event B given Ai ,

135
CHAPTER 4. INTRODUCTION TO PROBABILITY

• The denominator is the total probability of B, computed by sum-


ming over all possible partitions Aj .

Bayes’ Theorem is particularly useful in scenarios where the probability of


an event is updated as more evidence becomes available. It plays a crucial role
in fields such as machine learning, data analysis, and decision making under
uncertainty.
Problem 4.15. When a customer buys a car, the prior probabilities of it having
been assembled in a particular plant are P (Plant I) = 0.20, P (Plant II) = 0.24,
P (Plant III) = 0.25, and P (Plant IV) = 0.31. Each new car sold carries a
one-year bumper-to-bumper warranty. The company has collected data showing

T
the following conditional probabilities of making a warranty claim:

• P (claim | Plant I) = 0.05

• P (claim | Plant II) = 0.11


AF•
P (claim | Plant III) = 0.03

P (claim | Plant IV) = 0.08

If a claim is made on the warranty of the car, how does this change these
probabilities?

Solution
From Bayes’ theorem, the posterior probabilities are calculated as follows:

P (Plant I) · P (Claim | Plant I)


P (Plant I | Claim) =
P (Claim)
DR

Substitute the given values:


0.20 × 0.05
P (Plant I | Claim) = = 0.146
0.0687
P (Plant II) · P (Claim | Plant II)
P (Plant II | Claim) =
P (Claim)
Substitute the given values:
0.24 × 0.11
P (Plant II | Claim) = = 0.384
0.0687
P (Plant III) · P (Claim | Plant III)
P (Plant III | Claim) =
P (Claim)
Substitute the given values:

136
CHAPTER 4. INTRODUCTION TO PROBABILITY

0.25 × 0.03
P (Plant III | Claim) = = 0.109
0.0687
P (Plant IV) · P (Claim | Plant IV)
P (Plant IV | Claim) =
P (Claim)
Substitute the given values:
0.31 × 0.08
P (Plant IV | Claim) = = 0.361
0.0687

Comments on the results: The posterior probabilities are as follows:

T
• P (Plant I | Claim) = 0.146

• P (Plant II | Claim) = 0.384

• P (Plant III | Claim) = 0.109


AF• P (Plant IV | Claim) = 0.361

Notice that Plant II has the largest claim rate (0.11), and its posterior
probability (0.384) is much larger than its prior probability (0.24). This is
expected since the fact that a claim is made increases the likelihood that the
car has been assembled in a plant with a high claim rate. Similarly, Plant III
has the smallest claim rate (0.03), and its posterior probability (0.109) is much
smaller than its prior probability (0.25), as expected.
Problem 4.16. Suppose it is known that 1% of the population suffers from
a particular disease. A blood test has a 97% chance of identifying the disease
for diseased individuals, but also has a 6% chance of falsely indicating that a
DR

healthy person has the disease.

(a) What is the probability that a person will have a positive blood test?

(b) If your blood test is positive, what is the chance that you have the disease?
(c) If your blood test is negative, what is the chance that you do not have the
disease?

Solution
(a) Probability of a Positive Blood Test
Let D be the event that a person has the disease, and Dc be the event that a
person does not have the disease. Let T + be the event of a positive test result,
and T − be the event of a negative test result.

137
CHAPTER 4. INTRODUCTION TO PROBABILITY

P (D) = 0.01
P (Dc ) = 0.99
P (T + |D) = 0.97
P (T + |Dc ) = 0.06

The total probability of a positive test result is given by:

P (T + ) = P (T + |D)P (D) + P (T + |Dc )P (Dc )

T
= (0.97 × 0.01) + (0.06 × 0.99)
= 0.0097 + 0.0594
= 0.0691

So, the probability that a person will have a positive blood test is 0.0691.
AF
(b) Probability of Having the Disease Given a Positive Test
We use Bayes’ theorem:

P (T + |D)P (D)
P (D|T + ) =
P (T + )
0.97 × 0.01
=
0.0691
0.0097
=
0.0691
≈ 0.1403
DR

So, if your blood test is positive, the chance that you have the disease is
approximately 0.1403 or 14.03%.

(c) Probability of Not Having the Disease Given a Negative Test


We first find the probability of a negative test:

P (T − ) = P (T − |D)P (D) + P (T − |Dc )P (Dc )


= (1 − P (T + |D))P (D) + (1 − P (T + |Dc ))P (Dc )
= (1 − 0.97) × 0.01 + (1 − 0.06) × 0.99
= 0.03 × 0.01 + 0.94 × 0.99
= 0.0003 + 0.9406
= 0.9409

138
CHAPTER 4. INTRODUCTION TO PROBABILITY

Now, using Bayes’ theorem for P (Dc |T − ):

P (T − |Dc )P (Dc )
P (Dc |T − ) =
P (T − )
0.94 × 0.99
=
0.9409
0.9406
=
0.9409
≈ 0.9997

So, if your blood test is negative, the chance that you do not have the disease

T
is approximately 0.9997 or 99.97%.

4.8 Concluding Remarks


In this chapter, we covered the essential principles of probability that form the
backbone of data science. We discussed experiments, sample spaces, joint and
AF
marginal probabilities, and conditional probabilities, providing a solid foun-
dation for analyzing uncertainty and making data-driven decisions. We also
explored various methods for assigning probabilities, including classical, empir-
ical, and subjective approaches. The insights gained from understanding joint
probabilities, marginal probabilities, and Bayes’ Theorem will be invaluable for
refining models and interpreting data.

As we move forward, the next chapter will delve into random variables and
their properties. Random variables are crucial for quantifying and modeling
uncertainty in a more structured way. We will explore different types of ran-
dom variables, their distributions, and key properties, further building on the
DR

probability concepts introduced here. Mastering these topics will enhance your
ability to handle complex data challenges and apply statistical techniques effec-
tively. Understanding random variables is essential for advanced data analysis
and developing predictive models.

4.9 Chapter Exercises


1. Consider an experiment where a fair six-sided die is rolled. Define the
following events:

• A: The event that the outcome is an even number.


• B: The event that the outcome is greater than 4.
Calculate the following probabilities:
(a) P (A)

139
CHAPTER 4. INTRODUCTION TO PROBABILITY

(b) P (B)
(c) P (A ∩ B)
(d) P (A ∪ B)
(e) P (Ac )

2. In a bag of 10 balls, 4 are red and 6 are blue. Two balls are drawn at
random without replacement. Define the following events:
• A: Drawing a red ball on the first draw.
• B: Drawing a red ball on the second draw.

T
Calculate the following probabilities:

(a) P (A)
(b) P (B | A)
(c) P (A ∩ B)
AF(d) P (A ∪ B)

3. You are given a deck of 52 playing cards. Define the following events:
• A: Drawing a card that is a heart.
• B: Drawing a card that is a queen.
Calculate the following probabilities:
(a) P (A)
(b) P (B)
(c) P (A ∩ B)
DR

(d) P (A ∪ B)
(e) P (Ac )

4. Suppose that the probability of a young person liking Facebook is 0.7,


the probability of liking YouTube is 0.6, and the probability of liking
both platforms is 0.5. Using the relevant probability theorems, determine
the following:

(a) What is the probability that a young person likes exactly one of
the two social media platforms?
(b) What is the probability that a young person likes at least one of
the two platforms?
(c) What is the probability that a young person likes only Facebook
and not YouTube?

140
CHAPTER 4. INTRODUCTION TO PROBABILITY

5. A quality control team in a small factory inspects a batch of 60 parts.


It was observed that 10 parts were defective in appearance, 12 parts had
functional defects, and 4 parts were both defective in appearance and
function. If a part is selected randomly, what is the probability that it is
defective in appearance or has a functional defect?
6. A survey finds that 60% of people prefer coffee over tea, and 30% prefer
both coffee and tea. What is the probability that a randomly chosen
person prefers at least one of the two drinks? Define the following events:

• A: Preferring coffee.

T
B: Preferring tea.

Given:
• P (A) = 0.6
• P (A ∩ B) = 0.3
AFCalculate: P (A ∪ B)
7. In a class of 30 students, 18 like mathematics, 12 like science, and 8 like
both. If a student is chosen at random, calculate:

(a) The probability that the student likes science.


(b) The probability that the student likes mathematics given they like
science.
(c) The probability that the student likes either mathematics or science.

8. Consider the employment status of male and female employees in a tech-


DR

nology company. The company employs 1500 individuals, of whom 1050


are men and 450 are women. Over the past year, 375 employees were
promoted. After analyzing the promotion records, a committee of female
employees raised concerns about potential gender bias, noting that 315
male employees had received promotions, while only 60 female employees
were promoted.

(i) Construct a joint probability table based on these data. Calcu-


late the marginal probabilities. If a male employee is selected
randomly, what is the probability that he was promoted?
(ii) If a female employee is selected randomly, what is the probability
that she was not promoted? Also, if a randomly selected em-
ployee was promoted, what is the probability that the employee
is male?
(iii) If a randomly selected employee was not promoted, what is the
probability that the employee is female?

141
CHAPTER 4. INTRODUCTION TO PROBABILITY

9. In a clinical study, researchers are interested in the probability of a patient


developing a particular health condition based on the type of treatment
received. There are three types of treatments: A, B, and C. The proba-
bilities of receiving each treatment are as follows:

• Treatment A: 30% (P (A) = 0.30)


• Treatment B: 50% (P (B) = 0.50)
• Treatment C: 20% (P (C) = 0.20)

The probability of developing the health condition given the type of treat-
ment is known to be:

T
• P (Condition | A) = 0.10
• P (Condition | B) = 0.25
• P (Condition | C) = 0.15
AF Find the overall probability of a patient developing the health condition,
denoted as P (Condition).
Using Bayes’ theorem, calculate the following probabilities:

(a) P (A | Condition)
(b) P (B | Condition)
(c) P (C | Condition).

10. Suppose that somebody secretly rolls two fair six-sided dice, and what
is the probability that the face-up value of the first one is 3, given the
information that their sum is no greater than 5?
DR

11. An electrical system consists of four components as illustrated in the


following figure.

142
CHAPTER 4. INTRODUCTION TO PROBABILITY

The system works if components A and B work and either of the com-
ponents C or D works. The reliability (probability of working) of each
component is also shown in the above figure. Find the probability that
(a) the entire system works.
(b) the component C does not work, given that the entire system works.
Assume that the four components work independently.
(c) the component D does not work, given that the entire system works.
12. An agricultural research establishment grows vegetables and grades each
one as either good or bad for its taste, good or bad for its size, and good

T
or bad for its appearance. Overall 78% of the vegetables have a good
taste. However, only 69% of the vegetables have both a good taste and a
good size. Also, 5% of the vegetable have both a good taste and a good
appearance, but a bad size. Finally, 84% of the vegetables have either a
good size or a good appearance.
(a). If a vegetable has a good taste, what is the probability that it
AF (b).
also has a good size?
If a vegetable has a bad size and a bad appearance, what is the
probability that it has a good taste?
.
13. A company produces electronic components, and it has two types of ma-
chines, A and B, that manufacture these components. Machine A pro-
duces 60% of the components, while Machine B produces 40%. Historical
data shows that 2% of the components produced by Machine A are defec-
tive, while 5% of the components produced by Machine B are defective.
DR

A component is selected at random and found to be defective. What is


the probability that this defective component was produced by Machine
A?
14. A certain rare disease affects 2% of a population. A diagnostic test for
this disease has the following characteristics:
• It correctly identifies the disease (true positive) 90% of the time
for those who have it.
• It incorrectly indicates the disease (false positive) 5% of the time
for those who do not have it.
Answer the following questions.
(a) What is the probability that a person will receive a positive test
result?

143
CHAPTER 4. INTRODUCTION TO PROBABILITY

(b) If a person tests positive, what is the probability that they actually
have the disease?
(c) If a person tests negative, what is the probability that they do not
have the disease?
15. As a mining company evaluates the likelihood of discovering a gold de-
posit in a specific region, they have gathered data on the probabilities
associated with geological features. Given that the probability of finding
a gold deposit is P (G) = 0.3, the likelihood of observing specific geolog-
ical features if a deposit is present is P (E|G) = 0.8, and the chance of
observing those features if no deposit exists is P (E|Gc ) = 0.1, answer the

T
following:

(i) Calculate the probability of observing the geological features in this


area.
(ii) What is the probability that there is indeed a gold deposit given the
observed geological features?
AF (iii) How would you interpret these results in terms of their implications
for the mining company’s decision-making process regarding further
exploration in this region?

16. A factory produces 80% of products with Machine A and 20% with Ma-
chine B. If 2% of A’s products and 5% of B’s products are defective, what
is the probability that a defective product came from Machine A?
17. Suppose you are on a game show with three doors: one has a car, the
other two have goats. You choose Door 1. The host, who knows what’s
behind the doors, opens Door 3 to show a goat and asks if you want to
switch to Door 2.
DR

(a) What’s the chance of winning the car if you switch to Door 2?
(b) What’s the chance of winning the car if you stay with Door 1?

144
Chapter 5

Random Variable and Its

T
Properties
AF
5.1 Introduction
In the realm of data science, understanding and manipulating uncertainty is
a fundamental skill. At the core of this capability lies the concept of a ran-
dom variable. A random variable is a quantitative variable whose values are
determined by the outcome of a random phenomenon. It serves as a bridge
connecting the abstract world of probability theory to the concrete domain of
data analysis.

Random variables can be classified into two main types: discrete and contin-
uous. Discrete random variables take on a countable number of distinct values,
DR

often representing things like the number of occurrences of an event. Contin-


uous random variables, on the other hand, can take on an infinite number of
possible values within a given range, making them essential for representing
measurements and other quantities that vary smoothly.

This chapter delves into the foundational aspects of random variables, ex-
ploring their properties and the critical role they play in statistical modeling
and data analysis. We will discuss probability distributions, expected values,
variances, and other By the end of this chapter, readers will gain a robust un-
derstanding of how random variables function and how they can be applied to
solve real-world problems in data science.

5.2 Random Variable


A random variable is a mathematical concept used in probability theory and
statistics, representing a variable whose possible values depend on the outcomes

145
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

of a random experiment. It serves as a fundamental tool for defining proba-


bility distributions and calculating probabilities associated with events arising
from uncertain or stochastic processes. In data science, random variables are
fundamental because they allow us to model and reason about uncertainty and
variability in data.

A random variable is a numerical outcome of a random phenomenon. It is


a function that assigns a real number to each outcome in a sample space of a
random experiment. Formally, a random variable X is defined as a function:
X:S→R
where S is the sample space of the experiment, and R is the set of real numbers.

T
Random Variable: A variable whose possible values are determined by
outcomes of a random experiment or process, with each value associated
with a probability.

Example: Testing Electronic Components


AF
Consider a random experiment where three electronic components are tested
for defects. The sample space, giving a detailed description of each possible
outcome, can be written as follows:

S = {NNN, NDN, NND, DNN, NDD, DND, DDN, DDD}


where,
• N stands for a non-defective component.

• D stands for a defective component.


Let X be the random variable representing the number of defective com-
DR

ponents in the sample. The possible values of X are denoted by x, and their
corresponding outcomes are listed in Table 5.1. The random variable X can
take on the following values:
• x = 0: No defective components (Outcome: NNN)

• x = 1: One defective component (Outcomes: NDN, NND, DNN)

• x = 2: Two defective components (Outcomes: NDD, DND, DDN)

• x = 3: Three defective components (Outcome: DDD).

Table 5.1: Possible Outcomes When Testing Three Electronic Components

Outcome NNN NDN NND DNN NDD DND DDN DDD

x 0 1 1 1 2 2 2 3

146
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

In this example, X is a discrete random variable because it can take on


a countable number of distinct values. Each value of X corresponds to the
number of defective components in the tested sample. There are two main
types of random variables: discrete and continuous.

5.3 Discrete Random Variables


A discrete random variable can take on a countable number of possible
values. Here are some examples of discrete random variables:

1. Number of Heads in a Series of Coin Tosses: When flipping a

T
fair coin multiple times, the number of heads observed in the series is a
discrete random variable. For example, if you flip a coin 10 times, the
number of heads (0 to 10) is a discrete outcome.

2. Number of Defective Items in a Batch: In quality control, the num-


ber of defective items in a batch of products is a discrete random variable.
AF For instance, if a factory produces 100 items in a day, the number of de-
fective items could be any integer from 0 to 100.

3. Number of Customers in a Queue: The number of customers waiting


in line at a service center or a bank is a discrete random variable. At any
given time, this number could be 0, 1, 2, and so on.
4. Roll of a Die: When rolling a standard six-sided die, the outcome is a
discrete random variable with possible values of 1, 2, 3, 4, 5, or 6.

5. Number of Emails Received in a Day: The number of emails a


person receives in a day is a discrete random variable. It can take on any
non-negative integer value (0, 1, 2, . . . ).
DR

6. Number of Accidents at an Intersection: The number of traffic


accidents occurring at a particular intersection in a month is a discrete
random variable. This count could be 0, 1, 2, and so on.
7. Number of Children in a Family: The number of children in a family
is a discrete random variable, with possible values of 0, 1, 2, and so forth.
8. Number of Sales Transactions in a Day: The number of sales trans-
actions processed by a retail store in a single day is a discrete random
variable, representing the count of individual sales.

These examples illustrate various contexts in which discrete random vari-


ables are used to model and analyze real-world phenomena.

147
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

5.3.1 Probability Mass Function (pmf )


The probability distribution of a random variable describes how probabilities
are distributed over the values of the random variable. For a discrete random
variable, the probability distribution is described by the probability mass
function (pmf ), which gives the probability that the random variable takes
on a specific value.

Probability Mass Function: A pmf of a discrete random variable X


is a function p(x) that gives the probability that X takes the value x. It
satisfies:

T
(i). Non-negativity: For every possible value x that X can take, the
probability p(x) is non-negative:

p(x) ≥ 0

(ii). Normalization: The sum of the probabilities for all possible values
AF of X equals 1: X
p(x) = 1
x∈Range of X

(iii). Probability Assignment: For any specific value x, p(x) gives the
probability that the random variable X takes the value x:

p(x) = P (X = x)

Example: Testing Electronic Components


Consider the example of Testing Electronic Components described in the
DR

previous section, where X is the random variable representing the number of


defective components in the three tested electronic components. The prob-
ability mass function for X is shown below, and the graphical representation is
presented in Figure 5.1.

Table 5.2: Probability mass function

Outcome x P (X = x)
1
{NNN} 0 8
3
{NDN, NND, DNN} 1 8
3
{NDD, DND, DDN} 2 8
1
{DDD} 3 8

148
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

0.5

0.4 0.375 0.375

Probability P (X = x)
0.3

0.2
0.125 0.125
0.1

T
0
0 1 2 3
X (Number of Defective Components)
AF Figure 5.1: Probability Mass Function of X

5.3.2 Cumulative Distribution Function (cdf )


The cumulative distribution function (cdf) of a random variable X is a
function that gives the probability that X will take a value less than or equal
to x. For both discrete and continuous random variables,

F (x) = P (X ≤ x).

The cdf is a non-decreasing function that ranges from 0 to 1.


DR

For the discrete random variable X, the cumulative distribution function


can then be calculated from the expression:
X
F (x) = P (X = y).
y:y≤x

Example: Testing Electronic Components


Consider the example of Testing Electronic Components described in the
previous section, where X is the random variable representing the number
of defective components in the three tested electronic components. The
cumulative distribution function for X is shown below, and the graphical rep-
resentation is presented in Figure 5.2.

149
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Table 5.3: The cdf for the Number of Defective Components

x P (X = x) F (x)
1
0 8
0.125
3
1 8
0.500
3
2 8
0.875
1
3 8
1

T
1
0.875
F (x)

AF 0.5

0.125
0
0 1 2 3
X (Number of Defective Components)

Figure 5.2: Cumulative Distribution Function of X


DR

5.3.3 Properties of the Cumulative Distribution Function


The cdf of a random variable has several important properties:

1. Non-decreasing: The cdf F (x) is a non-decreasing function. This means


that if x1 ≤ x2 , then F (x1 ) ≤ F (x2 ). The probability that the random
variable takes a value less than or equal to x does not decrease as x
increases.
2. Limits:
• limx→−∞ F (x) = 0; that is the minimum value of the cdf is 0.
• limx→+∞ F (x) = 1; that is the maximum value of the cdf is 1.
3. Right-Continuous: The cdf F (x) is right-continuous. This means that
for any value x, the limit of F (x) as t approaches x from the right (t → x+ )

150
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

is equal to F (x). Mathematically, this can be written as limt→x+ F (t) =


F (x).
4. Range: The cdf F (x) takes values in the interval [0, 1]. For any real
number x, 0 ≤ F (x) ≤ 1. This reflects the fact that probabilities range
from 0 to 1.

5. Step Function: For discrete random variables, the cdf F (x) is a step
function, with the value increasing at each point where the random vari-
able takes a value.

Problem 5.1. An office has four copying machines, and the random variable

T
X measures how many of them are in use at a particular moment in time.
Suppose that P (X = 0) = 0.08, P (X = 1) = 0.11, P (X = 2) = 0.27, and
P (X = 3) = 0.33.

(a) What is P (X = 4)?


(b) Draw a line graph of the probability mass function.
AF
(c) Construct and plot the cumulative distribution function.

Solution
(a) Since the sum of all probabilities must be 1, we have:

P (X = 4) = 1 − (P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3))
= 1 − (0.08 + 0.11 + 0.27 + 0.33) = 1 − 0.79
= 0.21
DR

(b) The graphical presentation of the probability mass function is the follow-
ing:

151
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

0.4

0.33

Probability P (X = x)
0.3
0.27

0.21
0.2

0.11
0.1 0.08

T
0
0 1 2 3 4
Number of copying machines in use
AF Figure 5.3: Probability Mass Function

(c) We konw, the cumulative distribution function F (x) is defined as:

F (x) = P (X ≤ x)

The cumulative distribution function F (x) with probability mass function


is provided in Table 5.4.

x 0 1 2 3 4
DR

p(x) 0.08 0.11 0.27 0.33 0.21


F (x) 0.08 0.19 0.46 0.79 1.00

Table 5.4: Cumulative Distribution Function of X

152
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

where,
F (0) = P (X = 0) = 0.08
F (1) = P (X ≤ 1) = P (X = 0) + P (X = 1) = 0.08 + 0.11 = 0.19
F (2) = P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2)
= 0.08 + 0.11 + 0.27 = 0.46
F (3) = P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
= 0.08 + 0.11 + 0.27 + 0.33 = 0.79
F (4) = P (X ≤ 4)
= P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4)

T
= 0.08 + 0.11 + 0.27 + 0.33 + 0.21 = 1.00

The graphical presentation of F (x) is the following:

1
AF 0.8
F (X)

0.6

0.4

0.2

0
0 1 2 3 4
X
DR

Problem 5.2. Let the number of phone calls received by a switchboard during
a 5-minute interval be a random variable X with probability function
e−2 2x
p(x) = , for x = 0, 1, 2, . . .
x!
(a) Determine the probability that x equals 0, 1, 2, 3, 4, 5, and 6.
(b) Graph the probability mass function for these values of x.
(c) Determine the cumulative distribution function for these values of X.

Solution
(a) Probabilities
The probability function is given by
e−2 2x
p(x) =
x!

153
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

The probabilities for X = 0, 1, 2, 3, 4, 5, 6 are:

e−2 20
P (X = 0) = = e−2 ≈ 0.1353
0!
e−2 21
P (X = 1) = = 2e−2 ≈ 0.2707
1!
e−2 22
P (X = 2) = = 22 e−2 /2 ≈ 0.2707
2!
e−2 23
P (X = 3) = = 23 e−2 /6 ≈ 0.1804
3!
e−2 24
P (X = 4) = = 24 e−2 /24 ≈ 0.0902

T
4!
e−2 25
P (X = 5) = = 25 e−2 /120 ≈ 0.0361
5!
e−2 26
P (X = 6) = = 26 e−2 /720 ≈ 0.0120
6!
AF
(b) Graph of the Probability Mass Function

0.3
0.27 0.27
Probability P (X = x)

0.2 0.18

0.14
DR

0.1 0.09

0.04
0.01
0
0 1 2 3 4 5 6
x

Figure 5.4: Probability Mass Function of X

(c) Cumulative Distribution Function


The cumulative distribution function F (x) = P (X ≤ x) for x = 0, 1, 2, 3, 4, 5, 6, . . .
is:

154
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

F (0) = P (X ≤ 0) = P (X = 0) = 0.1353
F (1) = P (X ≤ 1) = P (X = 0) + P (X = 1) = 0.1353 + 0.2707 = 0.4060
F (2) = P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2)
= 0.4060 + 0.2707 = 0.6767
F (3) = P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
= 0.6767 + 0.1804 = 0.8571
F (4) = P (X ≤ 4) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4)
= 0.8571 + 0.0902 = 0.9473

T
F (5) = P (X ≤ 5) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
+ P (X = 4) + P (X = 5)
= 0.9473 + 0.0361 = 0.9834
F (6) = P (X ≤ 6) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
+ P (X = 4) + P (X = 5) + P (X = 6)
AF
5.3.4
= 0.9834 + 0.0120 = 0.9954

Exercise
1. An office has four copying machines, and the random variable X denotes
how many of them are in use in a particular time. Suppose the probability
mass function X is given below:

x 0 1 2 3 4
Pr(X = x) k 0.02 0.05 0.4 (k + 0.3)
DR

(a) What is the value of k and draw the line graph of the probability
mass function Pr(X = x).
(b) Find the Value of Pr(X ≤ 2).
(c) Find the probability that at least two copying machines are in
use.
(d) Find the cumulative Function F (x) and draw the F (x).

2. An office has five printers and the random variable Y measures how many
of them are currently being used. Suppose that P (Y = 0) = 0.05, P (Y =
1) = 0.10, P (Y = 2) = 0.20, P (Y = 3) = 0.30, and P (Y = 4) = 0.25.

(a) What is P (Y = 5)?


(b) Draw a line graph of the probability mass function.
(c) Construct and plot the cumulative distribution function.

155
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

3. A hospital has six emergency rooms and the random variable Z measures
how many of them are occupied at a given time. Suppose that P (Z =
0) = 0.04, P (Z = 1) = 0.10, P (Z = 2) = 0.20, P (Z = 3) = 0.25,
P (Z = 4) = 0.20, and P (Z = 5) = 0.15.

(a) What is P (Z = 6)?


(b) Draw a line graph of the probability mass function.
(c) Construct and plot the cumulative distribution function.

4. A hospital has three emergency rooms, and the random variable W de-
notes how many of them are occupied at a particular time. Suppose the

T
probability mass function W is given below:

w 0 1 2 3
Pr(W = w) q 0.15 0.25 (q + 0.05)
AF(a) What is the value of q and draw the line graph of the probability
mass function Pr(W = w).
(b) Find the value of Pr(W ≤ 2). Also, find the probability that at least
one emergency room is occupied.
(c) Find the cumulative distribution function F (w) and draw the graph
of F (w).

5. A clinic has three doctors and the random variable W measures how
many of them are available at a particular moment in time. Suppose that
P (W = 0) = 0.15, P (W = 1) = 0.20, and P (W = 2) = 0.30.
DR

(a) What is P (W = 3)?


(b) Draw a line graph of the probability mass function.
(c) Construct and plot the cumulative distribution function.

6. A warehouse has seven forklifts and the random variable V measures how
many of them are currently in operation. Suppose that P (V = 0) = 0.02,
P (V = 1) = 0.08, P (V = 2) = 0.18, P (V = 3) = 0.25, P (V = 4) = 0.20,
and P (V = 5) = 0.15.

(a) What is P (V = 6)?


(b) Draw a line graph of the probability mass function.
(c) Construct and plot the cumulative distribution function.

7. A manufacturing plant has four assembly lines and the random variable
U measures how many of them are operating at a given time. Suppose
that P (U = 0) = 0.10, P (U = 1) = 0.20, and P (U = 2) = 0.35.

156
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

(a) What is P (U = 3)?


(b) Draw a line graph of the probability mass function.
(c) Construct and plot the cumulative distribution function.

5.4 Continuous Random Variables


A continuous random variable can take on an uncountable number of pos-
sible values. Here are some examples of continuous random variables:

1. Height of Individuals: The height of a person is a continuous random

T
variable because it can take any value within a given range. For example,
the height could be 170.2 cm, 175.5 cm, etc.
2. Time Taken to Complete a Task: The time required to finish a task,
such as running a marathon, is a continuous random variable. It can be
measured in hours, minutes, seconds, and fractions of a second.
AF
3. Temperature: The temperature at a specific location and time is a
continuous random variable. It can take any value within the possible
range of temperatures, such as 23.45°C, 37.8°C, etc.

4. Weight of an Object: The weight of an object is a continuous random


variable. For example, a bag of flour might weigh 1.25 kg, 1.30 kg, etc.
5. Amount of Rainfall: The amount of rainfall in a day is a continuous
random variable. It can be measured in millimeters or inches, and it can
take any value within a range.
6. Price of a Stock: The price of a stock at any given moment is a con-
DR

tinuous random variable. It can vary continuously and take on any value
within the range of possible stock prices.
7. Age of an Individual: The age of a person can be considered a contin-
uous random variable if measured precisely. For instance, someone could
be 25.3 years old, 45.7 years old, etc.
8. Voltage in an Electrical Circuit: The voltage at a point in an electrical
circuit is a continuous random variable. It can take any value within the
possible voltage range.

These examples illustrate various contexts in which continuous random vari-


ables are used to model and analyze real-world phenomena.

157
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

5.4.1 Probability Density Function (pdf )


For a continuous random variable X, the probability distribution is described
by the probability density function (pdf), denoted as f (x). This function
specifies the probability density at each point in the random variable’s range.
The pdf f (x) has the following properties:

1. Non-negativity: f (x) ≥ 0 for all x.

2. Normalization: The total area under the pdf curve over the entire range
of X is equal to 1: Z ∞
f (x) dx = 1.

T
−∞

Note that the pdf f (x) does not provide the probability of X taking any spe-
cific value (which is always zero for continuous random variables). Instead, it
indicates the density of probability at each point. To find the probability that
X falls within a specific interval [a, b] given in Figure 5.5, you integrate the pdf
AF
over that interval:
P (a ≤ X ≤ b) =
Z

a
b
f (x) dx.

Rb
f (x) P (a ≤ X ≤ b) = f (x) dx
a
DR

a b x

Figure 5.5: The area under the probability density function f (x) between a and
b.

Sometimes, the density of X is denoted by fX (x) to explicitly indicate that


the function f corresponds to the random variable X.

158
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Probability Density Function: The probability density function (pdf)


of a continuous random variable X with support S is an integrable function
f (x) satisfying the following conditions:
(i). f (x) is positive everywhere in the support S, that is, f (x) ≥ 0 for
all x ∈ S. The area under the curve f (x) in the support S is 1,
that is: Z
f (x) dx = 1.
S

(ii). If f (x) is the pdf of X, then the probability that X belongs to


an interval [a, b] is given by the integral of f (x) over that interval,

T
that is: Z b
P (a ≤ X ≤ b) = f (x) dx.
a

It is useful to notice that the probability that a continuous random variable X


takes any specific value a is always 0! Technically, this can be seen by noting
AF
that Z a
P (X = a) =
a
f (x) dx = 0.

Problem 5.3 (Metal Cylinder Production). Suppose the diameter X of a metal


cylinder has the following probability density function (pdf ):
(
1.5 − 6(x − 50.0)2 for 49.5 ≤ x ≤ 50.5
f (x) =
0 otherwise.

(i). Prove that f (x) is a valid probability density function.


DR

(ii). Find the probability that the diameter of the metal cylinder lies between
49.8 mm and 50.1 mm, i.e., calculate P (49.8 ≤ X ≤ 50.1).

Solution
(i). To determine if f (x) = 1.5−6(x−50.0)2 for 49.5 ≤ x ≤ 50.5 and f (x) = 0
elsewhere is a valid probability pdf, we need to check two conditions:
1. Non-negativity: f (x) ≥ 0 for all x.
2. Normalization: The total integral of f (x) over all possible values must
equal 1.

Non-negativity Check
We need to ensure that f (x) ≥ 0 for 49.5 ≤ x ≤ 50.5:

f (x) = 1.5 − 6(x − 50.0)2

159
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

For 49.5 ≤ x ≤ 50.5, let’s calculate the minimum value of the quadratic
function: (x − 50.0)2 is minimized at x = 50.0, and (x − 50.0)2 ranges from 0
to (0.5)2 = 0.25.

f (x) = 1.5 − 6(x − 50.0)2 ≥ 1.5 − 6 · 0.25 = 1.5 − 1.5 = 0.


Thus, f (x) ≥ 0 for all x in the given interval.

Normalization Check
We need to integrate f (x) over the interval 49.5 ≤ x ≤ 50.5 and check if the
integral equals 1:

T
Z 50.5 Z 50.5
1.5 − 6(x − 50.0)2 dx

f (x) dx =
49.5 49.5
Z 50.5 Z 50.5
= 1.5 dx − 6(x − 50.0)2 dx
49.5 49.5
AF = 1.5 × (50.5 − 49.5) −
Z 50.5
Z 50.5

49.5
6(x − 50.0)2 dx

= 1.5 − 6(x − 50.0)2 dx. (5.1)


49.5
(5.2)

The integral Z 50.5


6(x − 50.0)2 dx
49.5
can be simplified by substitution. Let u = x − 50.0. Then du = dx, and the
DR

limits of integration change accordingly:


When x = 49.5, u = −0.5 and when x = 50.5, u = 0.5. So,
Z 50.5 Z 0.5
2
6(x − 50.0) dx = 6 u2 du
49.5 −0.5

The integral of u2 is:


0.5 0.5
u3
Z 
u2 du =
−0.5 3 −0.5
(0.5)3 (−0.5)3
 
0.125 0.125
= − = − −
3 3 3 3
0.125 0.125
= +
3 3
1
=
12

160
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

and Z 0.5
1 6
6 u2 du = 6 × = = 0.5
−0.5 12 12
Therefore, from the Equation (5.2), we have
Z 50.5
f (x) dx = 1.5 − 0.5 = 1
49.5

Since both conditions are satisfied, f (x) is indeed a valid probability density
function.

The graphical presentation of the density f (x) is presented in Figure 5.6.

T
2
f (x) = 1.5 − 6(x − 50.0)2

1.5
AF
f (x)

0.5

0
49.4 49.6 49.8 50 50.2 50.4 50.6
x
DR

Figure 5.6: Density Plot of the pdf f (x)

(ii). We can find the probability that a metal cylinder has a diameter between
49.8 and 50.1 mm which is

161
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Z 50.1
P (49.8 ≤ X ≤ 50.1) = f (x) dx
49.8
Z 50.1
1.5 − 6(x − 50.0)2 dx

=
49.8
Z 50.1 Z 50.1
= 1.5 dx − 6(x − 50.0)2 dx
49.8 49.8
Z 0.1
50.1
= 1.5 [x]49.8 −6 u2 du [let, u = x − 50.0]
−0.2

T
Z 0.1
= 1.5(50.1 − 49.8) − 6 u2 du
−0.2
(0.1)3 (−0.2) 3
= 0.45 − −
3 3
= 0.45 − 0.018 = 0.432
AF
Thus, the probability that a metal cylinder has a diameter between 49.8 and
50.1 mm is 0.432 or 43.2%.

5.4.2 Cumulative Distribution Function (cdf )


The cumulative distribution function (cdf) of a continuous random variable
X is a function that gives the probability that X will take a value less than or
equal to x. The cumulative distribution function can be calculated from the
expression: Z x
F (x) = P (X ≤ x) = f (y) dy.
−∞
DR

In practical applications, the lower integration limit of −∞ can be replaced by


the lower boundary of the state space, since the probability density function
(pdf) is zero outside this region. The pdf can be obtained by differentiating the
cumulative distribution function (cdf), which is given by:

dF (x)
f (x) = .
dx
This relationship connects the probability density function to the cumulative
probability.

Cumulative Distribution Function: The cumulative distribution


function F (x) of a random variable X is defined as
Z x
F (x) = P (X ≤ x) = f (y) dy.
−∞

162
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

For a continuous random variable X, the three properties mentioned in


Section 5.3.3 are satisfied. In addition, the following property must hold: if the
cdf F (·) is continuous at any a ≤ x ≤ b, then

P (a ≤ X ≤ b) = F (b) − F (a).

Note that F (x) is a non-decreasing function, meaning that if a < b, then


F (a) ≤ F (b). This reflects that as you move to the right along the x-axis, the
cumulative probability does not decrease. Another important property is as
follows:

• limx→−∞ F (x) = 0: As x approaches negative infinity, the cdf ap-

T
proaches 0, indicating that the probability of the random variable being
less than any finite value is 0.

• limx→∞ F (x) = 1: As x approaches positive infinity, the cdf approaches


1, indicating that the probability of the random variable being less than
any sufficiently large value is 1.
AF
Problem 5.4. Consider the Problem 5.3, the probability density function f (x)
of a random variable X is
(
1.5 − 6(x − 50.0)2 for 49.5 ≤ x ≤ 50.5,
f (x) =
0 otherwise.
Find the cumulative distribution function of X and give the graphical presenta-
tion of this function.

Solution
DR

The cumulative distribution function of X is defined as


Z x
F (x) = (1.5 − 6(t − 50.0)2 ) dt
49.5
Z x Z x
= 1.5 dt − 6(t − 50.0)2 dt
49.5 49.5
Z x−50
= 1.5(x − 49.5) − 6 u2 du [let, u = t − 50.0]
−0.5
x−50
u3

= 1.5(x − 49.5) − 6
3 −0.5
= 1.5(x − 49.5) − 2 (x − 50)3 + 0.125


= 1.5(x − 49.5) − 2(x − 50)3 − 0.25

Therefore, the cdf F (x) is:

163
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES


0
 for x < 49.5
F (x) = 1.5(x − 49.5) − 2(x − 50)3 − 0.25 for 49.5 ≤ x ≤ 50.5

1 for x > 50.5

The graphical presentation of F (x) is dipicted in Figure 5.7.

1 F (x)

T
0.8

0.6
F (x)

0.4
AF 0.2

49 49.5 50 50.5 51
x

Figure 5.7: The cumulative distribution function of f (x).

Problem 5.5. Let X be a continuous random variable with the probability


density function (pdf ):
DR

(
2x for 0 ≤ x ≤ 1
f (x) =
0 otherwise

(i). Verify that the function f (x) is a valid probability density function by
showing that the total area under the curve is equal to 1.

(ii). Calculate the probability P (0.5 < X < 0.8).

(iii). Derive the cumulative distribution function F (x).

Solution
(i). To verify that the function
(
2x for 0 ≤ x ≤ 1
f (x) =
0 otherwise

164
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

is a valid probability density function (pdf), we calculate the integral over its
range:
Z ∞ Z 1
f (x) dx = 2x dx
−∞ 0
Calculating the integral:
Z 1  1
2x dx = x2 0 = 12 − 02 = 1
0
Since Z ∞
f (x) dx = 1,

T
−∞
we conclude that f (x) is a valid probability density function.

(ii). We compute
Z 0.8  0.8
2x dx = x2 0.5 = (0.8)2 − (0.5)2 = 0.39
AF P (0.5 < X < 0.8) =

Thus, P (0.5 < X < 0.8) = 0.39.


0.5

(iii). The cumulative distribution function F (x) is given by:


Z x
F (x) = f (t) dt
−∞

For 0 ≤ x ≤ 1: Z x  x
F (x) = 2t dt = t2 0 = x2
0
DR

So, the cdf F (x) can be summarized as:



0
 for x < 0
F (x) = x2 for 0 ≤ x ≤ 1

1 for x > 1

Problem 5.6. Given the cumulative distribution function (CDF):



0
 for x < 0
3
F (x) = x for 0 ≤ x ≤ 1

1 for x > 1

(i). Find the probability density function (pdf ) f (x). Verify that the f (x)
is a valid pdf.

(i). Calculate the probability P (0.2 < X < 0.5).

165
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Solution
(i). It is noted that f (x) ≥ 0 for all x and
Z ∞ Z 1  3 1
2 x
f (x) dx = 3x dx = 3 = 1.
−∞ 0 3 0
R∞
Since f (x) ≥ 0 and −∞ f (x) dx = 1, f (x) is a valid probability density function.

(ii). To find the probability P (0.2 < X < 0.5), we can use the pdf:
Z 0.5 Z 0.5
P (0.2 < X < 0.5) = f (x) dx = 3x2 dx
0.2 0.2

T
 0.5
= x3 0.2 = (0.5)3 − (0.2)3
= 0.125 − 0.008 = 0.117
Alternatively,
P (0.2 < X < 0.5) = F (0.5) − F (0.2) = 0.53 − 0.23 = 0.117
AF
Hence, the probability P (0.2 < X < 0.5) is 0.117.
Problem 5.7. Let X be a continuous random variable with the pdf:
(
1 −|x|
e for − ∞ < x < ∞
f (x) = 2
0 otherwise
Compute F (x).

Solution
For −∞ < x < ∞, the cumulative function is
Z x
DR

1 −|t|
F (x) = e dt
−∞ 2

Since e−|t| can be split into two parts depending on the range of t.
For x < 0:
Z x
1 t 1  t x 1 1
F (x) = e dt = e −∞ = (ex − 0) = ex
−∞ 2 2 2 2
For x ≥ 0:
Z 0 Z x
1 t 1 −t
F (x) = e dt + e dt
−∞ 2 0 2
1  t 0 1  −t x
= e −∞ + −e 0
2 2
1 1
1 − e−x

= (1 − 0) +
2 2
1 −x
=1− e
2

166
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

So, the cdf F (x) is:


(
1 x
F (x) = 2e for x < 0
1 − 12 e−x for x ≥ 0.

5.4.3 Exercises
1. Consider a random variable measuring the following quantities. In each
case, state with reasons whether you think it is more appropriate to define
the random variable as discrete or continuous.

T
(a) The number of books in a library
(b) The duration of a phone call
(c) The number of steps a person takes in a day
(d) The amount of rainfall in a month
AF (e)
(f)
The
The
number of languages a person speaks
speed of a car on a highway

2. A random variable X takes values between 4 and 6 with a probability


density function
k
f (x) = for 4 ≤ x ≤ 6.
x ln(1.5)
(a) What is the value of k?
(b) Make a sketch of the probability density function.
(c) What is P (4.5 ≤ X ≤ 5.5)?
DR

(d) Construct and sketch the cumulative distribution function.

3. A random variable Y takes values between 1 and 3 with a probability


density function
k
g(y) = for 1 ≤ y ≤ 3.
(y + 1)2
(a) Find the value of k and then make a sketch of the probability density
function.
(b) What is P (1.5 ≤ Y ≤ 2.5)?
(c) Construct and sketch the cumulative distribution function.

4. A random variable Z takes values between 2 and 5 with a probability


density function
k
h(z) = for 2 ≤ z ≤ 5.
(z + 1)3

167
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

(a) Find the value of k and then make a sketch of the probability density
function.
(b) What is P (2.5 ≤ Z ≤ 4)?
(c) Construct and sketch the cumulative distribution function.

5. A random variable X takes values between 0 and 4 with a cumulative


distribution function
x2
F (x) = for 0 ≤ x ≤ 4.
16
(a) Sketch the cumulative distribution function.

T
(b) What is P (X ≤ 2)?
(c) What is P (1 ≤ X ≤ 3)?
(d) Construct and sketch the probability density function.

6. The resistance X of an electrical component has a probability density


AF function

f (x) = Ax(130 − x2 ) for resistance values in the range 10 ≤ x ≤ 11.

(a) Calculate the value of the constant A.


(b) Calculate the cumulative distribution function.
(c) What is the probability that the electrical component has a resistance
between 10.25 and 10.5?

5.5 The Expectation of a Random Variable


DR

While the probability mass function or the probability density function provides
complete information about the probabilistic properties of a random variable,
it is often useful to use some summary measures of these properties. One of
the most fundamental summary measures is the expectation or mean of a ran-
dom variable, denoted by E(X), which represents the “average” value of the
random variable. Two random variables with the same expected value can be
considered to have the same average value, even though their probability mass
functions or probability density functions may differ significantly.

The expected value (or mean) of a random variable is a measure of its


central tendency.

Expected Value of a Random Variable: For a discrete random vari-


able X, X
E[X] = x · P (X = x)
x

168
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

For a continuous random variable X,


Z ∞
E[X] = x · f (x) dx.
−∞

5.5.1 Example: Testing Electronic Components


To find the expectation (expected value) E[X] for the random variable X rep-
resenting the number of defective components in the three tested electronic
components, we use the definition of the expectation for a discrete random
variable:

T
X
E[X] = x · P (X = x)
x

Given the probability mass function (pmf) for X in Table 5.2, the calcula-
tions for the E[X] are shown in the following table.
AF x

0
P (X = x)
1
8
3
x · P (X = x)

3
0
3
1 8 1· 8 = 8
3 3 6
2 8 2· 8 = 8
1 1 3
3 8 3· 8 = 8

12
Total 8

Using the output in the above table


DR

X 12
E[X] = x · P (X = x) = = 1.5
x
8
Therefore, the expected number of defective components E[X] is 1.5.
Problem 5.8.
An office has four copying machines, and the random variable X denotes how
many of them are in use in a particular time. Suppose the probability mass
function X is given below:

x 0 1 2 3 4
Pr(X = x) k 0.02 0.05 0.4 (k + 0.2)

(a) What is the value of k.

(b) Find the expectation of X.

169
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Solution
(a) To find the value of k, we set up the equation based on the property that
the sum of probabilities must equal 1:

or, k + 0.02 + 0.05 + 0.4 + (k + 0.2) = 1


Solving for k:

or, 2k + 0.67 = 1

or, 2k = 1 − 0.67

T
or, 2k = 0.33

0.33
∴k= = 0.165
2
AF
(b) To find the expectation E(X) of the random variable X, we use the
formula for the expected value:
4
X
E(X) = x · Pr(X = x)
x=0

Given the probability mass function:

Pr(X = 0) = k = 0.165,
Pr(X = 1) = 0.02,
Pr(X = 2) = 0.05,
DR

Pr(X = 3) = 0.4,
Pr(X = 4) = k + 0.2 = 0.165 + 0.2 = 0.365.

Now we can calculate E(X):

E(X) = 0·Pr(X = 0)+1·Pr(X = 1)+2·Pr(X = 2)+3·Pr(X = 3)+4·Pr(X = 4)

Substituting the values:

E(X) = 0 · 0.165 + 1 · 0.02 + 2 · 0.05 + 3 · 0.4 + 4 · 0.365


= 0 + 0.02 + 0.1 + 1.2 + 1.46
= 0.02 + 0.1 + 1.2 + 1.46
= 2.78.

Thus, the expectation of X is 2.78.

170
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

5.5.2 Example: Metal Cylinder Production


The probability density function of the diameter of a metal cylinder (X) is

f (x) = 1.5 − 6(x − 50.0)2 for 49.5 ≤ x ≤ 50.5

The expectation E(X) is calculated as follows:


Z 50.5 Z 50.5
x 1.5 − 6(x − 50.0)2 dx

E(X) = xf (x) dx =
49.5 49.5
Z50.5 Z 50.5
= 1.5x dx − 6x(x − 50)2 dx
49.5 49.5

T
50.5 Z 50.5
x2

= 1.5 − 6x(x − 50)2 dx [let, u = x − 50]
2 49.5 49.5
Z 0.5
50.52 49.52
 
= 1.5 − −6 (u3 + 50u2 ) du
2 2 −0.5
  Z 0.5  3 0.5 !
AF = 1.5
2550.25 2450.25
2

= 75 − 0 − 6 × 50 ×
2
 3
0.5
+
−6

0.53
−0.5

u3 du − 6 50
u
3 −0.5

3 3
= 75 − 1.5 × 50
= 50

Hence the expectation of the diameter of a metal cylinder is 50mm.

5.5.3 Exercises
DR

1. Suppose the Laptop repair costs are $50, $200, and $350 with respective
probability values of 0.3, 0.2, and 0.5. What is the expected Laptop repair
cost?
2. Suppose the daily sales of a small shop are $100, $150, and $250 with
respective probability values of 0.4, 0.3, and 0.3. What is the expected
daily sales?
3. A game offers prizes of $10, $50, and $100 with respective probability
values of 0.6, 0.3, and 0.1. What is the expected prize amount?
4. Consider the waiting times (in minutes) at a bus stop: 5, 10, and 15 with
respective probability values of 0.5, 0.3, and 0.2. What is the expected
waiting time?
5. The lifetime (in years) of a certain type of light bulb is either 1, 3, or
5 with respective probability values of 0.2, 0.5, and 0.3. What is the
expected lifetime of the light bulb?

171
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

6. The number of daily website visits for a company is either 200, 500, or
800 with respective probability values of 0.25, 0.5, and 0.25. What is the
expected number of daily visits?
7. Let the temperature X in degrees Fahrenheit of a particular chemical
reaction with density
x − 190
f (x) = 220 ≤ x ≤ 280.
3600
Find the expectation of the temperature.

T
5.6 The Variance of a Random Variable
Another key summary measure of the distribution of a random variable is the
variance, which quantifies the spread or variability in the values that the random
variable can take. While the mean or expectation captures the central or average
value of the random variable, the variance measures the dispersion or deviation
AF
of the random variable around its mean value. Specifically, the variance of a
random variable is defined as

Var(X) = E((X − E(X))2 ).


This means that the variance is the expected value of the squared deviations
of the random variable values from the expected value E(X). The variance is
always positive, and larger values indicate a greater spread in the distribution
of the random variable around the mean. An alternative and often simpler
expression for calculating the variance is

Var(X) = E((X − E(X))2 )


DR

= E(X 2 − 2XE(X) + (E(X))2 )


= E(X 2 ) − 2E(X)E(X) + (E(X))2
= E(X 2 ) − (E(X))2 .

Variance: The variance of a random variable X is defined as

Var(X) = E((X − E(X))2 )

or equivalently
Var(X) = E(X 2 ) − (E(X))2 .

The variance is a positive measure that indicates the spread of the distri-
bution of the random variable around its mean value. Larger values of the
variance suggest that the distribution is more spread out.

172
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

It is typical to use the symbol µ to represent the mean or expectation of a


random variable, and the symbol σ 2 to represent the variance. The standard
deviation, denoted by σ, is the square root of the variance and is often used
instead of the variance to describe the spread of the distribution.

Standard Deviation: The standard deviation of a random variable X


is defined as the positive square root of the variance. The variance of a
random variable is commonly denoted by σ 2 , so σ represents the standard
deviation.

The concept of variance can be illustrated graphically. Figure 5.8 shows two
probability density functions with different mean values but identical variances.

T
The variances are the same because the shape or spread of the density func-
tions around their mean values is the same. In contrast, Figure 5.9 shows two
probability density functions with the same mean values but different variances.
The density function that is flatter and more spread out has the larger variance.
AF 0.4 µ = 0, σ 2 = 1
µ = 14, σ 2 = 1

0.3
f (x)

0.2

0.1
DR

−10 0 10 20 30 40
x

Figure 5.8: Two normal distributions with different means but identical vari-
ances.

173
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

0.4 µ = 0, σ 2 = 1
µ = 0, σ 2 = 4

0.3

f (x) 0.2

0.1

T
0

−6 −4 −2 0 2 4 6
x

Figure 5.9: Two normal distributions with identical means but different vari-
AF
ances.

It is important to note that the standard deviation has the same units as the
random variable X, while the variance has units that are squared. For instance,
if the random variable X is measured in seconds, then the standard deviation
will also be in seconds, but the variance will be measured in seconds squared
(seconds2 ).

Example: Testing Electronic Components


We already found that E[X] = 1.5. To find the variance of the random variable
X, we need to compute E[X 2 ]:
DR

X
E[X 2 ] = x2 · P (X = x)
x
Given the probability mass function (pmf) for X:

Table 5.5: Calculating E[X] and E[X 2 ]

x P (X = x) x · P (X = x) x2 · P (X = x)

0 1
8
0· 1
8
=0 02 · 1
8
=0
3 3 3 2 3
1 8
1· 8
= 8
1 · 8 = 38
2 3
8
2· 3
8
= 6
8
22 · 38 = 12
8

3 1
8
3· 1
8
= 3
8
32 · 81 = 98
12 24
Total 8
= 1.5 8
=3

174
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Now, we can find the variance:

Var(X) = E[X 2 ] − (E[X])2

Var(X) = 3 − (1.5)2

Var(X) = 3 − 2.25

Var(X) = 0.75
Therefore, the variance of X is 0.75.

T
5.6.1 Example: Metal Cylinder Production
The probability density function of the diameter of a metal cylinder (X) is
f (x) = 1.5 − 6(x − 50.0)2 for 49.5 ≤ x ≤ 50.5
AF
and the E(X) = 50. To find the variance V (X), we need E(X 2 ):

E(X 2 ) =
Z 50.5
x2 f (x) dx =
Z 50.5
x2 1.5 − 6(x − 50.0)2 dx

49.5 49.5
Z 50.5 Z 50.5
= 1.5x2 dx − 6x2 (x − 50)2 dx
49.5 49.5
50.5 0.5
x3
 Z
= 1.5 −6 (u + 50)2 u2 du [let, u = x − 50]
3 49.5 −0.5
Z 0.5
50.53 49.53
 
= 1.5 − −6 (u4 + 100u2 + 2500) du
3 3 −0.5
DR

= 2500.05 [after simplification]


Combining both parts:
E(X 2 ) = 3862.5 − 15049.9875 = 50

V (X) = E(X 2 ) − (E(X))2 = 2500.05 − 2500 = 0.05

√ Thus, the variance V (X) = 0.05 and the standard deviation sd(X) =
0.05 = 0.2236.
Problem 5.9. Consider a random variable X representing the number of heads
in three tosses of a fair coin. The possible values of X are 0, 1, 2, and 3. The
pmf of X is given by:
   3
3 1
P (X = x) =
x 2
Find the expected value and variance of X.

175
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

5.6.2 Chebyshev’s Inequality


Chebyshev’s Inequality is a powerful tool for understanding the spread of data
in scenarios where the distribution is unknown. It provides a way to make
probabilistic statements about deviations from the mean, which is particularly
useful in data science applications such as quality control and salary analysis.

Chebyshev’s Inequality: Let X be a random variable with mean µ and


variance σ 2 . Chebyshev’s Inequality states that for any k > 0,
1
P (|X − µ| ≥ kσ) ≤
k2

T
or equivalently,
1
P (µ − kσ ≤ X ≤ µ + kσ) ≥ 1 − .
k2

This inequality indicates that the probability of a random variable deviating


AF
from its mean by more than k standard deviations is at most k12 .

5.6.3 Example: Blood Pressure Measurement


Consider a study on blood pressure measurements where the systolic blood
pressure X of patients is known to have a mean of 120 mmHg and a standard
deviation of 15 mmHg. We want to determine the range within which at least
90% of the measurements fall, according to Chebyshev’s Inequality:
1
P (120 − k × 15 ≤ X ≤ 120 + k × 15) ≥ 1 − .
k2
where k be the number of standard deviations from the mean. We want:
DR

1
1− ≥ 0.90
k2
Solving for k:
1
≤ 0.10
k2
or,
1
k2 ≥ = 10
0.10

∴ k ≥ 10 ≈ 3.16
So, at least 90% of systolic blood pressure measurements should fall within:

120 ± 3.16 × 15 = 120 ± 47.4


In other words, the blood pressure measurements are expected to be within
the range of:

176
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

5.6.4 Example: Employee Salaries


Consider a company where the average salary is $60,000 with a standard de-
viation of $5,000. If the company wants to guarantee that at least 80% of
employees’ salaries are within a certain range of the mean salary, we can use
Chebyshev’s Inequality to estimate this range.
1
P (60000 − k × 5000 ≤ X ≤ 60000 + k × 60000) ≥ 1 − .
k2
To ensure at least 80% of salaries are within this range, we want:
1
1− ≥ 0.80

T
k2
1 1
≤ 0.20 =⇒ k 2 ≥ =5
k2 0.20

k ≥ 5 ≈ 2.24
AF
Thus, at least 80% of salaries should fall within:

60, 000 ± 2.24 × 5, 000 or, 60, 000 ± 11, 200

5.6.5 Quantiles of Random Variables


Quantiles are useful summary measures that provide insight into the spread or
variability of a random variable’s distribution. The p-th quantile of a random
variable X, which has a cumulative distribution function F (x), is the value x
that satisfies

F (x) = p
DR

meaning that there is a probability p that the random variable is less than
the p-th quantile. The probability p is often expressed as a percentage, and
the corresponding quantiles are known as percentiles. For instance, the 70th
percentile is the value x for which F (x) = 0.70. It is important to note that
the 50th percentile of a distribution is also known as the median.

Quantiles: The p-th quantile of a random variable X with a cumulative


distribution function F (x) is the value x such that

F (x) = p
This is also known as the p × 100-th percentile of the random variable.
The probability p signifies the chance that the random variable takes on a
value less than the p-th quantile.

177
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

To understand the spread of a distribution, one can compute its quartiles.


The upper quartile is the 75th percentile, and the lower quartile is the 25th
percentile. Together with the median, these quartiles divide the range of the
random variable into four equal parts, each with a probability of 0.25.

The interquartile range, which is the distance between the upper and lower
quartiles as depicted in Figure 2.55, serves as an indicator of distribution spread
similar to variance. A larger interquartile range suggests that the distribution
of the random variable is more spread out.

Quartiles and Interquartile Range: The upper quartile of a distribu-


tion is the 75th percentile, and the lower quartile is the 25th percentile.

T
The interquartile range, defined as the distance between these two quar-
tiles, provides a measure of distribution spread analogous to variance.

5.6.6 Example: Metal Cylinder Production


AF
The cumulative distribution function (cdf) for the diameters of the metal cylin-
ders is given by

F (x) = 1.5x − 2(x − 50.0)3 − 74.5


for 49.5 ≤ x ≤ 50.5.
The upper quartile (Q3 ) is found at the value of x where

F (x) = 0.75.
That is,
1.5x − 2(x − 50.0)3 − 74.5 = 0.75
DR

This equation can be solved numerically to find the precise value of x which
corresponds to Q3 = 50.17 mm.

178
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

f (x)

1.5 Area = 0.5

0.5

T
Q1 Q3
0 x
49.4 49.6 49.8 50 50.2 50.4 50.6

Figure 5.10: Interquartile range for metal cylinder diameters.


AF
The lower quartile (Q1 ) is the value where

F (x) = 0.25
resulting in Q1 = 49.83 mm. Consequently, the interquartile range is calcu-
lated as

50.17 − 49.83 = 0.34 mm


indicating that half of the cylinders will have diameters between 49.83 mm
and 50.17 mm, as illustrated in Figure 5.10.
DR

5.6.7 Exercises
1. Consider the Laptop repair costs discussed in question 1 of Exercises 5.5.3,
calculate the variance and standard deviation of the number of copying
machines in use at a particular moment.
2. In machine breakdown problem, suppose that electrical failures generally
cost $400 to repair, mechanical failures have repair cost of $550, and
operator misuse failures have an repair cost of only $100. These repair
costs generate a random variable cost, as illustrated in the following Table.

xi 100 400 550


Pr(X = xi ) 0.25 2(k + 1) 0.4

(a) What is the value of k?

179
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

(b) Find the average of the repair costs.


(c) Find the variance of the repair costs.

3. A random variable X takes values between 4 and 6 with a probability


density function
1
f (x) = for 4 ≤ x ≤ 6.
x ln(1.5)

(a) What is the variance of this random variable?


(b) What is the standard deviation of this random variable?

T
(c) Find the upper and lower quartiles of this random variable.
(d) What is the interquartile range?

4. A random variable X represents the time (in hours) until failure of a


certain machine part, which is uniformly distributed between 100 and
200 hours.
AF f (x) =
1
100
for 100 ≤ x ≤ 200.

(a) What is the variance of this random variable?


(b) What is the standard deviation of this random variable?
(c) Find the upper and lower quartiles of this random variable.
(d) What is the interquartile range?
5. Consider a random variable Y representing the strength of a material,
which follows a normal distribution with mean 500 MPa and standard
deviation 50 MPa.
DR

(a) What is the probability that the strength is between 450 MPa and
550 MPa?
(b) What is the 95th percentile of the strength?
(c) Calculate the variance of the strength.
(d) What proportion of material samples have a strength greater than
600 MPa?
6. A random variable Z represents the systolic blood pressure (in mmHg) of
a population, which is uniformly distributed between 90 and 140 mmHg.
1
f (z) = for 90 ≤ z ≤ 140.
50
(a) What is the variance of this random variable?
(b) What is the standard deviation of this random variable?
(c) Find the upper and lower quartiles of this random variable.

180
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

(d) What is the interquartile range?


7. A researcher is studying the cholesterol levels of a population of adults.
The cholesterol levels are known to have a mean of 200 mg/dL and a
standard deviation of 25 mg/dL.

(a) Using Chebyshev’s Inequality, determine the minimum percentage of


adults whose cholesterol levels are within 50 mg/dL of the mean.
(b) To guarantee that at least 85% of the population has cholesterol
levels within a certain number of standard deviations from the mean,
how many standard deviations from the mean are required?

T
(c) If a randomly selected adult has a cholesterol level of 250 mg/dL,
what is the maximum probability that this level deviates from the
mean by at least 50 mg/dL according to Chebyshev’s Inequality?

8. The systolic blood pressure of a certain population is normally distributed


with a mean of 120 mmHg and a standard deviation of 15 mmHg.
AF (a) What is the probability that a randomly selected person has a sys-
tolic blood pressure less than 110 mmHg?
(b) What is the probability that a randomly selected person has a sys-
tolic blood pressure between 110 mmHg and 130 mmHg?
(c) Find the 95th percentile of the systolic blood pressure distribution.
Suppose that the battery failure time, measured in hours, has a probabil-
ity density function given
2
f (x) =
(x + 1)3
DR

for x ≥ 0.
(a) Find the expected battery failure time.
(b) What is the probability that the battery fails within the first 4 hours.
(c) Find the cumulative distribution function of the battery failure times.
(d) Find the median of the battery failure times.

5.7 Essential Generating Functions


In probability theory and statistics, generating functions are a powerful tool
used to analyze and manipulate probability distributions. There are three main
types of generating functions:
• Moment Generating Function (MGF): Used for both discrete and con-
tinuous random variables.

181
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

• Probability Generating Function (PGF): Used for discrete random vari-


ables.

• Characteristic Function (CF): It complements other generating func-


tions, such as the MGF and PGF, and is closely related to the Fourier
transform.

5.7.1 Moment Generating Function


The Moment Generating Function (MGF) is a powerful tool in probability the-
ory and statistics used to summarize the properties of a probability distribution.

FT
It can help us find moments (for example, mean, variance), combine variables,
and understand the distribution better. The Moment Generating Function of
a random variable X is defined as:

M (t) = E[etX ].

5.7.2 Key Properties of MGF


• Higher Order Moments:
■ The first moment (mean) is found by taking the first derivative
of the MGF and evaluating it at t = 0:
A
µ = E[X] = M ′ (0)

■ The second moment is found by taking the second derivative of


the MGF and evaluating it at t = 0:

E[X 2 ] = M ′′ (0)
R
■ The rth moment is found by taking the rth derivative of the
MGF and evaluating it at t = 0:

dr M (t)
E[X r ] =
dtr
t=0
D

Evaluate the Derivative at t = 0 After differentiating the MGF


the required number of times, substitute t = 0 to obtain the
moment.
■ The variance can be found by taking the second derivative of
the MGF, evaluating at t = 0, and then using it with the mean:

Var(X) = M ′′ (0) − [M ′ (0)]2

• Combining Variables:

182
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

■ If X and Y are independent random variables, the MGF of their


sum X + Y is the product of their individual MGFs:

MX+Y (t) = MX (t) · MY (t)

Problem 5.10 (Discrete Random Variable). Consider a discrete random


variable X that takes values 1 1 with probability p and the value 0 with probability
1 − p. Find the moment generating function of X, and hence find mean and
variance.

Solution

T
The MGF of a random variable X is given by:
X
M (t) = E[etX ] = etx P (X = x)
x

We have,
P (X = 1) = p and P (X = 0) = 1 − p
AF
Substituting into the MGF formula:

M (t) = et·1 P (X = 1) + et·0 P (X = 0)


= et · p + e0 · (1 − p)
= pet + (1 − p)

Mean of X: To find the mean E[X], we differentiate the MGF with respect
to t and evaluate at t = 0:
d
M ′ (t) = pet + (1 − p) = pet

DR

dt
Evaluating at t = 0:

M ′ (0) = pe0 = p
Thus, the mean of X is:

E[X] = p
Variance of X:
To find the variance, we first calculate the second moment E[X 2 ], which is the
second derivative of the MGF evaluated at t = 0:
d
M ′′ (t) = pet = pet

dt
Evaluating at t = 0

M ′′ (0) = pe0 = p

183
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Thus, the second moment is

E[X 2 ] = p
The variance is given by

Var(X) = E[X 2 ] − (E[X])2


= p − p2 = p(1 − p)

Thus, the variance of X is

Var(X) = p(1 − p)

T
Problem 5.11 (Continuous Random Variable). The pdf of X is given by:
(
1 if 0 ≤ x ≤ 1,
f (x) =
0 otherwise.
AF
This is a density of the uniform distribution on [0, 1]. Find the moment gener-
ating function of X, and hence find the mean and variance.

Solution
For this density the moment generating function is
Z 1  tx 1
e et − 1
M (t) = E etX = etx · 1 dx =
 
= , for t ̸= 0.
0 t 0 t

For t = 0, M (0) = 1 (since M (t) is always 1 at t = 0 for any distribution).


The first derivative of M (t) with respect to t is:
DR

d et − 1
 

M (t) = .
dt t
Applying the quotient rule:
t · et − (et − 1) tet − et + 1 et (t − 1) + 1
M ′ (t) = = = .
t2 t2 t2
To evaluate the mean, we need to find the limit of M ′ (t) as t → 0. We have
the indeterminate form 00 , so we apply L’Hôpital’s rule. To do this, we need to
differentiate the numerator and denominator separately. So, we find,

d d d
(t − 1)et + 1 = (t − 1)et + (1) = et (t − 1) + et = et (t).
 
dt dt dt
and
d 2
(t ) = 2t.
dt

184
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

So, applying L’Hôpital’s rule, we find


(t − 1)et + 1 et t et t et 1 1
M ′ (0) = lim 2
= lim = lim = lim = = .
t→0 t t→0 2t t→0 2t t→0 2 2 2
So, E[X] = 12 .

The second derivative of M (t) with respect to t is:


d et (t − 1) + 1
 
′′
M (t) = .
dt t2
After some calculation (similar to the first derivative), we find:

T
et (t2 − 2t + 2) − 2(t − 1)et
M ′′ (t) = .
t3
Evaluating at t = 0:
et (t2 − 2t + 2) − 2(t − 1)et 1
M ′′ (0) = lim 3
= .
t→0 t 3
AF
So, E[X 2 ] = 13 .
Thus, the first two moments for a uniform random variable X on [0, 1] are:
1 1
E[X] = and E[X 2 ] = .
2 3

5.7.3 Probability Generating Function (PGF)


A Probability Generating Function (PGF) is a related concept to the Moment
Generating Function (MGF), specifically designed for discrete random variables.
The PGF provides a way to encode the probability distribution of a discrete
random variable into a generating function. It is particularly useful for random
DR

variables that take non-negative integer values, such as the number of successes
in a binomial distribution, or the number of events in a Poisson distribution.

Let X be a discrete random variable with probability mass function p(x) =


P (X = x) for x = 0, 1, 2, . . .. The PGF of X, denoted by G(s), is defined as:

X
G(s) = E[sX ] = sx p(x),
x=0
where s is a real or complex number for which the series converges.

Properties of the PGF


Let X be a discrete random variable taking non-negative integer values, and let
G(s) be its probability generating function, defined as:

X
G(s) = E[sX ] = sx P (x)
x=0

185
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

1. Normalization Property
The PGF at s = 1 is always equal to 1:

X ∞
X
G(1) = 1x P (x) = P (x) = 1
x=0 x=0

This holds for any probability distribution.

2. Probability Recovery
The probability that the random variable X takes the value x can be recovered

T
by differentiating the PGF:

1 dx
P (X = x) = G(s)
x! dsx s=0

This formula allows for the extraction of individual probabilities from the PGF.
AF
3. Expected Value (Mean)
The expected value E[X] of the random variable X can be obtained by differ-
entiating the PGF and evaluating at s = 1:

d
E[X] = G(s)
ds s=1

This gives the mean of the distribution directly from the PGF.

4. Variance
DR

The variance Var(X) can be derived using the PGF. First, compute the first
and second derivatives of the PGF:
d2
E[X 2 ] = G(s)
ds2 s=1

Then, the variance is:

Var(X) = E[X 2 ] − (E[X])2

5. Sum of Independent Random Variables


If X1 and X2 are independent random variables, the PGF of their sum is the
product of their individual PGFs:

GX1 +X2 (s) = GX1 (s) · GX2 (s)

This property is useful for dealing with sums of independent random variables.

186
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

6. PGF of a Constant
If X is a constant random variable, i.e., P (X = c) = 1 for some constant c, the
PGF is:
G(s) = sc
This reflects that the random variable always takes the value c, so the PGF has
only one non-zero term at x = c.

7. Derivative Relations
The r-th moment E[X r ] can be derived from the PGF by differentiating it r
times and evaluating at s = 1:

T
dr
E[X r ] = G(s)
dsr s=1
This formula is useful for calculating higher-order moments of the distribution.
Problem 5.12. Consider a random variable X with parameter λ. The proba-
bility mass function of X is:
AF λx e−λ
p(x) =
x!
x = 0, 2, 3, . . . .
Find PGF, and hence mean and variance.

Solution
The PGF G(s) is defined as:

X
G(s) = E[sX ] = p(x)sx .
x=0
DR

Substituting the PMF p(x):



X λx e−λ
G(s) = sx .
x=0
x!
We can factor out e−λ as it is constant with respect to the sum:

−λ
X (λs)x
G(s) = e .
x=0
x!
Recognize that the sum is the Taylor series expansion of eλs :

X (λs)x
= eλs .
x=0
x!
Therefore, the PGF is:

G(s) = e−λ · eλs = eλ(s−1) .

187
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Finding the Mean


The mean E[X] can be obtained by differentiating the PGF and evaluating at
s = 1:

E[X] = G′ (1).
First, compute the derivative of G(s):

G(s) = eλ(s−1) .

G′ (s) = λeλ(s−1) .

T
Evaluating at s = 1:

G′ (1) = λeλ(1−1) = λe0 = λ.


Thus, the mean E[X] is λ.
AF
Finding the Variance
The variance can be found using:

Var(X) = G′′ (1) + G′ (1) − (G′ (1))2 .


Compute the second derivative of G(s):

G′′ (s) = λ2 eλ(s−1) .


Evaluating at s = 1:

G′′ (1) = λ2 eλ(1−1) = λ2 e0 = λ2 .


DR

Now, calculate the variance:

Var(X) = G′′ (1) + G′ (1) − (G′ (1))2 .


Substitute the values:

Var(X) = λ2 + λ − λ2 = λ.
Thus, the variance Var(X) is λ.

188
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Applications
PGFs are used in various fields including:

• Queueing Theory: To analyze the number of customers in a queue.

• Reliability Engineering: To model system lifetimes.

• Genetics: To study inheritance patterns.

The PGF is a compact and powerful tool for handling problems involving
sums of random variables and their distributions.

T
5.7.4 Characteristic Function (CF)
The characteristic function (CF) of a random variable X is a fundamental tool
in probability theory, and it is closely related to the moment generating func-
tion. The characteristic function provides an alternative way to describe the
distribution of X, and it is particularly useful in the study of sums of indepen-
AF
dent random variables. It is important in data science, particularly in areas
related to probability theory, statistical inference, and stochastic processes.

The characteristic function φ(t) of a random variable X is defined as:

φ(t) = E eitX ,
 

where i is the imaginary unit (i2 = −1) and t is a real number.

5.7.5 Key Properties of Characteristic Functions


DR

Let X be a random variable with characteristic function ϕ(t).

1. Existence: The characteristic function always exists and is well-defined


for all real t.
2. Normalization:
ϕ(0) = E[ei·0·X ] = E[1] = 1.

3. Uniqueness: The characteristic function uniquely determines the dis-


tribution of X. If two random variables have the same characteristic
function, they have the same distribution.
4. Addition of Independent Random Variables: If X and Y are inde-
pendent, then:
ϕX+Y (t) = ϕ(t) · ϕY (t).

189
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

5. Moment Generating Function Relationship: The moment generat-


ing function M (t) is related to the characteristic function by:

M (t) = E[etX ] = ϕ(−it).

6. Inverse Relationship: The distribution function F (x) can be recovered


from the characteristic function using the inverse Fourier transform:
Z ∞
1
F (x) = ϕ(t)e−itx dt.
2π −∞

FT
7. Derivatives and Moments: The n-th moment of X, if it exists, is given
by:
dn ϕ(t)
E[X n ] = i−n .
dtn t=0

The Characteristic Function (CF) and the Moment Generating Function


(MGF) are both tools used in probability theory to describe the distribution
of random variables. While they share some similarities, they also have key
differences. A detailed comparisons are presented in Table 5.6.

Table 5.6: Comparison between MGF and CF


A
Feature MGF M (t) = E[etX ] CF φ(t) = E[eitX ]
Existence Not guaranteed (finite only Always exists (since
if E[etX ] exists for all t) |eitX | = 1)
Range of t Real t Real t, but involves imagi-
nary unit i
R
Moments Differentiation gives mo- Differentiation gives mo-
ments directly ments with i adjustment
Uniqueness Uniquely determines distri- Uniquely determines distri-
bution if it exists bution
Fourier No direct connection Essentially the Fourier
D

Transform transform of the PDF


Application Used for finding moments Used in distributional
and cumulants, proving analysis, sums of random
Central Limit Theorem variables, proving Central
Limit Theorem

Problem 5.13 (Discrete Random Variable). Consider the Problem 5.10,


the pmf of discrete random variable X is
• P (X = 0) = p

190
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

• P (X = 1) = 1 − p
where 0 ≤ p ≤ 1. Find the characteristic function of X, and hence find mean
and variance.

Solution

1. Characteristic Function
The characteristic function φ(t) of a discrete random variable X is defined as:

φ(t) = E[eitX ]

T
where E denotes the expectation and i is the imaginary unit.
For the given pmf:
X
φ(t) = E[eitX ] = eitx P (X = x)
x
AF
Substituting the values for X:

φ(t) = eit·0 · P (X = 0) + eit·1 · P (X = 1)

φ(t) = e0 · p + eit · (1 − p)
φ(t) = p + (1 − p)eit

2. Mean of X
The mean E[X] can be derived from the characteristic function as follows:

d
DR

E[X] = i φ(t)
dt t=0

First, compute the derivative of φ(t):

d d
p + (1 − p)eit

φ(t) =
dt dt
d
φ(t) = (1 − p) · ieit
dt
Evaluate at t = 0:
d
φ(t) = (1 − p) · iei·0 = (1 − p) · i
dt t=0

E[X] = i · (1 − p) · i = 1 − p

191
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

3. Variance of X
To find the variance, we first need the second moment E[X 2 ].
The second moment E[X 2 ] can be derived from the characteristic function
as follows:
d2
E[X 2 ] = − 2 φ(t)
dt t=0

Compute the second derivative of φ(t):

d2 d
(1 − p) · ieit

2
φ(t) =
dt dt

T
d2
φ(t) = (1 − p) · i2 eit = −(1 − p)eit
dt2
Evaluate at t = 0:
d2
φ(t) = −(1 − p)ei·0 = −(1 − p)
dt2
AF t=0

E[X 2 ] = − (−(1 − p)) = 1 − p


The variance Var(X) is given by:

Var(X) = E[X 2 ] − (E[X])2


= (1 − p) − (1 − p)2
= (1 − p) − (1 − 2p + p2 )
= 1 − p − 1 + 2p − p2
= p − p2
DR

= p(1 − p)

Hence,
• Characteristic function: φ(t) = p + (1 − p)eit

• Mean: E[X] = 1 − p

• Variance: Var(X) = p(1 − p)


Problem 5.14 (Continuous Random Variable). Consider the Problem
5.11, the pdf of X is given by:
(
1 if 0 ≤ x ≤ 1,
f (x) =
0 otherwise.

Find the characteristic function of X, and hence find the mean and variance.

192
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Solution
Let’s find the characteristic function of a uniform random variable X on the
interval [0, 1].
The characteristic function is:
Z ∞
φ(t) = E eitX = eitx f (x) dx,
 
−∞

where f (x) is the pdf of X.


For the uniform distribution on [0, 1], the pdf is:
(

T
1 if 0 ≤ x ≤ 1,
f (x) =
0 otherwise.

Thus, the characteristic function becomes:


Z 1
φ(t) = eitx dx.
AF
The integral is straightforward to evaluate:
0

1
1
eitx eit − 1
Z 
φ(t) = eitx dx = = .
0 it 0 it

The characteristic function can be simplified as:

eit − 1
φ(t) = .
it
This can also be written as:
DR

 
1 sin(t) cos(t) − 1
φ(t) = +i , for t ̸= 0.
t t t

For t = 0, φ(0) = 1, which is consistent since the characteristic function always


equals 1 at t = 0.

Mean
The mean E[X] is:
d
E[X] = i φ(t)
dt t=0

Compute the derivative of φ(t):

eit − 1
 
d d
φ(t) =
dt dt it

193
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

d (it · ieit ) − (eit − 1)


φ(t) =
dt (it)2
d −teit + 1 − eit
φ(t) =
dt −t2
d 1 − eit (t + 1)
φ(t) =
dt t2
Evaluate at t = 0:
d 1−1
φ(t) = =0
dt t=0 02
E[X] = 0

T
Variance
To find the variance, we first need the second moment E[X 2 ]:

d2
E[X 2 ] = − φ(t)
AF
Compute the second derivative:
dt2 t=0

d2 1 − eit (t + 1)
 
d
2
φ(t) =
dt dt t2

d2 t2 · (ieit (t + 1)) − (1 − eit (t + 1)) · 2t


φ(t) =
dt2 t4
Evaluate at t = 0:
d2
φ(t) =2
dt2
DR

t=0

E[X 2 ] = 2
The variance is:
Var(X) = E[X 2 ] − (E[X])2
Var(X) = 2 − 02
Var(X) = 2
Hence,

• Characteristic function: φ(t) = eit −1


it

• Mean: E[X] = 0

• Variance: Var(X) = 2

194
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

5.7.6 Exercises
1. Consider a discrete random variable Y with the following probability mass
function (pmf): (
1
for k = 1,
P (Y = k) = 21
2 for k = 2.

(a) Find the moment generating function MY (t) of Y .


(b) Using the moment generating function, find the mean and variance
of Y .

T
2. Let Z be a continuous random variable with the probability density func-
tion (pdf): (
2z
2 if 0 ≤ z ≤ θ,
fZ (z) = θ
0 otherwise.
where θ > 0 is a parameter.
AF (a) Find the moment generating function MZ (t) of Z.
(b) Find the characteristic function φZ (t) of Z.
(c) Using the moment generating function, determine the mean and vari-
ance of Z.

3. Let X be a binomial random variable with parameters n and p, where n


is the number of trials and p is the probability of success in each trial.
The probability mass function of X is:
 
n k
P (X = k) = p (1 − p)n−k , k = 0, 1, . . . , n.
k
DR

(a) Find the moment generating function M (t) of X.


(b) Using the moment generating function, find the mean and variance
of X.

4. Consider a discrete random variable X that takes non-negative integer


values with the following probability mass function:

3k
P (X = k) = P∞ for k = 0, 1, 2, . . .
i=0 3i

(a) Find the probability generating function (PGF) G(s) of the random
variable X.
(b) Use the PGF to determine E[X] and Var(X).
(c) Verify your results using the properties of the PGF.

195
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

5. Consider a continuous random variable W uniformly distributed over the


interval [0, a]. The probability density function is:
(
1
if 0 ≤ w ≤ a,
fW (w) = a
0 otherwise.

(a) Find the characteristic function φW (t) of W .


(b) Find the moment generating function MW (t) of W .
(c) Use the moment generating function to find the mean and variance
of W .

T
6. Let V be an exponential random variable with rate parameter λ. The
probability density function is:
(
λe−λv if v ≥ 0,
fV (v) =
0 otherwise.
AF (a) Find the moment generating function MV (t) of V .
(b) Find the characteristic function φV (t) of V .
(c) Using the moment generating function, determine the mean and vari-
ance of V .

5.8 Jointly Distributed Random Variables


Jointly distributed random variables are crucial in data science as they allow for
the modeling and analysis of relationships between multiple variables simulta-
neously. Understanding these relationships is essential for predicting outcomes,
DR

identifying correlations, and constructing probabilistic models that capture real-


world complexities. Joint distributions enable informed decisions based on the
combined behavior of multiple variables, which is vital for developing accurate
and robust predictive models.

In probability theory, two or more random variables are jointly distributed if


there is a joint probability distribution describing their behavior. For two ran-
dom variables X and Y , the joint probability distribution provides the probabil-
ity that X takes a specific value x and Y takes a specific value y simultaneously.

5.8.1 Joint Probability Mass Function (pmf )


For discrete random variables X and Y , the joint probability mass function
pX,Y (x, y) is defined as:

pX,Y (x, y) = P (X = x, Y = y).

196
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

The joint probability mass function must satisfy the condition:


XX
pX,Y (x, y) = 1.
x y

The joint cumulative distribution function is defined as:

F (x, y) = P (X ≤ x, Y ≤ y).

For discrete random variables:


X X
F (x, y) = pX,Y (x, y).

T
X≤x Y ≤y

If pij = P (X = i, Y = j) for i = 1, 2, . . . , m and j = 1, 2, . . . , n, then


y
x X
X
F (x, y) = P (X ≤ x, Y ≤ y) = pij
i=1 j=1
AF
5.8.2 Example: Computer Maintenance
A company managing maintenance services for computer servers is interested
in optimizing the scheduling of its technicians. Specifically, the company needs
to understand how long a technician spends on-site, which primarily depends
on the number of servers requiring maintenance.

Let the random variable X denote the maintenance time in hours at a lo-
cation, taking values 1, 2, 3, and 4. Let the random variable Y represent the
number of servers at the location, taking values 1, 2, and 3. These two random
variables are considered jointly distributed.
DR

The joint probability mass function pij for these variables is given in the
table below:

Number of Servers (Y )
1 2 3
Maintenance

1 0.12 0.08 0.01


Time (X)

2 0.08 0.15 0.01


3 0.07 0.21 0.02
4 0.05 0.13 0.07

For example, the table shows that there is a 0.12 probability that X = 1
and Y = 1, meaning a randomly selected location has one server that takes one
hour to maintain. Similarly, the probability is 0.08 that a location with three

197
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

servers requires four hours of maintenance. This is a valid probability mass


function, as
XX XX
pX,Y (x, y) = pij = 0.12 + 0.07 + · · · + 0.08 = 1.00
x y i j

The joint cumulative distribution function is defined as:


y
x X
X
F (x, y) = P (X ≤ x, Y ≤ y) = pij
i=1 j=1

For instance, the probability that a location has no more than two servers and

T
that the maintenance time does not exceed two hours is:

F (2, 2) = p11 + p12 + p21 + p22 = 0.12 + 0.07 + 0.12 + 0.03 = 0.32

5.8.3 Joint Probability Density Function (pdf )


AF
For continuous random variables X and Y , the joint probability density function
fX,Y (x, y) is defined as:

∂2
fX,Y (x, y) = P (X ≤ x, Y ≤ y).
∂x∂y
The joint probability density function must satisfy the condition:
ZZ
f (x, y) dx dy = 1.
state space

The probability that a ≤ X ≤ b and c ≤ Y ≤ d is obtained from the joint


DR

probability density function as:


Z b Z d
f (x, y) dy dx
x=a y=c

For continuous random variables:


Z x Z y
F (x, y) = f (w, z) dz dw
w=−∞ z=−∞

5.8.4 Example: Mineral Deposits


To evaluate the economic feasibility of mining in a specific region, a mining
company collects ore samples from the site and measures their zinc and iron
content. Let the random variable X represent the zinc content, ranging from
0.5 to 1.5, and the random variable Y represent the iron content, ranging from
20.0 to 35.0. Suppose the joint probability density function of X and Y is given
by

198
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

39 17(x − 1)2 (y − 25)2


f (x, y) = − −
400 50 10, 000
for 0.5 ≤ x ≤ 1.5 and 20.0 ≤ y ≤ 35.0.

To verify the validity of this joint probability density function, we need to


ensure that f (x, y) ≥ 0 within the defined state space and that
Z 1.5 Z 35.0
f (x, y) dy dx = 1.
0.5 20.0
This joint probability density function provides comprehensive information

T
about the joint probabilistic behavior of the random variables X and Y . For
instance, the probability that a randomly selected ore sample has a zinc content
between 0.8 and 1.0 and an iron content between 25 and 30 is given by
Z 1.0 Z 30.0
f (x, y) dy dx,
AF 0.8 25.0

which evaluates to 0.092. Thus, only about 9% of the ore at this location
has mineral levels within these specified ranges.

5.8.5 Marginal Distributions


The marginal distributions of X and Y can be obtained from the joint distri-
bution.
For discrete variables:
X
pX (x) = pX,Y (x, y)
y
DR

X
pY (y) = pX,Y (x, y)
x

For continuous variables:


Z ∞
fX (x) = fX,Y (x, y) dy
−∞
Z ∞
fY (y) = fX,Y (x, y) dx
−∞

Using these marginal distributions, we can easily find the mean of X and Y .

199
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

5.8.6 Example: Computer Maintenance


To find the marginal distributions of X and Y for the example Computer
Maintenance, as discussed in Section 5.8.2, we need to sum the probabilities
across rows for X and across columns for Y .

Number of Servers (Y )
Maintenance 1 2 3 PX (x)
Time (X) 1 0.12 0.08 0.01 0.21
2 0.08 0.15 0.01 0.24

T
3 0.07 0.21 0.02 0.30
4 0.05 0.13 0.07 0.25
PY (y) 0.32 0.57 0.11 1.00

Table 5.7: Joint probability mass function for server maintenance with marginal
AF
distributions.

Mean of X
X
µX = x · PX (x)
x
= 1 · 0.21 + 2 · 0.24 + 3 · 0.30 + 4 · 0.25
= 2.59

Expected Value of X 2
DR

X
E(X 2 ) = x2 · PX (x) = 12 · 0.21 + 22 · 0.24 + 32 · 0.30 + 42 · 0.25
x
= 7.87

Variance of X
Var(X) = E(X 2 ) − (µX )2 = 7.87 − (2.59)2 = 1.1619

Standard Deviation of X
p √
σX = Var(X) = 1.1619 ≈ 1.0779

200
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Mean of Y
X
µY = y · PY (y) = 1 · 0.32 + 2 · 0.57 + 3 · 0.11
y

= 1.79

Expected Value of Y 2
X
E(Y 2 ) = y 2 · PY (y) = 12 · 0.32 + 22 · 0.57 + 32 · 0.11 = 3.59
y

T
Variance of Y
Var(Y ) = E(Y 2 ) − (µY )2 = 3.59 − (1.79)2 = 0.3859

Standard Deviation of Y
p √
AF
5.8.7
σY = Var(Y ) = 0.3859 ≈ 0.6212

Example: Mineral Deposits


We consider the Mineral Deposits, as explain in 5.8.4. The marginal prob-
ability density function of X, representing the zinc content of the ore, is given
by:

Z 35.0
fX (x) = f (x, y) dy
20.0
Z 35.0
17(x − 1)2 (y − 25)2
 
39
− −
DR

= dy
20.0 400 50 10, 000
35.0
39y 17y(x − 1)2 (y − 25)3

= − −
400 50 30, 000 20.0
57 51(x − 1)2
= − for 0.5 ≤ x ≤ 1.5.
40 10
So, the expected zinc content E(X) is:
Z 1.5
E(X) = xfX (x) dx
0.5
Z 1.5 
57 51(x − 1)2

= x − dx
0.5 40 10
Z 1.5 Z 1.5
57 51
= x dx − x(x − 1)2 dx
40 0.5 10 0.5
= 1.

201
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Similarly, we can find E(X 2 ) which is


Z 1.5
E(X 2 ) = x2 fX (x) dx = 1.055
0.5

Therefore, the variance V (X) is


V (X) = E(X 2 ) − (E(X))2 = 1.055 − (1.00)2 = 0.055
and the standard deviation is
p √
σY = Var(Y ) = 0.055 ≈ 0.2345.
The probability that a sample of ore has a zinc content between 0.8 and

T
1.0 can be determined using the marginal probability density function. This
probability is given by:

Z 1.0
P (0.8 ≤ X ≤ 1.0) = fX (x) dx
0.8
AF =

Z 1.0

0.8

57 51(x − 1)2
40

57x 17(x − 1)3


10
1.0

dx

= −
40 10 0.8
= [1.425] − [1.1536]
= 0.2714
Therefore, approximately 27% of the ore has a zinc content within these
limits.
DR

The marginal probability density function of Y , the iron content of the ore,
is given by:
Z 1.5
fY (y) = f (x, y) dx
0.5
Z 1.5 
17(x − 1)2 (y − 25)2

39
= − − dx
0.5 400 50 10, 000
1.5
39x 17(x − 1)3 x(y − 25)2

= − −
400 150 10, 000 0.5
3
1.5
x(y − 25)2

39x 17(x − 1)
= − −
400 150 10, 000 0.5
83 (y − 25)2
= − for 20.0 ≤ y ≤ 35.0.
1200 10, 000
The expected iron content and the standard deviation of the iron content, which
are E(Y ) = 27.36 and σ = 4.27, respectively.

202
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

5.8.8 Conditional Distributions


Conditional distribution refers to the probability distribution of a random vari-
able given the occurrence of another event or condition. It provides insights
into how one variable behaves when another variable has a specific value or
falls within a certain range. This concept is crucial in various fields such as eco-
nomics, biology, and machine learning, where relationships between variables
are studied under specific conditions or contexts.

Conditional Distributions: The conditional distribution of Y given


X = x for discrete variables:

T
pX,Y (x, y)
pY |X (y|x) =
pX (x)

For continuous variables:


fX,Y (x, y)
fY |X (y|x) =
AF fX (x)

Using these conditional distributions, we can easily find the mean of X given
Y and Y given X. Conditional distributions are often used to make predictions,
assess risks, and uncover underlying patterns in data that may not be apparent
from marginal distributions alone.

5.8.9 Example: Computer Maintenance


To find the conditional mean and variance of X given Y and Y given X, we use
the definitions of conditional expectations and variances. Below, we derive these
values based on the joint probability mass function provided in the Example
DR

5.8.2.

Conditional Mean and Variance of X given Y

Table 5.8: The conditional distribution of X given Y = 1

x 1 2 3 4
.
pX|Y (x|1) 0.375 0.250 0.219 0.156

For Y = 1:
E(X|Y = 1) = 1 · 0.375 + 2 · 0.250 + 3 · 0.215 + 4 · 0.156 = 2.1563

E(X 2 |Y = 1) = 12 · 0.375 + 22 · 0.250 + 32 · 0.215 + 42 · 0.156 = 5.8438

203
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Var(X|Y = 1) = 5.8438 − (2.1563)2 ≈ 1.1943



Hence, the standard deviation of X given Y = 1 is 1.1934 ≈ 1.0924.

Similarly, we can easily find the conditional distribution with its mean,
variance, and standard deviation of X given Y = 2 and Y = 3. We can also
find the conditional distribution with its mean, variance, and standard deviation
of Y given different values of X.

5.8.10 Example: Mineral Deposits


Given a sample of ore with a zinc content of X = 0.55, what can be inferred

T
about its iron content? The information regarding the iron content Y is en-
capsulated in the conditional probability density function, which is expressed
as:

f (0.55, y)
fY |X=0.55 (y) =
fX (0.55)
AF
Here, the denominator represents the marginal distribution of the zinc con-
tent X evaluated at 0.55. Evaluating fX (0.55):

57 51(0.55 − 1.00)2
fX (0.55) = − = 0.39225
40 10
Thus, the conditional probability density function becomes:

39 17(0.55−1.00)2 (y−25)2
f (0.55, y) 400 − 50 − 10,000
fY |X=0.55 (y) = =
0.39225 0.39225
Simplifying, we get:
DR

(y − 25)2
fY |X=0.55 (y) = 0.073 −
3922.5
for 20.0 ≤ y ≤ 35.0. It can be easily find the conditional expectation of
the iron content, which is calculated to be 27.14, and the conditional standard
deviation, which is 4.14.

5.8.11 Independence and Covariance


Just as two events A and B are considered independent if they are “unrelated”
to each other, two random variables X and Y are deemed independent if the
value taken by one random variable is “unrelated” to the value taken by the
other. Specifically, in the context of data science, random variables are inde-
pendent if the distribution of one random variable does not depend on the value
taken by the other random variable.

204
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Independent Random Variables


• For discrete random variables, independence means that the joint
probability mass function (pmf) can be expressed as the product
of their individual pmf’s:

pX,Y (x, y) = pX (x) · pY (y).

• For continuous random variables, independence means that the


joint probability density function (pdf) can be expressed as the
product of their individual pdf’s:

T
fX,Y (x, y) = fX (x) · fY (y).

Example:
• Let X be the result of rolling a fair six-sided die, and Y be the result
AF of flipping a fair coin, where X can take values 1 through 6, and Y can
take values 0 (tails) and 1 (heads). The events are independent, so:

P (X = 3 and Y = 1) = P (X = 3) · P (Y = 1) =
1 1
· =
1
6 2 12

Problem 5.15. It is known that the ratio of gallium to arsenide does not affect
the functioning of gallium-arsenide wafers, which are the main components of
microchips. Let X denote the ratio of gallium to arsenide and Y denote the
functional wafers retrieved during a 1-hour period. X and Y are independent
random variables with the joint density function
x(1+3y 2 )
(
DR

4 , 0 < x < 2, 0 < y < 1,


f (x, y) =
0, elsewhere.
Show that X and Y are independent random variables.

Solution
To show that X and Y are independent random variables, we need to ver-
ify that the joint density function f (x, y) can be factored into the product of
the marginal density functions fX (x) and fY (y). Specifically, X and Y are
independent if and only if the joint density function f (x, y) can be written as:

f (x, y) = fX (x) · fY (y).

Marginal density function fX (x):


To find fX (x), integrate the joint density function f (x, y) over the possible
values of y:

205
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Z 1
fX (x) = f (x, y) dy.
0
2
Given the joint density function f (x, y) = x(1+3y 4
)
for 0 < x < 2 and
0 < y < 1, compute:
Z 1
x(1 + 3y 2 )
fX (x) = dy
0 4
Z 1
x
= (1 + 3y 2 ) dy
4 0

T
Z 1 Z 1 
x 2
= 1 dy + 3y dy
4 0 0
x
= (1 + 1)
4
x
= .
2
AF
So, the marginal density function for X is:

fX (x) =
x
, 0 < x < 2.
2

Marginal density function fY (y):


To find fY (y), integrate the joint density function f (x, y) over the possible
values of x:

Z 2
fY (y) = f (x, y) dx
DR

0
2
x(1 + 3y 2 )
Z
= dx
0 4
1 + 3y 2 2
Z
= x dx
4 0
1 + 3y 2 1 + 3y 2
= ·2= .
4 2
Therefore, the marginal density function for Y is:

1 + 3y 2
fY (y) = , 0 < y < 1.
2

Verify independence
Check if f (x, y) can be written as fX (x) · fY (y):

206
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

 x   1 + 3y 2  x(1 + 3y 2 )
fX (x) · fY (y) = · = .
2 2 4
This matches the given joint density function f (x, y).
Since f (x, y) = fX (x) · fY (y), the random variables X and Y are indepen-
dent.

5.8.12 Covariance and Correlation


Covariance

FT
Covariance is essential for understanding and quantifying relationships between
variables, which is a cornerstone of many data science techniques and analyses.
It measures the joint variability of two random variables.

Covariance: For two random variables X and Y , the covariance is defined


as:
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
or equivalently
Cov(X, Y ) = E[XY ] − E(X)E(Y ).
A
If X and Y are independent, then Cov(X, Y ) = 0. However, a covariance of
zero does not necessarily imply independence.

Correlation
Correlation is a normalized form of covariance that measures the strength and
direction of the linear relationship between two random variables. This normal-
R
ization makes correlation a more interpretable metric, useful for understanding
the strength and direction of the relationship between variables. It’s widely
used in statistical analysis, machine learning, and data visualization to reveal
and quantify relationships that might otherwise be obscured by differences in
scale or units.

Correlation: For two random variables X and Y , the correlation is de-


D

fined as:
Cov(X, Y )
Corr(X, Y ) =
σX σY
where σX and σY are the standard deviations of X and Y , respectively.

The correlation coefficient ρX,Y ranges from −1 to +1. A value of +1


implies a perfect positive linear relationship, −1 implies a perfect negative linear
relationship, and 0 implies no linear relationship.

207
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Problem 5.16. Consider Computer Maintenance example, where the ran-


dom variable X denote the maintenance time in hours at a location, taking val-
ues 1, 2, 3, and 4, and the random variable Y represent the number of servers
at the location, taking values 1, 2, and 3. The joint probability mass function
pij for these variables is given in the table below:

Number of Servers (Y )
1 2 3
Maintenance
1 0.12 0.08 0.01
Time (X) 2 0.08 0.15 0.01

T
3 0.07 0.21 0.02
4 0.05 0.13 0.07

(a). Find the covariance of X and Y .


AF
(b).

Solution
Find the correlation of X and Y .

(a) Covariance of X and Y :


The covariance of X and Y is defined as

Cov(X, Y ) = E[XY ] − E(X)E(Y ).


We already have E(X) = µX = 2.5 and E(Y ) = µY = 3.05. Now we need to
compute E(XY ). Therefore, we need to sum the products of x, y, and their
DR

corresponding joint probabilities:


XX
E(XY ) = x · y · P (X = x, Y = y)
x y

Calculating each term:

E(XY ) = 1 · 1 · 0.12 + 1 · 2 · 0.08 + 1 · 3 · 0.01


+ 2 · 1 · 0.08 + 2 · 2 · 0.15 + 2 · 3 · 0.01
+ 3 · 1 · 0.07 + 3 · 2 · 0.21 + 3 · 3 · 0.02
+ 4 · 1 · 0.05 + 4 · 2 · 0.13 + 4 · 3 · 0.07
= 4.86
Therefore, the expected value E(XY ) is 4.86.

Cov(X, Y ) = E[XY ] − E(X)E(Y ) = 4.86 − (2.59 × 1.79) = 0.2239.

208
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

The positive value of 0.2239 indicates that there is an positive relationship


between X and Y . As X increases, Y tends to increases, and vice versa. The
magnitude of the covariance gives an idea of the strength of the relationship.
However, because covariance is not standardized, it is difficult to assess the
strength of the relationship without additional context such as the variances of
X and Y .

(b) Correlation of X and Y :


To get a more standardized measure of the relationship between X and Y , we
can compute the correlation coefficient:

T
Cov(X, Y )
ρX,Y =
σX σY
Given the standard deviations σX = 1.0779 and σY = 06212, we can find
the correlation coefficient ρX,Y using the covariance Cov(X, Y ) = 0.2239.
Substitute the given values:
AF ρX,Y =
0.2239
1.0779 × 0.6212
≈ 0.3344

The correlation of 0.3344 suggests that there is a tendency for more servers
to need maintenance as the maintenance time increases.

Problem 5.17. Consider two continuous random variables X and Y with the
following joint probability density function:
(
4xy if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1,
fX,Y (x, y) =
0 otherwise.
DR

(a) Are X and Y independent?

(b) Find the covariance Cov(X, Y ).


(c) Find the correlation Corr(X, Y ).

Solution
(a). Are X and Y independent?
To check if X and Y are independent, we need to verify if the joint PDF
factorizes into the product of the marginal PDFs of X and Y . That is, we need
to check if:

fX,Y (x, y) = fX (x)fY (y).

209
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Marginal PDF of X:
The marginal PDF of X is obtained by integrating the joint PDF over all
possible values of y:
Z 1
fX (x) = fX,Y (x, y) dy.
0
For 0 ≤ x ≤ 1, we compute:
1 1 1
y2
Z Z 
1
fX (x) = 4xy dy = 4x y dy = 4x = 4x · = 2x.
0 0 2 0 2

T
Thus, the marginal PDF of X is:
(
2x if 0 ≤ x ≤ 1,
fX (x) =
0 otherwise.

Marginal PDF of Y :
AF
Similarly, the marginal PDF of Y is obtained by integrating the joint PDF over
all possible values of x:
Z 1
fY (y) = fX,Y (x, y) dx.
0
For 0 ≤ y ≤ 1, we compute:
1 1 1
x2
Z Z 
1
fY (y) = 4xy dx = 4y x dx = 4y = 4y · = 2y.
0 0 2 0 2
Thus, the marginal PDF of Y is:
DR

(
2y if 0 ≤ y ≤ 1,
fY (y) =
0 otherwise.

Check for Independence:


Now we check if the joint PDF factorizes as the product of the marginal PDFs.
We compute:

fX (x)fY (y) = (2x)(2y) = 4xy.


Since fX,Y (x, y) = 4xy (which matches fX (x)fY (y) for 0 ≤ x ≤ 1 and
0 ≤ y ≤ 1), we conclude that X and Y are independent.

210
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

(b). Find the Covariance Cov(X, Y ):


The covariance is defined as:

Cov(X, Y ) = E[XY ] − E[X]E[Y ].

Compute E[X] and E[Y ]:


First, we calculate the expected values E[X] and E[Y ].
For X:

1 1 1 1
x3
Z Z Z 
2

T
2
E[X] = xfX (x) dx = x(2x) dx = 2 x dx = 2 = .
0 0 0 3 0 3
For Y :

1 1 1 1
y3
Z Z Z 
2 2
E[Y ] = yfY (y) dy = y(2y) dy = 2 y dy = 2 = .
3 3
AF
Compute E[XY ]:
0 0 0 0

Now we compute E[XY ] using the joint PDF:


Z 1Z 1
E[XY ] = xyfX,Y (x, y) dx dy
0 0
Z 1 Z 1
= xy(4xy) dx dy
0 0
Z 1Z 1
=4 x2 y 2 dx dy
0 0
DR

Z 1  Z 1 
2 2
=4 x dx y dy
0 0
1 1
=4· ·
3 3
4
= .
9

Compute the Covariance:


Now, we compute the covariance:
 
4 2 2 4 4
Cov(X, Y ) = E[XY ] − E[X]E[Y ] = − · = − = 0.
9 3 3 9 9
Thus, the covariance is:

Cov(X, Y ) = 0.

211
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

(c). Find the Correlation Corr(X, Y ):


The correlation is given by:

Cov(X, Y )
Corr(X, Y ) = .
σX σY
Since Cov(X, Y ) = 0, the correlation is:

Corr(X, Y ) = 0.

T
5.8.13 Linear Functions of a Random Variable
We will now explore some properties that will simplify calculating the means
and variances of random variables discussed in later chapters. These properties
allow us to express expectations using other parameters that are either known
or easily computed. The results presented are applicable to both discrete and
AF
continuous random variables, though proofs are provided only for the continuous
case. We start with a theorem and two corollaries that should be intuitively
understandable to the reader.
Theorem 5.1. If a and b are constants, then

E(aX + b) = aE(X) + b

and the variance is


Var(aX + b) = a2 Var(X).

Proof. By the definition of expected value,


DR

Z ∞
E(aX + b) = (ax + b)f (x) dx.
−∞

This can be rewritten as


Z ∞ Z ∞
E(aX + b) = a xf (x) dx + b f (x) dx.
−∞ −∞

The first integral on the right is E(X) and the second integral equals 1. There-
fore, we have
E(aX + b) = aE(X) + b.

212
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Compute Var(aX + b):

Var(aX + b) = E[(aX + b − E(aX + b))2 ]


= E[(aX + b − (aE(X) + b))2 ]
= E[(aX + b − aE(X) − b)2 ]
= E[(aX − aE(X))2 ]
= E[a2 (X − E(X))2 ]
= a2 E[(X − E(X))2 ]
a2 Var(X).

T
Thus, we have shown that:

Var(aX + b) = a2 Var(X)

Problem 5.18. Applying Theorem 5.1 to the continuous random variable


AF Y = 1.1X + 0.3,

rework Example 5.5.2.


For Example 5.5.2 and 5.6.1, it is obtained E(X) = 50 and V (X) = 0.05.
We may use Theorem 5.1 to write

E[Y ] = 1.1E[X] + 0.3


= 1.1 × 50 − 4.5
= 50.5
DR

and

Var(Y ) = (1.1)2 Var(X) = (1.1)2 × 0.05 = 0.0605

Problem 5.19. Suppose that a temperature has a mean of 110◦ F and a stan-
dard deviation of 2.2◦ F. The conversion formula from Fahrenheit to Centigrade
is given by:
9C
F = + 32
5
where F is the temperature in Fahrenheit and C is the temperature in Centi-
grade. What are the mean and the standard deviation in degrees Centigrade?

Solution
To find the mean temperature in Centigrade, we use:
5
Cmean = (Fmean − 32)
9

213
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Substitute Fmean = 110:


5 5 ◦
Cmean = (110 − 32) = × 78 = 43.3 C
9 9
To find the standard deviation in Centigrade, we use:
5
σC = σF
9
Substitute σF = 2.2:
5 11
σC = × 2.2 = ≈ 1.22◦ C
9 9

T

Thus, the mean temperature is approximately 43.3 C and the standard
deviation is approximately 1.22◦ C.
Theorem 5.2. The expected value of the sum or difference of two or more
functions of a random variable X is the sum or difference of the expected values
of the functions. That is,
AF E[g(X) ± h(X)] = E[g(X)] ± E[h(X)].

Problem 5.20. Let X be a random variable with probability distribution as


follows:
x 0 1 2 3
1 1 1
f (x) 3 2 0 6

Find the expected value of Y = (X − 1)2 and variance of X.

Solution
DR

Applying Theorem 5.2 to the function Y = (X − 1)2 , we can write

E[(X − 1)2 ] = E(X 2 − 2X + 1) = E(X 2 ) − 2E(X) + E(1).

From Theorem 5.1, E(1) = 1, and by direct computation,


     
1 1 1
E(X) = (0) + (1) + (2)(0) + (3) = 1,
3 2 6
and      
1 1 1
E(X 2 ) = (0) + (1) + (4)(0) + (9) = 2.
3 2 6
Hence,
E[(X − 1)2 ] = 2 − (2)(1) + 1 = 1.
Now, to calculate the variance of Y , we need E[Y 2 ]. Since Y = (X − 1)2 ,

Y 2 = (X − 1)4 .

214
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

We need to compute E[(X − 1)4 ]:


X
E[(X − 1)4 ] = (x − 1)4 f (x)
x
1 1 1
= 0 − 1)4 ·
+ (1 − 1)4 · + (2 − 1)4 · 0 + (3 − 1)4 ·
3 2 6
1 8
= +0+0+
3 3
= 3.

Therefore,
Var(Y ) = E[Y 2 ] − (E[Y ])2 = 3 − 12 = 2.

T
Problem 5.21. The weekly demand for a particular drink, measured in thou-
sands of liters, at a chain of convenience stores is a continuous random variable
g(X) = X 2 + X − 2, where X has the following density function:
(
2(x − 1), 1 < x < 2,
AF
Solution
f (x) =
0, elsewhere.

To find the expected value of the weekly demand for the drink, we use Theorem
5.1:
E(X 2 + X − 2) = E(X 2 ) + E(X) − E(2).
From Theorem 5.1, E(2) = 2. By direct integration, we find:
Z 2
5
E(X) = 2x(x − 1) dx = ,
1 3
DR

and Z 2
17
E(X 2 ) = 2x2 (x − 1) dx = .
1 6
Thus,
17 5 5
E(X 2 + X − 2) = + −2= .
6 3 2
Therefore, the average weekly demand for the drink at this chain of convenience
stores is 2500 liters.

Example: Test Score Standardization


Suppose that the raw scores X from a particular testing procedure are dis-
tributed between −5 and 20 with an expected value of 10 and a variance of 7.
In order to standardize the scores so that they lie between 0 and 100, the linear
transformation
Y = 4X + 20

215
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

is applied to the scores. This means, for example, that a raw score of x = 12
corresponds to a standardized score of y = (4 × 12) + 20 = 68.
The expected value of the standardized scores is then known to be

E(Y ) = 4E(X) + 20 = (4 × 10) + 20 = 60

with a variance of

Var(Y ) = 42 Var(X) = 42 × 7 = 112



The standard deviation
√ of the standardized scores is σY = 112 = 10.58, which
is 4 × σX = 4 × 7.

T
5.8.14 Linear Combinations of Random Variables
When dealing with two random variables, X1 and X2 , it is often beneficial to
analyze the random variable formed by their sum. A general principle states
that:
AF E(X1 + X2 ) = E(X1 ) + E(X2 )
This means the expected value of the sum of two random variables is equal to
the sum of their individual expected values.

In addition:

Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) + 2Cov(X1 , X2 )

Note that if the two random variables are independent, their covariance is
zero, simplifying the variance of their sum to the sum of their variances:

Var(X1 + X2 ) = Var(X1 ) + Var(X2 )


DR

Thus, the variance of the sum of two independent random variables is equal to
the sum of their individual variances.

These results are straightforward, but it’s crucial to remember that while
the expected value of the sum of two random variables always equals the sum
of their expected values, the variance of the sum only equals the sum of their
variances if the random variables are independent.

Sums of Random Variables: If X1 and X2 are two random variables,


then:
E(X1 + X2 ) = E(X1 ) + E(X2 )
and
Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) + 2Cov(X1 , X2 )
If X1 and X2 are independent random variables such that Cov(X1 , X2 ) = 0,

216
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

then:
Var(X1 + X2 ) = Var(X1 ) + Var(X2 )

Now, consider a sequence of random variables X1 , . . . , Xn along with constants


a1 , . . . , an and b. Define a new random variable Y as the linear combination:

Y = a1 X1 + · · · + an Xn + b

Linear combinations of random variables are important in various contexts,


and deriving general results for them is useful. The expectation of the linear
combination is:
E(Y ) = a1 E(X1 ) + · · · + an E(Xn ) + b

T
which is simply the linear combination of the expectations of the random vari-
ables Xi . Additionally, if the random variables X1 , . . . , Xn are independent,
then:
Var(Y ) = a21 Var(X1 ) + · · · + a2n Var(Xn )
Note that the constant b does not affect the variance of Y , and the coefficients
AF
ai are squared in this expression.
Theorem 5.3. If X1 , . . . , Xn is a sequence of random variables and a1 , . . . , an
and b are constants, then

E(a1 X1 + · · · + an Xn + b) = a1 E(X1 ) + · · · + an E(Xn ) + b.

If, in addition, the random variables are independent, then

Var(a1 X1 + · · · + an Xn + b) = a21 Var(X1 ) + · · · + a2n Var(Xn ).

Problem 5.22. Suppose that X1 , . . . , Xn is a sequence of independent random


DR

variables each with an expectation µ and a variance σ 2 . Consider the sample


mean X̄ defined as:
n
1X
X̄ = Xi
n i=1

Find the mean and variance of sample mean X̄.

Solution
Mean of the Sample Mean: Using the linearity of expectation:
n
! n n
1X 1X 1X nµ
E(X̄) = E Xi = E(Xi ) = µ= =µ
n i=1 n i=1 n i=1 n

217
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Variance of the Sample Mean: Since the Xi are independent and each
has a variance σ 2 :
n
!
1X
Var(X̄) = Var Xi
n i=1
n
1 X
= Var(Xi )
n2 i=1
n
1 X 2
= σ
n2 i=1
nσ 2

T
=
n2
σ2
= .
n
Therefore, the mean and variance of the sample mean X̄ are:
E(X̄) = µ
AF σ2
Var(X̄) =
n
Problem 5.23. Let X1 and X2 represent the scores on two tests, with the
following information:
E(X1 ) = 18, Var(X1 ) = 24, E(X2 ) = 30, Var(X2 ) = 60.
The scores are standardized as:
10 5 50
Y1 = X1 , Y2 = X2 + .
3 3 3
The final score is:
2 1
DR

Z= Y1 + Y2 .
3 3
(a). Calculate the expected value of the final score E(Z).

(b). Calculate Var(Z) and the standard deviation σZ , assuming X1 and X2


are independent.

Solution
Let X1 and X2 represent the scores on two tests with the following data:
E(X1 ) = 18, Var(X1 ) = 24, E(X2 ) = 30, Var(X2 ) = 60.
The scores are standardized as:
10 5 50
Y1 = X1 , Y2 = X2 + .
3 3 3
The final score is:
2 1
Z = Y1 + Y2 .
3 3

218
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

(a) Calculate E(Z):


2 1
E(Z) = E(Y1 ) + E(Y2 ).
3 3
10 5 50
E(Y1 ) = × 18 = 60, E(Y2 ) = × 30 + = 66.67.
3 3 3
2 1
E(Z) = × 60 + × 66.67 = 62.22.
3 3

(b) Calculate Var(Z) and σZ :


 2  2

T
2 1
Var(Z) = Var(Y1 ) + Var(Y2 ).
3 3
 2  2
10 5
Var(Y1 ) = × 24 = 266.67, Var(Y2 ) = × 60 = 166.67.
3 3
4 1
AF Var(Z) =
9
× 266.67 + × 166.67 = 137.04.

9
σZ = 137.04 = 11.71.
Hence,
E(Z) = 62.22, Var(Z) = 137.04, σZ = 11.71.

5.8.15 Exercises
1. Suppose X, taking the values 1, 2, 3, and 4, is the service time in hours
taken at Bashundhara residential area, and the Y , taking the values 1, 2,
and 3, is the number of air conditioner (AC) units at the same location.
The joint probability of X and Y are presented in the Table 5.9.
DR

Table 5.9: Joint probability table

X = number of service time (hrs)


p(x, y)
1 2 3 4
1 0.12 0.08 0.07 0.05
Y = number of
2 0.08 k 0.21 0.13
AC units
3 0.01 0.01 0.02 0.07

(a) What is the value of k?


(b) Find the marginal distribution of X and Y .
(c) Find the conditional distribution of P (X|Y = 2) and compute its
mean.

219
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

(d) Compute E(XY ).

2. Suppose that the random variables X, Y , and Z are independent with


E(X) = 3, Var(X) = 4, E(Y ) = −4, Var(Y ) = 2, E(Z) = 7, and
Var(Z) = 7. Calculate the expectation and variance of the following
random variables.
(a) 3X + 7
(b) 5X − 9
(c) 2X + 6Y
(d) 4X − 3Y

T
(e) 5X − 9Z + 8
(f) −3Y − Z − 5
(g) X + 2Y + 3Z
(h) 6X + 2Y − Z + 16
AF
3. Suppose that items from a manufacturing process are subject to three
separate evaluations, and that the results of the first evaluation X1 have
a mean value of 59 with a standard deviation of 10, the results of the
second evaluation X2 have a mean value of 67 with a standard deviation
of 13, and the results of the third evaluation X3 have a mean value of 72
with a standard deviation of 4. In addition, suppose that the results of
the three evaluations can be taken to be independent of each other.

(a) If a final evaluation score is obtained as the average of the three


evaluations X = X1 +X32 +X3 , what are the mean and the standard
deviation of the final evaluation score?
DR

(b) If a final evaluation score is obtained as the weighted average of the


three evaluations X = 0.4X1 + 0.4X2 + 0.2X3 , what are the mean
and the standard deviation of the final evaluation score?

4. A machine part is assembled by fastening two components of type A


and one component of type B end to end. Suppose that the lengths of
components of type A have an expectation of 37.0 mm and a standard
deviation of 0.7 mm, whereas the lengths of components of type B have
an expectation of 24.0 mm and a standard deviation of 0.3 mm. What
are the expectation and variance of the length of the machine part?

5. A product is assembled by linking four components of type C and one


component of type D sequentially. The lengths of components of type C
have an average of 50.0 mm and a standard deviation of 0.8 mm, while
the lengths of components of type D have an average of 20.0 mm and a
standard deviation of 0.4 mm. Determine the average and variance of the
length of the product.

220
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

6. A system is constructed by connecting five components of type G and two


components of type H end to end. Assume that the lengths of components
of type G have an expected value of 40.0 mm and a standard deviation
of 1.5 mm, and the lengths of components of type H have an expected
value of 22.0 mm and a standard deviation of 0.7 mm. What are the
expectation and variance of the total length of the system?
7. A person’s cholesterol level C can be measured by three different tests.
Test-α returns a value Xα with a mean C and a standard deviation of 1.2,
test-β returns a value Xβ with a mean C and a standard deviation of 2.4,
and test-γ returns a value Xγ with a mean C and a standard deviation
of 3.1. Suppose that the three test results are independent. If a doctor

T
decides to use the weighted average 0.5Xα + 0.3Xβ + 0.2Xγ , what is the
standard deviation of the cholesterol level obtained by the doctor?
8. Suppose that the impurity levels of water samples taken from a particular
source are independent with a mean value of 3.87 and a standard deviation
of 0.18.
AF (a) What are the mean and the standard deviation of the sum of the
impurity levels from two water samples?
(b) What are the mean and the standard deviation of the sum of the
impurity levels from three water samples?
(c) What are the mean and the standard deviation of the average of the
impurity levels from four water samples?
(d) If the impurity levels of two water samples are averaged, and the
result is subtracted from the impurity level of a third sample, what
are the mean and the standard deviation of the resulting value?
DR

5.9 Python Functions for Statistical Distribu-


tions
In the analysis of statistical distributions, Python provides a variety of functions
to work with different types of distributions. These functions can be used to
perform tasks such as generating random variates, computing probability mass
functions (pmf), cumulative density functions (cdf), and more. The following
Table 5.10 summarizes some of the key functions available for discrete and
continuous distributions in Python.
These functions are typically part of the ‘scipy.stats’ module, which
includes a wide range of probability distributions and statistical functions. The
table below lists each function along with a brief explanation of its purpose:

221
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Python function Function explanation

rvs(p, loc=0, size=1) Random variates.

pmf(x, p, loc=0) Probability mass function.


logpmf(x, p, loc=0) Log of the probability mass function.
cdf(x, p, loc=0) Cumulative density function.
logcdf(x, p, loc=0) Log of the cumulative density function.
sf(x, p, loc=0) Survival function (1-cdf — sometimes more

T
accurate).
logsf(x, p, loc=0) Log of the survival function.
ppf(q, p, loc=0) Percent point function (inverse of cdf —
percentiles).
isf(q, p, loc=0) Inverse survival function (inverse of sf).
AF
stats(p,
ments=’mv’)
loc=0, mo- Mean (‘m’), variance (‘v’), skew (‘s’),
and/or kurtosis (‘k’).
entropy(p, loc=0) (Differential) entropy of the RV.
expect(func, p, loc=0, Expected value of a function (of one argu-
lb=None, ub=None, condi- ment) with respect to the distribution.
tional=False)
median(p, loc=0) Median of the distribution.
mean(p, loc=0) Mean of the distribution.
var(p, loc=0) Variance of the distribution.
DR

std(p, loc=0) Standard deviation of the distribution.


interval(alpha, p, loc=0) Endpoints of the range that contains alpha
percent of the distribution.

Table 5.10: Summary of Python functions for statistical distributions.

5.10 Concluding Remarks


In this chapter, we have established a comprehensive foundation for understand-
ing random variables and their fundamental properties. We began by defining
random variables and distinguishing between discrete and continuous types, ex-
amining their respective probability functions and cumulative distribution func-
tions. We then delved into the crucial concepts of expectation and variance,
illustrating their applications through various examples. The discussion on

222
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

Chebyshev’s Inequality highlighted its utility in providing probabilistic bounds


without assuming a specific distribution. Additionally, we explored jointly dis-
tributed random variables, emphasizing the importance of understanding inde-
pendence, covariance, and correlation. With these essential concepts and tools
in place, we are now well-equipped to explore specific discrete probability distri-
butions in the following chapter, where we will extend our knowledge to model
and analyze discrete data more effectively.

5.11 Chapter Exercises


1. A study records the number of adverse reactions to a new drug among a

T
group of patients. The probability distribution of the number of adverse
reactions is given by:

0.5 if x = 0

P (X = x) = 0.3 if x = 1

AF 
0.2 if x = 2

(a) Find the expected number of adverse reactions.


(b) Calculate the variance and standard deviation of the number of ad-
verse reactions.
2. Let the lifetime T in hours of a certain type of electronic device have the
probability density function
( t
1 − 100
e for t ≥ 0
fT (t) = 100
0 elsewhere
DR

Find the expectation and variance of the lifetime.


3. Let the height H in centimeters of a particular species of plant have the
probability density function
(
3
(h − 120)2 for 120 ≤ h ≤ 124
fH (h) = 64
0 elsewhere

Calculate the cumulative distribution function FH (h) and find the prob-
ability that a plant’s height is between 121 and 123 cm.

(a) Write down the probability mass function P (Y = y).


(b) Calculate the expectation and variance of Y .
(c) Find the probability that there are exactly 2 defective items in a
batch.

223
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES

4. Let the length L in meters of a certain type of fish have the probability
density function (
0.2l for 0 ≤ l ≤ 2
fL (l) =
0 elsewhere
Find the expectation, variance, and cumulative distribution function FL (l)
of the length.

5. Let the temperature X in degrees Fahrenheit of a particular chemical


reaction with density
x − 190
fX (x) = 220 ≤ x ≤ 280.

T
3600
(a) Find the cumulative distribution function F (x) of the the tempera-
ture and it’s median.
(b) Find the expectation and standard deviation of the temperature.
(c) Find the the expectation and variance of Y where Y = 95 X − 160
9 .
AF
6. The random variable X measures the concentration of ethanol in a chem-
ical solution, and the random variable Y measures the acidity of the so-
lution. They have a joint probability density function

f (x, y) = A(20 − x − 2y)

for 0 ≤ x ≤ 5 and 0 ≤ y ≤ 5 and f (x, y) = 0 elsewhere.


(a) What is the value of A? What is P (1 ≤ X ≤ 2, 2 ≤ Y ≤ 3)?
(b) Construct the marginal probability density function fX (x).
(c) What are the expectation and the variance of the ethanol concentra-
DR

tion?

224
Chapter 6

Some Discrete Probability

T
Distributions
AF
6.1 Introduction
In the field of data science, understanding discrete probability distributions is
crucial for analyzing and modeling data that can be categorized into distinct
outcomes. These distributions help data scientists interpret and predict the
likelihood of various events based on historical data, which can be essential for
making informed decisions and developing predictive models.

This chapter focuses on three fundamental discrete probability distributions:


the Bernoulli distribution, the Binomial distribution, and the Poisson distribu-
tion. Each of these distributions plays a vital role in data science applications,
DR

ranging from binary classification problems to event counting and rate model-
ing.

Throughout this chapter, we will delve into each distribution’s mathematical


properties, including expected value, variance, moment generating function, and
characteristic function. We will also present practical examples and exercises
to illustrate how these distributions can be applied to real-world data science
problems.

6.2 Bernoulli Distribution


Consider a simple experiment where we flip a fair coin. The outcome of this
experiment can be either “Heads” or “Tails.” We can assign a value of 1 to
“Heads” and 0 to “Tails.” This experiment is an example of a Bernoulli trial,
which is a random experiment with exactly two possible outcomes. It is named

225
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

after Jacob Bernoulli, a Swiss mathematician.

Suppose we are interested in modeling the probability of getting “Heads” in


a single coin flip. If the coin is fair, the probability of getting “Heads” (success)
is p = 0.5 and the probability of getting “Tails” (failure) is 1−p = 0.5. However,
in general, the probability of success in a Bernoulli trial can be any value p such
that 0 ≤ p ≤ 1.

Definition: A random variable X is said to have a Bernoulli distribution


with parameter p if it takes the value 1 with probability p and the value 0
with probability 1 − p. The probability mass function (pmf) of X is given
by:

T
(
p if x = 1
P (X = x) =
1 − p if x = 0
or more compactly, if X ∼ Bernoulli(0.1) then the pmf is

P (X = x) = px (1 − p)1−x for x ∈ {0, 1}.


AF
6.2.1 Expected Value (Mean)
The mean or expected value E(X) of a Bernoulli distributed random variable
X can be calculated as follows:
X
E(X) = x · P (X = x)
x

For a Bernoulli random variable X:

E(X) = 1 · P (X = 1) + 0 · P (X = 0)
DR

Since P (X = 1) = p and P (X = 0) = 1 − p, we have:

E(X) = 1 · p + 0 · (1 − p) = p
So, the mean of a Bernoulli distribution is:

E(X) = p

6.2.2 Variance
The variance Var(X) of a Bernoulli distributed random variable X is defined
as:

Var(X) = E (X − E(X))2
 

First, we calculate E(X 2 ):

226
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

X
E(X 2 ) = x2 · P (X = x)
x

For a Bernoulli random variable X:

E(X 2 ) = 12 · P (X = 1) + 02 · P (X = 0) = p
Now, using the formula for variance:

Var(X) = E(X 2 ) − [E(X)]2


Substitute the values we have calculated:

T
Var(X) = p − p2 = p(1 − p)
So, the variance of a Bernoulli distribution is:

Var(X) = p(1 − p)
AF
Properties

• Mean: The expected value (mean) of a Bernoulli random variable


X is given by:
E(X) = p

• Variance: The variance of a Bernoulli random variable X is given


by:
Var(X) = p(1 − p)

• Standard Deviation: The standard deviation of a Bernoulli


DR

random variable X is:


p
σ = p(1 − p)

Problem 6.1. A factory produces light bulbs, and each bulb has a 95% chance
of passing the quality control test. Define a random variable X such that X = 1
if a light bulb passes the quality control test (success) and X = 0 if it fails
(failure).

(a). What is the probability that a randomly selected light bulb passes the quality
control test?
(b). What is the expected value (mean) of X?

(c). What is the variance of X?

227
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Solution
Let’s define the random variable X as follows:
(
1 with probability p = 0.95
X=
0 with probability 1 − p = 0.05

(a). Probability of Passing the Quality Control Test


The probability that a randomly selected light bulb passes the quality control
test is given by P (X = 1).

FT
P (X = 1) = p = 0.95
So, the probability that a light bulb passes the quality control test is 0.95,
or 95%.

(b). Expected Value (Mean) of X


The expected value E(X) of a Bernoulli distributed random variable X is given
by:

E(X) = p
A
Substituting the value of p:

E(X) = 0.95
So, the expected value of X is 0.95.

(c). Variance of X
R
The variance Var(X) of a Bernoulli distributed random variable X is given by:

Var(X) = p(1 − p)
Substituting the value of p:
D

Var(X) = 0.95 × (1 − 0.95) = 0.95 × 0.05 = 0.0475


So, the variance of X is 0.0475.

Problem 6.2. A new vaccine is being tested for its effectiveness. In clinical
trials, it was found that the vaccine successfully immunizes 90% of the par-
ticipants. Define a random variable X such that X = 1 if a participant is
successfully immunized (success) and X = 0 if not (failure).

(i). What is the probability that a randomly selected participant is successfully


immunized?

228
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

(ii). What is the expected value (mean) of X?


(iii). What is the variance of X?
(iv). In a group of 10 participants, what is the expected number of participants
that will be successfully immunized?

Solution
Let’s define the random variable X as follows:
(
1 with probability p = 0.90
X=

T
0 with probability 1 − p = 0.10

(i) Probability of Successful Immunization


The probability that a randomly selected participant is successfully immunized
is given by P (X = 1).
AF P (X = 1) = p = 0.90
So, the probability that a participant is successfully immunized is 0.90, or
90%.

(ii) Expected Value (Mean) of X


The expected value E(X) of a Bernoulli distributed random variable X is given
by:

E(X) = p = 0.90
DR

So, the expected value of X is 0.90.

(iii) Variance of X
The variance Var(X) of a Bernoulli distributed random variable X is given by:

Var(X) = p(1 − p) = 0.90 × (1 − 0.90) = 0.90 × 0.10 = 0.09


So, the variance of X is 0.09.

(iv) Expected Number of Successful Immunizations in a Group of 10


Participants
Let Y be the total number of participants successfully immunized in a group
of 10. Y follows a Binomial distribution with parameters n = 10 and p = 0.90.
The expected value E(Y ) of a Binomial random variable is given by:

E(Y ) = np

229
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Substituting the values of n and p:

E(Y ) = 10 × 0.90 = 9
So, the expected number of participants successfully immunized in a group
of 10 is 9.

6.2.3 Moment Generating Function (MGF)


The moment generating function (MGF) MX (t) of a random variable X is
defined as:

T
MX (t) = E(etX )
For a Bernoulli distributed random variable X with probability p of success
(i.e., X = 1 with probability p and X = 0 with probability 1 − p):

MX (t) = E(etX ) = et·0 · P (X = 0) + et·1 · P (X = 1)


AF = e0 · (1 − p) + et · p
= 1 · (1 − p) + et · p
= (1 − p) + pet

So, the moment generating function of a Bernoulli distributed random vari-


able X is:

MX (t) = 1 − p + pet

6.2.4 Characteristic Function


DR

The characteristic function φX (t) of a Bernoulli distributed random variable X


is defined as:

φX (t) = E(eitX ) = eit·0 · P (X = 0) + eit·1 · P (X = 1)


= ei·0 · (1 − p) + eit · p
= 1 · (1 − p) + eit · p
= (1 − p) + peit .

6.2.5 Probability Generating Function


For a Bernoulli random variable X with parameter p, the probability generating
function (PGF) is given by:

GX (s) = E[sX ]
Since X can take values 0 and 1, we have:

230
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

X
GX (s) = E[sX ] = P (X = x) · sx
x
= (1 − p) · s0 + p · s1
= (1 − p) + p · s

Therefore, the PGF of a Bernoulli random variable X with parameter p is:

GX (s) = 1 − p + p · s

T
6.2.6 Example
Let’s consider a biased coin where the probability of getting “Heads” is p = 0.7.
The random variable X representing the outcome of a single coin flip follows a
Bernoulli distribution with parameter p = 0.7.
The pmf of X is:
AF P (X = x) =
(
0.7 if x = 1
0.3 if x = 0
The mean and variance of X are:

E(X) = 0.7

Var(X) = 0.7 × (1 − 0.7) = 0.21


Thus, we can model and analyze the outcomes of a single coin flip using the
Bernoulli distribution.
DR

Problem 6.3. A factory produces light bulbs, and each light bulb is tested
for quality. The probability that a light bulb is defective is p = 0.1. Let X
be a random variable that represents whether a randomly selected light bulb is
defective (1 if defective, 0 if not defective).

1. What is the probability that a randomly selected light bulb is defective?


2. What is the probability that a randomly selected light bulb is not defective?
3. Compute the expected value and variance of X.

Solution
Let X follow a Bernoulli distribution with parameter p = 0.1, i.e., X ∼
Bernoulli(0.1).

231
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

1. The probability that a randomly selected light bulb is defective is given


by P (X = 1):
P (X = 1) = p = 0.1

2. The probability that a randomly selected light bulb is not defective is


given by P (X = 0):

P (X = 0) = 1 − p = 1 − 0.1 = 0.9

3. For a Bernoulli random variable X with parameter p:


• The expected value E[X] is:

T
E[X] = p = 0.1

• The variance Var(X) is:

Var(X) = p(1 − p) = 0.1 · (1 − 0.1) = 0.1 · 0.9 = 0.09


AF
6.2.7 Applications
The Bernoulli distribution is used to model binary outcomes in various scenar-
ios, such as:
• Quality Control in Manufacturing
In manufacturing, the Bernoulli distribution is used to model the prob-
ability of a defect in a production process. For example, a factory pro-
ducing electronic components might use the Bernoulli distribution to
DR

determine the likelihood that a randomly selected component is defec-


tive.

• Clinical Trials in Medicine


The Bernoulli distribution can model the outcome of a clinical trial
for a new drug, where X = 1 represents a successful treatment (e.g.,
patient recovery) and X = 0 represents an unsuccessful treatment.
This helps in estimating the effectiveness of the drug.

• A/B Testing in Marketing


In digital marketing, A/B testing is used to compare two versions of
a webpage or advertisement. The Bernoulli distribution models the
probability of a user clicking on an ad or making a purchase, where
X = 1 indicates a click or purchase and X = 0 indicates no click or
purchase.

232
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

• Sports Performance Analysis


The Bernoulli distribution can be applied to model the probability of a
successful outcome in sports, such as a basketball player making a free
throw or a soccer player scoring a penalty kick. Here, X = 1 represents
a successful attempt, and X = 0 represents a failure.

• Insurance Risk Assessment


In insurance, the Bernoulli distribution is used to model the occurrence
of certain events, such as accidents or claims. For instance, X = 1 could
represent a policyholder filing a claim within a year, and X = 0 could
represent no claim filed.

• Genetics
The Bernoulli distribution is used in genetics to model the inheritance

T
of a particular gene. For example, X = 1 might represent the presence
of a specific gene in an offspring, and X = 0 represents its absence,
assuming a certain probability of inheritance.

6.2.8
AF
Python Code for Bernoulli Distribution
In Python, we can calculate various characteristics of the Bernoulli distribu-
tion using the ‘scipy.stats‘ module. Below, we detail how to compute the
Probability Mass Function, Cumulative Distribution Function, Mean (Expected
Value), Variance, and Probability Generating Function, etc.

Python Code
Here’s how we can compute these characteristics using Python:
DR
1 import numpy as np
2 from scipy . stats import bernoulli
3

4 # Define the parameter p


5 p = 0.25
6

7 # Bernoulli distribution
8 dist = bernoulli ( p )
9

10 # 1. Probability Mass Function ( PMF )


11 x_values = [0 , 1] # Possible values for a Bernoulli
random variable
12 pmf_values = dist . pmf ( x_values )
13 print ( " PMF values for x = 0 and x = 1: " , pmf_values )
14

15 # 2. Cumulative Distribution Function ( CDF )


16 cdf_values = dist . cdf ( x_values )
17 print ( " CDF values for x = 0 and x = 1: " , cdf_values )

233
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

18

19 # 3. Mean ( Expected Value )


20 mean = dist . mean ()
21 print ( " Mean ( Expected Value ) : " , mean )
22

23 # 4. Variance
24 variance = dist . var ()
25 print ( " Variance : " , variance )
26

27 # 5. Probability Generating Function ( PGF )


28 def pgf (t , p ) :
29 return (1 - p ) + p * t

T
30

31 # PGF values at t = 0 and t = 1


32 pgf_values = [ pgf (t , p ) for t in [0 , 1]]
33 print ( " PGF values for t = 0 and t = 1: " , pgf_values )

Explanations
AF • Probability Mass Function (pmf ):
■ The pmf gives the probability of each outcome (0 or 1). Use
dist.pmf(x values) to compute these probabilities.

• Cumulative Distribution Function (CDF):


■ The CDF gives the cumulative probability up to each outcome.
Use dist.cdf(x values) to compute these values.

• Mean (Expected Value):


DR

■ The mean is simply the parameter p of the Bernoulli distribution.


Use dist.mean() to get this value.

• Variance:
■ The variance of a Bernoulli distribution is p · (1 − p). Use
dist.var() to compute this.

• Probability Generating Function (PGF):


■ The PGF is calculated using the formula GX (t) = 1 − p + p · t.
Define a function pgf(t, p) to compute PGF values for specific
t values (e.g., 0 and 1).

6.2.9 Exercises
1. A medical test for a certain disease has a 98% chance of correctly iden-
tifying a diseased person (true positive) and a 2% chance of incorrectly

234
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

identifying a healthy person as diseased (false positive). Define a ran-


dom variable X such that X = 1 if the test result is positive (either true
positive or false positive) and X = 0 if the test result is negative.

(a) What is the probability that a randomly selected test result is posi-
tive?
(b) What is the expected value (mean) of X?
(c) What is the variance of X?
(d) In a group of 50 people who took the test, what is the expected
number of positive test results?

T
A genetic trait is passed on to the next generation with a probability of
25%. Define a random variable X such that X = 1 if the trait is passed
on (success) and X = 0 if it is not (failure).
2. (a) What is the probability that a randomly selected offspring inherits
the trait?
AF(b) What is the expected value (mean) of X?
(c) What is the variance of X?
(d) In a group of 40 offspring, what is the expected number of offspring
that will inherit the trait?
3. In a clinical trial, a new drug is found to be effective in 85% of the pa-
tients. Define a random variable X such that X = 1 if a patient responds
positively to the drug (success) and X = 0 if not (failure).

(a) What is the probability that a randomly selected patient responds


positively to the drug?
DR

(b) What is the expected value (mean) of X?


(c) What is the variance of X?
(d) In a sample of 20 patients, what is the expected number of patients
who will respond positively to the drug?

4. A diagnostic test has a 92% chance of correctly detecting a condition


when it is present (true positive rate) and an 8% chance of detecting the
condition when it is not present (false positive rate). Define a random
variable X such that X = 1 if the test result is positive (either true
positive or false positive) and X = 0 if the test result is negative.

(a) What is the probability that a randomly selected test result is posi-
tive?
(b) What is the expected value (mean) of X?
(c) What is the variance of X?

235
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

(d) In a group of 100 people who took the test, what is the expected
number of positive test results?

5. In a study of a certain disease, it is found that 70% of the subjects have a


particular gene variant that increases susceptibility to the disease. Define
a random variable X such that X = 1 if a subject has the gene variant
(success) and X = 0 if not (failure).

(a) What is the probability that a randomly selected subject has the
gene variant?
(b) What is the expected value (mean) of X?

T
(c) What is the variance of X?
(d) In a sample of 30 subjects, what is the expected number of subjects
who have the gene variant?

6.3 Binomial Distribution


AF
Consider the simple experiment of flipping a fair coin, where the outcome can be
either “Heads” (success) or “Tails” (failure). We assign a value of 1 to “Heads”
and 0 to “Tails,” making this a Bernoulli trial. Suppose we flip the coin n times.
We are interested in modeling the probability of obtaining exactly k “Heads”
(successes) in n flips.

This problem is an example of a Binomial experiment, which consists of n


independent Bernoulli trials, each with the same probability of success p. To
calculate the probability of getting exactly k successes in n trials, consider a
specific sequence of trials where exactly k trials are successes and n − k trials
are failures. The probability of such a specific sequence is given by
DR

pk × (1 − p)n−k

where p is the probability of success and 1 − p is the probability of failure.

We need to account for all possible sequences of n trials that result in exactly
k successes. The number of ways to choose k positions for successes out of n
positions is given by the Binomial coefficient
 
n n!
=
k k!(n − k)!
n

where k represents the number of combinations of n items taken k at a time.

The total probability of having exactly k successes in n trials is the product


of the probability of any specific sequence and the number of such sequences.
Thus, the probability mass function of the Binomial distribution is

236
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

 
n k
P (X = k) = p (1 − p)n−k for k = 0, 1, 2, . . . , n,
k
where X is the random variable representing the number of successes.
The Binomial distribution is a discrete probability distribution. This distri-
bution has the following conditions:

1. Fixed Number of Trials: The experiment is repeated a fixed number


of times, denoted as n.
2. Two Possible Outcomes: Each trial results in one of two outcomes,
often referred to as “success” and “failure.”
3. Constant Probability: The probability of success in each trial is con-
stant and denoted by p. Consequently, the probability of failure is 1 − p.

T
4. Independence: The outcome of one trial is independent of the outcomes
of other trials.

If these conditions are met, the number of successes in n trials follows a


AF
Binomial distribution with parameters n and p.

Definition: A random variable X that represents the number of successes


in n independent Bernoulli trials, each with a probability of success p, is
said to follow a Binomial distribution if its probability mass function (pmf)
is given by:
 
n k
P (X = k) = p (1 − p)n−k for k = 0, 1, 2, . . . , n,
k
DR
where nk is the Binomial coefficient. We write this as


X ∼ B(n, p).

Figure 6.1: Graphical Presentation of pmf of Binomial Distribution.

237
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

The graphical presentation of the Binomial distribution for n = 10 and


p = 0.25, p = 0.5, and p = 0.75 is depicted in Figure 6.1. The Python code
used to generate these figures is provided below.
1 import matplotlib . pyplot as plt
2 import numpy as np
3 from scipy . stats import binom
4

5 # Parameters
6 n = 10
7 p = 0.5
8

# Values

T
9

10 k = np . arange (0 , n +1)
11 pmf = binom . pmf (k , n , p )
12

13 # Plot
14 plt . bar (k , pmf , color = ’ blue ’ , edgecolor = ’ black ’)
15 plt . xlabel ( ’ Number of Successes ’)
16

17

18

19
AF plt . ylabel ( ’ Probability ’)
plt . title ( ’ Binomial PMF ( n =10 , p =0.5) ’)
plt . xticks ( k )
plt . grid ( True )
20 plt . show ()

Symmetric Binomial Distributions A B(n, 0.5) distribution is a symmet-


ric probability distribution for any value of the parameter n. The distribution
is symmetric about the expected value n/2.

6.3.1 Expected Value


DR

The expected value E[X] is


n
X
E[X] = k · P (X = k)
k=0
n  
X n k
= k· p (1 − p)n−k
k
k=0
n      
X n−1 k n−k n n−1
= n· p (1 − p) [Since, k · =n· ]
k−1 k k−1
k=1
n  
X n − 1 k−1
=n·p p (1 − p)n−k
k−1
k=1

238
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Using the binomial expansion of (p + (1 − p))n−1 , we have


n  
X n − 1 k−1
p (1 − p)n−k = 1
k−1
k=1

Thus,

E(X) = n · p · 1 = np
Therefore, the expected value of X is

E(X) = np

T
Alternatively:
By recognizing that X can be written as the sum of n independent Bernoulli
trials Xi
X = X1 + X2 + · · · + Xn
Each Xi ∼ Bernoulli(p) with
AF E[Xi ] = p
Using the linearity of expectation, we have
E[X] = E[X1 + X2 + · · · + Xn ] = E[X1 ] + E[X2 ] + · · · + E[Xn ] = n · p
Thus, the expected value E[X] is
E[X] = np

6.3.2 Variance and Standard Deviation


Since X1 , X2 , . . . , Xn are independent random variables, we can use Theorem
DR

5.3. Thus, the variance Var(X) can be computed as follows


σ 2 = Var(X) = Var(X1 + X2 + · · · + Xn )
= Var(X1 ) + Var(X2 ) + · · · + Var(Xn )
= p(1 − p) + p(1 − p) + · · · + p(1 − p)
= np(1 − p).
The standard deviation σ is
p p
σ = Var(X) = np(1 − p)

Properties

• Mean: The expected value (mean) of a Binomial random variable


X is given by
E(X) = np

239
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

• Variance: The variance of a Binomial random variable X is given


by
Var(X) = np(1 − p)

• Standard Deviation: The standard deviation of a Binomial ran-


dom variable X is p
σ = np(1 − p)

6.3.3 Example
Let us consider a biased coin where the probability of getting “Heads” is p = 0.7.
Suppose we flip this coin n = 10 times. The random variable X representing the

T
number of “Heads” in 10 flips follows a Binomial distribution with parameters
n = 10 and p = 0.7.
The pmf of X is
 
10
P (X = k) = (0.7)k (0.3)10−k for k = 0, 1, 2, . . . , 10.
k
AF
The mean and variance of X are

E(X) = 10 × 0.7 = 7

Var(X) = 10 × 0.7 × (1 − 0.7) = 2.1


Thus, we can model and analyze the number of “Heads” in 10 coin flips
using the Binomial distribution.
Problem 6.4. A factory produces light bulbs, and 5% of them are defective.
Suppose a quality control inspector randomly selects 8 bulbs for testing.
DR

(a). What is the probability that exactly 2 of the 8 bulbs are defective?
(b). What is the probability that at most 2 bulbs are defective?
(c). What is the probability that more than 1 bulb is defective?
(d). What is the probability that at least 1 bulb is defective?
(e). What is the expected number of defective bulbs in the sample of 8?
(f ). What is the variance and standard deviation of the number of defective
bulbs in the sample?

Solution
Let X be the number of defective bulbs in the sample. Here, X follows a
binomial distribution with parameters n = 8 and p = 0.05. The pmf of X is
 
8
P (X = k) = (0.05)k (1 − 0.05)8−k for k = 0, 1, 2, . . . , 8.
k

240
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

(a). Probability of Exactly 2 Defective Bulbs The probability of exactly


2 defective bulbs is
 
8
P (X = 2) = (0.05)2 (1 − 0.05)8−2 = 28 × (0.05)2 × (0.95)6 ≈ 0.0514
2

Thus, the probability that exactly 2 bulbs are defective is approximately 0.0514.

(b). Probability of At Most 2 Defective Bulbs

P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2)

T
We have
P (X = 0) = (0.95)8 ≈ 0.6634
P (X = 1) = 8 × 0.05 × (0.95)7 ≈ 0.2770
P (X = 2) ≈ 0.0514 (from part (a))
Compute each term
AF P (X = 0) =
 
8
0
(0.05)0 (0.95)8 ≈ 0.6634
 
8
P (X = 1) = (0.05)1 (0.95)7 ≈ 0.2770
1
 
8
P (X = 2) = (0.05)2 (0.95)6 ≈ 0.0514 (from part (a))
2

Hence,
P (X ≤ 2) ≈ 0.6634 + 0.2770 + 0.0514 = 0.9918
DR

Thus, the probability that at most 2 bulbs are defective is approximately


0.9918.

Problem 6.5. A pharmaceutical company is testing a new drug to see if it


improves recovery rates. The probability that a patient responds positively to
the drug is p = 0.25. Suppose the company tests the drug on 10 patients. Let
X be the number of patients who respond positively to the drug.

(a). What is the probability that exactly 3 patients respond positively?


(b). What is the probability that at most 3 patients respond positively?
(c). Calculate the expected number of patients who respond positively and the
variance of X.
(d). If Y represents the number of patients who do not respond positively, find
the probability distribution of Y and its mean and variance.

241
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Solution
Let X be the number of patients who respond positively out of 10 patients, with
each patient responding positively with probability p = 0.25. Then X follows
a binomial distribution with pmf of X is
 
8
P (X = k) = (0.25)k (1 − 0.25)8−k for k = 0, 1, 2, . . . , 10.
k

(a). The probability of exactly 3 patients responding positively is given by


the binomial probability mass function
 

T
10
P (X = 3) = (0.25)3 (0.75)7
3
= 120 × (0.25)3 × (0.75)7 ≈ 120 × 0.015625 × 0.1335
≈ 0.2503

Therefore, the probability that exactly 3 patients respond positively is approx-


AF
imately 0.2503.

(b). The probability that at most 3 patients respond positively is the cu-
mulative probability P (X ≤ 3), which is the sum of the probabilities for
X = 0, 1, 2, 3. Therefore,

P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
     
10 0 10 10 1 9 10
= (0.25) (0.75) + (0.25) (0.75) + (0.25)2 (0.75)8
0 1 2
+ 0.2503 (from part (a))
DR

≈ 0.0563 + 0.1877 + 0.2816 + 0.2503 = 0.7760

Therefore, the probability that at most 3 patients respond positively is approx-


imately 0.7760.

(c). For a binomial distribution X ∼ B(n, p), the expected value and variance
are given by

E(X) = np and Var(X) = np(1 − p)


Substituting n = 10 and p = 0.25,

E(X) = 10 × 0.25 = 2.5


Var(X) = 10 × 0.25 × 0.75 = 1.875
Therefore, the expected number of patients who respond positively is 2.5
and the variance is 1.875.

242
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

(d). Since Y represents the number of patients who do not respond positively,
we can express Y = 10 − X. The number of patients who do not respond
positively follows a binomial distribution Y ∼ B(n = 10, p = 0.75).
The probability mass function of Y is
 
10
P (Y = k) = (0.75)k (0.25)10−k , k = 0, 1, 2, . . . , 10
k
The expected value and variance of Y are

E(Y ) = np = 10 × 0.75 = 7.5


Var(Y ) = np(1 − p) = 10 × 0.75 × 0.25 = 1.875
Therefore, the probability distribution of Y is binomial with parameters n = 10
and p = 0.75, and the mean and variance are E(Y ) = 7.5 and Var(Y ) = 1.875.

T
6.3.4 Python Code for Binomial Distribution
In Python, you can compute various characteristics of the Binomial distribution
AF
using the ‘scipy.stats‘ module. Below is a demonstration of how to compute
these characteristics.

Python Code
Here’s how you can calculate various characteristics of a Binomial distribution
1 import numpy as np
2 from scipy . stats import binom
3
DR
4 # Define parameters
5 n = 10 # number of trials
6 p = 0.25 # probability of success
7

8 # Binomial distribution
9 dist = binom (n , p )
10

11 # 1. Probability Mass Function ( pmf )


12 x_values = np . arange (0 , n + 1) # possible values for the
random variable
13 pmf_values = dist . pmf ( x_values )
14 print ( " PMF values for x = 0 to 10: " , pmf_values )
15

16 # 2. Cumulative Distribution Function ( CDF )


17 cdf_values = dist . cdf ( x_values )
18 print ( " CDF values for x = 0 to 10: " , cdf_values )
19

20 # 3. Mean ( Expected Value )

243
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

21 mean = dist . mean ()


22 print ( " Mean ( Expected Value ) : " , mean )
23

24 # 4. Variance
25 variance = dist . var ()
26 print ( " Variance : " , variance )
27

28 # 5. Probability Generating Function ( PGF )


29 def pgf (t , n , p ) :
30 return (1 - p + p * t ) ** n
31

32 # PGF values at t = 0 and t = 1

T
33 pgf_values = [ pgf (t , n , p ) for t in [0 , 1]]
34 print ( " PGF values for t = 0 and t = 1: " , pgf_values )

Explanations
• Probability Mass Function (pmf ):
AF ■ The pmf provides the probability of each number of successes.
Compute these probabilities using dist.pmf(x values).

• Cumulative Distribution Function (CDF):


■ The CDF provides the cumulative probability up to a certain
number of successes. Compute these values using dist.cdf(x values).

• Mean (Expected Value):


■ The mean is given by n · p. Compute this using dist.mean().


DR

Variance:
■ The variance is given by n · p · (1 − p). Compute this using
dist.var().

• Probability Generating Function (PGF):


■ The PGF is given by GX (t) = (1 − p + p · t)n . Define a function
pgf(t, n, p) to compute PGF values for specific t values.

6.3.5 Exercises
1. Define the Binomial distribution. Include the conditions that must be
met for a random variable to follow a Binomial distribution.
2. Prove that the expected value (mean) of a Binomially distributed random
variable X with parameters n and p is E(X) = np. Also, prove that the
variance of X is given by Var(X) = np(1 − p).

244
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

3. Prove that the sum of the probabilities of all possible outcomes of a Bi-
nomial random variable X with parameters n and p is equal to 1. That
is, show that
n  
X n k
p (1 − p)n−k = 1.
k
k=0

4. A fair coin is flipped 5 times. What is the probability of getting exactly


3 heads?
5. In a class, the probability that a student passes an exam is 0.8 If 15
students are randomly selected, calculate the expected number of students
who pass the exam.

T
6. Show that    
n n
=
k n−k
and use this property to demonstrate that the sum of the Binomial coef-
ficients for a given n is symmetric around k = n2 .
AF
7. Find the moment generating function (MGF) of the Binomial distribution
and use it to compute the first two moments (mean and variance).
8. A factory produces widgets, with a 1% defect rate. If a sample of 50 wid-
gets is tested, what is the probability that at least 3 widgets are defective?
9. A medical test has a 95% sensitivity and a 90% specificity. If 10 patients
are tested who all have the disease, calculate the probability that exactly
9 patients test positive.
10. A soccer player has a 60% chance of scoring a goal in a penalty kick. If the
player takes 12 penalty kicks, find the probability that the player scores
more than 8 goals.
DR

11. In a marketing campaign, the probability of converting a lead into a cus-


tomer is 10%. If 25 leads are contacted, find the probability of converting
at most 5 leads.
12. In a clinical trial, the number of successes (patients who show improve-
ment) follows a Binomial distribution with n = 10 and p = 0.7.
(a) What is the probability of exactly 7 successes?
(b) What is the probability of at least 8 successes?
(c) Find the expected number of successes.
13. Prove the following Binomial coefficient identity
     
n n n+1
+ = .
k k−1 k
Use a combinatorial argument or algebraic manipulation to justify this
identity.

245
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

6.4 Poisson Distribution


Imagine you are managing a small coffee shop. You notice that, on average, 4
customers come into the shop every hour. You are interested in understanding
how likely it is to see a specific number of customers in the shop during a given
hour. For example, what is the probability of having exactly 2 customers or
exactly 6 customers in an hour? In this scenario, the number of customers X
arriving at the coffee shop in an hour follows a Poisson distribution with
parameter λ = 4, where λ represents the average rate of customers per hour.
The pmf for a Poisson distribution is given by

λk e−λ

T
P (X = k) = k = 0, 1, 2, . . .
k!
where
• k is the number of events (the number of customers),

• λ is the average rate (4 customers per hour),


AF• e is the base of the natural logarithm (approximately equal to 2.71828).
In this example, we consider λ = 4.
• For exactly 2 customers (k = 2),
42 e−4
P (X = 2) = = 8e−4 ≈ 0.1465
2!
Hence, the probability of having exactly 2 customers in an hour is
approximately 0.1465.

• For exactly 6 customers (k = 6),


DR

46 e−4 4096e−4
P (X = 6) = = ≈ 0.1042
6! 720
That is, the probability of having exactly 6 customers in an hour is
approximately 0.1042.
The Poisson distribution is a discrete probability distribution that models
the number of events occurring within a fixed interval of time or space,
given the following conditions:
(i). Each event happens independently of the others. (For example, if
we’re modeling the number of customers arriving at a coffee shop, the
number of customers arriving in one hour does not affect the number
of customers arriving in the next hour.)

(ii). The average rate (mean number of events) λ is constant over time or
space. This means that the expected number of events in any interval
of the same length is the same.

246
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

(iii). The events are relatively rare in the given interval. More specifically,
the probability of more than one event occurring in a very short interval
is negligible.

(iv). The number of events can only be whole numbers (0, 1, 2, ...). You
can’t have fractional events.

(v). The number of events is counted over a fixed interval of time or space.
The intervals are non-overlapping, meaning events in different intervals
do not influence each other.
When these conditions are met, the number of events occurring in a fixed in-

T
terval follows a Poisson distribution. It is observe that the series expansion of
eλ guarantees that the total probability sums to 1, as shown below:
∞ ∞
λk λ2 λ3
 
X X 1 λ
P (X = k) = e−λ = e−λ + + + + ···
k! 0! 1! 2! 3!
k=0 k=0
−λ
=e · eλ = 1
AF
Additionally, for a random variable X that follows a Poisson distribution
with parameter λ (denoted X ∼ P(λ)), it holds that

E(X) = Var(X) = λ

Definition: A random variable X is said to follow a Poisson distribution


with parameter λ, written as X ∼ P(λ), if its probability mass function is
given by
e−λ λk
P (X = k) = for k = 0, 1, 2, 3, . . . .
DR

k!
The Poisson distribution is particularly useful for modeling the number of
occurrences of a certain event within a specified unit of time, distance, or
volume, with both its mean and variance equal to λ.

Problem 6.6. In a small coffee shop, customers arrive according to a Poisson


distribution. The average number of customers arriving in an hour is 2. You
are tasked with analyzing the pmf of this distribution, as well as comparing it
to another Poisson distribution with an average of 5 customers per hour.

(i). Calculate the pmf of a Poisson random variable with λ = 2 for integer
values of k from 0 to 10.
(ii). Generate a plot to illustrate the pmf for λ = 2.
(iii). Discuss how the pmf and cumulative distribution functions for both λ = 2
and λ = 5 compare, particularly in terms of expected value and variance.

247
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Solution
(i). The pmf of a Poisson random variable with parameter λ = 2 is given by

e−2 · 2k
P (X = k) = k = 0, 1, 2, . . .
k!
(ii). We plot this pmf for integer values of k from 0 to 10 in Figure 6.2.

0.3
0.271 0.271
0.25

FT
0.2 0.181
P (X = k)

0.15 0.135

0.1 0.09

0.05 0.036
0.012 0.003
0.001 0 0
0
0 1 2 3 4 5 6 7 8 9 10
A
k

Figure 6.2: Probability Mass Function of a Poisson Distribution with λ = 2.

(iii). Figures 6.2 and 6.3 compare the probability mass functions and cumula-
tive distribution functions of Poisson distributions with parameters λ = 2 and
R
λ = 5. These figures demonstrate that, given that the mean and variance of
a Poisson distribution both equal the parameter value, a distribution with a
higher parameter value will have a greater expected value and exhibit a wider
spread.
These figures illustrate that a Poisson distribution with a higher parameter
value, such as λ = 5, has a greater expected value and wider spread compared
to one with λ = 2.
D

The Poisson distribution is used to model the number of rare events oc-
curring within a fixed interval of time or space, under the assumption that these
events occur independently and at a constant average rate. This distribution
helps answer questions about the likelihood of observing a specific number of
events given an average rate, such as predicting the number of genetic mutations
in bacterial cultures or call arrivals in a call center.
Problem 6.7. A quality inspector at a glass manufacturing company checks
each glass sheet for imperfections. Suppose the number of flaws in each sheet

248
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

0.2
0.176
0.176

0.15 0.14 0.146


P (X = k)

0.105
0.1 0.084
0.065
0.05 0.034 0.036
0.018

T
0.007 0.008
0.003
0.001 0 0
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
k

Figure 6.3: Probability Mass Function of a Poisson Distribution with λ = 5.


AF
follows a Poisson distribution with a parameter λ = 0.5, which indicates that
the expected number of flaws per sheet is 0.5.

(a). Determine the probability that a glass sheet has no flaws.


(b). Sheets with two or more flaws are scrapped by the company Estimate the
percentage of glass sheets that need to be scrapped and recycled.

Solution
(a). Probability of No Flaws: The probability that a glass sheet has no
DR

flaws (X = 0) is given by

e−0.5 · 0.50
P (X = 0) = = e−0.5 ≈ 0.607
0!
Thus, approximately 61% of the glass sheets are in “perfect” condition.

(b). Probability of Two or More Flaws: The probability of having two


or more flaws (X ≥ 2) can be computed as

P (X ≥ 2) = 1 − P (X = 0) − P (X = 1)

Where
e−0.5 · 0.51
P (X = 1) = = e−0.5 · 0.5 ≈ 0.305
1!
Therefore,

P (X ≥ 2) = 1 − e−0.5 − (e−0.5 · 0.5) ≈ 1 − 0.607 − 0.305 = 0.090

249
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Hence, about 9% of the glass sheets have two or more flaws and need to be
scrapped.

Problem 6.8. A researcher is studying the occurrence of a rare genetic muta-


tion in a population of individuals. The number of individuals with the mutation
in a randomly selected sample of 100 individuals follows a Poisson distribution
with a parameter λ = 2.

(a). Calculate the probability that exactly one individual in the sample has the
mutation.

T
(b). Determine the probability that at least three individuals in the sample have
the mutation.
(c). Estimate the percentage of samples in which two or more individuals are
expected to have the mutation.
AF
Solution
(a). Probability of Exactly One Individual with the Mutation
The probability that exactly one individual has the mutation is given by

e−2 · 21
P (X = 1) = = 2e−2 ≈ 0.2707
1!

(b). Probability of At Least Three Individuals with the Mutation


The probability of at least three individuals having the mutation is

P (X ≥ 3) = 1 − P (X < 3) = 1 − [P (X = 0) + P (X = 1) + P (X = 2)]
DR

Where,
P (X = 0) = e−2 ≈ 0.1353
P (X = 1) = 2e−2 ≈ 0.2707
22 · e−2
P (X = 2) = ≈ 0.2707
2!
Therefore,

P (X ≥ 3) = 1 − (0.1353 + 0.2707 + 0.2707) = 1 − 0.6767 = 0.3233

(c). Percentage of Samples with Two or More Individuals Having the Mutation
To estimate the percentage of samples where two or more individuals have
the mutation:

P (X ≥ 2) = 1 − P (X < 2) = 1 − [P (X = 0) + P (X = 1)]

250
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Thus,

P (X ≥ 2) = 1 − (0.1353 + 0.2707) = 1 − 0.4060 = 0.5940

The percentage is
0.5940 × 100% = 59.40%

6.4.1 Expected Value


The expected value E(X) of a discrete random variable X is defined as

T

X
E(X) = x · P (X = x)
x=0

X λx e−λ
= x·
x=0
x!
AF =0·


λ0 e−λ X
0!
+
x=1


λx e−λ
x!
X λx e−λ
= .
x=1
(x − 1)!

To adjust the index, let x′ = x − 1. Then, when x starts from 1, x′ starts


from 0.
Rewriting the sum in terms of x′ ,
∞ ′
X λx +1
E(X) = e−λ
DR


x′ !
x =0
∞ ′
−λ
X λx
= λe .
x′ !
x′ =0

Recognize that the sum is the Taylor series expansion of eλ ,


∞ ′
X λx
= eλ

x′ !
x =0

Thus,
E(X) = λe−λ · eλ = λ

251
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

6.4.2 Variance
To find the variance, we first need to calculate E(X 2 ). We use,

X
E(X 2 ) = x2 · P (X = x)
x=0

X λx e−λ
= x2 ·
x=0
x!

X λx
= e−λ x·
x=0
(x − 1)!

T

X xλx e−λ
=
x=1
(x − 1)!

Change the index of summation. Let x′ = x − 1. Therefore, x = x′ + 1, and


when x starts from 1, x′ starts from 0.
AF
Rewriting the sum,

2
E(X ) = e −λ

X
x′ !

(x′ + 1)λx +1

x =0
∞ ∞
′ ′
!
−λ
X x′ λx +1 X λx +1
=e +
x′ ! x′ !
x′ =0 x′ =0
∞ ∞
′ ′
!
X x′ λx X λx
= e−λ λ +λ
x′ ! x′ !
x′ =0 x′ =0
= e−λ λ · λeλ + λeλ

DR

= λ2 + λ.

Now, using the definition of variance,

V (X) = E(X 2 ) − (E(X))2


= (λ2 + λ) − λ2
= λ.

252
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

6.4.3 Moment Generating Function


The moment generating function (MGF) MX (t) of X is defined as,

X
MX (t) = E[etX ] = etx · P (X = x)
x=0

X (et λ)x
= e−λ
x=0
x!
t
= e−λ · ee λ

t
−1)

FT
= eλ(e .

Thus, the moment generating function MX (t) of a Poisson random variable


X with parameter λ is,
t
MX (t) = eλ(e −1) .

6.4.4 Characteristic Function


The characteristic function ϕX (t) of X is defined as,

X
tX
ϕX (t) = E[e ]= eitx · P (X = x)
x=0
A

X (eit λ)x
= e−λ
x=0
x!
it
= e−λ · ee λ

it
−1)
= eλ(e .
R
Thus, the characteristic function ϕX (t) of a Poisson random variable X with
parameter λ is,
it
ϕX (t) = eλ(e −1) .

6.4.5 Approximation of Binomial Distribution Using Pois-


son Distribution
D

In statistical theory, the Poisson distribution can be used as an approximation


to the Binomial distribution under certain conditions. Specifically, when dealing
with scenarios where the number of trials n is very large and the probability
of success p is very small, while the product np (which represents the mean of
the Binomial distribution) remains constant, the Binomial distribution can be
approximated by the Poisson distribution.
Theorem 6.1. Approximation of Binomial Distribution by Poisson
Distribution: Let X be a Binomial random variable with probability distri-
bution B(x; n, p). When n → ∞, p → 0, and np → λ remains constant, the

253
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Binomial distribution b(x; n, p) converges to the Poisson distribution P (x; λ) as


n → ∞. That is,
n→∞
B(x; n, p) −−−−→ P (x; λ)
where,
λ = np
Problem 6.9. A box contains 500 electrical switches, each with a probability of
0.005 of being defective. What is the probability that the box contains no more
than 3 defective switches?
(i) Use the Binomial Distribution.

T
(ii) Use the Poisson Distribution to make an approximation.

(iii) Compare the results.

Solution:
AF
Given, total number of switches n = 500, and probability of a switch being
defective p = 0.005. Let a random variable X represents the number of defective
switches. We are required to find P (X ≤ 3), i.e., the probability that the
number of defective switches is no more than 3.

(i). Using Binomial Distribution: The probability mass function for the
Binomial Distribution is given by

 
500
P (X = k) = 0.005k (1 − 0.005)500−k for k = 0, 1, 2, . . . , 500.
k
DR

We need to calculate P (X ≤ 3).

P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
We have,  
500
P (X = 0) = (0.005)0 (0.995)500 ≈ 0.08157
0
 
500
P (X = 1) = (0.005)1 (0.995)499 ≈ 0.20495
1
 
500
P (X = 2) = (0.005)2 (0.995)498 ≈ 0.25697
2
 
500
P (X = 3) = (0.005)3 (0.995)497 ≈ 0.21435
3
Hence,

254
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

P (X ≤ 3) ≈ 0.08157 + 0.20495 + 0.25697 + 0.21435 = 0.75784


Thus, the probability that the box contains no more than 3 defective switches
using the binomial distribution is approximately 0.757 or 75.78%.

(ii). Using Poisson Distribution (Approximation) For n = 500 and


p = 0.005, the parameter λ for the Poisson distribution is
λ = np = 500 × 0.005 = 2.5
The probability that the number of defective switches X is at most 3 is

T
P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
where X follows a Poisson distribution with parameter λ = 2.5. The probability
mass function of the Poisson distribution is
e−2.5 2.5k
P (X = k) = for k = 0, 1, 2, . . . .
k!
AF
Therefore,
P (X = 0) =
e−2.5 · 2.50
0!
= e−2.5 = 0.08208

e−2.5 · 2.51
P (X = 1) = = 2.5e−2.5 = 0.20521
1!
e−2.5 · 2.52 2.52 e−2.5
P (X = 2) = = = 0.25652
2! 2
e−2.5 · 2.53 2.53 e−2.5
P (X = 3) = = = 0.21376
3! 6
Hence,
DR

2.52 e−2.5 2.53 e−2.5


P (X ≤ 3) = e−2.5 + 2.5e−2.5 + +
2 6
= 0.08208 + 0.2052 + 0.2565 + 0.2138
≈ 0.7576
The probability that the box contains no more than 3 defective switches is
approximately 0.7576 or 75.76%.

(iii). Comparison: From Part (a) and Part (b), we have,


• Binomial result: 75.78%

• Poisson result: 75.76%


The Poisson approximation is extremely close to the binomial result, differing
by only 0.02%. This demonstrates its effectiveness for modeling rare events
when n is large and p is small. However, the Poisson distribution provides a
simpler calculation when n is large and p is small.

255
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

6.4.6 Python Code for Poisson Distribution


The Poisson distribution models the number of occurrences of an event in a
fixed interval of time or space, given a constant mean rate of occurrence. In
Python, you can calculate various characteristics of the Poisson distribution
using the ‘scipy.stats‘ module. Below is a demonstration of how to compute
these characteristics.

Python Code
Here’s how you can calculate various characteristics of the Poisson distribution:

T
1 import numpy as np
2 from scipy . stats import poisson
3

4 # Define the parameter lambda ( mean rate of occurrence )


5 lambda_ = 4
6

7 # Poisson distribution
8

10

11
AF dist = poisson ( mu = lambda_ )

# 1. Probability Mass Function ( PMF )


x_values = np . arange (0 , 11) # Possible values for the
random variable
12 pmf_values = dist . pmf ( x_values )
13 print ( " PMF values for x = 0 to 10: " , pmf_values )
14

15 # 2. Cumulative Distribution Function ( CDF )


16 cdf_values = dist . cdf ( x_values )
17 print ( " CDF values for x = 0 to 10: " , cdf_values )
18
DR

19 # 3. Mean ( Expected Value )


20 mean = dist . mean ()
21 print ( " Mean ( Expected Value ) : " , mean )
22

23 # 4. Variance
24 variance = dist . var ()
25 print ( " Variance : " , variance )
26

27 # 5. Probability Generating Function ( PGF )


28 def pgf (t , lambda_ ) :
29 return np . exp ( lambda_ * ( np . exp ( t ) - 1) )
30

31 # PGF values at t = 0 and t = 1


32 pgf_values = [ pgf (t , lambda_ ) for t in [0 , 1]]
33 print ( " PGF values for t = 0 and t = 1: " , pgf_values )

256
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Explanations
• Probability Mass Function (pmf ):
■ The pmf provides the probability of observing a certain number
of events. Compute these probabilities using dist.pmf(x values).

• Cumulative Distribution Function (CDF):


■ The CDF provides the cumulative probability up to each num-
ber of events. Compute these cumulative probabilities using
dist.cdf(x values).

T
• Mean (Expected Value):
■ The mean of a Poisson distribution is λ. Compute this using
dist.mean().

• Variance:
AF

■ The variance of a Poisson distribution is also λ. Compute this
using dist.var().

Probability Generating Function (PGF):


■ The PGF for a Poisson distribution is given by GX (t) = exp(λ ·
(exp(t) − 1)). Define a function pgf(t, lambda ) to compute
PGF values for specific t values (e.g., 0 and 1).

6.4.7 Exercises
1. The number of patients arriving at a clinic follows a Poisson distribution
DR

with a mean of 3 patients per hour.


(a) What is the probability that exactly 5 patients will arrive in an hour?
(b) What is the probability that at most 2 patients will arrive in an
hour?
(c) What is the expected number of patients arriving in 3 hours?
2. A call center receives an average of 10 calls per hour. Assume the number
of calls follows a Poisson distribution.
(a) What is the probability that the call center receives exactly 12 calls
in an hour?
(b) Calculate the probability of receiving fewer than 5 calls in an hour.
(c) Determine the probability of receiving more than 15 calls in an hour.
3. On a particular stretch of highway, the number of traffic accidents follows
a Poisson distribution with an average rate of 3 accidents per month.

257
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

(a) What is the probability of exactly 2 accidents occurring in a month?


(b) Find the probability that there will be no accidents in a given month.
(c) Calculate the probability of having 4 or more accidents in a month.
4. A rare disease affects an average of 0.2 patients per 1000 individuals in a
population. Assume the number of affected individuals follows a Poisson
distribution.
(a) What is the probability of finding exactly 1 patient with the disease
in a sample of 1000 individuals?
(b) Determine the probability of finding no patients with the disease in

T
a sample of 1000 individuals.
(c) Find the probability of discovering 2 or more patients with the dis-
ease in a sample of 1000 individuals.
5. An employee receives an average of 8 emails per day. Assume the number
of emails follows a Poisson distribution.
AF (a) What is the probability of receiving exactly 10 emails in a day?
(b) Calculate the probability of receiving fewer than 6 emails in a day.
(c) Determine the probability of receiving more than 12 emails in a day.
6. A retail store has an average of 20 customers arriving per hour. The
number of customer arrivals follows a Poisson distribution.
(a) What is the probability that exactly 25 customers will arrive in an
hour?
(b) Find the probability that fewer than 15 customers arrive in an hour.
DR

(c) Calculate the probability of having 30 or more customers in an hour.

7. A factory produces an average of 2 defective items per 1000 items pro-


duced. Assume the number of defective items follows a Poisson distribu-
tion.
(a) What is the probability of finding exactly 3 defective items in a batch
of 1000 items?
(b) Determine the probability of finding no defective items in a batch of
1000 items.
(c) Calculate the probability of finding 1 or more defective items in a
batch of 1000 items.

8. In a laboratory experiment, rare events occur at an average rate of 0.4


events per hour. Assume these events follow a Poisson distribution.
(a) What is the probability of observing exactly 2 events in an hour?

258
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

(b) Find the probability of observing no events in an hour.


(c) Determine the probability of observing more than 1 event in an hour.
9. A software program encounters an average of 5 errors per week. Assume
the number of errors follows a Poisson distribution.

(a) What is the probability of encountering exactly 7 errors in a week?


(b) Calculate the probability of encountering fewer than 4 errors in a
week.
(c) Find the probability of encountering 8 or more errors in a week.

T
10. A library has an average of 12 book checkouts per day. The number of
checkouts follows a Poisson distribution.
(a) What is the probability of exactly 10 book checkouts in a day?
(b) Determine the probability of having 15 or more book checkouts in a
day.
AF (c) Calculate the probability of having fewer than 8 book checkouts in
a day.
11. In a small town, a rare disease has an average occurrence rate of 1 case
per month. Assume the number of cases follows a Poisson distribution.
(a) What is the probability of having exactly 2 cases of the disease in a
month?
(b) Find the probability of having no cases of the disease in a month.
(c) Calculate the probability of having at least 1 case of the disease in
a month.
DR

12. In a certain industrial facility, accidents occur infrequently. It is known


that the probability of an accident on any given day is 0.005 and accidents
are independent of each other.

(a) What is the probability that in any given period of 400 days there
will be an accident on one day?
(b) What is the probability that there are at most three days with an
accident?

13. In a manufacturing process where glass products are made, defects or


bubbles occur, occasionally rendering the piece undesirable for marketing.
It is known that, on average, 1 in every 1000 of these items produced has
one or more bubbles. What is the probability that a random sample of
8000 will yield fewer than 7 items possessing bubbles?

259
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

6.4.8 Discrete Uniform Distribution


There are two types of uniform distributions: discrete uniform distribution and
continuous uniform distribution. The discrete uniform distribution plays a
crucial role in various domains, particularly in statistics and data science. At
its core, the discrete uniform distribution describes a situation where a finite
number of outcomes are equally likely to occur. This property makes it an
essential model for understanding fundamental concepts in probability theory.

In the realm of simulations, particularly Monte Carlo simulations, the


discrete uniform distribution is employed to generate random inputs. This is

FT
vital for modeling complex systems and assessing the behavior of different sce-
narios. Similarly, in A/B testing, where subjects are randomly assigned to dif-
ferent groups, the assumption of a uniform distribution ensures that the groups
are comparable, thereby enhancing the validity of the test results.

In a discrete uniform distribution, a finite number of outcomes are possible,


and each outcome has an equal probability of occurring. For example, if you
roll a fair six-sided die, the probability of each face showing up (1 through 6)
is the same
1
P (X = x) = ; x = 1, 2, . . . , 6.
6
A
Definition: A random variable X is said to follow a discrete uni-
form distribution on the set of integers {x1 , x2 , . . . , xn }, denoted X ∼
U (x1 , xn ), if its pmf is given by
1
P (X = x) = ; x ∈ {x1 , x2 , . . . , xn },
n
R
where n is the total number of outcomes.

If the lower bound of X is zero and the upper bound is n then the pmf
is given by
1
P (X = x) = ; x = 0, 1, 2, . . . , n.
n+1
D

The discrete uniform distribution finds applications in games of chance. Un-


derstanding the probabilities associated with games like lotteries or card games
helps players develop strategies and assess risks. Beyond these practical appli-
cations, the discrete uniform distribution is foundational for other probability
distributions. Many advanced statistical methods and techniques build upon
the principles established by the uniform distribution, making it a key compo-
nent of statistical inference.

260
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Theorem 6.2. Let X be a random variable that follows a discrete uniform


distribution on the set {0, 1, 2, . . . , n}. Then the mean µ and variance σ 2 of X
are given by

n n2 − 1
µ=and σ 2 = .
2 12
Proof. Mean: The mean µ of X is calculated as follows
n n n
X 1 X 1 X
µ = E[X] = x · P (X = x) = x· = x.
x=0 x=0
n+1 n + 1 x=0

Using the formula for the sum of the first n integers, we have

T
n
X n(n + 1)
x= .
x=0
2
Substituting this back into the equation for µ
AF
Variance:
µ=
1
n+1
·
n(n + 1)
2
n
= .
2

The variance σ 2 is calculated using the formula

σ 2 = E[X 2 ] − (E[X])2 .
First, we compute E[X 2 ]
n n
X X 1
E[X 2 ] = x2 · P (X = x) = x2 · .
x=0 x=0
n+1
This simplifies to
DR

n
1 X 2
E[X 2 ] = x .
n + 1 x=0
Using the formula for the sum of the squares of the first n integers, we have
n
X n(n + 1)(2n + 1)
x2 = .
x=0
6

Substituting this into the equation for E[X 2 ]:


1 n(n + 1)(2n + 1) n(2n + 1)
E[X 2 ] = · = .
n+1 6 6
Now, we can compute the variance
n(2n + 1)  n 2
σ 2 = E[X 2 ] − (E[X])2 = − .
6 2

261
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Thus,

2n(2n + 1) 3n2 4n2 + 2n − 3n2 n2 + 2n n2 − 1


σ2 = − = = = .
12 12 12 12 12

Problem 6.10. A company has a database of 100 customer IDs ranging from
1 to 100. The marketing team wants to select a random sample of 10 customer
IDs to send a promotional offer. Each customer ID should have an equal chance
of being selected.

FT
(a). What is the probability of selecting any specific customer ID?

(b). If the marketing team selects 10 IDs, what is the expected number of times
a specific customer ID (e.g., ID 5) will be included in the sample?

Solution
(a). Probability of Selecting Any Specific Customer ID
Since the customer IDs are uniformly distributed, the probability mass function
(pmf) is
1
P (X = x) = ; x = 1, 2, . . . , 100.
A
n
Here, n = 100. Thus, the probability of selecting any specific customer ID
(say ID 5) is
1
P (X = 5) = = 0.01.
100
R
(b). Expected Number of Times a Specific Customer ID Will Be
Included
When selecting 10 customer IDs from the total of 100, the probability of select-
1
ing a specific customer ID (like ID 5) in one draw is 100 .
To find the expected number of times a specific customer ID (ID 5) will be
D

included in the sample of 10 IDs, we can use the formula for the expected value

E[X] = n · p,

where n is the number of trials (in this case, the number of IDs selected) and p
is the probability of success (selecting ID 5).

1
Here, n = 10 and p = 100

1
E[X] = 10 · = 0.1.
100

262
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

Problem 6.11. A school is organizing a raffle with 50 tickets numbered from


1 to 50. Each ticket has an equal chance of being drawn.

(a). What is the probability of drawing any specific ticket number?


(b). If the raffle will draw 5 tickets, what is the expected number of times a
specific ticket (e.g., ticket number 10) will be drawn?

Solution
(a). Probability of Drawing Any Specific Ticket Number
Since the tickets are uniformly distributed, the probability mass function (pmf)
is given by
1
P (X = x) = ; x = 1, 2, . . . , 50.
n

T
Here, n = 50. Thus, the probability of drawing any specific ticket number (e.g.,
ticket number 10) is:
1
P (X = 10) = = 0.02.
50
AF
(b). Expected Number of Times a Specific Ticket Will Be Drawn
When drawing 5 tickets from a total of 50, the probability of drawing a specific
1
ticket (like ticket number 10) in one draw is 50 .
To find the expected number of times a specific ticket (ticket number 10)
will be drawn in the sample of 5 tickets, we can use the expected value formula

E[X] = n · p,
DR
where n is the number of draws (in this case, the number of tickets drawn)
and p is the probability of success (drawing ticket number 10).

1
Here, n = 5 and p = 50

1
E[X] = 5 · = 0.1.
50

Discrete Uniform Distribution


1 import numpy as np
2

3 # Define the Discrete Uniform parameters


4 a_discrete = 1
5 b_discrete = 10
6

7 def pmf_discrete (x , a , b ) :
8 if a <= x <= b :

263
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

9 return 1 / ( b - a + 1)
10 else :
11 return 0
12

13 def cdf_discrete (x , a , b ) :
14 if x < a :
15 return 0
16 elif a <= x <= b :
17 return ( x - a + 1) / ( b - a + 1)
18 else :
19 return 1
20

T
21 # PMF values
22 x_di screte_values = np . arange ( a_discrete , b_discrete + 1)
23 p mf _ v alues_discrete = [ pmf_discrete (x , a_discrete ,
b_discrete ) for x in x_discrete_values ]
24 print ( " PMF values for x from {} to {}: " . format ( a_discrete ,
b_discrete ) , pmf_values_discrete )
25

26

27

28
AF # CDF values
c df _ v alues_discrete = [ cdf_discrete (x , a_discrete ,
b_discrete ) for x in x_discrete_values ]
print ( " CDF values for x from {} to {}: " . format ( a_discrete ,
b_discrete ) , cdf_values_discrete )
29

30 # Mean ( Expected Value )


31 mean_discrete = ( a_discrete + b_discrete ) / 2
32 print ( " Mean ( Expected Value ) : " , mean_discrete )
33

34 # Variance
35 vari ance_discrete = (( b_discrete - a_discrete + 1) ** 2 -
DR

1) / 12
36 print ( " Variance : " , variance_discrete )
37

38 # Standard Deviation
39 std_dev_discrete = np . sqrt ( variance_discrete )
40 print ( " Standard Deviation : " , std_dev_discrete )

6.4.9 Exercises
1. A box contains 20 different colored balls, numbered from 1 to 20. If a ball
is drawn at random, what is the probability of drawing ball number 15?
2. A survey is conducted with 30 participants, each assigned a unique ID
from 1 to 30. If 5 IDs are randomly selected for a follow-up interview,
what is the expected number of times a specific ID (e.g., ID 12) will be
selected?

264
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

3. A raffle has 100 tickets numbered from 1 to 100. If the raffle draws 10
tickets, what is the probability that ticket number 25 will be drawn at
least once?
4. In a game, a player rolls a fair six-sided die. What is the probability of
rolling a specific number, say 4? Additionally, if the player rolls the die 10
times, what is the expected number of times the number 4 will appear?
5. A classroom has 15 students, each assigned a number from 1 to 15. If the
teacher randomly selects 3 students for a project, what is the probability
that student number 7 is selected?
6. A bag contains 50 different coupons numbered from 1 to 50. If 5 coupons

T
are drawn randomly, what is the expected number of times coupon number
30 will be drawn?
7. A committee consists of 12 members, each assigned a number from 1 to
12. If 4 members are randomly chosen to form a subcommittee, what is
the probability that member number 6 is included in the selection?
AF
8. A local lottery involves selecting 6 numbers from a set of 1 to 49. What
is the probability that the number 7 is chosen in a single drawing? If you
play the lottery 10 times, what is the expected number of times you will
have number 7 in your selected numbers?
9. In a game show, contestants choose a number from 1 to 100. If a contes-
tant has chosen number 45, what is the probability that this number is
drawn if the game draws 5 numbers randomly without replacement?

6.5 Concluding Remarks


DR

In this chapter, we have examined key discrete probability distributions that


are integral to data science: the Bernoulli, Binomial, and Poisson distributions.
Understanding these distributions equips data scientists with powerful tools for
analyzing categorical data and modeling discrete events.

Mastering these discrete distributions enhances our ability to model and


interpret data effectively, leading to more accurate predictions and insights. As
we transition to continuous probability distributions in the next chapter, we
will expand our analytical toolkit to handle a broader range of data types and
modeling scenarios.

6.6 Chapter Exercises


1. A coin is flipped once. Define the random variable X as the outcome of
the coin flip (1 for heads, 0 for tails). Calculate the mean and variance of
X.

265
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS

2. Suppose a factory produces light bulbs, and each bulb has a 90% prob-
ability of being functional. If you randomly select one bulb, what is the
probability that it is functional?
3. A basketball player has a free throw success rate of 75%. If she takes 10
free throws, what is the probability that she makes exactly 8 of them?
(Use the binomial probability formula.)
4. In a survey, it is found that 60% of people prefer coffee over tea. If you
randomly sample 15 people, what is the probability that exactly 9 prefer
coffee?

T
5. A call center receives an average of 4 calls per hour. What is the proba-
bility that they receive exactly 6 calls in the next hour?
6. In a certain city, the average number of accidents at a particular inter-
section is 2 per month. What is the probability that there will be no
accidents in the next month?
AF
7. A six-sided die is rolled. Define the random variable Y as the outcome of
the roll. Calculate the mean and variance of Y .
8. A spinner is divided into 8 equal sections numbered from 1 to 8. What
is the probability of landing on an even number when the spinner is spun
once?

9. In a quality control process, the probability of finding a defective product


is 0.02. If a batch contains 100 products, calculate the probability of
finding exactly 3 defective products.
10. An average of 3 earthquakes occur in a year in a certain region. Using
DR

the Poisson distribution, calculate the probability of experiencing at least


one earthquake in the next year.
11. A game involves rolling a die and flipping a coin. If the die shows a 3 or a
4, you win a prize with a probability of 0.5 (coin shows heads). If the die
shows any other number, you don’t win. What is the overall probability
of winning a prize?

266
Chapter 7

Some Continuous

T
Probability Distributions
AF
7.1 Introduction
In probability theory and statistics, continuous probability distributions play
a fundamental role in modeling and analyzing real-world phenomena. Unlike
discrete distributions, which are defined for countable outcomes, continuous
distributions are used to describe outcomes that can take on any value within
a given range. This chapter delves into some of the most widely used continu-
ous probability distributions, including the Uniform, Exponential, and Normal
distributions.

Continuous probability distributions are integral to various fields, such as


DR

engineering, economics, and the natural sciences, due to their ability to repre-
sent diverse processes and events accurately. Understanding these distributions
enables us to calculate probabilities and make inferences about populations
based on sample data.

We begin with the Uniform distribution, which serves as a simple model for
random variables that have equally likely outcomes over a specific interval. Fol-
lowing this, we explore the Exponential distribution, commonly used to model
the time between events in a Poisson process. We then delve into the Normal
distribution, arguably the most important distribution in statistics, due to the
Central Limit Theorem’s implication that it approximates many natural phe-
nomena.

Each section will provide a detailed definition of the distribution, its proper-
ties, and practical examples to illustrate its application. Additionally, exercises
are included to reinforce the concepts and allow for hands-on practice in calcu-

267
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

lating probabilities and understanding the distribution’s behavior.

7.2 Continuous Uniform Distribution


The continuous uniform distribution is a probability distribution that describes
an outcome where all values within a specified range are equally likely to occur.
This distribution is defined over an interval [a, b], where a is the minimum value
and b is the maximum value. In contrast to discrete distributions, which deal
with distinct outcomes, the continuous uniform distribution addresses scenarios
where the outcomes can take on any value within the range.

T
The continuous uniform distribution is commonly used in simulations, ran-
dom sampling, and scenarios where a uniform distribution of outcomes is as-
sumed, such as in generating random numbers or modeling processes where
each outcome within a specified range is equally probable. Its simplicity and
intuitive nature make it a fundamental concept in statistics and probability
theory.
AF
Definition: A random variable X is said to follow a continuous uniform
distribution on the interval [a, b], denoted X ∼ U (a, b), if its probability
density function (pdf) is
(
1
for a ≤ x ≤ b
f (x) = b−a
0 otherwise

The plot of X ∼ U (a, b) where X is uniformly distributed between a and b


are presented in Figure 7.1.
DR

1
f (x) = b−a

a b x

Figure 7.1: The plot of X ∼ U (a, b).

268
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

7.2.1 Distributional Properties


Expected Value (Mean)
The expected value E(X) is given by
Z ∞ Z b
1
E(X) = xfX (x) dx = x·dx.
−∞ a b − a
Z b  2
a2

1 1 b
= x dx = −
b−a a b−a 2 2
(b − a)(b + a) b+a
= = .
2(b − a)

T
2
Hence,
a+b
Mean = E(X) =
2

Variance and Standard Deviation


AF
Expected Value: We already have the expected value E(X) for a uniform
random variable X over [a, b]
a+b
E(X) = .
2
Expected Value of X 2 : To find the variance, we first need E(X 2 ). This is
given by
Z b Z b
1
E(X 2 ) = x2 fX (x) dx = x2 dx
a b − a a
 3 b
DR

1 x
=
b−a 3 a
 3
a3 b3 − a3

1 b
= − =
b−a 3 3 3(b − a)
2 2
b + ab + a
= .
3
Hence, the variance is
2
b2 + ab + a2

a+b
Var(X) = −
3 2
4(b2 + ab + a2 ) − 3(a2 + 2ab + b2 )
=
12
(b − a)2
=
12

269
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

and the standard deviation is


b−a
Standard Deviation = √ .
12
Problem 7.1. Suppose the time (in minutes) it takes for a customer to be
served at a coffee shop follows a continuous uniform distribution between 5 and
15 minutes.

(a). What is the probability that a customer is served within 10 minutes?


(b). What is the expected time for a customer to be served?

FT
(c). What is the variance of the time for a customer to be served?

Solution
(a). Probability that a customer is served within 10 minutes:
Let X be the time taken to be served, and X ∼ Uniform(5, 15). To find
P (X ≤ 10), the pdf of X is given by
1 1
f (x) = = for 5 ≤ x ≤ 15
15 − 5 10
The probability is given by the integral of the pdf from 5 to 10
A
Z 10
1 10 − 5
P (X ≤ 10) = dx = = 0.5
5 10 10

(b). Expected time:


The expected value for a continuous uniform distribution X ∼ Uniform(a, b)
R
is
a+b
E(X) =
2
Here, a = 5 and b = 15
5 + 15
E(X) = = 10 minutes
2
D

(c). Variance of the time:


The variance for a continuous uniform distribution X ∼ Uniform(a, b) is

(b − a)2
Var(X) =
12
Here, a = 5 and b = 15, the variance is

(15 − 5)2 100


Var(X) = = ≈ 8.33
12 12

270
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Problem 7.2. Suppose you are designing a random number generator that
outputs a number between 1 and 100. Each number in this range is equally
likely to be selected.
(a). What is the probability that the generator outputs a number between 20
and 50?

(b). Determine the mean and variance of the numbers generated by this
random number generator.

(c). If you generate 10,000 numbers, what is the expected number of times

T
a number between 20 and 50 is generated?

Solution
(a). Probability Calculation
Since the numbers generated are uniformly distributed between 1 and
AF 100, we can model this using a continuous uniform distribution U (a, b)
with a = 1 and b = 100.

The probability density function (pdf) is


(
1
for a ≤ x ≤ b,
f (x) = b−a
0 otherwise.

1 1
Here, f (x) = 100−1 = 99 for 1 ≤ x ≤ 100.
To find the probability that the generator outputs a number between
DR

20 and 50, we calculate

Z 50 Z 50
1
P (20 ≤ X ≤ 50) = f (x) dx = dx
20 20 99
Z 50
1
= dx
99 20
50 − 20
=
99
≈ 0.303.

(b). Mean and Variance


For a uniform distribution U (a, b),
a+b 1 + 100
Mean = µ = = = 50.5,
2 2

271
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

(b − a)2 (100 − 1)2 992 9801


Variance = σ 2 = = = = ≈ 816.75.
12 12 12 12

(c). Expected Number of Times a Number Between 20 and 50 is


Generated
The expected number of times a number between 20 and 50 is generated
out of 10,000 numbers can be found by multiplying the probability by
the total number of trials

E(number of times) = P (20 ≤ X ≤ 50)×10000 = 0.303×10000 = 3030.

Thus, we expect the number generator to output a number between 20


and 50 approximately 3030 times out of 10,000 trials.

T
7.2.2 Python Code for Uniform Distribution Character-
istics
AF
The Uniform distribution models scenarios where all outcomes are equally likely
within a given interval. Below is Python code demonstrating how to compute
various characteristics for both Continuous and Discrete Uniform distributions.

Continuous Uniform Distribution


1 import numpy as np
2 from scipy . stats import uniform
3
DR
4 # Define the parameters
5 a = 0 # Lower bound
6 b = 10 # Upper bound
7

8 # Define the Continuous Uniform distribution


9 dist_continuous = uniform ( loc =a , scale =b - a )
10

11 # 1. Probability Density Function ( PDF )


12 x_values = np . linspace (a , b , 100)
13 pdf_values = dist_continuous . pdf ( x_values )
14 print ( " PDF values for x from {} to {}: " . format (a , b ) ,
pdf_values )
15

16 # 2. Cumulative Distribution Function ( CDF )


17 cdf_values = dist_continuous . cdf ( x_values )
18 print ( " CDF values for x from {} to {}: " . format (a , b ) ,
cdf_values )
19

20 # 3. Mean ( Expected Value )

272
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

21 mean_continuous = dist_continuous . mean ()


22 print ( " Mean ( Expected Value ) : " , mean_continuous )
23

24 # 4. Variance
25 v ar i ance_continuous = dist_continuous . var ()
26 print ( " Variance : " , variance _continuous )
27

28 # 5. Standard Deviation
29 st d_ dev_continuous = dist_continuous . std ()
30 print ( " Standard Deviation : " , std_dev_continuous )
31

32 # 6. Quantiles

T
33 q u a n tiles_continuous = dist_continuous . ppf ([0.25 , 0.5 ,
0.75]) # 25 th , 50 th ( median ) , and 75 th percentiles
34 print ( " Quantiles at 0.25 , 0.5 , and 0.75: " ,
q u a n tiles_continuous )
35

36 # 7. Percentiles
37 p e r c en til es _c ont in uo us = dist_continuous . ppf ([0.1 , 0.9])

38

39
AF # 10 th and 90 th percentiles
print ( " Percentiles at 0.1 and 0.9: " ,
p e r c en til es _c ont in uo us )

7.2.3 Exercises
1. Consider a discrete uniform distribution where X takes values {1, 2, 3, . . . , n}.
(i) Show that the expected value E(X) is n+12 and (ii) Find the Variance
of X.

2. Find the cumulative distribution function (cdf) of a continuous uniform


DR

random variable X ∼ U (a, b).


3. The cholesterol level of adults in a certain region follows a uniform distri-
bution between 150 mg/dL and 250 mg/dL.
4. If X ∼ U (0, 1), find the distribution of Y = a + (b − a)X.

5. Suppose the waiting time for a bus is uniformly distributed between 0 and
30 minutes. What is the probability that a person will wait more than 20
minutes?
6. A factory produces items with weights that are uniformly distributed
between 50 grams and 150 grams. What is the probability that a randomly
chosen item weighs between 80 grams and 120 grams?
7. The cholesterol level of adults in a certain region follows a uniform distri-
bution between 150 mg/dL and 250 mg/dL.

273
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

(a) Write the probability density function f (x) of the cholesterol level.
(b) What is the probability that a randomly selected adult has a choles-
terol level between 180 mg/dL and 220 mg/dL?
(c) Find the mean and variance of the cholesterol level.

8. A machine produces metal rods that are uniformly distributed in length


between 98 cm and 102 cm. What is the probability that a randomly
selected rod is between 99 cm and 101 cm in length?
9. In a quality control process, the time to inspect a product is uniformly
distributed between 1 and 5 minutes. Find the probability that the in-

T
spection time for a randomly chosen product is more than 4 minutes.
10. The download speed of a certain internet connection is uniformly dis-
tributed between 10 Mbps and 100 Mbps. What is the probability that
the download speed at any given time is less than 50 Mbps?
11. The delivery time for a package from a warehouse to a customer is uni-
AF formly distributed between 2 and 7 days. What is the probability that a
package will be delivered in less than 4 days?
12. The time a wildlife photographer waits to see a particular bird is uniformly
distributed between 30 minutes and 3 hours. Find the probability that
the wait time is more than 2 hours.

13. The fuel efficiency of a car is uniformly distributed between 15 and 25


miles per gallon. What is the probability that the car’s fuel efficiency is
between 18 and 22 miles per gallon?

7.3 Exponential Distribution


DR

Imagine you are at a busy city street waiting for a taxi. Taxis arrive at the
street randomly but at an average rate of 5 taxis per hour. You’re curious about
how long you might have to wait for the next taxi. Is there a way to predict or
understand the waiting time better?

You notice that sometimes the wait is short, and other times it can be quite
long. This variability and randomness in waiting times suggest a need for a
mathematical model to describe it.

The exponential distribution is a probability distribution used to model


the time between successive events in a process where events occur continuously
and independently at a constant average rate. It is characterized by its rate
parameter λ, which is the reciprocal of the average time between events. It is
widely applied in various fields, including data science, to analyze the duration

274
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

or time between events. In our taxi example, it helps us understand and pre-
dict the waiting time until the next taxi arrives. The exponential distribution
provides a framework to calculate probabilities and make informed decisions
based on the average rate of arrivals.

Definition: The continuous random variable, say X is said to have an


exponential distribution, if it has the following probability density function
(
λe−λx for x ≥ 0,
f (x|λ) = (7.1)
0 for x < 0,

T
where λ > 0 is the rate parameter.

The density plot of X is present in Figure 7.2 for value of λ = 0.04.

f (x)
AF
f (x) = 0.4e−0.4x
DR

Figure 7.2: The probability density function f (x) = 0.4e0.4x .

This probability is the area under the probability density function between
the points a = 1 and b = 2 as illustrated in Figure 7.3.

275
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

f (x)

Rb
P (a ≤ X ≤ b) = a
f (x) dx

T
x
0 a b

Figure 7.3: The area under the probability density function f (x) between a and
b.
AF
The cdf is obtained by integrating the pdf
Z x
F (x) = λe−λt dt
0
x
= −e−λt 0
= −e−λx + e0
= 1 − e−λx .
DR

Thus, the cdf F (x) for the exponential distribution is:


(
0 x<0
F (x) = −λx
1−e x≥0

The CDF of the exponential distribution for λ = 0.4 is F (x) = 1 − e−0.4x


and the graphical presentation is presented in Figure 7.4.

276
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

0.8

0.6
F (x)

0.4

0.2

T
F (x) = 1 − e−0.4x
0
0 2 4 6 8 10
x

Figure 7.4: CDF of the Exponential Distribution with λ = 0.4


AF
Theorem 7.1. Let X be a random variable that follows an exponential distri-
bution with parameter λ. The rth mean, defined as E[X r ], is given by
r!
E[X r ] =
λr
Proof. The probability density function (PDF) of the exponential distribution
is given by:

f (x) = λe−λx , x ≥ 0.
r
To find E[X ], we compute
DR

Z ∞ Z ∞
r
E[X ] = r
x f (x) dx = xr λe−λx dx.
0 0

Using the substitution u = λx (hence x = λu and dx = duλ ), we have


Z ∞  r Z ∞
u du λ
E[X r ] = λe−u = r+1 ur e−u du.
0 λ λ λ 0
R∞
The integral 0 ur e−u du is the definition of the gamma function Γ(r + 1),
which equals r!
λ r!
E[X r ] =· r! = r .
λr+1 λ
This completes the proof for the rth mean of the exponential distribution.

277
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Mean and Variance of the Exponential Distribution


Using Theorem 7.1, we can easily find the mean and variance of the exponential
distribution.

Mean
For the exponential distribution (specifically when r = 1)
1! 1
E[X] = E[X 1 ] = =
λ1 λ

Variance

T
The variance Var(X) is calculated as follows
Var(X) = E[X 2 ] − (E[X])2 .
Calculating E[X 2 ] (using r = 2), we have
2! 2
E[X 2 ] = = 2
AF 2
 2
1
λ2

Now, substituting into the variance formula


λ

2 1 1
Var(X) = 2 − = 2− 2 = 2
λ λ λ λ λ
Problem 7.3. A data center experiences server failures where the time between
failures follows an exponential distribution with a mean of 10 days.
(i). Calculate the rate parameter λ.
(ii). What is the probability that a server will fail within the next 3 days?
(iii). What is the probability that the server will last longer than 12 days without
DR

failing?

(i). Calculate the Rate Parameter λ


The mean time between failures is given as 10 days. The rate parameter λ is
calculated using the formula
1 1
λ= = = 0.1.
Mean 10

(ii). Probability of Failure Within the Next 3 Days


To find the probability that a server will fail within the next 3 days, we use the
cumulative distribution function (CDF) of the exponential distribution
P (X ≤ 3) = 1 − e−λx = 1 − e−0.1×3 = 1 − e−0.3 = 1 − 0.7408 = 0.2592.
So, the probability that a server will fail within the next 3 days is approxi-
mately 0.2592 (or 25.92%).

278
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

(iii). Probability of Lasting Longer Than 12 Days


To find the probability that the server will last longer than 12 days, we can use
the survival function

P (X > 12) = 1 − P (X ≤ 12) = e−λx = e−0.1×12 = e−1.2 ≈ 0.3012..

So, the probability that the server will last longer than 12 days without failing
is approximately 0.3012 (or 30.12%).
Problem 7.4. In a hospital, the time between arrivals of patients in the emer-
gency room follows an exponential distribution with an average time of 15 min-
utes.

T
(a) What is the probability that the time between two successive arrivals is
more than 20 minutes?
(b) What is the probability that the time between two successive arrivals is
less than 10 minutes?
AF
(c) Calculate the expected time between two successive arrivals and its stan-
dard deviation.

Solution
(a). Let X be the time between arrivals, which follows an exponential distri-
bution with parameter λ. The rate parameter λ is the reciprocal of the
1
mean, so λ = 15 per minute. The probability that the time between two
successive arrivals is more than 20 minutes is calculated as follows

P (X > 20) = 1 − P (X ≤ 20) = 1 − F (20) = 1 − 1 − e−λ·20 = e−λ·20



DR

1
Substituting λ = 15 ,
20 4
P (X > 20) = e− 15 = e− 3 ≈ 0.2636

Thus, the probability that the time between two successive arrivals is
more than 20 minutes is approximately 0.2636.
(b). The probability that the time between two successive arrivals is less than
10 minutes is calculated as follows

P (X < 10) = F (10) = 1 − e−λ·10


1
Substituting λ = 15 ,
10 2
P (X < 10) = 1 − e− 15 = 1 − e− 3 ≈ 0.4866

Thus, the probability that the time between two successive arrivals is less
than 10 minutes is approximately 0.4866.

279
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

(c). The expected time between two successive arrivals (the mean of the ex-
ponential distribution) is given by
1
E(X) = = 15 minutes
λ

The standard deviation of the time between two successive arrivals is the
same as the mean for an exponential distribution, so

1
Standard deviation = = 15 minutes
λ

T
Therefore, the expected time between two successive arrivals is 15 min-
utes, and the standard deviation is also 15 minutes.

7.3.1 Properties of the Exponential Distribution


AF
Let X be an exponential random variable with rate parameter λ. The following
are the key properties of the exponential distribution:

• Cumulative Distribution Function (CDF):


(
1 − e−λx x ≥ 0,
F (x) =
0 x < 0.

• Mean (Expected Value):


Z ∞
1
E[X] = x · λe−λx dx = .
DR

0 λ

• Variance:
1
Var(X) = E(X 2 ) − [E(X)]2 = .
λ2
• Standard Deviation:
1
σX = .
λ
• Memoryless Property: The exponential distribution has the mem-
oryless property, which states that

P (X > s + t | X > s) = P (X > t) for all s, t ≥ 0.

• Moment Generating Function (MGF):


λ
MX (t) = E[etX ] = , for t < λ.
λ−t

280
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

• Characteristic Function:
λ
φX (t) = E[eitX ] = , for t ∈ R.
λ − it
• Quantile Function: The quantile function (inverse of the CDF) for
0 < p < 1 is given by:
1
Q(p) = F −1 (p) = − ln(1 − p).
λ
• Relationship with the Poisson Process: If X ∼ Exponential(λ),
it can be interpreted as the waiting time between events in a Poisson

FT
process with rate λ.

• Sum of Independent Exponential Variables: The sum of n inde-


pendent exponential random variables with the same rate parameter λ
follows a Gamma distribution
n
X
Y = Xi ∼ Gamma(n, λ).
i=1

For n = 1, the Gamma distribution reduces to the exponential distri-


bution.

• Relationship with Other Distributions: - If X ∼ Exponential(λ),


A
then X ∼ Gamma(1, λ). - If X ∼ Exponential(λ), then Y = X β ∼
 
λ
Exponential β for β > 0.

Theorem 7.2. The characteristic function of an exponential random variable


X with rate parameter λ is given by
R
λ
φX (t) = , for t ∈ R.
λ − it
Proof. To find the characteristic function φX (t), we need to calculate the ex-
pected value E[eitX ]
Z ∞ Z ∞
−λx
itX
φX (t) = E[e ] = itx
e λe dx = λ e−(λ−it)x dx.
D

0 0
R∞ −ax 1
The integral is of the form e 0
dx = for ℜ(a) > 0
a
 
1 λ
φX (t) = λ = .
λ − it λ − it
Thus, the characteristic function of X is
λ
φX (t) = .
λ − it

281
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Theorem 7.3. The moment generating function (MGF) of an exponential ran-


dom variable X with rate parameter λ is given by
λ
MX (t) = , for t < λ.
λ−t
Proof. To find the moment generating function MX (t), we need to calculate
the expected value E[etX ]
Z ∞ Z ∞  
tX tx −λx −(λ−t)x 1
MX (t) = E[e ] = e λe dx = λ e dx = λ .
0 0 λ−t

T
Thus, the moment generating function of X is
λ
MX (t) = , for t < λ.
AF λ−t

7.3.2 Memoryless Property


In probability and statistics, the concept of memorylessness refers to a char-
acteristic of specific probability distributions. A random variable X is said to
have the memoryless property if the probability of an event occurring in the
future is independent of any past events. In other words, it “forgets” what has
happened before. Imagine you’re waiting for a bus that comes randomly every
few minutes. If you’ve been waiting for 10 minutes and the bus hasn’t arrived
yet, the probability of it arriving in the next 5 minutes is the same as if you
had just started waiting.

Only two types of distributions exhibit the memoryless property: geometric


DR

and exponential probability distributions. The exponential distribution has the


memoryless property, which states that the probability of an event occurring in
the next t units of time is independent of how much time has already elapsed.
Mathematically, this can be expressed in the following theorem.
Theorem 7.4 (Memoryless Property). Let X be an exponentially distributed
random variable with rate parameter λ > 0. Then, for all s, t ≥ 0,

P (X > s + t | X > s) = P (X > t).

Proof. The proof of the memoryless property relies on the definition of condi-
tional probability and the exponential distribution’s probability density func-
tion.
By definition of conditional probability, we have

P (X > s + t and X > s) P (X > s + t)


P (X > s + t | X > s) = = .
P (X > s) P (X > s)

282
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Since X is exponentially distributed with rate parameter λ, the cumulative


distribution function (CDF) is

F (x) = 1 − e−λx for x ≥ 0,

and the survival function (which gives the probability that X is greater than a
certain value) is
P (X > x) = e−λx .
Using the survival function, we can rewrite the conditional probability

P (X > s + t) e−λ(s+t)
P (X > s + t | X > s) = = = e−λt = P (X > t).
P (X > s) e−λs

T
Thus, the memoryless property is proved.
Problem 7.5. Suppose that the waiting time for a bus at a certain bus stop is
exponentially distributed with a mean waiting time of 10 minutes. Let X denote
the waiting time. Given that a person has already waited for 5 minutes, what
is the probability that they will have to wait at least an additional 10 minutes?
AF
Solution
. To solve this problem, we use the properties of the exponential distribution,
specifically its memoryless property. For an exponential distribution, the mean
is given by
1
Mean =
λ
Therefore,
1 1
λ= =
Mean 10
DR

We need to find the probability that the waiting time exceeds 15 minutes,
given that the person has already waited for 5 minutes. Using the memoryless
property
P (X > 15 | X > 5) = P (X > 10)
The probability that X exceeds a certain time x is given by

P (X > x) = e−λx
1
Substitute λ = 10 and x = 10
10
P (X > 10) = e− 10 = e−1 ≈ 0.3679

Problem 7.6. Assume that the lifetime of a light bulb follows an exponential
distribution with a mean lifetime of 1000 hours. Let X be the lifetime of the light
bulb. If a light bulb has already been used for 800 hours, what is the probability
that it will last at least an additional 500 hours?

283
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Solution
To solve this problem, we use the memoryless property of the exponential dis-
tribution. Here’s a step-by-step solution.

The mean lifetime of the light bulb is 1000 hours. For an exponential dis-
tribution, the mean is given by
1
Mean =
λ
Therefore,
1 1

T
λ= =
Mean 1000
We need to find the probability that the light bulb will last for at least
500 more hours, given that it has already been used for 800 hours. Using the
memoryless property

P (X > 1300 | X > 800) = P (X > 500)


AF
The probability that X exceeds a certain time x is given by

P (X > x) = e−λx
1
Substitute λ = 1000 and x = 500
500
P (X > 500) = e− 1000 = e−0.5 ≈ 0.6065

Problem 7.7. Consider a system with a component whose time to failure is


DR

exponentially distributed with a rate parameter λ = 0.01 failures per hour. Let
X represent the time to failure of the component. If the component has been
operational for 100 hours without failure, what is the probability that it will
operate for at least another 50 hours?

Solution
To solve this problem, we use the properties of the exponential distribution,
specifically its memoryless property. Here’s a step-by-step solution: The rate
parameter for the exponential distribution is given as λ = 0.01 failures per hour.

We need to find the probability that the component will operate for at least
50 more hours, given that it has already operated for 100 hours. Using the
memoryless property

P (X > 150 | X > 100) = P (X > 50)

284
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

The probability that X exceeds a certain time x is given by

P (X > x) = e−λx

Substitute λ = 0.01 and x = 50

P (X > 50) = e−0.01×50 = e−0.5 ≈ 0.6065

Calculate the value of e−0.5

e−0.5 ≈ 0.6065

Applications in Data Science

T
The important applications of exponential distribution are mentioned in the
following:

1. Survival Analysis: In medical research and clinical trials, the exponen-


tial distribution is used to model the time until an event occurs, such as
AF
patient survival times or time to recovery from a treatment.
2. Queuing Theory: Exponential distributions model the time between
arrivals in queuing systems, like customer service lines, call centers, and
computer networks. This helps in analyzing and optimizing service pro-
cesses and reducing wait times
3. Reliability Engineering: It helps in estimating the lifespan of products
and systems, providing insights into maintenance schedules and warranty
DR
analysis.

4. Markov Processes: As part of continuous-time Markov chains, expo-


nential distributions model the time spent in each state before transition-
ing to another state.
5. Customer Behavior Analysis: Exponential models can help analyze
customer behavior, such as the time between purchases or interactions
with a service, allowing businesses to tailor marketing strategies effec-
tively.

7.3.3 Python Code for Exponential Distribution Charac-


teristics
The Exponential distribution models the time between events in a Poisson pro-
cess. Below is Python code demonstrating how to compute various character-
istics of the Exponential distribution.

285
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Python Code
1 import numpy as np
2 from scipy . stats import expon
3

4 # Define the parameter lambda ( rate of occurrence )


5 lambda_ = 1 # Rate parameter
6 scale = 1 / lambda_ # Scale parameter for scipy ’s expon
7

8 # Exponential distribution
9 dist = expon ( scale = scale )
10

11 # 1. Probability Density Function ( PDF )


12 x_values = np . linspace (0 , 10 , 100) # Values for the
random variable
pdf_values = dist . pdf ( x_values )

T
13

14 print ( " PDF values for x from 0 to 10: " , pdf_values )


15

16 # 2. Cumulative Distribution Function ( CDF )


17 cdf_values = dist . cdf ( x_values )
print ( " CDF values for x from 0 to 10: " , cdf_values )
18

19
AF
20 # 3. Mean ( Expected Value )
21 mean = dist . mean ()
22 print ( " Mean ( Expected Value ) : " , mean )
23

24 # 4. Variance
25 variance = dist . var ()
26 print ( " Variance : " , variance )
27
DR
28 # 5. Standard Deviation
29 std_dev = dist . std ()
30 print ( " Standard Deviation : " , std_dev )
31

32 # 6. Quantiles
33 quantiles = dist . ppf ([0.25 , 0.5 , 0.75]) # 25 th , 50 th (
median ) , and 75 th percentiles
34 print ( " Quantiles at 0.25 , 0.5 , and 0.75: " , quantiles )
35

36 # 7. Percentiles
37 percentiles = dist . ppf ([0.1 , 0.9]) # 10 th and 90 th
percentiles
38 print ( " Percentiles at 0.1 and 0.9: " , percentiles )
39

40 # 8. Moment Generating Function ( MGF )


41 def mgf (t , lambda_ ) :
42 return 1 / (1 - t / lambda_ ) # MGF of Exponential
distribution for t < lambda_
43

286
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

44 # MGF values at t = 0.1 and t = 0.5


45 mgf_values = [ mgf (t , lambda_ ) for t in [0.1 , 0.5]]
46 print ( " MGF values for t = 0.1 and t = 0.5: " , mgf_values )
47

48 # 9. Probability Generating Function ( PGF )


49 # Note : The PGF is not typically used for continuous
distributions like the Exponential distribution .
50 # Here , we will use the MGF as a substitute .
51

Explanations

T
• Probability Density Function (PDF):

■ The PDF provides the likelihood of each value. Compute these


values using dist.pdf(x values).

• Cumulative Distribution Function (CDF):


AF ■ The CDF gives the probability that the random variable is less
than or equal to a given value. Compute these values using
dist.cdf(x values).

• Mean (Expected Value):

■ The mean (expected value) is λ1 . Compute this using dist.mean().

• Variance:
1
■ The variance is λ2 . Compute this using dist.var().


DR

Standard Deviation:
1
■ The standard deviation is λ. Compute this using dist.std().

• Quantiles:
■ Quantiles are values below which a given proportion of data falls.
Compute these using dist.ppf() for the desired quantile prob-
abilities.

• Percentiles:
■ Percentiles are specific quantiles, such as the 10th and 90th per-
centiles. Compute these using dist.ppf() for the desired per-
centiles.

• Moment Generating Function (MGF):

287
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

1
■ The MGF for the Exponential distribution is MX (t) = 1−t/λ for
t < λ. Define a function mgf(t, lambda ) to compute MGF
values.

• Probability Generating Function (PGF):


■ The PGF is generally used for discrete distributions. For con-
tinuous distributions like the Exponential, the MGF serves a
similar purpose.

7.3.4 Exercises

T
1. The lifetime of a particular brand of lightbulb is exponentially distributed
with a mean of 1000 hours.
(a) What is the probability that a lightbulb lasts more than 1200 hours?
(b) What is the probability that a lightbulb lasts between 800 and 1200
hours?
AF
2. A machine in a factory breaks down on average once every 500 hours.
The time between breakdowns is exponentially distributed.
(a) What is the probability that the machine will operate for at least
1000 hours without a breakdown?
(b) Determine the probability that the machine will break down within
the next 200 hours.
3. The waiting time for a specific genetic test result from a laboratory follows
an exponential distribution with a mean of 2 days.
(a) Find the probability that the test result will be available in less than
DR

1 day.
(b) Find the probability that the test result will take more than 3 days.
4. The time until failure of a critical component in a medical device follows
an exponential distribution with a mean of 5 years.
(a) Calculate the probability that the component will fail within the first
3 years.
(b) Determine the probability that the component will last more than 7
years.
5. In a call center, the time between consecutive calls follows an exponential
distribution with a mean time of 4 minutes.
(a) What is the probability that the next call comes within 2 minutes?
(b) What is the probability that the next call will not come for at least
6 minutes?

288
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

6. The time between arrivals of buses at a particular bus stop follows an


exponential distribution with an average time of 20 minutes.
(a) What is the probability that a bus will arrive in the next 5 minutes?
(b) Calculate the probability that you will have to wait more than 25
minutes for the next bus.
7. The lifespan of a certain species of bacteria follows an exponential distri-
bution with a mean of 24 hours.
(a) Find the probability that a bacterium lives longer than 30 hours.
(b) Determine the probability that a bacterium lives between 10 and 20

T
hours.
8. The response time of a server to a network request is exponentially dis-
tributed with an average time of 0.5 seconds.
(a) What is the probability that the server responds in less than 0.3
seconds?
AF (b) Calculate the probability that the server takes more than 1 second
to respond.
9. The time it takes for a chemical reaction to complete in a lab experiment
follows an exponential distribution with a mean of 45 minutes.
(a) What is the probability that the reaction will complete in less than
30 minutes?
(b) What is the probability that the reaction will take more than 60
minutes to complete?
10. The lifespan of a certain species of laboratory mice follows an exponential
DR

distribution with a mean lifespan of 2.5 years.


(a) Determine the probability density function (pdf) of the lifespan.
(b) What is the median lifespan of the mice?
(c) Calculate the variance and standard deviation of the lifespan.
(d) What is the probability that a randomly selected mouse lives more
than 3 years?
11. The time (in hours) until recovery from a certain disease follows an expo-
nential distribution with a mean of 2 hours.
(a) What is the probability that a patient will recover in less than 1
hour?
(b) What is the probability that a patient will recover between 1 and 3
hours?
(c) Find the median recovery time.

289
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

7.4 Normal Distribution


Imagine you are working for a clothing company that wants to optimize its
garment sizes. To do this effectively, you need to understand the distribution
of body measurements, such as heights, within your target customer base. If
you were to collect height data from a large number of individuals, what kind
of pattern would you expect to see? Intuitively, you would likely find that
most people are of average height, with fewer individuals being exceptionally
tall or short. This common phenomenon, where measurements cluster around
a central value with a symmetrical spread on either side, is a hallmark of the
normal distribution, also known as the Gaussian distribution..

T
In the field of data science, the normal distribution plays a pivotal role due
to its ubiquitous nature and the mathematical properties that simplify analysis.
It is characterized by its symmetric, bell-shaped curve and is instrumental
in various statistical methods, including hypothesis testing, regression analysis,
and many machine learning algorithms.
AFThe normal distribution is especially useful because of the Central Limit
Theorem, which states that the sum of a large number of independent, iden-
tically distributed variables tends toward a normal distribution, regardless of
the original distribution of the variables. This makes the normal distribution
a powerful tool for modeling real-world phenomena and for making inferences
about populations based on sample data.

The normal distribution is characterized by its bell-shaped curve, which is


symmetric about its mean. The parameters of the normal distribution are the
mean (µ) and the standard deviation (σ). The mean represents the central
value of the distribution, while the standard deviation measures the spread of
DR

the distribution.

7.4.1 Definition of the Normal Distribution


Normal Distribution: A random variable X is said to follow a normal
distribution if its probability density function (pdf) is given by:
1 1 x−µ 2
f (x) = √ e− 2 ( σ ) , −∞ < x < ∞
σ 2π
where µ is the mean and σ is the standard deviation of the distribution.
The graphical presentation of a normal distribution presented in Figure 7.5.
The probability density function is a bell-shaped curve that is symmetric
about µ. The notation
X ∼ N (µ, σ 2 )

290
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

denotes that the random variable X has a normal distribution with mean
µ and variance σ 2 .

f (x)

N (µ, σ 2 )

T
AF µ
x

Figure 7.5: The graphical presentation of a normal distribution.

The mean and the variance

E(X) = µ and Var(X) = σ 2

of the distribution.
The probability density function of a normal random variable is symmetric
DR

around the mean value µ and exhibits a “bell-shaped” curve. Figure 7.6 displays
the probability density functions of normal distributions with µ = 5, σ = 2
and µ = 10, σ = 2. It illustrates that while altering the mean value µ shifts
the location of the density function, it does not affect its shape. In contrast,
Figure 7.7 presents the probability density functions of normal distributions
with µ = 5, σ = 2 and µ = 5, σ = 0.5. Here, the central position of the
density function remains the same, but its shape changes. Larger values of the
variance σ 2 lead to wider, flatter bell-shaped curves, whereas smaller values of
the variance σ 2 produce narrower, sharper bell-shaped curves.

291
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

0.4

N (10, 4)
0.3

N (5, 4)
f (x)

0.2 N (15, 4)

T
0.1

0 5 10 15 20
AF x

Figure 7.6: The effect of changing the mean of a normal distribution.

µ = 0, σ 2 = 1
0.4
µ = 0, σ 2 = 4

0.3
DR f (x)

0.2

0.1

−6 −4 −2 0 2 4 6
x

Figure 7.7: The effect of changing the variance of a normal distribution.

292
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

7.4.2 Properties of the Normal Distribution


The normal distribution, also known as the Gaussian distribution, is one of
the most important probability distributions in statistics. The normal distri-
bution has several important properties:

1. Symmetry: The normal distribution is symmetric about its mean µ.


That is f (µ − x) = f (µ + x). This means that the left half of the distri-
bution is a mirror image of the right half.

2. Bell-Shaped Curve: The pdf of the normal distribution forms a bell-


shaped curve that is unimodal, meaning it has a single peak at the mean

T
µ.
3. Mean, Median, and Mode: For a normal distribution, the mean, me-
dian, and mode are all equal and located at µ.
4. Total Area: The total area under the curve and above the horizontal
axis is equal to 1.
AF5. Inflection Points: The points at which the curve changes concavity are
located at µ − σ and µ + σ.

6. Empirical Rule (68-95-99.7 Rule): Approximately 68% of the data


lies within one standard deviation of the mean (µ ± σ), about 95% within
two standard deviations (µ ± 2σ), and about 99.7% within three standard
deviations (µ ± 3σ). See the Figure 7.8.
DR
f (x)

99.7%

95%

68%

−3σ −2σ −σ µ σ 2σ 3σ x

Figure 7.8: The Empirical Rule for Normal Distribution

293
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

7. Asymptotic Behavior: The tails of the normal distribution approach


the horizontal axis but never touch it. This implies that every value, no
matter how extreme, has a non-zero probability of occurring.
8. Linear Transformations: If X is normally distributed with mean µ
and standard deviation σ, then a linear transformation of X given by
Y = aX + b is also normally distributed with mean aµ + b and standard
deviation |a|σ.
9. Additivity: The sum of two independent normal random variables is also
normally distributed. Specifically, if X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ),
then X1 + X2 ∼ N (µ1 + µ2 , σ12 + σ22 ).

T
10. Moment Generating Function: The moment generating function MX (t)
of a normal random variable X ∼ N (µ, σ 2 ) is given by:
 
1 2 2
MX (t) = exp µt + σ t .
2
AF
11. Characteristic Function: The characteristic function φX (t) of a normal
random variable X ∼ N (µ, σ 2 ) is given by:
 
1 2 2
φX (t) = exp iµt − σ t .
2

12. Cumulative Distribution Function (cdf ): The cumulative distribu-


tion function (CDF) of the normal distribution with mean µ and standard
deviation σ is given by:
  
1 x−µ
F (x) = Pr(X ≤ x) = 1 + erf √
2 σ 2
DR

where erf(z) is the error function, defined as:


Z z
2 2
erf(z) = √ e−t dt.
π 0

Theorem 7.5. If X ∼ N (µ, σ 2 ), then the mean and standard deviation are µ
and σ 2 , respectively.
Proof. To evaluate the mean, we first calculate
Z ∞
1 1 x−µ 2
E(X − µ) = (x − µ) √ e− 2 ( σ ) dx.
−∞ 2πσ
x−µ
Setting z = σ and dx = σdz, we obtain
Z ∞
1 1 2
E(X − µ) = √ ze− 2 z dz = 0,
2π −∞

294
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

since the integrand above is an odd function of z. Using Theorem 4.5 on


page 128, we conclude that

E(X) = µ.
The variance of the normal distribution is given by
Z ∞
1 1 x−µ 2
2
E[(X − µ) ] = √ (x − µ)2 e− 2 ( σ ) dx.
2πσ −∞
x−µ
Again setting z = and dx = σdz, we obtain
σ
Z ∞
2 1 1 2
2
z 2 e− 2 z dz.

T
E[(X − µ) ] = σ √
2π −∞
2
Integrating by parts with u = z and dv = ze−z /2
dz so that du = dz and
2
v = −e−z /2 , we find that

" 2
#
AF 2
E[(X − µ) ] = σ 2 −ze−z /2


−∞
+
Z

−∞

e −z 2 /2
dz = σ 2 (0 + 1) = σ 2 .

Theorem 7.6. For a normal distribution with mean µ and standard deviation
σ, the inflection points occur at x = µ − σ and x = µ + σ.
Proof. To find the inflection points, we need to determine where the second
derivative of the density function changes sign.

First Derivative
DR

The first derivative of f (x) with respect to x is:


 
d 1 (x−µ)2
f ′ (x) = √ e− 2σ2
dx σ 2π
Using the chain rule:
 
′ 1 (x−µ)2
− 2σ2 (x − µ)
f (x) = √ ·e · −
σ 2π σ2
Simplifying:
(x − µ) − (x−µ)2

f ′ (x) = − √ e 2σ2
σ 3 2π
Second Derivative
The second derivative of f (x) with respect to x is:
 
′′ d (x − µ) − (x−µ) 2

f (x) = − √ e 2σ 2
dx σ 3 2π

295
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Again, using the product rule and chain rule:

(x − µ)2
 
′′ 1 (x−µ)2
− 2σ2
f (x) = − √ e · 1−
σ 3 2π σ2

Simplifying:
1 (x−µ)2
f ′′ (x) = e− (x − µ)2 − σ 2

√ 2σ 2
σ5 2π
Setting the Second Derivative to Zero
To find the inflection points, we set f ′′ (x) = 0:

T
(x − µ)2 − σ 2 = 0

Solving for x:
(x − µ)2 = σ 2
x − µ = ±σ
AF x=µ±σ
Thus, the inflection points of the normal distribution are at x = µ − σ and
x = µ + σ.

7.4.3 Standard Normal Distribution


The standard normal distribution is a special case of the normal distribution
with mean µ = 0 and standard deviation σ = 1. It is denoted by Z and its
properties are as follows:

• Probability Density Function (pdf ): The pdf of the standard


DR

normal distribution is given by:


1 z2
fZ (z) = √ e− 2 ,

as illustrated in Figure 7.9.

• Cumulative Distribution Function (cdf ): The cdf of the


standard normal distribution is denoted by Φ(z) and is defined
as:   
1 z
Φ(z) = Pr(Z ≤ z) = 1 + erf √
2 2
where erf(z) is the error function:
Z z
2 2
erf(z) = √ e−t dt
π 0

296
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Below is a plot of the standard normal distribution’s pdf.

µ=0
0.4

2
√1 e− 2
z

0.2 N (0, 1)
fZ (z) =

T
0
z
−4 −2 0 2 4

Figure 7.9: The standard normal distribution with mean µ = 0 and standard
deviation σ = 1.
AF
The symmetry of the standard normal distribution about 0 implies that if
the random variable Z has a standard normal distribution, then
1 − Φ(z) = Pr(Z ≥ z) = Pr(Z ≤ −z) = Φ(−z),
as illustrated in Figure 7.9. This equation can be rearranged to provide the
easily remembered relationship
Φ(z) + Φ(−z) = 1.
The plot presented in Figure 7.10, illustrates the cumulative distribution func-
tions Φ(z) and Φ(−z) of the standard normal distribution. The symmetry of
the standard normal distribution is evident from these plots.
DR

µ=0
0.4
2
√1 e− 2
z

0.2 N (0, 1)
fZ (z) =

Φ(−2) Φ(2)

0 z
−4 −2
−z 0 2 4

Figure 7.10: Standard normal distribution with shaded areas for Φ(−2) and
Φ(2).

297
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

7.4.4 Finding the Probability P (a ≤ X ≤ b


The probability P (a ≤ X ≤ b) for a continuous random variable X with pdf
f (x) is given by the definite integral of the pdf over the interval [a, b]:
Z b
P (a ≤ X ≤ b) = f (x) dx
a

as illustrated in Figure 7.11. The probability P (a ≤ X ≤ b) can also be


expressed in terms of the cdf F (x) of X as follows

P (a ≤ X ≤ b) = F (b) − F (a).

T
Direct computation of F (·) for a general normal distribution can be challenging.
However, it is readily facilitated by using the cdf of the standard normal distri-
bution of Z. The following steps explain how to compute P (a ≤ X ≤ b) for a
normally distributed random variable X by leveraging the cdf of the standard
normal random variable Z.
AFf (x) P (a ≤ X ≤ b) =
Rb
a
f (x) dx
DR

a µ b x

Figure 7.11: The area under the probability density function f (x) between a
and b.

Steps to Find the Probability


1. Standardize the variable: Convert the normal random variable X ∼
N (µ, σ 2 ) to the standard normal random variable Z ∼ N (0, 1) using the
z-score transformation:
X −µ
Z=
σ

298
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

2. Standardize the limits: Convert the lower limit a and the upper limit
b to their corresponding z-scores:
a−µ b−µ
za = and zb =
σ σ
3. Find the cumulative probabilities: Use the cdf of the standard nor-
mal distribution, denoted by Φ(z) = P (Z ≤ z), to find the cumulative
probabilities at za and zb :
   
a−µ a−µ
Φ(za ) = Φ =P Z≤
σ σ
   
b−µ b−µ

T
Φ(zb ) = Φ =P Z≤ .
σ σ
4. Calculate the probability: The probability P (a ≤ X ≤ b) is the dif-
ference between the cumulative probabilities at zb and za :
 
a−µ X −µ b−µ
P (a ≤ X ≤ b) = P ≤ ≤
σ σ σ
AF = P (za ≤ Z ≤ zb )
= P (Z ≤ zb ) − P (Z ≤ za )
= Φ(zb ) − Φ(za )
   
b−µ a−µ
=Φ −Φ
σ σ

Finding the Probability P (a ≤ X ≤ b): If X ∼ N (µ, σ 2 ), then the


probability P (a ≤ X ≤ b) can be computed as
 
a−µ X −µ b−µ
DR

P (a ≤ X ≤ b) = P ≤ ≤
σ σ σ
   
b−µ a−µ
=Φ −Φ .
σ σ

Problem 7.8. Suppose X ∼ N (10, 4). Find P (8 ≤ X ≤ 12).


Solution
Since, X ∼ N (10, 4), then
 
8 − 10 12 − 10
P (8 ≤ X ≤ 12) = P ≤Z≤
2 2
= Φ(1) − Φ(−1)
≈ 0.8413 − 0.1587 = 0.6826
Thus, the probability that X lies between 8 and 12 is approximately 0.6826.
Problem 7.9. Suppose that Y ∼ N (10, 16). What is the value of P (|Y − 10| ≥
12)?

299
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Solution
Given Y ∼ N (10, 16), we know that the mean µ = 10 and standard deviation
σ = 4. We need to compute P (|Y − 10| ≥ 12). This is equivalent to the
following:

P (|Y − 10| ≥ 12) = P (Y ≤ −2) + P (Y ≥ 22)


   
Y −µ −2 − 10 Y −µ 22 − 10
=P ≤ +P ≥
σ 4 σ 4
= P (Z ≤ −3) + P (Z ≥ 3)
= Φ(−3) + [1 − Φ(3)]

T
≈ 0.00135 + [1 − 0.99865] = 0.0027

Problem 7.10. Assume the heights of adult males in a certain population are
normally distributed with a mean of 70 inches and a standard deviation of 3
inches. What is the probability that a randomly selected adult male has a height
AF
between 67 and 73 inches?

Solution
To find the probability that a randomly selected adult male has a height be-
tween 67 and 73 inches, we standardize the values and use the standard normal
distribution. Therefore,
 
67 − 70 73 − 70
P (67 ≤ X ≤ 73) = P ≤Z≤
3 3
   
73 − 70 67 − 70
=P Z≤ −P Z ≤
DR

3 3
= Φ(1) − Φ(−1)
≈ 0.8413 − 0.1587 = 0.6826

Thus, approximately 68.26% of adult males have heights between 67 and 73


inches.

Problem 7.11. What is the value of a for which P (Z ≥ a) = 0.72, where


Z ∼ N (0, 1)?
Solution
We are given that P (Z ≥ a) = 0.72, where Z follows the standard normal
distribution. To find a, we first express the probability as:

P (Z ≥ a) = 0.72
This is equivalent to:

300
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

P (Z < a) = 1 − 0.72 = 0.28


Next, we use the standard normal distribution table to find the z-score
corresponding to a cumulative probability of 0.28. From the table, we find

a ≈ −0.58
Thus, the value of a is approximately −0.58.
Problem 7.12. Given a standard normal distribution, find the value of k such
that

T
(a). P (Z > k) = 0.3015
(b). P (k < Z < −0.18) = 0.4197.
Solution (a). Finding k for P (Z > k) = 0.3015
AF The area to the right of k is 0.3015. Therefore, the area to the left of k
is:
1 − 0.3015 = 0.6985
Using the standard normal distribution table, we look up the value that
corresponds to an area of 0.6985 to the left. This value is:

k ≈ 0.52

(b). Finding k for P (k < Z < −0.18) = 0.4197

The total area to the left of −0.18 is


DR

P (Z < −0.18) = 0.4286

The area between k and −0.18 is 0.4197. Therefore, the area to the left
of k is
0.4286 − 0.4197 = 0.0089
That is
P (Z < k) = 0.0089
Using the standard normal distribution table, we look up the value that
corresponds to an area of 0.0089 to the left. This value is

k ≈ −2.37

301
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Problem 7.13. Suppose the heights of a large population of adult males in a


certain country follow a normal distribution with a mean height of 175 centime-
ters and a standard deviation of 6 centimeters. Calculate the height below which
the shortest 10% of adult males in this population fall.
Solution
Let X be the height of an adult male, which follows a normal distribution, i.e.,

X ∼ N (175, 62 ).
We need to find the height x such that the cumulative probability up to x
is 0.10. In other words, we want to find x for which

T
P (X ≤ x) = 0.10.
To find the height below which the shortest 10% fall, we find the Z-score
for the 10th percentile, which is approximately z = −1.28. Using the formula
X = Z · σ + µ, we get
AF X = (−1.28) · 6 + 175 ≈ 167.32 cm
Thus, the height below which the shortest 10% of adult males fall is approx-
imately 167.32 cm.

Problem 7.14. The Wall Street Journal Interactive Edition spend average of
27 hours per week using the computer at work. Assume the normal distribution
applies and that the standard deviation is 8 hours.
(a). What is the probability a randomly selected subscriber spends less than 10
hours using the computer at work?
DR

(b). What percentage of the subscribers spends more than 35 hours per week
using the computer at work?
(c). A person is classified as a heavy user if he or she is in the upper 20%
in terms of hours of usage. How many hours must a subscriber use the
computer in order to be classified as a heavy user?
Solution
Let X be the number of hours a randomly selected subscriber spends using the
computer at work per week. Assume X follows a normal distribution:

X ∼ N (27, 82 )
where µ = 27 hours and σ = 8 hours.

302
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

(a). To find the probability that a subscriber spends less than 10 hours on
the computer, we need to calculate P (X < 10). First, convert this to
the standard normal variable Z:

X −µ 10 − 27 −17
Z= = = = −2.125
σ 8 8
Using standard normal distribution tables or software, find:

P (Z < −2.125) ≈ 0.0169

Thus, the probability that a randomly selected subscriber spends less

T
than 10 hours using the computer is approximately 0.0169 or 1.69%.

(b). To find the percentage of subscribers who spend more than 35 hours per
week, we need to calculate P (X > 35). Convert this to the standard
normal variable Z:
AF Z=
X −µ
σ
=
35 − 27
8
8
= =1
8
Using standard normal distribution tables or software, find:

P (Z > 1) = 1 − P (Z ≤ 1) = 1 − Φ(1) ≈ 1 − 0.8413 = 0.1587

Thus, the percentage of subscribers who spend more than 35 hours per
week is approximately 15.87%.

(c). To classify as a heavy user, a subscriber must be in the upper 20% of


usage. This corresponds to the 80th percentile of the normal distribu-
DR

tion. Find the z-score for the 80th percentile:

z0.80 ≈ 0.84

Convert this z-score to the corresponding number of hours x:

x = µ + zσ = 27 + 0.84 × 8

x = 27 + 6.72 = 33.72

Therefore, a subscriber must use the computer for at least 33.72 hours
per week to be classified as a heavy user.

303
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

7.4.5 Central Limit Theorem


The Central Limit Theorem (CLT) is a fundamental theorem in probability
theory and statistics. It states that the distribution of the sum (or average) of
a large number of independent, identically distributed (i.i.d.) random variables
approaches a normal distribution, regardless of the original distribution of the
variables. This theorem underpins many statistical methods and justifies the
use of the normal distribution in inferential statistics.
Theorem 7.7 (Central Limit Theorem). Let X1 , X2 , . . . , Xn be a sequence of
i.i.d. random variables with mean µ and variance σ 2 . Let X̄n denote the sample
mean:
1X

T
X̄n = nXi
n i=1
Then, as n approaches infinity, the distribution of the standardized sample mean
approaches the standard normal distribution:

X̄n − µ d
AF
or equivalently,
√ −
σ/ n

d
→ N (0, 1)

→ N (µ, σ 2 /n)
X̄n −
d
where −
→ denotes convergence in distribution.

Example: Central Limit Theorem


Consider a fair die. The random variable X representing the outcome of a single
roll has a mean µ = 3.5 and variance σ 2 = 35
12 . Suppose we roll the die 30 times
and compute the sample mean X̄30 . According to the CLT, the distribution of
DR

X̄30 can be approximated by a normal distribution with mean 3.5 and standard
deviation √σ30 .
 
d 35
X̄30 −
→ N 3.5, .
12 × 30
Problem 7.15. A researcher is studying the systolic blood pressure levels in
a population of adults. It is known that the systolic blood pressure levels are
normally distributed with a mean (µ) of 120 mmHg and a standard deviation
(σ) of 15 mmHg.

(a). What proportion of the population has systolic blood pressure levels be-
tween 110 mmHg and 130 mmHg?
(b). What is the probability that a randomly selected individual from this pop-
ulation has a systolic blood pressure level above 140 mmHg?

304
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

(c). If the researcher takes a random sample of 25 adults, what is the probabil-
ity that the sample mean systolic blood pressure is less than 115 mmHg?

Solution (a). Proportion of the Population between 110 mmHg and 130
mmHg
We need to find P (110 ≤ X ≤ 130) where X is the systolic blood
pressure level.
 
110 − 120 X −µ 130 − 120
P (110 ≤ X ≤ 130) = P ≤ ≤
15 σ 15
= P (−0.67 ≤ Z ≤ 0.67)

T
= P (Z ≤ 0.67) − P (Z ≤ −0.67)
= Φ(0.67) − Φ(−0.67)
= 0.7486 − 0.2514
= 0.4972
AF
(b).
So, approximately 49.72% of the population has systolic blood pressure
levels between 110 mmHg and 130 mmHg.

Probability of Blood Pressure Above 140 mmHg


We need to find P (X > 140).
 
X −µ 140 − 120
P (X > 140) = P >
σ 15
= 1 − P (Z ≤ 1.33)
≈ 1 − 0.9082 = 0.0918
DR

So, the probability that a randomly selected individual has a systolic


blood pressure level above 140 mmHg is approximately 0.0918 or 9.18%.

(c). Probability of Sample Mean Less Than 115 mmHg


For a sample of n = 25, the distribution of the sample mean X̄ is
normally distributed with mean µX̄ = µ and standard deviation σX̄ =
√σ .
n
Given:
15
µX̄ = 120, σX̄ = √ = 3
25

We need to find P (X̄ < 115).


Standardize the value to the standard normal distribution Z:
X̄ − µX̄ 115 − 120 −5
Z= = = ≈ −1.67
σX̄ 3 3

305
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Using the standard normal distribution table or a calculator, we find:

P (Z ≤ −1.67) ≈ 0.0475

So, the probability that the sample mean systolic blood pressure of 25
adults is less than 115 mmHg is approximately 0.0475 or 4.75%.

Problem 7.16. A study is conducted to measure the cholesterol levels in a pop-


ulation of adults. It is known that the cholesterol levels are normally distributed
with a mean (µ) of 200 mg/dL and a standard deviation (σ) of 25 mg/dL.

(a). What percentage of the population has cholesterol levels between 175 mg/dL

T
and 225 mg/dL?
(b). What is the probability that a randomly selected individual has a choles-
terol level below 180 mg/dL?
(c). If a sample of 36 adults is taken, what is the probability that the sample
AF mean cholesterol level is greater than 210 mg/dL?

Solution
(a). Percentage of the Population between 175 mg/dL and 225 mg/dL
Let X is the cholesterol level. Then,
 
175 − 200 X −µ 225 − 200
P (175 ≤ X ≤ 225) = P ≤ ≤
25 σ 25
= P (−1 ≤ Z ≤ 1)
= Φ(1) − Φ(−1)
≈ 0.8413 − 0.1587 = 0.6826
DR

So, approximately 68.26% of the population has cholesterol levels between 175
mg/dL and 225 mg/dL.

(b). Probability of Cholesterol Level Below 180 mg/dL

 
X −µ 180 − 200
P (X < 180) = P <
σ 25
= P (Z ≤ −0.8)
= Φ(−0.8) ≈ 0.2119

So, the probability that a randomly selected individual has a cholesterol level
below 180 mg/dL is approximately 0.2119 or 21.19%.

306
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

(c). Probability of Sample Mean Greater Than 210 mg/dL


For a sample of n = 36, the distribution of the sample mean X̄ is normally
distributed with mean µX̄ = µ and standard deviation σX̄ = √σn .
Given:
25 25
µX̄ = 200, σX̄ = √ = ≈ 4.17
36 6
We need to find P (X̄ > 210). Thus,
 
X̄ − µX̄ 210 − µX̄
P (X̄ > 210) = P >
σX̄ σX̄
= P (Z > 2.40) = 1 − Φ(2.40)
≈ 1 − 0.9918 = 0.0082

T
So, the probability that the sample mean cholesterol level of 36 adults is greater
than 210 mg/dL is approximately 0.0082 or 0.82%.

7.4.6 Python Code for Normal Distribution Characteris-


tics
AF
The Normal (Gaussian) distribution is fundamental in statistics and is used to
model continuous random variables. Below is Python code that demonstrates
how to compute various characteristics of the Normal distribution.

Python Code
1 import numpy as np
from scipy . stats import norm
DR
2

4 # Define the parameters


5 mu = 0 # Mean
6 sigma = 1 # Standard deviation
7

8 # Normal distribution
9 dist = norm ( loc = mu , scale = sigma )
10

11 # 1. Probability Density Function ( PDF )


12 x_values = np . linspace ( -5 , 5 , 100) # Values for the
random variable
13 pdf_values = dist . pdf ( x_values )
14 print ( " PDF values for x from -5 to 5: " , pdf_values )
15

16 # 2. Cumulative Distribution Function ( CDF )


17 cdf_values = dist . cdf ( x_values )
18 print ( " CDF values for x from -5 to 5: " , cdf_values )
19

307
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

20 # 3. Mean ( Expected Value )


21 mean = dist . mean ()
22 print ( " Mean ( Expected Value ) : " , mean )
23

24 # 4. Variance
25 variance = dist . var ()
26 print ( " Variance : " , variance )
27

28 # 5. Standard Deviation
29 std_dev = dist . std ()
30 print ( " Standard Deviation : " , std_dev )
31

T
32 # 6. Quantiles
33 quantiles = dist . ppf ([0.25 , 0.5 , 0.75]) # 25 th , 50 th (
median ) , and 75 th percentiles
34 print ( " Quantiles at 0.25 , 0.5 , and 0.75: " , quantiles )
35

36 # 7. Percentiles
37 percentiles = dist . ppf ([0.1 , 0.9]) # 10 th and 90 th

38

39

40
AF percentiles
print ( " Percentiles at 0.1 and 0.9: " , percentiles )

# 8. Moment Generating Function ( MGF )


41 def mgf (t , mu , sigma ) :
42 return np . exp ( mu * t + 0.5 * ( sigma **2) * t **2)
43

44 # MGF values at t = 0 and t = 1


45 mgf_values = [ mgf (t , mu , sigma ) for t in [0 , 1]]
46 print ( " MGF values for t = 0 and t = 1: " , mgf_values )
47

48
DR

Explanations
• Probability Density Function (PDF):
■ The PDF provides the likelihood of each value. Compute these
values using dist.pdf(x values).

• Cumulative Distribution Function (CDF):


■ The CDF gives the probability that the variable takes on a
value less than or equal to a given value. Compute this using
dist.cdf(x values).

• Mean (Expected Value):

■ The mean (expected value) is µ. Compute this using dist.mean().

308
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

• Variance:
■ The variance is σ 2 . Compute this using dist.var().

• Standard Deviation:

■ The standard deviation is σ. Compute this using dist.std().

• Quantiles:

■ Quantiles are values at specific probabilities. Compute these


using dist.ppf() for the desired quantile probabilities.

T
Percentiles:

■ Percentiles are specific types of quantiles. Compute these using


dist.ppf() for the desired percentile probabilities.

• Moment Generating Function (MGF):


AF ■ The MGF for a Normal distribution is given by MX (t) = exp(µ ·
t+0.5·σ 2 ·t2 ). Define a function mgf(t, mu, sigma) to compute
MGF values.

7.5 Exercises
1. Given a normal distribution X ∼ N (µ, σ 2 ), answer the following ques-
tions:
(a) If µ = 10 and σ = 2, what is the probability that X is less than 8?
(b) Find the probability that X lies between 8 and 12.
DR

(c) Determine the value a such that P (X ≤ a) = 0.975.


2. The standard normal distribution is a special case of the normal distribu-
tion where µ = 0 and σ = 1. Use the standard normal distribution table
(z-table) to answer the following:

(a) Find P (Z ≤ 1.645) for Z ∼ N (0, 1).


(b) Determine P (−1.96 ≤ Z ≤ 1.96).
(c) Calculate the value z such that P (Z ≥ z) = 0.05.
3. Given X ∼ N (20, 25), transform X to the standard normal distribution
Z and solve the following:

(a) Find the probability that X is less than 15.


(b) Determine the probability that X lies between 18 and 22.
(c) Find the value x such that P (X ≤ x) = 0.90.

309
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

4. Answer the following application-based questions:


(a) The heights of adult men in a certain population are normally dis-
tributed with a mean of 175 cm and a standard deviation of 10 cm.
What percentage of men are taller than 190 cm?
(b) A factory produces light bulbs with lifetimes that are normally dis-
tributed with a mean of 1200 hours and a standard deviation of 100
hours. What is the probability that a randomly selected light bulb
lasts between 1100 and 1300 hours?
(c) The scores on a standardized test are normally distributed with a
mean of 500 and a standard deviation of 100. What is the minimum

T
score needed to be in the top 5
5. Given a population with mean µ = 60 and standard deviation σ = 15:
(a) If a sample of size 50 is taken, what is the expected value and stan-
dard deviation of the sample mean?
AF(b) Using the Central Limit Theorem, find the probability that the sam-
ple mean is less than 58.
(c) Calculate the probability that the sample mean lies between 59 and
62.
6. A study measures the cholesterol levels (in mg/dL) of a group of patients,
which are found to follow a normal distribution with a mean of 200 mg/dL
and a standard deviation of 20 mg/dL.
(a) What is the probability that a randomly selected patient has a choles-
terol level between 180 mg/dL and 220 mg/dL?
(b) What is the 95th percentile of the cholesterol levels?
DR

(c) Calculate the variance and standard deviation of the cholesterol lev-
els.
(d) If a cholesterol level above 240 mg/dL is considered high, what pro-
portion of the patients have high cholesterol levels?
7. Consider a population where the heights of adult women are approxi-
mately normally distributed with a mean of 65 inches and a standard
deviation of 4 inches.

(a) Using Chebyshev’s Inequality, find the minimum percentage of women


whose heights are within 10 inches of the mean height.
(b) Suppose you want to ensure that at least 95% of the women fall
within a certain number of standard deviations from the mean. Using
Chebyshev’s Inequality, determine the minimum number of standard
deviations required for this guarantee.

310
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

(c) In the same population, if the height of a randomly selected woman


is 70 inches, what is the probability that her height deviates from
the mean by at least 5 inches, according to Chebyshev’s Inequality?

8. Suppose X and Y are independent random variables representing the


weights (in kg) of two different species of fish in a lake. X follows a
normal distribution with mean 2 kg and variance 0.25 kg2 , and Y follows
a normal distribution with mean 3 kg and variance 0.36 kg2 .

(a) What is the probability that a fish of the first species weighs more
than 2.5 kg?
(b) What is the probability that a fish of the second species weighs be-

T
tween 2.5 kg and 3.5 kg?
(c) What is the expected value of the difference in weights D = X − Y ?
(d) What is the variance of the difference in weights D?

AF
9. Consider a sample mean X̄ from a normal distribution with population
mean µ and variance σ 2 . Assume the sample size is n = 36.

(a) If µ = 50 and σ = 12, find the probability that the sample mean is
greater than 52.
(b) Calculate the probability that the sample mean lies between 48 and
51.
(c) Determine the value of x̄ such that P (X̄ ≤ x̄) = 0.95.

7.6 Concluding Remarks


DR

In this chapter, we have examined several essential continuous probability dis-


tributions, highlighting their definitions, properties, and applications. The Uni-
form distribution provided a foundation for understanding equal likelihood over
a range of values, while the Exponential distribution illustrated the modeling
of time intervals between events in a Poisson process.

The Normal distribution, with its profound significance in statistical theory


and practice, was explored in detail. We discussed its properties, the concept of
the Standard Normal distribution, methods for finding areas under the normal
curve, and the Central Limit Theorem, which underscores the Normal distribu-
tion’s ubiquitous presence in statistical analysis.

Understanding these continuous distributions equips us with powerful tools


for modeling and analyzing data in numerous disciplines. By mastering these

311
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

concepts, we can better interpret real-world phenomena, make informed deci-


sions, and contribute to advancements in various fields.

As we continue our journey through probability and statistics, the knowledge


gained from studying continuous distributions will serve as a crucial foundation
for more complex analyses and applications.

7.7 Chapter Exercises


1. A random variable X follows a continuous uniform distribution between
2 and 5. Calculate the probability that X is less than 3.

T
2. If the length of a rod is uniformly distributed between 10 and 20 cm, what
is the probability that a randomly selected rod is longer than 15 cm?
3. The time between arrivals of customers at a coffee shop follows an expo-
nential distribution with a mean of 5 minutes. What is the probability
AF that the next customer will arrive within 3 minutes?
4. A radioactive substance has a half-life of 10 years. What is the probability
that a sample will decay in less than 5 years?
5. A set of test scores is normally distributed with a mean of 70 and a
standard deviation of 10. What is the probability that a randomly selected
score is greater than 85?
6. In a factory, the weight of bags of flour is normally distributed with a
mean of 50 kg and a standard deviation of 2 kg. What percentage of bags
weigh between 48 kg and 52 kg?
DR

7. A company’s delivery times follow a normal distribution with a mean of 30


minutes and a standard deviation of 5 minutes. What is the probability
that a delivery takes longer than 40 minutes?
8. Compare the probabilities of an event occurring within a specified time
frame using both the exponential distribution (with a mean of 6 minutes)
and the continuous uniform distribution (from 0 to 12 minutes).

9. A car rental service finds that the time a customer spends renting a car
follows a normal distribution with a mean of 4 days and a standard devi-
ation of 1.5 days. What is the probability that a customer rents a car for
less than 3 days?

312
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Table 7.1: A.2: Cumulative Distribution Function of the Standard Normal


Distribution

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

−3.4 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002
−3.3 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003
−3.2 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005
−3.1 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007
−3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010

T
−2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
−2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
−2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
−2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
−2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
AF
−2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
−2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
−2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
−2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
−2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
−1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
−1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
−1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
−1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
DR

−1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
−1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
−1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
−1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
−1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
−1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
−0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
−0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
−0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
−0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
−0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
−0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
−0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
−0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
313
−0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS

Table 7.2: A.3: Cumulative Distribution Function of the Standard Normal


Distribution

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5200 0.5240 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5754
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879

T
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7258 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7518 0.7549
0.7 0.7580 0.7612 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7996 0.8023 0.8051 0.8079 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
AF
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8600 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9430 0.9441
1.6 0.9452 0.9463 0.9474 0.9485 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9600 0.9616 0.9625 0.9633 0.9641
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9700 0.9706
DR

1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9762 0.9767
2.0 0.9773 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9924 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9942 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9958 0.9959 0.9960 0.9961 0.9962 0.9963
2.7 0.9964 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973
2.8 0.9974 0.9975 0.9976 0.9977 0.9978 0.9979 0.9980 0.9981 0.9982 0.9983
2.9 0.9984 0.9985 0.9986 0.9987 0.9988 0.9989 0.9990 0.9991 0.9992 0.9993
3.0 0.9994 0.9995 0.9996 0.9997 0.9998 0.9999 1.0000 1.0001 1.0002 1.0003

314
Chapter 8

Confidence Interval

T
Estimation
AF
8.1 Introduction
Interval estimation is a crucial concept in statistics and data science, providing
a range of values within which a population parameter is expected to lie. Unlike
point estimation, which gives a single value as an estimate of the population
parameter, interval estimation provides an interval, giving a measure of relia-
bility to the estimation.

In many real-world applications, it is not sufficient to estimate a parameter


with a single value due to the inherent variability in data. Interval estima-
tion addresses this issue by incorporating this variability and providing a range
DR

within which the parameter is likely to fall. This chapter will explore various
methods of constructing confidence intervals for different population parame-
ters, including means, proportions, and variances.

8.2 Confidence Intervals


A confidence interval is a statistical tool that provides a range of values de-
rived from a data set, allowing researchers to estimate a population parameter.
Unlike a single point estimate, which may not accurately reflect the true value,
a confidence interval captures the uncertainty inherent in sampling. This inter-
val is constructed around a sample statistic, such as a mean or proportion, and
is designed to include the true population parameter at a specified confidence
level.

The level of confidence, typically expressed as a percentage such as 90%,


95%, or 99%, quantifies how certain we are that the interval contains the true

315
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

parameter. For instance, a 95% confidence interval suggests that if we were to


take many random samples and calculate an interval for each, approximately
95% of those intervals would encompass the actual population parameter. This
underscores the importance of understanding variability in data and offers a
more comprehensive view of the estimate’s reliability. Thus, confidence inter-
vals serve as a critical tool in statistics, enabling researchers to communicate
not just point estimates but also the associated uncertainty.

Confidence Interval (CI): A confidence interval is a range of values


derived from sample data that is likely to contain the true population
parameter with a specified level of confidence (e.g., 95%).

T
Let’s say we want to know the average height of adult women in a city. We
can’t measure every woman, so we take a sample of 100 women. From this sam-
ple, we find that the average height is 64 inches, and the variability in heights
is 3 inches.
AFNow, we’re pretty confident that the average height of all women in the
city is around 64 inches but we can’t be absolutely certain. There’s a chance
the true average is a bit higher or lower. To show this uncertainty, we use a
confidence interval. This is a range of values where we believe the true average
lies, with a certain level of confidence. For example, a 95% confidence interval
might be from 63.41 to 64.59 inches, meaning we’re 95% sure the true average
is between 63.41 and 64.59 inches.

In the next sections, we will explore the math and theory behind confidence
intervals.
DR

8.3 Confidence Intervals for the Population Mean


When the population variance σ 2 is known, the confidence interval for the
population mean µ is given by:

x̄ ± ME ⇒ (x̄ − ME, x̄ + ME) (8.1)

where
 
σ
ME = z √
n
is called the margin of error (ME). This margin quantifies the uncertainty
associated with our estimate of the population mean. Here, x̄ represents the
sample mean, z is the critical value from the standard normal distribution
corresponding to P (Z < −z) = α2 and P (Z > z) = α2 , σ is the population

316
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

standard deviation, and n is the sample size. We denote this critical value as
z = z α2 , as illustrated in Figure 8.1. The term √σn represents the standard error
(SE) of x̄.
The confidence interval for the population mean, as described in Equation
8.1, can be expressed as:
    
σ σ
P x̄ − z √ ≤ µ ≤ x̄ + z √ = 1 − α.
n n

f (z)

T
α α
2 2

z
AF −z α2 z α2

Figure 8.1: Standard Normal Distribution with shaded area to the right of z α2
and to the left of −z α2

Margin of Error (ME): The margin of error is a statistical measure


that indicates the potential difference between survey results and the true
population value. It provides a range around the sample estimate within
which the true value is likely to fall. For example, if a poll shows 60%
DR

support for a candidate with a margin of error of ±4%, the actual support
could be between 56% and 64%. The expression of ME of the confidence
interval for the mean is given in Figure 8.2.

The z value can be found in the standard normal distribution, which is


given in Table 8.2 The following table given in Table 8.1 summarizes important
z-values for various confidence levels:

Table 8.1: z-values for common confidence levels

α α
100(1 − α)% α 2 1− 2 z = z α2

90% 0.10 0.05 0.95 1.645


95% 0.05 0.025 0.975 1.960
99% 0.01 0.005 0.995 2.576

317
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

α
Table 8.2: The value of 2 with corresponding values z α2 .

α α α α α
2
z α2 2
z α2 2
z α2 2
z α2 2
z α2
0.001 3.090 0.021 2.034 0.041 1.739 0.061 1.546 0.081 1.398
0.002 2.878 0.022 2.014 0.042 1.728 0.062 1.538 0.082 1.392
0.003 2.748 0.023 1.995 0.043 1.717 0.063 1.530 0.083 1.385
0.004 2.652 0.024 1.977 0.044 1.706 0.064 1.522 0.084 1.379
0.005 2.576 0.025 1.960 0.045 1.695 0.065 1.514 0.085 1.372
0.006 2.512 0.026 1.943 0.046 1.685 0.066 1.506 0.086 1.366

FT
0.007 2.457 0.027 1.927 0.047 1.675 0.067 1.499 0.087 1.359
0.008 2.409 0.028 1.911 0.048 1.665 0.068 1.491 0.088 1.353
0.009 2.366 0.029 1.896 0.049 1.655 0.069 1.483 0.089 1.347
0.010 2.326 0.030 1.881 0.050 1.645 0.070 1.476 0.090 1.341
0.011 2.290 0.031 1.866 0.051 1.635 0.071 1.468 0.091 1.335
0.012 2.257 0.032 1.852 0.052 1.626 0.072 1.461 0.092 1.329
0.013 2.226 0.033 1.838 0.053 1.616 0.073 1.454 0.093 1.323
0.014 2.197 0.034 1.825 0.054 1.607 0.074 1.447 0.094 1.317
0.015 2.170 0.035 1.812 0.055 1.598 0.075 1.440 0.095 1.311
A
0.016 2.144 0.036 1.799 0.056 1.589 0.076 1.433 0.096 1.305
0.017 2.120 0.037 1.787 0.057 1.580 0.077 1.426 0.097 1.299
0.018 2.097 0.038 1.774 0.058 1.572 0.078 1.419 0.098 1.293
0.019 2.075 0.039 1.762 0.059 1.563 0.079 1.412 0.099 1.287
0.020 2.054 0.040 1.751 0.060 1.555 0.080 1.405 0.100 1.282
R
When the population variance σ 2 is unknown and the sample size is less
than 30 (i.e., n < 30), the confidence interval for the population mean µ is
given by:
x̄ ± ME ⇒ (x̄ − ME, x̄ + ME)
D

where  
s
ME = t √ ,
n
and t = t α2 ,ν is the critical value from the t-distribution with ν = n−1 degrees
of freedom, s is the sample standard deviation, and n is the sample size. The
critical value of t for the desired confidence level can be found in Table 8.3.

Remark 8.3.1. When the sample size is greater than or equal to 30 (i.e.,
n ≥ 30), the central limit theorem suggests that the sampling distribution of
the sample mean is approximately normally distributed, even if the population

318
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Margin of Error (ME)

σ 2 is known σ 2 is unknown

n ≥ 30 n < 30
 
σ
ME = z √
n

T
   
s s
ME = z √ ME = t √
n n

Figure 8.2: The margin of error (ME) for the confidence interval for the popu-
lation mean.
AF
distribution is unknown. In such cases, it is recommended to use the critical
value z = z α2 instead of t = t α2 .

However, if the sample size is smaller than 30 and the population variance
is unknown, it is generally recommended to use the critical value from the t-
distribution, denoted as t = t α2 , which accounts for additional variability due to
the smaller sample size.

Problem 8.1. A researcher is studying the time spent on social media by


DR

teenagers. Based on a previous study, the population variance is known to be 25


minutes². A sample of 30 teenagers shows a mean time of 120 minutes. Con-
struct the following confidence intervals for the population mean and compare
their widths and implications:

(i). 90% confidence interval


(ii). 95% confidence interval
(iii). 99% confidence interval

Solution
It is given that
• Sample mean (x̄) = 120 minutes

• Population
√ variance σ 2 = 25 minutes², so population standard deviation
σ = 25 = 5 minutes

319
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Figure 8.3: A.4: Critical points = t α2 ,ν of the t-distribution with its degrees of
freedom (ν)

α
2
ν 0.10 0.05 0.025 0.01 0.005 0.001 0.0005

1 3.078 6.314 12.706 31.821 63.657 318.31 636.62


2 1.886 2.920 4.303 6.965 9.925 22.326 31.598
3 1.638 2.353 3.182 4.541 5.841 10.213 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 1.476 2.015 2.571 3.365 4.032 5.893 6.869
6 1.440 1.943 2.447 3.143 3.707 5.208 5.959

T
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 1.350 1.771 2.160 2.650 3.012 3.852 4.221
AF
14
15
16
1.345
1.341
1.337
1.761
1.753
1.746
2.145
2.131
2.120
2.624
2.602
2.583
2.977
2.947
2.921
3.787
3.733
3.686
4.140
4.073
4.015
17 1.333 1.740 2.110 2.567 2.898 3.646 3.965
18 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 1.319 1.714 2.069 2.500 2.807 3.485 3.767
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 1.316 1.708 2.060 2.485 2.787 3.450 3.725
DR

26 1.315 1.706 2.056 2.479 2.779 3.435 3.707


27 1.314 1.703 2.052 2.473 2.771 3.421 3.690
28 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 1.311 1.699 2.045 2.462 2.756 3.396 3.659
30 1.310 1.697 2.042 2.457 2.750 3.385 3.646
40 1.303 1.684 2.021 2.423 2.704 3.307 3.551
60 1.296 1.671 2.000 2.390 2.660 3.232 3.460
120 1.289 1.658 1.980 2.358 2.617 3.160 3.373
∞ 1.282 1.645 1.960 2.326 2.576 3.090 3.291

• Sample size (n) = 30

(i). 90% Confidence Interval: For a 90% confidence interval, the critical
value z is approximately 1.645.
Calculating the margin of error:

320
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

   
σ 5
Margin of Error = z √ = 1.645 √ ≈ 1.645 × 0.912 ≈ 1.502
n 30
Thus, the 90% confidence interval is:

120 ± 1.502 ⇒ (118.498, 121.502)


The width of the 90% confidence interval is 121.502 - 118.498 = 3.004 minutes.

Comment: The 90% confidence interval suggests that we are 90% confident
that the true mean time spent on social media by all teenagers lies between

T
118.498 and 121.502 minutes.

(ii). 95% Confidence Interval: For a 95% confidence interval, the critical
value z is approximately 1.96.
Calculating the margin of error:
AF Margin of Error = 1.96 √

5
30
Thus, the 95% confidence interval is:

≈ 1.96 × 0.912 ≈ 1.790

120 ± 1.790 ⇒ (118.210, 121.790)


The width of the 95% confidence interval is 121.790 - 118.210 = 3.580 minutes.

Comment: The 95% confidence interval indicates that we can be 95%


confident that the true mean time spent on social media by all teenagers falls
between 118.211 and 121.789 minutes. There is a 5% chance that the true mean
is outside this interval
DR

(iii). 99% Confidence Interval: For a 99% confidence interval, the critical
value z is approximately 2.576.
Calculating the margin of error:
 
5
Margin of Error = 2.576 √ ≈ 2.576 × 0.912 ≈ 2.352
30
Thus, the 99% confidence interval is:

120 ± 2.352 ⇒ (117.648, 122.352)


The width of the 99% confidence interval is 122.352 - 117.648 = 4.704 minutes.

Comment: The 99% confidence interval suggests that we are 99% confident
that the true mean time spent on social media by all teenagers is between
117.648 and 122.352 minutes. There is only a 1% chance that the true mean is
not in this interval.

321
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Comparison of Confidence Intervals


The widths of the confidence intervals are:
• 90% confidence interval: 3.004 minutes

• 95% confidence interval: 3.580 minutes

• 99% confidence interval: 4.704 minutes


As the confidence level increases, the width of the confidence interval also
increases. This is because a higher confidence level requires a larger critical
z-value, which in turn increases the margin of error (z √σn ).

T
Implications: A narrower interval provides a more precise estimate but with
lower confidence, while a wider interval provides higher confidence but with
less precision. The choice of confidence level depends on the desired level of
certainty and the acceptable margin of error for the study. For critical decisions,
a higher confidence level (e.g., 99%) might be preferred to minimize the risk
AF
of error, even if it results in a wider interval. For less critical situations, a
lower confidence level (e.g., 90% or 95%) might be acceptable if a more precise
estimate is desired. .
Problem 8.2. A factory produces light bulbs with a known standard deviation
of 100 hours in their lifespan. A sample of light bulbs has an average lifespan
of 1000 hours. Construct a 99% confidence interval for the mean lifespan of the
light bulbs for the following sample sizes:
(i). n = 20
(ii). n = 30
(iii). n = 50
DR

Solution: It is given that


• Population standard deviation (σ) = 100

• Sample mean (x̄) = 1000


For a 99% confidence level, the critical value z is:
z ≈ 2.576

(i). For n = 20:


100
ME = 2.576 · √ ≈ 57.69
20
CI = 1000 ± 57.69 ⇒ (942.31, 1057.69)
This interval suggests we are 99% confident that the true mean lifespan of the
light bulbs is between 942.31 and 1057.69 hours.

322
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

(ii). For n = 30:


100
ME = 2.576 · √ ≈ 47.03
30
CI = 1000 ± 47.03 ⇒ (952.97, 1047.03)
This interval indicates we are 99% confident that the true mean lifespan of the
light bulbs falls between 952.97 and 1047.03 hours. As the sample size increases,
the margin of error decreases, resulting in a narrower confidence interval.

(iii). For n = 50:


100
ME = 2.576 · √ ≈ 36.45

FT
50
CI = 1000 ± 36.45 ⇒ (963.55, 1036.45)
This interval shows we are 99% confident that the true mean lifespan of the
light bulbs is between 963.55 and 1036.45 hours. With a larger sample size, the
confidence interval is even narrower, reflecting greater precision in estimating
the population mean.

Comparison of Confidence Intervals For the three sample sizes, the con-
fidence intervals are:
• For n = 20: CI = (942.31, 1057.69)
A
• For n = 30: CI = (952.97, 1047.03)

• For n = 50: CI = (963.55, 1036.45)


Increasing the sample size results in a more precise estimate of the population
mean, as seen in the narrowing of the confidence intervals.
R
Factors Influencing the Width and Reliability of a Confidence Interval
The width and reliability of a confidence interval are influenced by several key
factors:
1. Sample Size (n): Larger sample sizes generally lead to narrower confi-
dence intervals because they reduce the standard error of the mean. This
D

increases the reliability of the estimate of the population parameter.


2. Confidence Level (1 − α): Higher confidence levels (e.g., 99% vs. 90%)
result in wider confidence intervals. This is because a higher confidence
level requires capturing a broader range of values to ensure the true pop-
ulation parameter is included.
3. Population Variability (σ 2 ): The greater the variability or spread in
the population (measured by the population variance), the wider the con-
fidence interval. High variability means there’s more uncertainty about
the population mean.

323
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Increasing sample size, reducing variability, and lowering the confidence level
lead to a narrower confidence interval.
Problem 8.3. Consider a sample of 20 measurements of blood pressure with a
mean of 130 mmHg and a standard deviation of 15 mmHg. Construct a 95%
confidence interval for the population mean blood pressure.

Solution: In this case, we have, s = 15, x̄ = 130, n = 20. Since the sample
size is small (n < 30) and the population standard deviation is unknown, we will
use the t-distribution. The degrees of freedom (ν) is calculated as n − 1 = 19.
From the appropriate t-distribution table (e.g., see “Statistical Tables for the

FT
Student’s t-Distribution give in Table 8.3”), the critical value t for a 95% con-
fidence level with 19 degrees of freedom is approximately 2.093.

The margin of error (ME) is calculated as:


   
s 15
ME = t √ = 2.093 √ ≈ 2.093 × 3.354 ≈ 7.02
n 20

Thus, the 95% confidence interval is:

CI = x̄ ± ME = 130 ± 7.02
This gives us:
A
CI = (130 − 7.02, 130 + 7.02) ⇒ (122.98, 137.02)

We are 95% confident that the true population mean blood pressure is between
122.98 mmHg and 137.02 mmHg.

Problem 8.4. A study measures the daily caffeine intake of 25 adults. The
R
sample has a mean intake of 200 mg and a standard deviation of 50 mg. Con-
struct a 90% confidence interval for the population mean daily caffeine intake.

Solution: Here, s = 50, x̄ = 200, n = 25. Since the sample size is small
(n < 30), we will use the t-distribution. The degrees of freedom (ν) is n−1 = 24.
From Table 8.3, the critical value t for a 90% confidence level with 24 degrees
D

of freedom is approximately 1.711.


The margin of error (ME) is calculated as:
   
s 50
ME = t √ = 1.711 √ = 1.711 × 10 ≈ 17.11
n 25
Thus, the 90% confidence interval is:

CI = x̄ ± ME = 200 ± 17.11
This gives us

324
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

CI = (200 − 17.11, 200 + 17.11) ⇒ (182.89, 217.11)


We are 90% confident that the true population mean daily caffeine intake is
between 182.89 mg and 217.11 mg.
Problem 8.5. A study measures the effectiveness of a new diet on weight loss.
A sample of 25 participants who followed the diet for one month had an average
weight loss of 4.5 kg with a standard deviation of 1.2 kg. Construct a 90%
confidence interval for the mean weight loss.

Solution: In this case, we have, s = 1.2, x̄ = 4.5, n = 25. Since the sample

T
size is small (n < 30) and the population standard deviation is unknown, we will
use the t-distribution. The degrees of freedom (ν) is calculated as n − 1 = 24.
From the appropriate t-distribution table (e.g., see Table 8.3), the critical value
t for a 90% confidence level with 24 degrees of freedom is approximately 1.711.

Next, we calculate the margin of error (ME):


AF 
ME = t √
s
n
 
= 1.711 √
1.2
25

= 1.711 × 0.24 ≈ 0.41064

Thus, the 90% confidence interval is:

CI = x̄ ± ME = 4.5 ± 0.41064
This results in:

CI = (4.5 − 0.41064, 4.5 + 0.41064) ⇒ (4.089, 4.911)


Thus, we are 90% confident that the true mean weight loss for participants
DR

on the new diet is between approximately 4.09 kg and 4.91 kg.

Problem 8.6. A research team is evaluating the effect of a new vaccine on


reducing symptoms of a respiratory illness. In a sample of 30 vaccinated indi-
viduals, the average reduction in symptoms score is 6.3 with a standard deviation
of 1.8. Construct a 99% confidence interval for the mean reduction in symptoms
score.

Solution: In this case, we have, s = 1.8, x̄ = 6.3, n = 30. Since the sample
size is large (n ≥ 30), we can use the normal distribution. The critical value z
for a 99% confidence level can be found from the standard normal distribution
table, which is approximately 2.576.
The margin of error (ME) is
   
s 1.8
ME = z √ = 2.576 √ = 2.576 × 0.328 ≈ 0.845
n 30

325
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Thus, the 99% confidence interval is:

CI = x̄ ± ME = 6.3 ± 0.845
This results in:

CI = (6.3 − 0.845, 6.3 + 0.845) ⇒ (5.455, 7.145)


We are 99% confident that the true mean reduction in symptoms score for
vaccinated individuals is between approximately 5.46 and 7.15.

Problem 8.7. A biologist studies the effect of a new fertilizer on plant growth.

T
A sample of 10 plants is measured for growth in centimeters over a month. The
growth measurements are as follows:

12.5, 15.3, 14.0, 16.7, 13.5, 14.8, 15.1, 13.2, 17.0, 12.0

(i). Calculate the sample mean (x̄) and sample standard deviation (s) of the
AF growth measurements.
(ii). Using a 95% confidence level, construct a confidence interval for the mean
growth of the plants.

Solution
(i). Given the growth measurements of 10 plants:

12.5, 15.3, 14.0, 16.7, 13.5, 14.8, 15.1, 13.2, 17.0, 12.0
Using Table 8.3, the sample Mean (x̄) is
DR

1441
x̄ = = 14.41
10

326
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Table 8.3: Growth Measurements and Their Squares

i xi x2i
1 12.5 156.25
2 15.3 234.09
3 14.0 196.00
4 16.7 278.89
5 13.5 182.25
6 14.8 219.04
7 15.1 228.01
8 13.2 174.24

T
9 17.0 289.00
10 12.0 144.00
Total 144.1 2101.77
AF
Using the formula for the sample variance, we have
1 X 2  1
s2 = xi − nx̄2 = 2101.77 − 10 × 14.412 ≈ 2.81

n−1 10 − 1
Hence, the sample Standard Deviation (s) is
√ √
s = s2 = 2.81 ≈ 1.68
DR
(ii). 95% Confidence Interval: For a 95% confidence level, the critical
value t = 2.262 for degrees of freedom ν = 9. So, the margin of error is
calculated as follows:
 
s 1.65
ME = t √ = 2.262 · √ ≈ 2.262 · 0.522 = 1.18
n 10
The confidence interval is given by:

CI = x̄ ± ME
Thus, we have:

CI = 14.41 ± 1.18 ⇒ (13.23, 15.59)


We are 95% confident that the true mean growth of the plants lies between
13.23 cm and 15.59 cm.

327
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Python Code

1 import numpy as np
2 import scipy . stats as stats
3

4 # Growth measurements in centimeters


5 g r ow t h _m easurements = [12.5 , 15.3 , 14.0 , 16.7 , 13.5 , 14.8 ,
15.1 , 13.2 , 17.0 , 12.0]
6

7 # ( i ) Calculate the sample mean and sample standard


deviation
8 sample_mean = np . mean ( growth_me asurements )

T
9 sample_std_dev = np . std ( growth_measurements , ddof =1) #
Sample standard deviation
10

11 print ( f " Sample Mean : { sample_mean :.2 f } cm " )


12 print ( f " Sample Standard Deviation ( s ) : { sample_std_dev :.2 f }
cm " )
13

14

15
AF # ( ii ) Construct a 95% confidence interval for the mean
growth
confidence_level = 0.95
16 n = len ( growth_measurements ) # Sample size
17

18 # Critical t - value for the given confidence level


19 t_critical = stats . t . ppf ((1 + confidence_level ) / 2 , df = n -
1)
20

21 # Standard error of the mean


22 standard_error = sample_std_dev / np . sqrt ( n )
23
DR

24 # Margin of error
25 margin_of_error = t_critical * standard_error
26

27 # Confidence interval
28 ci_lower = sample_mean - margin_of_error
29 ci_upper = sample_mean + margin_of_error
30

31 print ( f " 95% Confidence Interval for the mean growth : ({


ci_lower :.2 f } , { ci_upper :.2 f }) cm " )
32

33

Listing 8.1: The 95% Confidence Interval for the Population Mean.

Problem 8.8. A study is conducted to determine the average cholesterol level


in a population after a new treatment. A sample of 50 patients shows an average
cholesterol level of 190 mg/dL with a standard deviation of 15 mg/dL. Construct
a 95% confidence interval for the mean cholesterol level.

328
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Solution: Here, s = 15, x̄ = 190, n = 50. Since the sample size is large
(n ≥ 30), we will use the normal distribution. The critical value z for a 95%
confidence level is approximately 1.96.

The margin of error (ME) is calculated as:


   
s 15
ME = z √ = 1.96 √ ≈ 1.96 × 2.121 ≈ 4.16
n 50
Thus, the 95% confidence interval is:

CI = x̄ ± ME = 190 ± 4.16

T
This gives us:

(190 − 4.16, 190 + 4.16) ⇒ (185.84, 194.16)


We are 95% confident that the true population mean cholesterol level is between
185.84 mg/dL and 194.16 mg/dL.
AF
Problem 8.9. A clinical trial was conducted to test the effectiveness of a new
drug in reducing blood pressure. A sample of 40 patients who received the drug
had an average reduction in blood pressure of 8 mmHg with a sample standard
deviation of 2.5 mmHg. Construct a 95% confidence interval for the mean
reduction in blood pressure.

Solution: Here, s = 2.5, x̄ = 8 and n = 40. Since n ≥ 30, we use the normal
distribution. The critical value for a 95% confidence level is z ≈ 1.96.

Calculate the margin of error (ME):


DR

   
s 2.5
ME = z √ = 1.96 √ ≈ 1.96 × 0.395 = 0.774
n 40
The 95% confidence interval is:

x̄ ± ME = 8 ± 0.774 ⇒ (7.226, 8.774)


We are 95% confident that the true mean reduction in blood pressure is
between approximately 7.23 mmHg and 8.77 mmHg.

8.4 Confidence Intervals for Variances and Stan-


dard Deviations
In data science, estimating the variance and standard deviation of a popula-
tion is essential for understanding the variability in data. Confidence intervals
provide a range of plausible values for these parameters based on sample data.

329
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

8.4.1 Confidence Interval for Variance


To compute a confidence interval for the population variance (σ 2 ), we utilize
the chi-square distribution. The confidence interval can be expressed as:
!
(n − 1)s2 (n − 1)s2
, (8.2)
χ2α/2, n−1 χ21−α/2, n−1
where:
• n = sample size

• s2 = sample variance

T
• χ21−α/2,n−1 = the lower chi-square critical value at α/2 with n − 1
degrees of freedom

• χ2α/2,n−1 = the upper chi-square critical value at 1 − α/2 with n − 1


degrees of freedom
AFThe critical value of χ2 can be calculated from the standard chi-square
distribution. Some values are given in Table 8.4.

8.4.2 Confidence Interval for Standard Deviation


To derive the confidence interval for the population standard deviation (σ), we
simply take the square root of the endpoints of the confidence interval for the
variance:
s s !
(n − 1)s2 (n − 1)s2
, (8.3)
χ2α/2,n−1 χ21−α/2,n−1
DR

Important Notes
• These methods assume that the sample comes from a normally dis-
tributed population.

• The confidence interval will become wider with a decrease in sample


size or an increase in the confidence level.

Problem 8.10. A sample of 25 measurements yields a sample variance of 20.


Construct a 95% confidence interval for the population variance and standard
deviation.

330
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Solution: Here, s2 = 20, n = 25, and for a 95% confidence level with 24
degrees of freedom, χ2α ,24 ≈ 39.36 and χ21− α ,24 ≈ 12.40.
2 2
The confidence interval for the variance is:
 
24 × 20 24 × 20
, = (12.20, 38.71)
39.36 12.40
Thus, the 95% confidence interval for the population variance is (12.20, 38.71)
and the 95% confidence interval for the standard deviation is approximately:
√ √
( 12.20, 38.71) = (3.49, 6.22).
Problem 8.11. A biostatistical study is conducted with a sample of 10 pa-

T
tients, and the following blood pressure measurements (in mmHg) are recorded:
120, 125, 130, 135, 140, 145, 150, 155, 160, 165. Calculate the variance and stan-
dard deviation of the blood pressure measurements. Also, compute a 95% con-
fidence interval for the variance.

Solution: First, calculate the mean x̄:


AF
x̄ =
120 + 125 + 130 + 135 + 140 + 145 + 150 + 155 + 160 + 165
10
= 142.5

Next, compute the variance:


1 X 2  1
s2 = xi − nx̄2 = 205125 − 10 × 142.52 ≈ 229.17

n−1 10 − 1

i xi x2i
1 120 14400
2 125 15625
DR

3 130 16900
4 135 18225
5 140 19600
6 145 21025
7 150 22500
8 155 24025
9 160 25600
10 165 27225
Total 1425 205125
For the 95% confidence interval for the variance, use the Chi-square distribution:
" #
(n − 1)s2 (n − 1)s2
CIσ2 = ,
χ2α , n−1 χ21− α , n−1
2 2

331
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

where α = 0.05, n = 10, and the degrees of freedom ν = n − 1 = 9.

Using Chi-square critical values:


χ20.025, 9 ≈ 19.023 and χ20.975, 9 ≈ 2.700
 
9 × 229.17 9 × 229.17
CIσ2 = , = [108.74, 764.81]
19.023 2.700
The 95% confidence interval for the variance of blood pressure measurements,
[108.74, 764.81], indicates that we are 95% confident the true population vari-
ance lies within this range. This suggests that there is significant variability

FT
in blood pressure readings among patients, with potential values for variance
reflecting both low and high levels of dispersion.

Python Code

1 import numpy as np
2 import scipy . stats as stats
3

4 # Blood pressure measurements in mmHg


5 blood_pressure = [120 , 125 , 130 , 135 , 140 , 145 , 150 , 155 ,
160 , 165]
6
A
7 # Calculate variance and standard deviation
8 variance = np . var ( blood_pressure , ddof =1) # Sample
variance
9 st an da rd_deviation = np . std ( blood_pressure , ddof =1) #
Sample standard deviation
10

11 print ( " Variance : {:.2 f } sq . mmHg " . format ( variance ) )


R
12 print ( " Standard Deviation : {:.2 f } mmHg " . format (
st an dard_deviation ) )
13

14 # Compute the 95% confidence interval for the variance


15 n = len ( blood_pressure ) # Sample size
16 alpha = 0.05 # Significance level
17
D

18 # Chi - squared critical values


19 chi2_lower = stats . chi2 . ppf ( alpha / 2 , df = n - 1)
20 chi2_upper = stats . chi2 . ppf (1 - alpha / 2 , df = n - 1)
21

22 # Confidence interval for the variance


23 ci_lower = ( n - 1) * variance / chi2_upper
24 ci_upper = ( n - 1) * variance / chi2_lower
25

26 print ( " 95% Confidence Interval for the variance : ({:.2 f } ,


{:.2 f }) sq . mmHg . format ( ci_lower , ci_upper ) )
Listing 8.2: The 95% Confidence Interval for the Population Variance.

332
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Problem 8.12. A clinical trial measures the time (in minutes) taken for a
specific treatment to have an effect in 6 patients, yielding the following times:
22, 27, 25, 30, 24, 28. Determine the variance and standard deviation of the time
data. Also, compute a 99% confidence interval for the variance.

Solution: First, calculate the mean x̄:


22 + 27 + 25 + 30 + 24 + 28
x̄ = = 26
6
Next, compute the variance:
n
1 X
s2 = (xi − x̄)2
n − 1 i=1
1 

T
= (22 − 26)2 + (27 − 26)2 + (25 − 26)2
6−1
+(30 − 26)2 + (24 − 26)2 + (28 − 26)2


1
= [16 + 1 + 1 + 16 + 4 + 4]
5
AF =
42
= 8.4
5
The standard deviation is:
√ √
s= s2 = 8.4 ≈ 2.90
For the 99% confidence interval for the variance, use the Chi-square distri-
bution:
DR
" #
(n − 1)s2 (n − 1)s2
CIσ2 = ,
χ2α , n−1 χ21− α , n−1
2 2

where α = 0.01, n = 6, and the degrees of freedom ν = n − 1 = 5.


Using Chi-square critical values:

χ20.005, 5 ≈ 16.75 and χ20.995, 5 ≈ 0.41


Then, the confidence interval is:
 
5 × 8.4 5 × 8.4
CIσ2 = , = [2.51, 102.01]
16.75 0.41
The 99% confidence interval for the variance of the treatment effect times,
[2.51, 102.01], indicates that we are 99% confident the true population variance
lies within this range. This suggests that there is considerable variability in the
time taken for the treatment to have an effect among patients.

333
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Python Code

1 import numpy as np
2 import scipy . stats as stats
3

4 # Given times
5 data = np . array ([22 , 27 , 25 , 30 , 24 , 28])
6

7 # Calculate mean
8 mean = np . mean ( data )
9

10 # Calculate variance ( using ddof =1 for sample variance )

T
11 variance = np . var ( data , ddof =1)
12

13 # Calculate standard deviation


14 std_dev = np . std ( data , ddof =1)
15

16 # Degrees of freedom
n = len ( data )
17

18

19

20
AF df = n - 1

# Chi - square critical values for 99% confidence interval


21 alpha = 0.01
22 chi2_lower = stats . chi2 . ppf ( alpha / 2 , df )
23 chi2_upper = stats . chi2 . ppf (1 - alpha / 2 , df )
24

25 # Confidence interval for variance


26 ci_lower = ( df * variance ) / chi2_upper
27 ci_upper = ( df * variance ) / chi2_lower
28

29 # Output results
DR

30 print ( f " Mean : { mean :.2 f } " )


31 print ( f " Variance : { variance :.2 f } " )
32 print ( f " Standard Deviation : { std_dev :.2 f } " )
33 print ( f " Critical Values : ({ chi2_lower :.2 f } ,{ chi2_upper :.2 f
}) " )
34 print ( f " 99% Confidence Interval for Variance : ({ ci_lower :.2
f } , { ci_upper :.2 f }) " )
35

Listing 8.3: The 99% Confidence Interval for the Population Variance.

8.5 Confidence Intervals for Population Propor-


tions
A confidence interval for a population proportion estimates the true proportion
of a population based on sample data. It is calculated using the sample pro-

334
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

x
portion pb = n, where x is the number of successes and n is the sample size.

For a given confidence level (e.g., 95%), the critical value z is determined
from the standard normal distribution. The margin of error (ME) is calculated
as:
r
pb(1 − pb)
ME = z
n
where the quantity r
pb(1 − pb)
SE =
n

T
is called standard error (SE) of pb. The confidence interval is then constructed
as:

pb ± ME ⇒ (b
p − ME, pb + ME)
This interval provides a range within which we can be confident that the
true population proportion lies, based on the sample data. Validity conditions
AF
include having a sufficiently large sample size (n ≥ 30) and ensuring np ≥ 5
and n(1 − p) ≥ 5.

Confidence intervals are crucial for assessing uncertainty and making in-
formed decisions based on sample estimates.
Problem 8.13. In a survey of 500 voters, 300 indicated they would vote for
candidate A. Construct a 95% confidence interval for the proportion of voters
who support candidate A.

Solution: The sample size is n = 500 and the number of voters for candidate
DR

300
A is x = 300. The sample proportion is pb = 500 = 0.6. For a 95% confidence
level, the critical value is z ≈ 1.96.

The margin of error (ME) is calculated as:


r r
pb(1 − pb) 0.6 × 0.4
ME = z = 1.96 ≈ 1.96 × 0.0349 ≈ 0.0685
n 500
Thus, the 95% confidence interval is:
pb ± ME = 0.6 ± 0.0685 ⇒ (0.5315, 0.6685)
We are 95% confident that the true proportion of voters supporting candi-
date A is between 53.15% and 66.85%.

Problem 8.14. In a survey of 150 patients, 45 reported that they are satisfied
with their current medication. Estimate the proportion of patients satisfied with
their medication and calculate a 95% confidence interval for this proportion.

335
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Solution: Let pb be the sample proportion of satisfied patients. Calculate pb:


45
pb = = 0.30
150
To compute the 95% confidence interval for the proportion, use the formula:
r
pb(1 − pb)
pb ± z
n
where n = 150 and z ≈ 1.96 for a 95% confidence level.
Compute the standard error (SE):
r r
pb(1 − pb) 0.30 × 0.70
SE = = ≈ 0.037
n 150

T
Calculate the confidence interval:

0.30 ± 1.96 × 0.037 = 0.30 ± 0.072 ⇒ [0.228, 0.372]

So, the 95% confidence interval for the population proportion is [0.228, 0.372].
AF
Problem 8.15. A study on the effectiveness of a new drug finds that out of 200
patients, 50 show significant improvement. Estimate the proportion of patients
who show improvement and find a 90% confidence interval for this proportion.

Solution: Let pb be the sample proportion of patients showing improvement.


Calculate pb:
50
pb = = 0.25
DR
200
To compute the 90% confidence interval for the proportion, use the formula:
r
pb(1 − pb)
pb ± z
n
where n = 200 and z ≈ 1.645 for a 90% confidence level.

Compute the standard error (SE):


r r
pb(1 − pb) 0.25 × 0.75
SE = = ≈ 0.033
n 200
Calculate the confidence interval:

0.25 ± 1.645 × 0.033 = 0.25 ± 0.054 ⇒ [0.196, 0.304]

So, the 90% confidence interval for the proportion is [0.196, 0.304].

336
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Python Code

1 import numpy as np
2 import scipy . stats as stats
3

4 # Given data
5 n = 200 # total number of patients
6 x = 50 # number of patients showing improvement
7

8 # Estimate the proportion


9 p_hat = x / n
10

T
11 # Confidence level
12 confidence_level = 0.90
13 alpha = 1 - confidence_level
14

15 # Standard error
16 se = np . sqrt (( p_hat * (1 - p_hat ) ) / n )
17

18

19

20
AF # Z - score for the given confidence level
z_score = stats . norm . ppf (1 - alpha / 2)

21 # Margin of error
22 margin_of_error = z_score * se
23

24 # Confidence interval
25 lower_bound = p_hat - margin_of_error
26 upper_bound = p_hat + margin_of_error
27

28 # Output results
29 print ( f " Estimated proportion of patients showing
DR

improvement : { p_hat :.4 f } " )


30 print ( f " 90% Confidence Interval : [{ lower_bound :.4 f } , {
upper_bound :.4 f }] " )
31

32

Listing 8.4: The 95% Confidence Interval for the Population Proportion.

Problem 8.16. In a clinical trial, out of 80 participants, 28 reported a side


effect from the medication. Estimate the proportion of participants experiencing
side effects and compute a 99% confidence interval for this proportion.

Solution: Let pb be the sample proportion of participants experiencing side


effects. Calculate pb:
28
pb = = 0.35
80

337
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

To compute the 99% confidence interval for the proportion, use the formula:
r
pb(1 − pb)
pb ± z
n
where n = 80 and z ≈ 2.576 for a 99% confidence level.

Compute the standard error (SE):


r r
pb(1 − pb) 0.35 × 0.65
SE = = ≈ 0.060
n 80

T
Calculate the confidence interval:

0.35 ± 2.576 × 0.060 = 0.35 ± 0.154 ⇒ [0.196, 0.504]

So, the 99% confidence interval for the proportion is [0.196, 0.504].
AF
8.6 Sample Size Estimation
Sample size estimation is crucial for ensuring that studies and experiments
in business are designed to provide reliable and valid results. This section
covers sample size estimation for estimating a population mean and a population
proportion, including business-related examples.

8.6.1 Sample Size for Estimating a Population Mean


When estimating the mean of a population, the required sample size depends
on the desired level of confidence, the acceptable margin of error, and the stan-
DR

dard deviation of the population.

Estimating Sample Size with Known Population Variance


To estimate the sample size n required to achieve a desired margin of error E
with a confidence level of 1 − α, use the following formula:

 2
z×σ
n= (8.4)
E

where:
• z = z1− α2 is the critical value (See the Table 8.1) from the standard
normal distribution corresponding to a confidence level of 1 − α,

• σ is the standard deviation of the population,

338
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

• E is the desired margin of error.


This formula allows researchers to determine the minimum sample size
needed to achieve a specified level of confidence while ensuring that the es-
timated mean falls within the desired margin of error. It is particularly useful
in cases where prior studies provide reliable estimates of the population stan-
dard deviation.
Problem 8.17. A company wants to estimate the average monthly spending
of its customers on their products. The company aims for a margin of error of
$10 with 95% confidence. If the standard deviation of monthly spending is $50,
determine the required sample size.

Solution: It is given that

σ = 50, E = 10, z = 1.96 (for 95% confidence level)

T
Calculate the sample size:
 2  2
1.96 × 50 98 2
n= = = (9.8) = 96.04
10 10
AF
Rounding up, the required sample size is 97.

Python Code

1 import scipy . stats as stats


2 import math
3

4 def c a l cu late_ samp le_si ze ( margin_of_error ,


DR
standard_deviation , confidence_level ) :
5

6 # Z - score for the given confidence level


7 z_score = stats . norm . ppf ((1 + confidence_level ) / 2)
8

9 # Calculate required sample size


10 r e q u i r e d_sample_size = ( z_score * standard_deviation /
margin_of_error ) ** 2
11 return math . ceil ( required_sample_size ) # Round up to the
nearest whole number
12

13 # Example usage
14 margin_of_error = 10 # Desired margin of error ( $ )
15 st an da rd_deviation = 50 # Population standard deviation ( $
)
16 confidence_level = 0.95 # Confidence level
17

18 required_size = calc ulate _sam ple_ size ( margin_of_error ,


standard_deviation , confidence_level )

339
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

19 print ( " Required Sample Size : {} " . format ( required_size ) )


20

To use equation (8.4), we need a value for the population standard deviation
σ. Even if we dot know σ, we can still use equation (8.4) if we have a preliminary
value or planing value for it. Here are some practical ways to find this value:

1. Take the estimated population standard deviation from previous studies


as the value for σ.
2. Run a pilot study to gather some initial data. The standard deviation
from this sample can be used as the value for σ.

T
3. Use judgment or a “best guess” for the value of σ. For example, estimate
the largest and smallest values in the population. The difference between
these values gives you a range. A common suggestion is to divide this
range by 4 to get a rough estimate of the standard deviation, which can
then serve as the value for σ.
AF Problem 8.18. A market research firm wants to estimate the average monthly
revenue generated by small businesses. A pilot study estimated the standard
deviation of revenue to be $15,000. To ensure a margin of error of $2,000 with
a 90% confidence level, determine the required sample size.

Solution: It is given that

s = 15000, E = 2000, z = 1.645 (for 90% confidence level)

Calculate the sample size:


 2
1.645 × 15000
DR

n=
2000
 2
24675
=
2000
2
= (12.3375)
= 152.2

Rounding up, the required sample size is 153.


Problem 8.19. A company wants to estimate the average amount of time
employees spend on training each month. They conduct a pilot study with 15
employees and find that the standard deviation of training time is 8 hours. To
achieve a margin of error of 2 hours with a 95% confidence level, determine the
required sample size.

340
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Solution: It is given,

s = 8 hours, E = 2 hours,

For 95% confidence z = 1.96.


 2  2
1.96 × 8 15.68 2
n= = = (7.84) ≈ 61.47
2 2
Since the sample size must be a whole number, round up to the next whole
number:
Thus, the required sample size is approximately n = 62.

T
Problem 8.20. The range for a set of data is estimated to be 36.

(a). What is the planning value for the population standard deviation?

(b). At 95% confidence, how large a sample would provide a margin of error
of 3?
AF
(c). At 95% confidence, how large a sample would provide a margin of error
of 2?

Solution: (a.)
The estimated population standard deviation (σ) can be calculated using
the range:
Range
σ≈
4
Given that the range is 36:
DR

36
σ≈
=9
4
(b.) The formula for the sample size (n) is:
 2
z×σ
n=
E
Where:
• z is the Z-score for 95% confidence (Z ≈ 1.96)

• σ is the estimated population standard deviation

• E is the margin of error

341
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Using σ = 9 and E = 3:
 2  2
1.96 × 9 17.64 2
n= = = (5.88) ≈ 34.57
3 3
Rounding up, we find:

n ≈ 35
(c.) Using the same formula with E = 2:
 2  2
1.96 × 9 17.64 2
n= = = (8.82) ≈ 77.92

T
2 2
Rounding up, we find:
AF n ≈ 78

8.6.2 Sample Size for Estimating a Population Proportion


When estimating a population proportion, the sample size depends on the de-
sired confidence level, margin of error, and the estimated proportion.

To estimate the sample size n required to achieve a desired margin of error


E with a confidence level 1 − α, use the following formula:

z 2 p(1 − p)
n=
E2
where,
DR

• z = z α2 is the critical value from the standard normal distribution


corresponding to a confidence level of 1 − α,

• p is the estimated proportion of the population,

• E is the desired margin of error.


If the true proportion p is unknown, use p = 0.5 for a conservative estimate.
Problem 8.21. A retailer wants to estimate the proportion of customers who
are satisfied with their new product. The retailer aims for a margin of error of
0.05 with 99% confidence. If the estimated proportion of satisfied customers is
0.6, determine the required sample size.

342
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Solution: Given,

p = 0.6, E = 0.05, z = 2.576 (for 99% confidence level)

Calculate the sample size:

2.5762 × 0.6 × (1 − 0.6) 6.635 × 0.6 × 0.4 1.59084


n= = = = 636.336
0.052 0.0025 0.0025
Rounding up, the required sample size is 637.

8.6.3 Sample Size Estimation for Finite Populations

T
When estimating the mean or proportion of a population, it is important to
consider whether the population is finite or small (generally under 5,000). In
this case, the standard sample size formulas may overestimate the needed sam-
ple size. Adjustments are made to account for the limited number of individuals
available, ensuring that the sample more accurately represents the population.
AF
The formula for estimating the sample size n when the population size N is
known is given by:
n0
n=
1 + nN0
where,
• n: The final sample size adjusted for the finite population.

• n0 : The initial sample size calculated without considering the popula-


tion size.
DR

• N : The total number of individuals in the population.

Example:
To illustrate the process of estimating the sample size for a finite population,
consider the following scenario: Suppose you are conducting a study to assess
the prevalence of a certain health condition within a population of residents in
a small town. You have determined the following parameters for your study:

• Population Size N : 1000

• Margin of Error E: 0.05 (5%)

• Confidence Level: 95% (corresponding z-value ≈ 1.96)

• Estimated Proportion p: 0.5 (maximum variability)

343
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

Step 1: Calculate the Initial Sample Size n0


Using the formula for the initial sample size, we can compute n0 :

z 2 · p(1 − p)
n0 =
E2
Substituting the values into the formula gives:

(1.96)2 · 0.5 · (1 − 0.5) (3.8416) · (0.25) 0.9604


n0 = ≈ = ≈ 384.16
(0.05)2 0.0025 0.0025

T
Thus, the initial sample size n0 is approximately 384.
Remark 8.6.1. When calculating the initial sample size n0 , we usually round
to the nearest whole number since we cannot survey a fraction of a person. The
method of rounding can depend on the specific context or research guidelines.
Both rounding to the nearest whole number and rounding up are acceptable, as
long as the reasoning behind the choice is clear.
AF
Step 2: Adjust for Finite Population
Next, we apply the finite population correction to determine the adjusted sam-
ple size n:
n0
n=
1 + nN0
Substituting n0 = 384 and N = 1000:
384 384 384
n= 384 = 1 + 0.384 = 1.384 ≈ 277.64
1 + 1000
DR

After rounding, the final sample size n is approximately 278.

8.7 Concluding Remarks


This chapter covered the fundamental concepts of interval estimation, including
confidence intervals for means, proportions, and variances. We explored the
interpretation and construction of confidence intervals and provided practical
examples to illustrate these concepts. Interval estimation is a powerful tool in
statistics, offering a range of values for population parameters and providing a
measure of reliability for the estimates.

344
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

8.8 Chapter Exercises


1. A nutritionist studies the effect of a new diet on weight loss. A sample
of 25 participants followed the diet for six weeks, resulting in an average
weight loss of 6.2 kg with a known population standard deviation of 1.5
kg. Construct a 95% confidence interval for the mean weight loss of all
participants on this diet.

2. An engineer tests the durability of a new type of material. A sample of


15 test pieces showed an average lifespan of 120 hours with a population
standard deviation of 20 hours. Calculate the 99% confidence interval for
the average lifespan of this material.

T
3. A teacher measures the effectiveness of a new teaching method. In a sam-
ple of 40 students, the average test score was 78 with a known population
standard deviation of 10. Construct a 90% confidence interval for the
mean test score of all students taught with this method.
4. A pharmaceutical company is evaluating the effectiveness of a new drug.
AFA sample of 50 patients reported an average improvement score of 8.5 on
a health scale, with a population standard deviation of 2. Create a 95%
confidence interval for the mean improvement score of all patients taking
the drug.
5. A sociologist studies the average number of hours people spend on social
media. In a sample of 30 individuals, the average time spent was 3.2 hours
with a known population standard deviation of 1.1 hours. Construct a
99% confidence interval for the mean time spent on social media by the
population.
6. A horticulturist studies the effect of a new watering technique on plant
DR

growth. A sample of 10 plants is measured for growth in centimeters over


a month. The growth measurements are as follows:

11.8, 14.2, 15.0, 13.6, 12.3, 16.1, 14.4, 13.8, 15.7, 12.5

(i). Calculate the sample mean (x̄) and sample standard deviation (s) of
the growth measurements.
(ii). Using a 95% confidence level, construct a confidence interval for the
mean growth of the plants.
(iii). If the horticulturist wants to increase the confidence level to 99%,
how would that affect the width of the confidence interval? Calculate
the new confidence interval.
(iv). If the growth measurements were to decrease by 1.5 cm for each
plant due to an adverse weather condition, how would this change
the sample mean and the confidence interval?

345
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

(v). Based on the confidence interval obtained, discuss the effectiveness


of the new watering technique. What conclusions can be drawn?

7. A survey of 200 students revealed that 120 of them prefer online classes to
in-person classes. Construct a 95% confidence interval for the proportion
of all students who prefer online classes.
8. In a clinical trial, 150 patients were given a new treatment, and 90 re-
ported improvement in their condition. Calculate a 99% confidence inter-
val for the proportion of patients who respond positively to the treatment.
9. A marketing firm conducted a survey of 500 customers and found that 275

T
are satisfied with their services. Using a 90% confidence level, construct
a confidence interval for the proportion of all customers who are satisfied.
10. A poll conducted on 1, 000 voters shows that 430 support a certain can-
didate. Determine the 95% confidence interval for the proportion of all
voters who support this candidate.
AF
11. In a study of cholesterol levels in a population, the following measure-
ments (in mg/dL) were recorded: 210, 220, 230, 240, 250, 260. Calculate
the variance and standard deviation of the cholesterol levels. Also, com-
pute a 90% confidence interval for the variance.
12. A researcher wants to estimate the proportion of voters who support a
new policy with a margin of error of 0.05 and a confidence level of 95%.
If a previous study found that 60% of voters support the policy, calculate
the required sample size.
13. A company aims to determine the average time (in minutes) customers
spend on their website. They want to estimate the mean with a margin of
DR

error of 2 minutes and a confidence level of 99%. If the standard deviation


from a pilot study is 10 minutes, find the required sample size.
14. A medical researcher wants to estimate the average blood pressure of
patients in a certain age group within 3 mmHg, with a 95% confidence
level. From previous studies, the standard deviation is known to be 12
mmHg. Calculate the sample size needed.
15. A social scientist wants to estimate the mean income of households in a
city. They want a margin of error of 500 dollars with a confidence level of
90%. If previous research shows that the standard deviation of household
income is approximately 3000 dollars, what sample size is necessary?
16. An education researcher plans to conduct a study on student performance
and wants to estimate the mean score on a standardized test with a margin
of error of 1.5 points. If the standard deviation of test scores is 6 points,
calculate the required sample size for a 95% confidence level.

346
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION

α χ2
5 χ2α 10 15 20

Figure 8.4: Chi-Square Distribution


Table 8.4: A.5: Percentage points of the chi-square distribution (χ2ν,u )

ν χ2.995 χ2.99 χ2.975 χ2.95 χ2.90 χ2.75 χ2.50 χ2.25 χ2.10 χ2.05 χ2.025 χ2.01 χ2..005 χ2.001
1 0.00 0.00 0.00 0.00 0.02 0.10 0.45 1.32 2.71 3.84 5.02 6.63 7.88 10.83
2 0.01 0.02 0.05 0.10 0.21 0.58 1.39 2.77 4.61 5.99 7.38 9.21 10.60 13.81

T
3 0.07 0.12 0.22 0.35 0.58 1.21 2.37 4.11 6.25 7.81 9.35 11.34 12.84 16.27
4 0.21 0.30 0.48 0.71 1.06 1.92 3.36 5.39 7.78 9.49 11.14 13.28 14.86 18.47
5 0.41 0.55 0.83 1.15 1.61 2.67 4.35 6.63 9.24 11.07 12.83 15.09 16.75 20.52
6 0.68 0.87 1.24 1.64 2.20 3.45 5.35 7.84 10.64 12.59 14.45 16.81 18.55 22.46
7 0.99 1.24 1.69 2.17 2.83 4.25 6.35 9.04 12.02 14.07 16.01 18.48 20.28 24.32
8 1.34 1.65 2.18 2.73 3.49 5.07 7.34 10.22 13.36 15.51 17.53 20.09 21.95 26.12
AF
9 1.73 2.09 2.70 3.33 4.17 5.90 8.34 11.39 14.68 16.92 19.02 21.67 23.59 27.88
10 2.16 2.56 3.25 3.94 4.87 6.74 9.34 12.55 15.99 18.31 20.48 23.21 25.19 29.59
11 2.60 3.05 3.82 4.57 5.58 7.58 10.34 13.70 17.28 19.68 21.92 24.72 26.76 31.26
12 3.07 3.57 4.40 5.23 6.30 8.44 11.34 14.85 18.55 21.03 23.34 26.22 28.30 32.91
13 3.57 4.11 5.01 5.89 7.04 9.30 12.34 15.98 19.81 22.36 24.74 27.69 29.82 34.53
14 4.07 4.66 5.63 6.57 7.79 10.17 13.34 17.12 21.06 23.68 26.12 29.14 31.32 36.12
15 4.60 5.23 6.27 7.26 8.55 11.04 14.34 18.25 22.31 25.00 27.49 30.58 32.80 37.70
16 5.14 5.81 6.91 7.96 9.31 11.91 15.34 19.37 23.54 26.30 28.85 32.00 34.27 39.25
17 5.70 6.41 7.56 8.67 10.09 12.79 16.34 20.49 24.77 27.59 30.19 33.41 35.72 40.79
18 6.26 7.01 8.23 9.39 10.86 13.68 17.34 21.60 25.99 28.87 31.53 34.81 37.16 42.31
19 6.84 7.63 8.91 10.12 11.65 14.56 18.34 22.72 27.20 30.14 32.85 36.19 38.58 43.82
20 7.43 8.26 9.59 10.85 12.44 15.45 19.34 23.83 28.41 31.41 34.17 37.57 40.00 45.32
DR

21 8.03 8.90 10.28 11.59 13.24 16.34 20.34 24.93 29.62 32.67 35.48 38.93 41.40 46.80
22 8.64 9.54 10.98 12.34 14.04 17.24 21.34 26.04 30.81 33.92 36.78 40.29 42.80 48.27
23 9.26 10.20 11.69 13.09 14.85 18.14 22.34 27.14 32.01 35.17 38.08 41.64 44.18 49.73
24 9.89 10.86 12.40 13.85 15.66 19.04 23.34 28.24 33.20 36.42 39.36 42.98 45.56 51.18
25 10.52 11.52 13.12 14.61 16.47 19.94 24.34 29.34 34.38 37.65 40.65 44.31 46.93 52.62
26 11.16 12.20 13.84 15.38 17.29 20.84 25.34 30.43 35.56 38.89 41.92 45.64 48.29 54.05
27 11.81 12.88 14.57 16.15 18.11 21.75 26.34 31.53 36.74 40.11 43.19 46.96 49.64 55.48
28 12.46 13.56 15.31 16.93 18.94 22.66 27.34 32.62 37.92 41.34 44.46 48.28 50.99 56.89
29 13.12 14.26 16.05 17.71 19.77 23.57 28.34 33.71 39.09 42.56 45.72 49.59 52.34 58.30
30 13.79 14.95 16.79 18.49 20.60 24.48 29.34 34.80 40.26 43.77 46.98 50.89 53.67 59.70
40 20.71 22.16 24.43 26.51 29.05 33.66 39.34 45.62 51.81 55.76 59.34 63.69 66.77 73.40
50 27.99 29.71 32.36 34.76 37.69 42.94 49.33 56.33 63.17 67.50 71.42 76.15 79.49 86.66
60 35.53 37.48 40.48 43.19 46.46 52.29 59.33 66.98 74.40 79.08 83.30 88.38 91.95 99.61
70 43.28 45.44 48.76 51.74 55.33 61.70 69.33 77.58 85.53 90.53 95.02 100.42 104.22 112.32
80 51.17 53.54 57.15 60.39 64.28 71.14 79.33 88.13 96.58 101.88 106.63 112.33 116.32 124.84
90 59.20 61.75 65.65 69.13 73.29 80.62 89.33 98.64 107.56 113.14 118.14 124.12 128.30 137.21
100 67.33 70.06 74.22 77.93 82.36 90.13 99.33 109.14 118.50 124.34 129.56 135.81 140.17 149.45

347
Chapter 9

Hypothesis Testing for

T
Decision Making
AF
9.1 Introduction
In data science, hypothesis testing is a fundamental aspect of statistical in-
ference, providing a systematic method for making decisions about population
parameters based on sample data. This chapter aims to equip readers with a
thorough understanding of hypothesis testing, its importance, and its applica-
tion in various statistical scenarios.

We begin by introducing the basic concepts of hypothesis testing, including


the formulation of null and alternative hypotheses and the significance level,
which determines the threshold for making decisions. The subsequent sections
DR

outline the step-by-step process of conducting hypothesis tests, emphasizing the


calculation of test statistics and the determination of acceptance and rejection
regions.

A crucial part of this chapter is understanding the role of the p-value in


hypothesis testing, as well as its interpretation and significance. We provide il-
lustrative examples involving normal distributions, addressing cases with both
known and unknown variances. Additionally, the chapter explores the concept
of power in hypothesis testing, demonstrating how to calculate it and why it is
vital for ensuring the reliability of test results.

Finally, we cover the methods for estimating the sample size required for
mean and proportion tests, highlighting the importance of adequate sample
sizes for achieving meaningful and accurate conclusions. The chapter concludes
with a set of exercises designed to reinforce the concepts and techniques dis-
cussed, providing practical experience in applying hypothesis testing methods.

348
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

This comprehensive introduction to hypothesis testing serves as a founda-


tion for more advanced statistical analyses, enabling readers to make informed
decisions based on empirical data.

9.2 Concepts of Hypothesis Testing


Consider a scenario where you work for an e-commerce company that has re-
cently redesigned its website to enhance its visual appeal. The marketing team
hypothesizes that this new design will lead to an increase in sales. To evaluate
this hypothesis, you conduct an A/B test, where half of the visitors are shown

T
the old website (Version A) and the other half are exposed to the new design
(Version B).

After running the test for several weeks, you gather data on the number of
purchases made by visitors from both versions of the website. Your goal is to
determine whether the new design genuinely results in more sales compared to
AF
the old version or if any observed differences are attributable to random chance.

Hypothesis testing is a statistical methodology that allows researchers to


make inferences about population parameters based on sample data. This
method is essential for determining whether observed data can substantiate
a specific claim or hypothesis regarding the population. The process begins by
formulating two competing hypotheses:
• The null hypothesis, denoted as H0

• The alternative hypothesis, denoted as H1 or Ha


DR

The term “null” in “null hypothesis” reflects the concept of a “null effect”
or “no effect.” It serves as a baseline assumption that there is no significant
difference, relationship, or effect between the variables under investigation. Es-
sentially, the null hypothesis is the starting point for statistical testing, positing
that any observed differences or effects are due to random chance rather than
a true underlying effect. If the evidence is robust enough to reject the null
hypothesis, it suggests the presence of a significant effect or relationship that
merits further investigation.
Hypothesis: A hypothesis is an assertion or assumption regarding popu-
lation parameters or characteristics of random variables.
• Null Hypothesis (H0 ): The null hypothesis asserts that there
is no effect or difference between the groups or variables being
studied.

• Alternative Hypothesis (H1 or Ha ): The alternative hypoth-

349
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

esis proposes that there is a significant effect or difference between


the groups or variables being studied.

The alternative hypothesis represents the researcher’s objective, positing the


existence of an effect or difference.
For example, a pharmaceutical company may aim to assess whether a new
therapy provides greater benefits than an existing treatment. In another case,
a health organization might explore differences in physiological indicators be-
tween newborns in rural and urban environments. In such studies, researchers
formulate hypotheses about the population parameters and utilize sample data
to test these hypotheses. For instance, if the average cholesterol level in children
is established at 175 mg/dL, researchers could investigate whether the average

T
cholesterol level of children whose fathers died from heart disease is significantly
higher than this benchmark. In this case, the hypotheses are following:

• Null Hypothesis (H0 ): The average cholesterol level of children


whose fathers died from heart disease is equal to 175 mg/dL. (H0 :
AF•
µ = 175)

Alternative Hypothesis (H1 or Ha ): The average cholesterol level


of children whose fathers died from heart disease is greater than 175
mg/dL. (H1 : µ > 175)

Importance of Hypothesis Testing


Hypothesis testing plays a crucial role in data analysis for several reasons:

1. Avoiding Misinterpretation of Results: Without hypothesis testing,


one might observe that Version B has slightly higher sales than Version
DR

A and mistakenly conclude that the new design is definitively superior.


Such a conclusion may ignore the possibility that the difference could arise
from random variations in the data.
2. Structured Decision-Making: Hypothesis testing provides a system-
atic framework for evaluating whether the difference in sales is statisti-
cally significant. It enables a clear approach for determining whether the
observed effect is substantial enough to warrant changes to the website
design based on the data collected.
3. Informed Decision-Making: By employing hypothesis testing, you can
make data-driven decisions about implementing the new design site-wide.
If the test indicates that the new design significantly increases sales, you
may opt to adopt it. Conversely, if the results are not statistically sig-
nificant, you might decide to retain the old design or explore further
improvements.

350
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

9.3 Steps for Hypothesis Testing


Hypothesis testing involves the following steps:

Step 1: Formulate the null and alternative hypotheses

Step 2: Set level of significance, usually, denoted by α

Step 3: Collect Data and define a test statistic and then compute it

Step 4: Construct acceptance ad rejection regions by collecting tabulated value


from the sampling distribution fo statistic such as

T
■ standard normal table for Z-test
■ standard t-table for T -test
■ F -table for F -test
■ χ2 -table for χ2 -test, etc.
AF for the level of significance α or compute p-value and then define deci-
sion rule

Step 5: Based on steps 3 and 4, draw a conclusion about H0 .


■ reject H0
▶ if the value of test statistic belongs to the rejection region
at level of significance 100α%.
▶ or p-value is less then α.
■ otherwise, do not reject H0 at level of significance 100α%.
DR

Details of each step of hypothesis testing for decision-making are explained in


the following subsections.

9.3.1 Formulating Hypotheses


A hypothesis becomes a statistical hypothesis when it is formulated in a
way that allows it to be tested using statistical methods. The first step in
hypothesis testing is formulating statistical hypotheses, which consist of the null
hypothesis (H0 ) and the alternative hypothesis (H1 or Ha ). Formulating the
null and alternative hypotheses involves clearly defining the research question
and the population parameter of interest. The hypotheses must be mutually
exclusive and collectively exhaustive, meaning that one of them must be true,
and both cannot be true simultaneously.

351
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Statistical Hypothesis

Null Hypothesis Alternatively Hypothesis


(denoted by H0 ) (denoted by H1 or Ha )

• assumed to be true • any statement without


H0
• a given value

T
rejection of true value,
• assumption, nothing new a given
• independent • rejection of the as-
sumption
• negation of the research aim

• • does not contain equal-


AF usually contains an equality
(e.g, =, ≥, ≤)
ity (usually contains >,
<, ̸=)

Null Hypothesis (H0 )


The null hypothesis represents a statement of no effect, no difference, or the
status quo. It is an assertion that any observed difference or effect in the data
is due to random chance rather than a real underlying effect. For example, if a
pharmaceutical company is testing a new drug, the null hypothesis might state
that the new drug has no effect on patients compared to the existing treatment.
Mathematically, the null hypothesis often includes an equality sign, such as
DR

H0 : θ = θ0

where θ is the population parameter and θ0 is the hypothesized value of popu-


lation parameter.

Alternative Hypothesis (H1 or Ha )


The alternative hypothesis represents a statement that contradicts the null
hypothesis. It indicates the presence of an effect, a difference, or a relationship
that the researcher aims to detect. The alternative hypothesis is what the
researcher hopes to support through evidence from the data. It can be one-
sided or two-sided, depending on the research question. For example, if the
pharmaceutical company believes the new drug is more effective, the alternative
hypothesis might state that the new drug has a greater effect than the existing
treatment
H1 : θ > θ0

352
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

In two-sided tests, the hypothesis might state that there is a difference in either
direction
H1 : θ ̸= θ0

9.3.2 Level of Significance


The second step in hypothesis testing involves establishing the level of sig-
nificance, denoted by α. This threshold is chosen before conducting the test
and represents the probability of rejecting the null hypothesis (H0 ) when it is
actually true, known as a Type I error. Common values for α include 0.05,
0.01, and 0.10, with 0.05 being the most prevalent. For example, if α is set
at 0.05, there is a 5% chance of incorrectly concluding that a significant effect

T
exists when it does not. This chosen α level reflects the researcher’s tolerance
for making an erroneous decision. A lower α means a stricter criterion for re-
jecting H0 , which reduces the risk of a Type I error but may increase the risk
of a Type II error—failing to reject a false null hypothesis. This threshold thus
sets the stage for evaluating the test results, ensuring that decisions align with
AF
the predetermined risk level.

Level of Significance: The level of significance is the threshold proba-


bility of rejecting the null hypothesis when it is true.

Type I and Type II Errors


Understanding Type I and Type II errors is crucial for interpreting hypothesis
testing results. A Type I error occurs when the null hypothesis is rejected when
it is true, with the probability of this error being denoted by α. For instance,
if α = 0.05, the chance of incorrectly rejecting H0 is 5%. Conversely, a Type
II error occurs when the null hypothesis is not rejected when it is false, with
DR

the probability of this error denoted by β. The complement of β (i.e., 1 − β)


represents the power of the test, which measures the probability of correctly
rejecting a false null hypothesis. See the Table 9.1. The trade-off between α
and β is critical; reducing α to decrease Type I errors may increase β, making
it more likely to miss a true effect. Therefore, choosing appropriate values for
α and β involves balancing the risks of both types of errors in the context of
the study.

Types of error

Reject H0 when it is true Do not reject H0 when it false


(Type I error) (Type II error)

353
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Table 9.1: Level of Significance and Power of the test.

In Reality
H0 is TRUE H0 is FALSE
Correct Decision Type II error
Accept H0
Decision

1 − α = Confidence level β = Pr(Type II Error)

Type I Error Correct Decision


Reject H0
α = Pr(Type I Error) 1 − β =Power of the test

T
9.3.3 Test Statistics
The test statistic provides a standardized value that is calculated from sample
data during a hypothesis test. It quantifies the discrepancy between the ob-
served sample data and the null hypothesis. The test statistic is a random
variable because it is derived from random sample data, and it follows a prob-
AF
ability distribution. In general, the test statistic for the location parameter test
defined in null hypothesis H0 : θ = θ0 is

θb − θ0
Test statistic =
se(θ)
b
where θ is a location parameter. It could be a mean, median, quantile, etc.

9.3.4 Acceptance and Rejection Regions


In hypothesis testing, we define acceptance and rejection regions to decide
whether to reject the null hypothesis H0 . The acceptance region is the range of
DR

values for which we fail to reject the null hypothesis H0 . The rejection region
is the range of values for which we reject H0 . The critical value zcri defines the
boundary between these regions. Figure 9.1, Figure 9.2, and Figure 9.3 show
the acceptance and rejection regions for a left-tailed test, a right-tailed test,
and a two-tailed test, respectively, at a significance level of α.

354
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

f (z)

T
Acceptance Region
Rejection Region

−zα z
AF
Figure 9.1: Acceptance and Rejection Regions for a Left-Sided Test at α = 0.05.
f (z)
DR

Acceptance Region
Rejection Region

z zα

Figure 9.2: Acceptance and Rejection Regions for a Right-Tailed Test at α =


0.05.

355
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

f (z)

T
Acceptance Region
Rejection Region Rejection Region

−zα/2 z zα/2
AF
Figure 9.3: Acceptance and Rejection Regions for a Two-Sided Test at α = 0.05.

The graphs above represent the standard normal distribution (for Z-test),
which is a common distribution used in hypothesis testing. The curve shown is
the probability density function of the standard normal distribution, N (0, 1),
which is symmetric about the mean µ = 0.
• Critical Value (zcri ): Critical values help determine whether to reject
the null hypothesis H0 based on the significance level α and test type.

■ One-Tailed Test: For α, the critical value is zα (upper tail) or


−zα (lower tail). E.g., for α = 0.05, zα = 1.645. Reject H0 if
DR

the test statistic exceeds the critical value.


■ Two-Tailed Test: For α, the critical values are ±zα/2 . E.g., for
α = 0.05, zα/2 ≈ 1.96. Reject H0 if the test statistic is outside
±zα/2 .

• Acceptance Region: Area to the left of zα where H0 is not rejected.

• Rejection Region: Area to the right of zα where H0 is rejected.

Interpreting the Regions


• If the test statistic is in the acceptance region, do not reject H0 (the
data is consistent with H0 ).

• If the test statistic is in the rejection region, reject H0 (the data is


unlikely under H0 ).

356
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

9.3.5 Decision Rules


In hypothesis testing, there are two common methods for making decisions:
the critical-value method and the p-value method. The critical-value method
involves comparing the test statistic to a critical value. The decision rules are:

Critical-Value Method:

• One-tailed test (right-tailed): Reject H0 if the test statistic is


greater than the critical value; otherwise, fail to reject H0 .

• One-tailed test (left-tailed): Reject H0 if the test statistic is less

T
than the critical value; otherwise, fail to reject H0 .

• Two-tailed test: Reject H0 if the test statistic is outside the critical


values; otherwise, fail to reject H0 .

The p-value Method:


AF• Reject H0 if the p-value is less than or equal to α; otherwise, fail to
reject H0 .

9.4 The p-value


In hypothesis testing, the p-value helps us determine the strength of the evi-
dence against the null hypothesis, H0 . It is represented by the shadow are in
Figure 9.4 in hypothesis testing. The p-value is the area under the curve to
the right of the test statistic in a right-tailed test (as shown in the graph). It
quantifies the evidence against the null hypothesis: a smaller p-value indicates
DR

stronger evidence against H0 .

The p-value: The p-value represents the probability of observing the test
statistic as extreme as, or more extreme than, the value observed if the
null hypothesis is true. A small p-value (less than α) indicates that the
observed data is unlikely under the null hypothesis, leading us to reject
H0 .

357
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

T
Figure 9.4: The p-value is showed in shadow area.

Interpreting the p-value


AF• If the p-value is less than or equal to the significance level α, we re-
ject the null hypothesis H0 . This indicates that the observed data is
unlikely under H0 .

• If the p-value is greater than α, we fail to reject the null hypothesis H0 .


This means there is insufficient evidence to conclude that the observed
data is inconsistent with H0 .

Guidelines for Judging the Significance of a p-Value


• If 0.01 ≤ p-value < 0.05, then the results are significant.
DR

• If 0.001 ≤ p-value < 0.01, then the results are highly significant.

• If p-value < 0.001, then the results are very highly significant.

• If p-value > 0.05, then the results are considered not statistically sig-
nificant (sometimes denoted by NS).
However, if 0.05 < p-value < 0.10, then a trend toward statistical significance
is sometimes noted.

9.5 Why is hypothesis testing so important?


Hypothesis testing is crucial because it provides an objective framework for
making decisions based on data, ensuring consistency and reproducibility in
research. By quantifying the strength of evidence against the null hypothesis,
it allows researchers to make informed decisions while controlling for Type I

358
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

and Type II errors. This statistical method is essential in scientific research,


policy-making, and business decisions, enhancing the credibility and reliability
of findings. Through hypothesis testing, researchers can test theories, validate
models, and make predictions, driving innovation and scientific progress by
challenging existing knowledge and exploring new ideas.

9.6 Hypothesis Testing for Means


This section focuses on hypothesis tests specifically for means, which are essen-
tial for making inferences about the average values of populations or comparing
means across different groups. We will explore three key types of tests:

T
• One-Sample Test of Means: This test assesses whether the mean of
a single sample differs from a known or hypothesized population mean.
It is useful when we want to determine if the sample mean is consistent
with a particular value.
AF• Testing Equality of Two Means: Here, we compare the means
of two independent samples to evaluate whether there is a significant
difference between them. This is crucial when examining the effects of
different treatments or conditions on two separate groups.

• Testing Equality of Several Means: This test extends the compari-


son to more than two groups, determining whether there are significant
differences among the means of multiple samples. It is commonly used
in experiments with multiple treatment levels or groups.

Each of these tests plays a critical role in statistical inference, helping re-
searchers and analysts make data-driven decisions and understand underlying
DR

patterns. We will delve into the methodology, assumptions, and applications of


these tests in the following subsections.

9.6.1 One-Sample Test of Means


A one-sample test of means is a statistical method used to determine whether
the average (mean) of a single sample is significantly different from a known
population mean. This test is particularly useful for assessing whether sample
data aligns with a specific standard or expectation.

Formulating Hypotheses
Before conducting the test, researchers establish two hypotheses:

• Null Hypothesis (H0 ): The sample mean is equal to the hypothe-


sized population mean.
H0 : µ = µ0

359
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

• Alternative Hypothesis (Ha ): The sample mean is not equal to the


hypothesized population mean.

Ha : µ ̸= µ0

This is a two-tailed test because we are interested in deviations in both


directions (greater or less).

Assumptions
Before performing the test, several key assumptions must be considered:

T
1. Normality: The sample data should be approximately normally dis-
tributed, particularly if the sample size is small (less than 30).
2. Independence: Each observation within the sample must be indepen-
dent of others.

3. Sample Size: Larger samples enhance the approximation of a normal


AF distribution.

Choosing the Test


Researchers have two primary options for the statistical test:
• T-test: This test is appropriate for small samples (usually, n < 30)
when the population variance is unknown. For T-test, the statistic is
x̄ − µ0
T = √
s/ n
DR

where v
u n
u 1 X
s=t (xi − x̄)2
n − 1 i=1

is the sample standard deviation.

• Z-test: This test is suitable for larger samples (usually, n ≥ 30) or


when the population variance is known. For Z-test, the test statistic is
x̄ − µ0
Z= √
σ/ n

In these formulas, x̄ represents the sample mean, µ0 is the population mean


being compared against, s is the sample standard deviation (for the T-test),
and σ is the population standard deviation (for the Z-test).

360
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Critical-Value Method
Next, researchers determine a critical value using statistical tables based on the
chosen significance level (commonly set at 0.05). This value aids in deciding
whether to reject the null hypothesis. Alternatively, the p-value is calculated to
indicate the probability of observing the results if the null hypothesis is true.

Decision Rule for Z-test


Under H0 , the sampling distribution of Z follows a standard normal distribu-
tion. Therefore, for a two-sided alternative hypothesis, we select the critical
value z α2 from the standard normal distribution. For a one-tailed test with

T
significance level α, the critical value is denoted as zα (for the upper tail) or
−zα (for the lower tail). For example:

zα = 1.645 for α = 0.05.

Suppose the realized value of the test statistic (Z) is zcal . Then,
AF•

for a two-sided test, reject H0 if and only if zcal ≥ z α2 or zcal ≤ −z α2 .

for a one-sided test, reject H0 if and only if


■ zcal > zα for an upper-tail test, or
■ zcal < −zα for a lower-tail test.
■ Equivalently, |zcal | > zα .

• If the p-value is less than or equal to the significance level (α), reject
the null hypothesis H0 .
DR

Decision Rule for T -test


Under H0 , the sampling distribution of the test statistic T follows a Student’s
t-distribution with ν = n − 1 degrees of freedom, where n is the sample size.
Therefore, for a two-sided alternative hypothesis, we select the critical value
t α2 ,(n−1) from the Student’s t-distribution with the appropriate degrees of free-
dom (n−1). For a one-sided alternative hypothesis, the critical value is denoted
by tα,(n−1) . Suppose the realized value of the test statistic (T ) is tcal . Then,
• for a two-sided test, reject H0 if and only if tcal ≤ −t α2 ,(n−1) or tcal ≥
t α2 ,(n−1) .

• For a one-sided test, reject H0 if and only if


■ tcal > tα,(n−1) for an upper-tail test, or
■ tcal < −tα,(n−1) for a lower-tail test.
■ Equivalently, |tcal | > tα,(n−1) .

361
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

The p-Value Method: Decision Rule


The p-value is calculated to indicate the probability of observing the results if
the null hypothesis is true. If the p-value is less than or equal to the significance
level (α), reject the null hypothesis H0 . To compute the p-value for a T -test,
we can use the following formula based on the t-distribution.

• For a two-tailed test:

p-value = P (T ≤ −tcal ) + P (T ≥ tcal ) = 2 × P (T > |tcal |)

• For a one-tailed test (greater than):

p-value = P (T > tcal )

T
For a one-tailed test (less than):

p-value = P (T < tcal )

where T follows a t-distribution with degrees of freedom (df):


AF
df = n − 1

1 from scipy import stats


2

3 # Calculate p - value for a two - tailed test


4 p_ va lu e_two_tailed = 2 * (1 - stats . t . cdf ( abs ( t_cal ) , df ) )
5 print ( " Two - Tailed p - value : " , p_value_two_tailed )
6
DR
7 # Calculate p - value for a one - tailed test ( greater than )
8 p _ v a l u e _ o n e _ t a i l e d _ g r e a t e r = 1 - stats . t . cdf ( t_cal , df )
9 print ( " One - Tailed p - value ( greater than ) : " ,
p_value_one_tailed_greater )
10

11 # Calculate p - value for a one - tailed test ( less than )


12 p _ v a l u e_ o ne _t a il e d_ le s s = stats . t . cdf ( t_cal , df )
13 print ( " One - Tailed p - value ( less than ) : " ,
p _ v a l u e_ o ne _t a il e d_ le s s )

To compute the p-value for Z-test, we can use the following formulas:
• For a two-tailed test:

p-value = 2 × P (Z > |zcal |)

• For a one-tailed test (greater than):

p-value = P (Z > zcal )

362
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

• For a one-tailed test (less than):

p-value = P (Z < zcal )

Where Z follows a standard normal distribution.


1 from scipy import stats
2

3 # Calculate p - value for a two - tailed test


4 p_ va lu e_two_tailed = 2 * (1 - stats . norm . cdf ( abs ( z_cal ) ) )
5 print ( " Two - Tailed p - value : " , p_value_two_tailed )
6

# Calculate p - value for a one - tailed test ( greater than )

T
7

8 p _ v a l u e _ o n e _ t a i l e d _ g r e a t e r = 1 - stats . norm . cdf ( z_cal )


9 print ( " One - Tailed p - value ( greater than ) : " ,
p_value_one_tailed_greater )
10

11 # Calculate p - value for a one - tailed test ( less than )


12 p _ v a l u e_ o ne _t a il e d_ le s s = stats . norm . cdf ( z_cal )
13
AF print ( " One - Tailed p - value ( less than ) : " ,
p _ v a l u e_ o ne _t a il e d_ le s s )

Remark 9.6.1. If σ 2 is known, we always use the Z-test. If σ 2 is unknown,


then we have to check if the sample size n is large or not. In this case, we
replace the population standard deviation with the sample standard deviation s,
and:
(i). if n < 30, we use the T -test,

(ii). if n ≥ 30, we can use the Z-test.


Problem 9.1. From long-term experience, a factory owner knows that a worker
DR

can produce a product in an average time of 89 minutes. However, on Sunday


morning, it seems that it takes longer. Assuming the population variance is
144, the owner collects a sample of size n = 12 and finds that the sample mean
is x̄ = 92.2. To determine whether this impression is correct, how should this
be justified?

Solution
To test whether the production time on Sunday morning is significantly longer,
a sample of size n = 12 was taken, yielding a sample mean x̄ = 92.2. It
is assumed that the production time is normally distributed with a known
variance σ 2 = 144.
We start by setting up the hypothesis test:

H0 : µ = 89 vs. H1 : µ > 89.

Next, we select the significance level α = 0.05.

363
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

The test statistic is calculated as follows:


x̄ − µ0 92.2 − 89
zcal = √ =p = 0.9237.
σ/ n 144/12
For a significance level of α = 0.05, the critical value from the z-distribution
is zα = z0.05 = 1.645.
According to the decision rule, we will reject H0 if zcal ≥ zα . Since zcal =
0.9237 is less than 1.645, we do not reject the null hypothesis H0 : µ = 89 at
the 5% significance level. Thus, there is insufficient evidence to conclude that
it takes longer to produce on Sunday morning.

Alternatively, we can also use the p-value method. The p-value is calcu-

T
lated as:
p-value = Pr(Z > 0.9237) = 0.1778.
Since the p-value is greater than 0.05, we do not reject the null hypothesis
H0 at the 5% significance level. This confirms that there is not enough
evidence to suggest that production time is longer on Sunday morning.
AF
Problem 9.2. (Cardiovascular Disease) We want to compare fasting serum-
cholesterol levels between recent Asian immigrants and the general U.S. popula-
tion. Cholesterol levels for U.S. women aged 21-40 are normally distributed with
a mean of 190 mg/dL. For recent female Asian immigrants, with levels assumed
to be normally distributed but with an unknown mean µ, we test H0 : µ = 190
against H1 : µ ̸= 190. Blood tests from 200 female Asian immigrants show a
mean of 181.52 mg/dL and a standard deviation of 40. What conclusions can
we draw from this data?

Solution
DR

To test whether the mean cholesterol level differs from 190, we set up the
following hypotheses:
H0 : µ = 190 vs. H1 : µ ̸= 190.
Assuming that X follows a normal distribution, i.e., X ∼ N (µ, σ 2 ), with a
sample mean x̄ = 181.52, sample standard deviation s = 40, and sample size
n = 200, we proceed with the following steps.
First, we set the significance level to α = 0.05. The test statistic is calculated
as:
x̄ − µ0 181.52 − 190
zcal = √ = √ = −3.00.
s/ n 40/ 200
For a significance level of α = 0.05, the critical value from the z-distribution
is zα/2 = z0.025 = 1.96. Since |zcal | > 1.96, we reject the null hypothesis
H0 : µ = 190 at the 5% significance level. Thus, we conclude that the mean
cholesterol level of recent Asian immigrants is significantly different from that
of the general U.S. population.

364
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Alternatively, we compute the p-value for the test statistic. The p-value
is given by:

p-value = 2 Pr(Z ≥ |zcal | | H0 ) = 2 Pr(Z ≥ |−3.00|) = 2×0.00137 = 0.0027

Since the p-value is less than α = 0.05, we reject the null hypothesis
H0 : µ = 190 at the 5% level of significance. Thus, we conclude that the
mean cholesterol level of recent Asian immigrants is significantly different
from that of the general U.S. population.

Problem 9.3. (Obstetrics Problem) We want to test if mothers with low


socioeconomic status (SES) have babies with lower birthweights than the na-

T
tional average. From 100 consecutive full-term births in a low-SES area, the
average birthweight is 115 oz with a standard deviation of 24 oz. Nationally,
the mean birthweight is 120 oz. Can we conclude that the mean birthweight in
this hospital is lower than the national average?
AF
Solution
To determine whether the average birthweight is significantly less than 120, we
set up the following hypotheses:

H0 : µ = 120 vs. H1 : µ < 120.

Assuming that X follows a normal distribution, i.e., X ∼ N (µ, σ 2 ), with a sam-


ple mean x̄ = 115, sample standard deviation s = 24, and sample size n = 100,
we proceed as follows.

First, we set the significance level to α = 0.05. Since the sample size in this
case is large (n = 100), we could use either a T -test or a Z-test. However, we
DR

typically use the Z-test in this situation. Therefore, the realized value of test
statistic is calculated as:
x̄ − µ0 115 − 120
zcal = √ = √ = −2.08.
s/ n 24/ 100

For a one-tailed test at α = 0.05, the right-sided critical value zcri = 1.645.
Since |zcal | = 2.08 is greater than 1.645, we reject the null hypothesis H0 : µ =
120 at the 5% level of significance. Therefore, we conclude that the average
birthweights are lower than the normal average.

Problem 9.4. From a long term experience a factory owner knows that a
worker can produce a product in an average time of 89 min. However on Monday
morning, there is the impression that it takes longer. To test whether this
impression is correct or not, he collect a sample of size (n = 12), and it is
found that x̄ = 92.2 and s = 10.75. Test his clam.

365
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Solution
We assume that the production time follows a normal distribution. To verify
whether the impression that it takes longer on Monday morning is correct, we
conduct a hypothesis test at a significance level of 5%.

We test the following hypotheses:

H0 : µ = 89

H1 : µ > 89
Assuming X follows a normal distribution with an unknown variance σ 2 ,

T
the test statistic is given by
x̄ − µ0
T = √ ∼ t(n−1) under H0 .
s/ n
We use the significance level α to find the critical value tα,(n−1) . From
the Student’s t-distribution table (Table A6 in the Appendix), we find that
AF
tα,(n−1) = t0.05,11 = 1.795.
The decision rule is to reject H0 if |tcal | > tα,(n−1) = 1.795. Given that
n = 12, x̄ = 92.2, and s = 10.75, the calculated test statistic is
92.2 − 89
tcal = √ = 1.0312.
10.75/ 12
Since 1.0312 < 1.795, we cannot reject H0 at the 5% significance level.
Therefore, there is insufficient evidence to conclude that it takes longer to pro-
duce on Monday morning.

Alternatively: we find the same conclusion based on the p−value


DR

p-value = Pr(T > |tcal |) = Pr(T > 1.0312) = 0.8377

Since p-value is larger than 0.05, then we do not reject H0 at significance


level 5%

Problem 9.5. (Cardiology) A recent clinical focus is on whether drugs can


reduce infarct size in patients who have experienced a myocardial infarction
within the last 24 hours. It is known that the average infarct size in untreated
patients is 25 (ck − g − EQ/m2 ). In contrast, among 8 patients treated with
the drug, the mean infarct size is 16 with a standard deviation of 10. Is there
evidence that the drug is effective in reducing infarct size?

Solution
We set up the hypothesis test with

H0 : µ = 25 vs. H1 : µ < 25.

366
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Here, X is normally distributed, i.e., X ∼ N (µ, σ 2 ), with sample statistics


x̄ = 16, s = 24, and n = 8. Set the significance level to α = 0.05. Calculate the
test statistic:
x̄ − µ0 16 − 25
tcal = √ = √ = −2.55.
s/ n 24/ 8
From the t-distribution table, find tα,(n−1) = t0.05,7 = 1.895.
Since |tcal | = | − 2.55| = 2.55 is larger than tα,(n−1) = t0.05,7 = 1.895 (see
Table A6 in the Appendix), we reject the null hypothesis H0 : µ = 25 at the
5% level of significance. Therefore, the evidence suggests that the mean infarct
size is lower than 25.
Alternatively, the p-value is calculated as follows:

T
p-value = Pr(T ≥ |tcal | under H0 )
= Pr(T ≥ | − 2.55|)
= .019
AF
Since p-value is less than α = 0.05, we reject the null hypothesis H0 = 25.
That is, the evidence is that the mean birth-weight from this hospital is
lower than the national average.

Problem 9.6. Example: (Cardiovascular Disease) A current area of re-


search interest is the Family aggregation of cardiovascular risk factors in general
and lipid levels in particular. Suppose the “average” cholesterol level in children
is 175 mg/dL. A group of men who have died from heart disease within the past
year are identified, and the cholesterol levels of their offspring are measured.
Two hypotheses are considered:
(i). The average cholesterol level of these children is 175 mg/dL.
DR

(ii). The average cholesterol level of these children is >175 mg/dL.


Suppose the mean cholesterol level of 10 children whose fathers died from heart
disease is 200 mg/dL and the sample standard deviation is 50 mg/dL. Test the
hypothesis that the mean cholesterol level is higher in this group than in the
general population.

Solution
To test whether the mean cholesterol level is significantly different from 175,
we set up the following hypotheses:

H0 : µ = 175 vs. H1 : µ > 175.

Assuming that X follows a normal distribution, i.e., X ∼ N (µ, σ 2 ), with a


sample mean x̄ = 200, sample standard deviation s = 50, and sample size
n = 10, we proceed with the following steps.

367
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

First, we set the significance level to α = 0.05. The test statistic is calculated
as follows:
x̄ − µ0 200 − 175
tcal = √ = √ = 1.58.
s/ n 50/ 10
Next, from the t-distribution table, the critical value for t9,0.95 is 1.833.
Since the calculated test statistic |tcal | = 1.58 is less than the critical value
t9,0.95 = 1.833, we do not reject the null hypothesis H0 : µ = 175 at the 5%
significance level. Thus, we conclude that the mean cholesterol level of these
children does not significantly differ from that of an average child.

Alternatively, we calculate the p-value as follows:

T
p-value = Pr(T ≥ |tcal | | H0 ) = Pr(T ≥ 1.58) = 0.074.

(Note: The p-value can be obtained using the function TDIST(1.58, 9, 1)


in Excel, which yields 0.07428.)
Since the p-value is greater than α = 0.05, we do not reject the null
hypothesis H0 : µ = 175. Therefore, we conclude that the mean cholesterol
AF
level of these children does not differ significantly from that of an average
child.
Problem 9.7. Suppose the mean pulse rate in healthy adults is 72 beats per
min. Research was conducted to examine the pulse rate in patients with hyper-
thyroidism.Twenty patients were randomly enrolled with a mean of 80 and a
standard deviation of 20. Assuming that the pulse rate follows a normal distri-
bution, is the mean pulse rate in hyperthyroidism patients different from that in
healthy adults?

Solution
DR

To test whether the mean pulse rate differs from 72, we set up the following
hypotheses:
H0 : µ = 72 vs. H1 : µ ̸= 72.
Assuming that X follows a normal distribution, i.e., X ∼ N (µ, σ 2 ), with a
sample mean x̄ = 80, sample standard deviation s = 20, and sample size
n = 20, we perform the following steps.
First, we set the significance level to α = 0.05. The test statistic is calculated
as:
x̄ − µ0 80 − 72
tcal = √ = √ = 1.79.
s/ n 20/ 20
For a two-tailed test with an α = 0.05 and 19 degrees of freedom, the critical
t-values can be found using a t-distribution table or calculator. The critical t-
value is approximately 2.093. Since |tcal | < 0.093, we do not reject H0 : µ = 72.
This indicates that there is not enough statistical evidence to conclude that the
mean pulse rate in patients with hyperthyroidism is significantly different from
that in healthy adults.

368
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Alternatively, we compute the p-value for the test statistic. The p-value
is given by:

p-value = 2 Pr(T ≥ |tcal | | H0 ) = 2 Pr(T ≥ |1.79|) = 0.0894.

(Note: The p-value can be obtained using the function TDIST(1.79, 19, 2)
in Excel, which gives 0.0894.)
Since the p-value is greater than α = 0.05, we do not reject the null
hypothesis H0 : µ = 72. Thus, there is insufficient evidence to conclude
that the mean pulse rate in hyperthyroidism patients differs from that in
healthy adults.

T
9.6.2 Testing Equality of Two Means
Testing the equality of two means is a common procedure in hypothesis testing,
used to determine if there is a significant difference between the means of two
AF
populations. This can be done using various methods depending on the nature
of the samples.

9.6.3 Independent Samples T -Test


The two-sample t-test is used to compare the means of two independent groups.
The hypotheses for this test are defined as follows:

• Null Hypothesis (H0 ): The means of the two independent samples


are equal.
H0 : µ1 = µ2
DR

• Alternative Hypothesis (H1 ): The means of the two independent


samples are not equal.
H1 : µ1 ̸= µ2
For a one-tailed test, the alternative hypothesis can be specified as:

■ Right-Tailed Test:
H1 : µ1 > µ2
■ Left-Tailed Test:
H1 : µ1 < µ2

Define the Significance Level: Choose a significance level, denoted by α.


This is the probability of rejecting the null hypothesis when it is actually true.
Common choices for α are 0.05, 0.01, and 0.10.

369
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

1. Equal Variances (Pooled T -Test)


When the variances of the two populations are assumed to be equal, the test
statistic is calculated as follows:
x̄1 − x̄2
T =r  
s2p n11 + n12

where
(n1 − 1)s21 + (n2 − 1)s22
s2p =
n1 + n2 − 2

T
Here:

• x̄1 and x̄2 are the sample means,

• s21 and s22 are the sample variances,

• n1 and n2 are the sample sizes,


AF• s2p is the pooled variance.

For pooled T -test, the degrees of freedom is

ν = n1 + n2 − 2.

Problem 9.8. (Hypertension) A sample of eight nonpregnant, premenopausal


oral contraceptive (OC) users aged 35-39 has a mean systolic blood pressure
(SBP) of 132.86 mm Hg with a standard deviation of 15.34 mm Hg. In addi-
tion, a sample of 21 non-OC users from the same age group has a mean SBP of
127.44 mm Hg and a standard deviation of 18.23 mm Hg. Are the mean SBP
DR

values between these two groups equal?

Solution
We begin by setting up the hypotheses. Let µ1 and µ2 represent the mean
systolic blood pressures of OC users and non-OC users, respectively. The hy-
potheses are:

H0 : µ1 = µ2 or µ1 − µ2 = 0
H1 : µ1 ̸= µ2 or µ1 − µ2 ̸= 0

We assume the difference x̄1 − x̄2 follows a normal distribution with mean
0 and variance σ 2 . The estimated variance s2 is given by:

(n1 − 1)s21 + (n2 − 1)s22


s2p =
n1 + n2 − 2

370
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

where n1 and n2 are the sample sizes, and s21 and s22 are the sample variances.
For our data:

(8 − 1)(15.34)2 + (21 − 1)(18.23)2 8293.9


s2p = = = 307.18
8 + 21 − 2 27
We set the significance level α to 0.05. We consider the test statistic is
x̄1 − x̄2
T =r  .
2 1 1
sp n1 + n2

Based on the sample information, the realized value of T is

T
132.86 − 127.44
tcal = q  = 0.74
307.18 81 + 21
1

For a two-tailed test at the 0.05 significance level with degrees of freedom
27, the critical t-value of the right-side is approximately 2.052. Since |tcal | =
0.74 < 2.052, the we do not reject H0 . This indicates that there is not enough
AF
statistical evidence to conclude that the mean systolic blood pressure between
oral contraceptive users and non-users is significantly different.
Alternatively, the p-value is calculated as:

p-value = 2 Pr(T ≥ |tcal | | H0 ) = 2 Pr(T ≥ |0.74|) = 0.46

Since the p-value of 0.46 is greater than the significance level α =


0.05, we do not reject the null hypothesis H0 : µ1 − µ2 = 0. Therefore,
we conclude that there is no significant difference in mean systolic blood
pressure between OC users and non-OC users in the specified age group.
DR

2. Unequal Variances (Welch’s T -Test)


When the variances of the two populations are not assumed to be equal, use
Welch’s T -test. The test statistic T is as follows:
x̄1 − x̄2
T =r    
s21 n11 + s22 n12

For the Welch’s T -test, the degrees of freedom is


 2 2
s1 s22
n1 + n2
ν =  2 2  2 2
s1 s2
n1 n2

n1 −1 + n2 −1

Decision Rule: Compare the calculated t-statistic to the critical value:

371
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

• Reject H0 if |tcal | > critical value or if the p-value < α.

• Fail to Reject H0 if |tcal | ≤ critical value or if the p-value ≥ α.


Problem 9.9. A data scientist wants to determine if there is a significant dif-
ference in the mean sales figures before and after implementing a new marketing
strategy. The sales data before the strategy (sample size = 25) has a mean of
$5000 with a standard deviation of $800, and the sales data after the strategy
(sample size = 30) has a mean of $5300 with a standard deviation of $850. Test
if the new strategy had an impact at a 10% significance level.

Solution

T
The hypotheses for the test are as follows:

• Null Hypothesis (H0 ): The means of the two independent samples


are equal.
H0 : µ1 = µ2
AF• Alternative Hypothesis (H1 ): The means of the two independent
samples are not equal.
H1 : µ1 ̸= µ2

Test Statistic: To determine the test statistic for the two-sample t-test, we
use the following formula:
x̄1 − x̄2
T =q 2
s1 s22
n1 + n2

We have,

• Sales before the strategy:


DR

■ Sample size (n1 ) = 25


■ Mean (x̄1 ) = $5000
■ Standard deviation (s1 ) = $800

• Sales after the strategy:


■ Sample size (n2 ) = 30
■ Mean (x̄2 ) = $5300
■ Standard deviation (s2 ) = $850

Substituting the given values, we have:


5000 − 5300 −300
tcal = q = = −1.35
800 2
+ 850 2 222.85
25 30

372
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Degrees of Freedom: To determine the degrees of freedom for the test, we


use the formula:
 2 2 2
s22
 2
s1 800 8502
n1 + n2 25 + 30
ν =  2 2  2 2 =
640000 2 2 ≈ 51
s1 s2 ( 25 ) ( 722500
30 )
n1 n2
24 + 29
n1 −1 + n2 −1

Decision: Using a t-table with ν ≈ 51 (using Welch’s approximation), the


critical t-value for a two-tailed test at α = 0.10 is approximately ±1.676. Since
|tcal | = 1.34 < tcri = 1.676, we fail to reject the null hypothesis. There is not
enough evidence to suggest that the new marketing strategy had a significant

T
impact on sales.

9.6.4 Paired T -Test


The paired t-test is used to determine whether there is a significant difference
between the means of two related groups. This test is commonly used in exper-
AF
iments where measurements are taken on the same subjects under two different
conditions or at two different times. The hypotheses for the paired T -test are
defined as follows:
• Null Hypothesis (H0 ): The mean difference between paired obser-
vations is zero.
H0 : µd = 0

• Alternative Hypothesis (H1 ): The mean difference between paired


observations is not zero.
H1 : µd ̸= 0
For a one-tailed test, the alternative hypothesis can be specified as:
DR

■ Right-Tailed Test:
H1 : µd > 0
■ Left-Tailed Test:
H1 : µd < 0

Test Statistic
Calculate the test statistic t using the formula:

T = √
sd / n
where d¯ is the mean difference, sd is the standard deviation of the differences,
and n is the number of pairs.

Critical Value and Decision Rule:

373
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

• Determine the critical value from the t-distribution table based on α


and degrees of freedom (df = n − 1).

• For a two-tailed test, reject H0 if |t| is greater than the critical value.
For a one-tailed test, reject H0 if t falls in the direction specified by
Ha .
Problem 9.10. A nutritionist wants to evaluate the effectiveness of a new
dietary supplement in reducing cholesterol levels. A sample of 10 participants
had their cholesterol levels measured before and after taking the supplement for
a month. The cholesterol levels (in mg/dL) before and after the treatment are
provided as follows:

T
Before After
220 210
240 230
215 205
AF 230
250
215
240
210 200
225 210
240 220
235 225
245 230
DR

Test if the dietary supplement had a significant effect on cholesterol levels at


the 5% significance level.

Solution
To determine whether the dietary supplement significantly affects cholesterol
levels, we use a paired t-test.
Hypotheses:
• Null Hypothesis (H0 ): There is no difference in cholesterol levels
before and after the treatment, i.e., µd = 0.

• Alternative Hypothesis (H1 ): There is a significant difference in


cholesterol levels, i.e., µd ̸= 0.
where µd is the mean difference in cholesterol levels before and after the
treatment.

374
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

First, we calculate the differences between the before and after measure-
ments for each participant:

Differences = [10, 10, 10, 15, 10, 10, 15, 20, 10, 15]

i Before After Difference (di )


1 220 210 10
2 240 230 10
3 215 205 10
4 230 215 15

T
5 250 240 10
6 210 200 10
7 225 210 15
8 240 220 20
AF 9
10
235
245
225
230
10
15

¯ and the standard deviation of the


Next, we compute the mean difference (d)
differences (sd ):
10 + 10 + 10 + 15 + 10 + 10 + 15 + 20 + 10 + 15
d¯ = = 12.5
10

sP r
¯2
(di − d) (10 − 12.5)2 + (10 − 12.5)2 + · · · + (15 − 12.5)2
sd = = = 3.5355
DR

n−1 9

The test statistic is calculated using:

d¯ 12.5 12.5
tcal = √ = √ = ≈ 11.1803
sd / n 2.87/ 10 0.91
With n − 1 = 9 degrees of freedom, we compare the calculated t-value to
the critical value from the t-distribution table at the 5% significance level. For
a two-tailed test with 9 degrees of freedom, the critical value is approximately
2.262.

Since 11.1803 exceeds 2.262, we reject the null hypothesis. At the 5% signif-
icance level, there is sufficient evidence to conclude that the dietary supplement
has a significant effect on reducing cholesterol levels.

375
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

9.7 Testing Equality of Several Means


9.7.1 Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA) is a statistical technique used to compare the
means of multiple groups and assess whether there are statistically significant
differences between them. It is particularly useful when comparing more than
two independent groups or treatments.

For instance, to evaluate the effectiveness of different diabetes medications,


researchers design an experiment to explore the relationship between the type
of medication and the resulting blood sugar levels. The study involves a sam-

T
ple population, which is divided into several groups, each receiving a specific
medication over a trial period. After the trial, blood sugar levels are measured
for each participant. The mean blood sugar level is then calculated for each
group. ANOVA is used to compare these group means to determine if there are
significant differences, or if the means are statistically similar.
AFThis method is also known as Fisher’s analysis of variance, emphasizing
its capacity to examine how a categorical variable with multiple levels affects
a continuous variable. The application of ANOVA is determined by the re-
search design. Typically, ANOVAs are employed in three main forms: one-way
ANOVA, two-way ANOVA, and N-way ANOVA. The layout of data for one-way
ANOVA are in the following table.

Group 1 Group 2 ··· Group k


x11 x12 ··· x1k
x21 x22 ··· x2k
DR

··· ··· ··· ···


··· ··· ··· ···
x n1 1 x n2 2 ··· xnk k
x̄1 x̄2 ··· x̄k
s21 s22 ··· s2k

Step 1: State the Hypotheses


• Null Hypothesis (H0 ): All group means are equal.

H0 : µ1 = µ2 = · · · = µk

• Alternative Hypothesis (H1 ): At least one group mean is different.

H1 : At least one µi differs from the others

376
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Step 2: Collect and Summarize the Data


Organize the data into groups. Compute the following statistics for each group:
• x̄i : Mean of the i-th group

• s2i : Variance of the i-th group

• ni : Sample size of the i-th group


Calculate the overall mean x̄overall :
Pk
ni x̄i
x̄overall = Pi=1

T
k
i=1 ni

Step 3: Compute the ANOVA Table

Table 9.2: ANOVA Table


AF
Source of
Variation
Sum of
Squares (SS)
Degrees of
Freedom (df)
Mean
Square (MS)
F-Statistic

SSB MSB
Between Groups (SSB) SSB k−1 MSB = k−1
F = MSW
SSW
Within Groups (SSW) SSW n−k MSW = n−k

Total (SST) SST n−1

Definitions:
• Sum of Squares Between Groups (SSB):
DR

k
X
SSB = ni (x̄i − x̄overall )2
i=1

• Sum of Squares Within Groups (SSW):


k
X
SSW = (ni − 1)s2i
i=1

• Total Sum of Squares (SST):


ni
k X
X
SST = (xij − x̄overall )2 = SSB + SSW
i=1 j=1

• Degrees of Freedom:

377
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

■ Between Groups: dfbetween = k − 1


■ Within Groups: dfwithin = n − k, where n is the total number of
observations

• Mean Squares:

■ Mean Square Between Groups (MSB):

SSB
MSB =
k−1

FT
■ Mean Square Within Groups (MSW):

SSW
MSW =
n−k

• F-Statistic:
MSB
F =
MSW

Step 4: Determine the Critical Value or p-Value


Find the critical value from the F -distribution table for dfbetween and degrees of
freedom at the desired significance level α. The F -distribution table is presented
A
in Table A8 of the Appendix.. Alternatively, compute the p-value associated
with the F -statistic.dfwithin

Step 5: Make a Decision


• If using the critical value:
R
■ If F -statistic > Critical value, reject the null hypothesis H0 .
■ If F -statistic ≤ Critical value, do not reject the null hypothesis
H0 .

• If using the p-value:


D

■ If p-value ≤ α, reject the null hypothesis H0 .


■ If p-value > α, do not reject the null hypothesis H0 .

Step 6: Interpret the Results


Draw conclusions based on the test result:
• If the null hypothesis is rejected, conclude that there is a significant
difference in means among the groups.

378
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

• If the null hypothesis is not rejected, conclude that there is no signifi-


cant difference in means among the groups.
Problem 9.11. Suppose we want to compare the effectiveness of three different
blood pressure medications: Medication A, Medication B, and Medication C. We
have following data on the systolic blood pressure of patients who received each
of the three medications.
• Medication A: 140, 145, 148, 142, 144

• Medication B: 130, 133, 135, 138, 132

• Medication C: 125, 128, 124, 127, 126

T
Test if there are significant differences in the mean blood pressure levels among
these groups at the 5% significance level.

Step 1: State the Hypotheses


• Null Hypothesis (H0 ): The mean systolic blood pressure is the same
AF across all three groups (Medication A, Medication B, Medication C).
H0 : µA = µB = µC

• Alternative Hypothesis (H1 ): At least one of the group means is


different.
H1 : At least one µi differs from the others.

Step 2: Collect and Summarize the Data


The data collected for the three groups are as follows:
DR

Medication A Medication B Medication C


140 130 125
145 133 128
148 135 124
142 138 127
144 132 126
Mean (x̄i ) 143.8 133.6 126
Variance (s2i ) 9.2 9.3 2.5

We now calculate the overall mean (x̄overall ):


5 × 143.8 + 5 × 133.6 + 5 × 126
x̄overall = = 134.47
5+5+5

379
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Step 3: Compute the ANOVA Table


Sum of Squares Between Groups (SSB):
k
X
SSB = ni (x̄i − x̄overall )2
i=1
= 5 × (143.8 − 134.47)2 + 5 × (133.6 − 134.47)2 + 5 × (126 − 134.47)2
= 797.73

Sum of Squares Within Groups (SSW):

T
SSW = (n1 − 1)s21 + (n2 − 1)s22 + (n3 − 1)s23
= 4 × 9.2 + 4 × 9.3 + 4 × 2.5
= 84

Degrees of Freedom:
AF•

Between Groups: dfbetween = k − 1 = 3 − 1 = 2

Within Groups: dfwithin = n − k = 15 − 3 = 12


Mean Squares:
• Mean Square Between Groups (MSB):
SSB 797.73
MSB = = = 398.87
dfbetween 2

• Mean Square Within Groups (MSW):


DR

SSW 84
MSW = = =7
dfwithin 12

F-Statistic:
MSB 398.87
F = = = 56.98
MSW 7

Table 9.3: ANOVA Table

Source of Sum of Degrees of Mean F -Statistic


Variation Squares (SS) Freedom (df) Square (MS)
Between Groups 797.73 2 398.87 56.98
Within Groups 84 12 7 -
Total 881.7333 14 - -

380
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Step 4: Determine the Critical Value or p-Value


Using an F -distribution table, find the critical value for dfbetween = 2 and
dfwithin = 12 at a significance level α = 0.05. The critical value for F (2, 12) at
α = 0.05 is approximately 3.89.
Alternatively, using statistical software, we can calculate the p-value asso-
ciated with the F -statistic of 56.98, which is very small (much less than 0.05).

Step 5: Make a Decision


• Using the critical value: Since F = 56.98 is greater than the critical
value of 3.89, we reject the null hypothesis.

T
• Using the p-value: Since the p-value is less than 0.05, we also reject
the null hypothesis.

Problem 9.12. A company wants to determine if there are significant differ-


ences in the average productivity of employees working under three different
AF
types of work environments. The productivity scores (in units produced per
hour) are collected from three groups of employees in different environments:
Environment A, Environment B, and Environment C. The data collected are:

Environment A Environment B Environment C


30 45 55
32 50 53
29 47 52
31 46 58
33 48 60
DR

Test if there are significant differences in the mean productivity across the
three environments at the 5% significance level.

Solution
Step 1: Calculate Means and Overall Mean

381
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Table 9.4: Means of Productivity Scores

Environment Mean
30+32+29+31+33
Environment A x̄A = 5 = 31.00
45+50+47+46+48
Environment B x̄B = 5 = 47.20
55+53+52+58+60
Environment C x̄C = 5 = 55.60
31+47.20+55.60
Overall Mean x̄overall = 3 = 44.6

T
Environment A Environment B Environment C
30 45 55
32 50 53
29 47 52
AF 31
33
46
48
58
60
Mean (x̄i ) 31 47.2 55.6
Variance (s2i ) 2.5 3.7 11.3

Step 2: Sum of Squares Between Groups (SSB)


SSB = n (x̄A − x̄overall )2 + (x̄B − x̄overall )2 + (x̄C − x̄overall )2
 

= 5 (31 − 44.6)2 + (47.20 − 44.6)2 + (55.60 − 44.6)2


 
DR

= 1563.6
Step 3: Sum of Squares Within Groups (SSW)
Calculate the sum of squares within each group. For simplicity, assume the
following values for demonstration:

SSW = (nA − 1)s2A + (nB − 1)s2B + (nC − 1)s2C = 70


Step 4: Mean Squares
SSB 1563.6
MSB = = = 781.8
k−1 3−1
SSW 70
MSW = = = 8.83
n−k 15 − 3
Step 5: F -Statistic
MSB 781.8
F = = ≈ 134.02
MSW 5.83

382
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Table 9.5: ANOVA Table

Source of Sum of Degrees of Mean F -Statistic


Variation Squares (SS) Freedom (df) Square (MS)

Between Groups 1563.6 2 781.8 134.02


Within Groups 70 12 5.83 -
Total 1633.6 14 - -

T
Step 6: Conclusion
Compare the calculated F -value to the critical value from the F -distribution
table with k − 1 and n − k degrees of freedom at α = 0.05.
Assuming the critical value from the F -table is approximately 3.49:


AF•
Since the calculated F -value (134.02) is greater than the critical value
(3.49), we reject the null hypothesis.

There is a significant difference in the mean productivity among the


three environments.

9.8 Power of the Test


The power of a statistical test is a key measure of its effectiveness in detecting
a true effect when it exists. It is denoted by 1 − β, where β is the probability
of committing a Type II error, the power reflects the likelihood of correctly
DR

rejecting a false null hypothesis. For example, if a test has a power of 0.80, it
means there is an 80% chance of detecting an effect if it is present.

Power of the test: The power of a test is the probability of correctly


rejecting the null hypothesis when it is false. That is,

Power = Pr(reject H0 |H0 false)


=1−β

A higher power indicates a greater likelihood of identifying a true effect.

Increasing the sample size or effect size, or decreasing variability, generally


enhances the test’s power. However, lowering α to reduce Type I errors can
also reduce power, highlighting the need for careful consideration in test design.
Ensuring adequate power is essential for reliably detecting meaningful effects
and making informed conclusions.

383
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

• The power of the two-tailed test H0 : µ = µ0 vs. H1 : µ ̸= µ0 for


the specific alternative µ = µ1 , where the underlying distribution
is normal and the population variance σ 2 is assumed known, is
given exactly by

   
µ0 − µ1 µ1 − µ0
Power = Φ −zα/2 + √ + Φ −zα/2 + √ (9.1)
σ/ n σ/ n

and approximately by

FT
 
|µ1 − µ0 |
Power ≈ Φ −zα/2 + √ (9.2)
σ/ n

where Φ is the cumulative distribution function of the standard


normal distribution and zα is the critical values from the standard
normal distribution corresponding to the significance level α.

• The power of the left-tail test H0 : µ = µ0 vs. H1 : µ < µ0 :


 
µ0 − µ1
Power = Φ −zα + √ (9.3)
σ/ n
A
• The power of the right-tail test H0 : µ = µ0 vs. H1 : µ > µ0 :
 
µ1 − µ0
Power = Φ −zα + √ (9.4)
σ/ n

Problem 9.13. (Obstetrics) Compute the power of the test for the birthweight
R
data in Problem 9.3 with an alternative mean of 115 ounces (oz) and α = 0.05,
assuming the true standard deviation = 24 oz.

Solution
We have µ0 = 120 oz, µ1 = 115 oz, α = 0.05, σ = 24, n = 100. Thus,
D

h √ i
Power = Φ −z0.05 + (120 − 115) 100/24
= Φ [−1.645 + 5 × 10/24]
= Φ(0.438) = 0.669

Therefore, there is about a 67% chance of detecting a significant difference using


a 5% significance level with this sample size.
Problem 9.14. (Cardiovascular Disease, Pediatrics) Using a 5% level
of significance and a sample of size 10, compute the power of the test for the

384
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

cholesterol data in Problem 9.6, with an alternative mean of 190 mg/dL, a null
mean of 175 mg/dL, and a standard deviation of 50 mg/dL.

Solution
We have µ0 = 175 oz, µ1 = 190 oz, α = 0.05, σ = 50, n = 10. Thus,
h √ i
Power = Φ −z0.05 + (190 − 175) 10/50
h √ i
= Φ −1.645 + 15 × 10/50
= Φ(−0.696) = 0.243

T
Therefore, the chance of finding a significant difference in this case is only 24.3%.
Problem 9.15. (Cardiovascular Disease, Pediatrics) Compute the power
of the test for the cholesterol data in Problem 9.6 with a significance level of
α = 0.01 vs. an alternative mean of190 mg/dL.
AF
Solution
We have µ0 = 175 oz, µ1 = 190 oz, α = 0.01, σ = 50, n = 10. Thus,
h √ i
Power = Φ −z0.01 + (190 − 175) 10/50
h √ i
= Φ −2.326 + 15 × 10/50
= Φ(−1.377) = 0.08

which is lower than the power of 24.3% for α = 0.05, computed in Problem
9.14. What does this mean? It means that if the α level is lowered from 0.05
DR

to 0.01, the β error will be higher or, equivalently, the power, which decreases
from 0.243 to 0.08, will be lower.

Factors Affecting the Power


The power of a test is influenced by several factors, including

(1). Significance level (α): If the significance level is made smaller (α


decreases), zα increases and hence the power decreases.

(2). Effect size (d): If the alternative mean is shifted farther away from
the null mean (d = |µ0 − µ1 | increases), then the power increases.

(3). Data variability: If the standard deviation of the distribution of in-


dividual observations increases (σ increases), then the power decreases.

(4). Sample size (n): If the sample size increases (n increases), then the
power increases.

385
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

9.9 Sample Size Estimation for the Mean Test


Estimating the appropriate sample size for a mean test is crucial to ensure that
the test has sufficient power to detect a significant difference. The sample size
needed depends on the desired level of statistical significance, the power of the
test, the population standard deviation, and the expected difference in means.
To estimate the required sample size, we need the quantity of the following:

• α: Significance level (commonly 0.05).

• β: Type II error rate (commonly 0.2, which gives a power of 0.8).

T
σ: Population standard deviation.

• d: Minimum detectable difference in means (effect size).

• zα/2 : Critical value from the standard normal distribution for a two-
tailed test.
AF• zβ : Critical value from the standard normal distribution corresponding
to the power (1 − β) of the test.

9.9.1 When Testing for the Mean of a Normal Distribu-


tion (One-Sided Alternative)
Suppose we wish to test
(
H0 : µ = µ0
H1 : µ = µ1
DR

where the data are normally distributed with mean µ and known variance σ 2 .
The sample size needed to conduct a one-sided test with significance level α
and probability of detecting a significant difference with power 100(1 − β)% is

(zβ + zα )2 σ 2
n=
(µ0 − µ1 )2

where d = µ0 − µ1 .
Problem 9.16. (Obstetrics) Consider the birthweight data in Problem 9.3.
Suppose that µ0 = 120 oz, µ1 = 115 oz, σ = 24, α = .05, 1 − β = 0.80, and we
use a one-sided test. Compute the appropriate sample size needed to conduct
the test.

386
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Solution
Since the power 1 − β = 0.80 then β = 0.20. Therefore, the sample size is

(zβ + zα )2 σ 2 (z0.20 + z0.05 )2 × 242 (0.84 + 1.645)2 × 242


n= = =
(µ0 − µ1 )2 (120 − 115)2 (120 − 115)2
= 142.3 ≈ 143

The sample size is always rounded up, so we can be sure to achieve at least the
required level of power (in this case, 80%). Thus, a sample size of 143 is needed
to have an 80% chance of detecting a significant difference at the 5% level if

FT
the alternative mean is 115 oz and a one-sided test is used.
Problem 9.17. (Cardiovascular Disease, Pediatrics) Consider the choles-
terol data in Problem 9.6. Suppose the null mean is 175 mg/dL, the alternative
mean is 190 mg/dL, the standard deviation is 50, and we wish to conduct a one-
sided significance test at the 5% level with a power of 90%. How large should
the sample size be?

Solution
Since the power 1 − β = 0.90 then β = 0.10. Therefore, the sample size is

(zβ + zα )2 σ 2 (z0.10 + z0.05 )2 × 502 (1.28 + 1.645)2


A
n= 2
= 2
= = 95.1
(µ0 − µ1 ) (190 − 175) 152
≈ 96

Thus, 96 people are needed to achieve a power of 90% using a 5% significance


level.
R
9.9.2 Sample Size Estimation When Testing for the Mean
of a Normal Distribution (Two-Sided Alternative)
Suppose we wish to test
(
H0 : µ = µ0
D

H1 : µ = µ1

where the data are normally distributed with mean µ and known variance σ 2 .
The sample size needed to conduct a two-sided test with significance level α
and probability of detecting a significant difference with power 100(1 − β)% is

(zβ + zα/2 )2 σ 2
n=
(µ0 − µ1 )2
Note that this sample size is always larger than the corresponding sample
size fora one-sided test, because zα/2 is larger than zα .

387
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Problem 9.18. (Cardiology) Consider a study of the effect of a calcium-


channel-blocking agent on heart rate for patients with unstable angina. Suppose
we want at least 80% power for detecting a significant difference if the effect
of the drug is to change mean heart rate by 5 beats per minute over 48 hours
in either direction and σ = 10 beats per minute. How many patients should be
enrolled in such a study?

Solution
We assume α = 0.05 and σ = 10 beats per minute.. We intend to use a two-
sided test because we are not sure in what direction the heart rate will change
after using the drug. Therefore, the sample size is estimated using the two-sided

T
formulation. We have
(zβ + zα/2 )2 σ 2 (z0.20 + z0.025 )2 × 102 (0.84 + 1.96)2 × 100
n= 2
= 2
=
(µ0 − µ1 ) 5 25
= 31.36 ≈ 32
AF
Thus, 32 patients must be studied to have at least an 80% chance of finding
a significant difference using a two-sided test with α = 0.05 if the true mean
change in heart rate from using the drug is 5 beats per minute.

9.10 Single Proportion Test


A single proportion test is used to determine whether a sample proportion is
significantly different from a hypothesized population proportion. This type of
test is commonly used in situations where we are interested in the proportion
of a certain characteristic in a population.
DR

Steps for Single Proportion Test


1. Define the null and alternative hypotheses.
• H0 : p = p0 (the population proportion is p0 )
• H1 : p ̸= p0 (the population proportion is different from p0 )
2. Fix the level of significance α.
3. Calculate the sample proportion and test statistic.
• Let pb be the sample proportion:
x
pb =
n
where x is the number of successes in the sample and n is the
sample size.

388
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Calculate the standard error of the proportion.


• The standard error SE is given by:
r
p0 (1 − p0 )
SE =
n
Calculate the test statistic.
• The test statistic z is given by:
pb − p0
zcal =
SE

T
4. Determine the critical value or p-value.
• For a two-tailed test at significance level α, the critical values are
−zα/2 and zα/2 .
• Alternatively, calculate the p-value corresponding to the test statis-
AF tic Z.
5. Make a decision.
• If |zcal | > zα/2 , reject the null hypothesis H0 .
• If the p-value is less than α, reject the null hypothesis H0 .

Problem 9.19. Suppose we want to test if the proportion of defective items in


a production process is 5%. We take a sample of 200 items and find that 12 are
defective.

Solution
DR

• Define the null and alternative hypotheses:

H0 : p = 0.05 and H1 : p ̸= 0.05

• Calculate the sample proportion:


12
pb = = 0.06
200

• Calculate the standard error:


r
0.05 × 0.95
SE = ≈ 0.0154
200

• Calculate the test statistic:


0.06 − 0.05
zcal = ≈ 0.649
0.0154

389
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

• Determine the critical value for α = 0.05:


z0.025 ≈ 1.96

• Make a decision:
■ Since zcal = |0.649| < 1.96, we fail to reject the null hypothesis.
■ The p-value is greater than α = 0.05, so we fail to reject the null
hypothesis.

9.10.1 Sample Size Estimation for Proportion Test

T
Estimating the appropriate sample size for a proportion test is crucial to ensure
that the test has sufficient power to detect a significant difference. The sample
size needed depends on the desired level of statistical significance, the power of
the test, and the expected proportion.

Steps for Sample Size Estimation


AF
1. Define the parameters.
• α: Significance level (commonly 0.05).
• β: Type II error rate (commonly 0.2, which gives a power of 0.8).
• p0 : Hypothesized proportion under the null hypothesis.
• p1 : Expected proportion under the alternative hypothesis.
• zα/2 : Critical value from the standard normal distribution for a
two-tailed test.

DR

zβ : Critical value from the standard normal distribution corre-


sponding to the type II error of the test.
2. Calculate the required sample size.
• The formula for the required sample size n is given by:
p p !2
zα/2 2p(1 − p) + zβ p0 (1 − p0 ) + p1 (1 − p1 )
n=
p1 − p0

where p is the pooled proportion:


p0 + p1
p=
2
Problem 9.20. Suppose we want to estimate the sample size needed to test
whether the proportion of defective items is different from 5% (0.05) at a 5%
significance level with 80% power, and we expect the true proportion to be 8%
(0.08).

390
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

Solution
• Define the parameters:

α = 0.05, β = 0.2, p0 = 0.05, p1 = 0.08

• Find the critical values:

zα/2 = 1.96 (for a two-tailed test at 5% significance level)

and
zβ = 0.84 (for 80% power)

T
• Calculate the pooled proportion:
p0 + p1 0.05 + 0.08
p= = = 0.065
2 2

• Calculate the required sample size:


AF n=
p √
1.96 2 × 0.065 × (1 − 0.065) + 0.84 0.05 × 0.95 + 0.08 × 0.92
0.08 − 0.05
!2

 √ √ 2
1.96 2 × 0.065 × 0.935 + 0.84 0.05 × 0.95 + 0.08 × 0.92
=
0.03
 2
1.96 × 0.3486 + 0.84 × 0.3189
=
0.03
 2
0.683 + 0.268
=
0.03
DR

≈ 1004

Therefore, the required sample size is approximately 1004.

9.11 Concluding Remarks


In this chapter, we explored the fundamental concepts of hypothesis testing, a
critical tool for decision-making in statistics. We began with the essential steps
of formulating hypotheses, establishing significance levels, and understanding
test statistics. The introduction of p-values and their interpretation highlighted
the nuances of drawing conclusions from statistical data.

We delved into various testing methodologies, including tests for means and
proportions, each with specific applications and considerations. The discus-
sions on power analysis and sample size estimation underscore the importance

391
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

of planning in hypothesis testing to ensure robust and reliable results.

As we conclude, it is clear that hypothesis testing not only enhances our


ability to make informed decisions based on data but also provides a structured
framework for assessing the validity of claims in diverse fields. Mastery of these
concepts will empower you to apply statistical reasoning effectively in your
future endeavors.

9.12 Chapter Exercises


1. A company claims that their new battery lasts an average of 500 hours.

T
To test this claim, a sample of 20 batteries is tested, and the sample mean
is found to be 490 hours with a sample standard deviation of 15 hours.
Test the company’s claim at a 5% significance level.
2. A researcher wants to test if the average height of a population of adult
males is different from 70 inches. A random sample of 30 men is selected,
AF and the sample mean height is found to be 68.5 inches with a standard
deviation of 3 inches. Perform a hypothesis test at the 5% significance
level.

(a) State the null and alternative hypotheses.


(b) Calculate the test statistic.
(c) Determine the critical value or p-value.
(d) Make a decision to reject or fail to reject the null hypothesis.

3. In a survey of 200 people, 120 indicated that they prefer coffee over tea.
Test whether the proportion of people who prefer coffee is different from
DR

50% at the 1% significance level.

(a) State the null and alternative hypotheses.


(b) Calculate the sample proportion.
(c) Calculate the test statistic.
(d) Determine the critical value or p-value.
(e) Make a decision to reject or fail to reject the null hypothesis.

4. A pharmaceutical company is testing a new drug that they believe will


lower blood pressure by an average of 5 mmHg. The standard deviation
of blood pressure in the population is known to be 10 mmHg. Calculate
the power of the test if the sample size is 50 and the significance level is
0.05. Calculate the power of the test.

392
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

5. A study is being designed to estimate the mean weight of newborns in a


hospital. The standard deviation of birth weights is known to be 1.5 kg.
How many newborns need to be included in the sample to estimate the
mean weight within 0.2 kg with 95% confidence?

(a) State the desired confidence level and margin of error.


(b) Determine the critical value for the confidence level.
(c) Calculate the required sample size.

6. A political pollster wants to estimate the proportion of voters who support


a particular candidate. How large a sample is needed to estimate the true

T
proportion within 3% with 95% confidence?

(a) State the desired confidence level and margin of error.


(b) Determine the critical value for the confidence level.
(c) Use an estimated proportion (e.g., 0.5 if no prior estimate is avail-
AF able) to calculate the required sample size.

7. Benzene is a potential carcinogen, and a chemical company wants to de-


termine if the concentration of benzene in the air is greater than 1 ppm.
The following sample data represents the concentration of benzene (in
ppm) in the air at various locations within the company:

0.21 1.44 2.54 2.97 0.00


3.91 2.24 2.41 4.50 0.15
0.30 0.36 4.50 5.03 0.00
2.89 4.71 0.85 2.60 1.26
DR

You are tasked with testing whether the concentration of benzene is


greater than 1 ppm at a 5% level of significance.
8. Suppose we want to estimate the sample size needed to test whether
the mean of a population is different from a specified value. We set
the significance level at 5% (α = 0.05), the power at 80% (β = 0.2),
the population standard deviation at 10, and the minimum detectable
difference in means at 3.
9. A manufacturer claims that their light bulbs last an average of 1,200
hours. A consumer group tests a sample of 30 bulbs and finds a mean
lifetime of 1,150 hours with a standard deviation of 100 hours. Test the
manufacturer’s claim at the 0.05 significance level.

393
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

10. A school claims that the average score of its students on a standardized
test is 75. A random sample of 50 students has a mean score of 72
with a standard deviation of 10. Conduct a hypothesis test at the 0.01
significance level to determine if the school’s claim is valid.
11. A researcher wants to compare the effectiveness of two different teaching
methods. Method A has a sample mean of 82 with a standard deviation
of 5 from 25 students. Method B has a sample mean of 78 with a standard
deviation of 6 from 30 students. Test the hypothesis that the two methods
are equally effective at a significance level of 0.05.
12. A study compares the average daily water consumption of two cities. City

T
X has a mean consumption of 150 liters with a standard deviation of 20
liters based on a sample of 40 households. City Y has a mean consumption
of 160 liters with a standard deviation of 25 liters based on a sample of
35 households. Conduct a hypothesis test at the 0.05 level.
13. A dietician wants to test the effectiveness of a new diet plan. She measures
AF the weight of 10 participants before and after the diet. The weights (in
kg) are as follows:
• Before: 85, 78, 90, 82, 88, 80, 76, 85, 87, 90
• After: 83, 76, 87, 80, 84, 78, 75, 82, 86, 89
Test whether the diet plan has significantly reduced weight at the 0.05
significance level.
14. A medical researcher is testing the effect of a new medication on blood
pressure. He measures the blood pressure of 12 patients before and after
treatment. The readings (in mmHg) are as follows:
DR

• Before: 130, 135, 140, 128, 132, 138, 136, 134, 129, 137, 141, 133
• After: 125, 130, 135, 127, 130, 132, 128, 129, 126, 134, 138, 131
Conduct a hypothesis test at the 0.01 significance level.
15. Explain the concepts of Type I and Type II errors in the context of hy-
pothesis testing. Provide examples of each based on the exercises above.
16. A new software is claimed to improve productivity. If you decide to reject
the null hypothesis that it does not improve productivity (Type I error)
when in fact it does not, what are the consequences of such an error?
17. A researcher is conducting a study to evaluate whether a new drug is
effective in lowering blood pressure. The null hypothesis is that the drug
has no effect on blood pressure (i.e., the mean change in blood pressure is
zero), while the alternative hypothesis is that the drug does lower blood
pressure. The researcher expects that the drug will lower blood pressure

394
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING

by an average of 5 mmHg, based on previous studies. The population


standard deviation for the change in blood pressure is known to be 12
mmHg. The sample size for the study is 40 patients, and the significance
level for the hypothesis test is set to 0.05. Given the information above,
calculate the power of the hypothesis test to detect a true mean difference
of 5 mmHg.
18. A company wants to test whether a new production process increases the
average output of widgets. The company conducts a hypothesis test using
a one-sample t-test with a significance level of 0.05. The null hypothesis is
that the mean output is 100 widgets per hour, and the alternative hypoth-
esis is that the mean output is greater than 100 widgets per hour. The

T
population standard deviation is known to be 15 widgets. The company
expects that the new process will increase the mean output by 5 widgets
per hour, i.e., the true population mean is 105 widgets per hour. Given
that the sample size is 25, calculate the power of the test to detect a true
mean of 105 widgets per hour.
AF
DR

395
Chapter 10

Correlation and Regression

T
Analysis
AF
10.1 Introduction
In the realm of data science, understanding the relationships between variables
is crucial for deriving actionable insights from data. This chapter provides a
comprehensive overview of correlation and regression analysis, fundamen-
tal techniques in statistical modeling that are essential for data-driven decision-
making. Correlation analysis allows data scientists to quantify the strength
and direction of (linear) relationships between two variables, offering a pre-
liminary understanding of their interdependencies.

We begin by exploring scatter diagrams, which offer a visual representation


DR

of relationships between variables, followed by discussions on covariance and


correlation coefficients that formalize these relationships into quantifiable met-
rics. These foundational concepts pave the way for more advanced regression
techniques.

Regression analysis is a powerful tool used to model and predict the behavior
of a dependent variable based on one or more independent variables. This
chapter covers simple linear regression as well as multiple linear regression, pro-
viding a detailed examination of the assumptions, estimation procedures, and
interpretation of regression coefficients. We also delve into evaluating model
performance through metrics such as R2 and adjusted R2 , which are critical for
assessing the accuracy and reliability of the model.

To bridge theory and practice, Python code examples are integrated through-
out the chapter, demonstrating how to implement these techniques in real-world
data science applications. These practical illustrations enhance the understand-

396
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

ing of both theoretical concepts and their computational applications, equip-


ping data scientists with the tools needed to effectively analyze and interpret
complex datasets.

10.2 Scatter Diagram


A scatter diagram, also known as a scatter plot or scatter graph, is a type
of plot using Cartesian coordinates to illustrate the relationship between two
variables. In this diagram, one variable (usually, dependent variable, which is
the variable being estimated or predicted) is placed on the Y-axis and other
variable (usually, independent variable, which is used as the predictor) is scaled

T
on the X-axis. By plotting data points for pairs of observations, the scatter
diagram visually represents how the two variables relate to each other.

Let’s say we have data on the height (in cm) and weight (in kg) of a group
of individuals. The height and weight pairs might look like this:
AF Table 10.1: Height and Weight of Individuals

Height (cm) Weight (kg)


150 55
160 62
170 64
180 71
190 73
185 81
DR

The scatter diagram of the height (in cm) and weight (in kg) of a group of
individuals, as given in Table 10.1 is presented in Figure 10.1.

397
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Weight (kg)

80

70

60

T
Height (cm)
150 160 170 180 190 200
AF Figure 10.1: Scatter diagram of Height and Weight.

Here’s how to interpret scatter plot:

1. Direction of the Relationship


• Positive Correlation: If the points trend upwards from left to right,
it indicates a positive correlation, meaning that as one variable in-
creases, the other tends to increase as well.

• Negative Correlation: If the points trend downwards from left to


right, it indicates a negative correlation, meaning that as one variable
DR

increases, the other tends to decrease.

• No Correlation: If the points are scattered randomly with no clear


pattern, there is no correlation between the variables.

2. Strength of the Relationship


• Strong Correlation: Points closely clustered around a line suggest a
strong relationship between the variables.

• Weak Correlation: Points more widely spread out but still following
a general trend indicate a weak relationship.

• No Correlation: Widely scattered points without a discernible pat-


tern suggest no relationship between the variables.

398
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

3. Form of the Relationship


• Linear Relationship: If the points form a straight line, the relation-
ship between the variables is linear, indicating that a change in one
variable is associated with a proportional change in the other.

• Non-Linear Relationship: If the points form a curved pattern, the


relationship is non-linear, which may indicate a quadratic, exponential,
or other more complex relationship.

4. Outliers
• Outliers: Points that are far from the general pattern of the data
are called outliers. They may indicate special cases or errors in data
collection and can significantly affect the interpretation of the scatter
plot.

T
Interpretation of Figure 10.1
Consider a scatter plot where height (cm) is plotted against weight (kg):

AF
Direction: If the points generally trend upwards, there is a positive
correlation between height and weight.

• Strength: If the points are closely packed around a line, the correla-
tion is strong; if they are more spread out, the correlation is weaker.

• Form: If the points lie on or near a straight line, the relationship is


linear.


DR
Outliers: If a point is far away from others (e.g., a height of 190 cm
but a weight of 55 kg), it may be an outlier, which requires further
investigation.

10.3 Python Code: Scatter diagram


1 import matplotlib . pyplot as plt
2

3 # Data
4 heights = [150 , 160 , 170 , 180 , 190 , 185] # Heights in cm
5 weights = [55 , 60 , 65 , 70 , 75 , 81] # Weights in kg
6

7 # Create a scatter plot


8 plt . scatter ( heights , weights , color = ’ blue ’ , marker = ’o ’)
9

10 # Add title and labels


11 plt . title ( ’ Height vs Weight ’)

399
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

12 plt . xlabel ( ’ Height ( cm ) ’)


13 plt . ylabel ( ’ Weight ( kg ) ’)
14

15 # Show grid
16 plt . grid ( True )
17

18 # Display the plot


19 plt . show ()
20

21

T
10.4 Covariance
A scatter diagram is a powerful tool for visualizing the relationship between
two variables. However, it has some limitations:
• No Quantitative Measure: While scatter plots can visually show
AF trends, they do not provide a quantitative measure of the strength or
direction of the relationship between the variables. For this, statistical
measures like covariance or correlation are needed.

• Subjectivity in Interpretation: The interpretation of scatter plots


can sometimes be subjective, especially in cases of weak or non-linear
relationships where the trend is not immediately obvious.

• Sensitivity to Outliers: Scatter diagrams can be highly sensitive


to outliers, which can distort the perceived relationship between the
variables.
Some of the limitations can be minimized by another popular statistical tech-
DR

nique, called covariance.

Covariance: Covariance is a statistical measure that quantifies the degree


to which two random variables change together. It assesses whether an
increase in one variable corresponds to an increase or decrease in another
variable. The sample covariance, denoted by sxy , can be calculated using
the formula:
n
P
xi yi − nx̄ȳ
i=1
sxy =
n−1
where:

• xi and yi are the individual data points of the variables x and y,

• x̄ and ȳ are the means of x and y, respectively,

400
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• n is the number of data points.


Since,
n
X n
X
(xi − x̄)(yi − ȳ) = xi yi − nx̄ȳ
i=1 i=1
then, the covariance is also calculated by using formula
n
P
(xi − x̄)(yi − ȳ)
i=1
sxy = .
n−1

FT
Interpretation of the value of Covariance:
• A positive covariance indicates that the variables tend to move in the
same direction, meaning that when one variable increases, the other
tends to increase as well.

• Conversely, a negative covariance suggests that the variables move in


opposite directions—when one variable increases, the other tends to
decrease.

• Zero (or close to zero) indicates that there is no (or weak) relationship
between two variables.
Problem 10.1. Consider a dataset containing the number of commercials aired
A
and the corresponding sales volume for a product over ten weeks:

Table 10.2: Sample Data for the San Francisco Electronics Store

Week Number of Sales Volume


commercials (X) ($100s)
R
1 2 50
2 5 57
3 1 41
4 3 54
D

5 4 54
6 1 38
7 5 63
8 3 48
9 4 59
10 2 46

To further analyze this data, follow these steps:

401
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

(i). Draw a scatter diagram of the data points representing the relationship
between the number of commercial advertisements and the sales volume.
(ii). Based on the scatter diagram, describe the observed trend or pattern in
the data.

(iii). Calculate the sample covariance between the number of commercial ad-
vertisements and the sales volume to quantitatively assess the degree of
association between these two variables.
(iv). Interpret the sample covariance value in terms of the strength and direc-
tion of the relationship between the number of commercial advertisements

T
and the sales volume.

Solution
(i). The scatter plot of Number of commercials (X) and Sales Volume ($100s)
is presented in Figure 10.2.
AF 70
Sales Volume ($100s)

60

50

40
DR

1 2 3 4 5 6
Number of Commercials (X)

Figure 10.2: Scatter Plot of Number of Commercials vs. Sales Volume

(ii). The scatter plot in Figure 10.5 illustrates the relationship between the
number of commercials aired and the sales volume:

• Positive Relationship: The plot shows that as the number of com-


mercials increases, the sales volume generally increases as well, indi-
cating a positive relationship between the two variables.

402
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• Data Points Distribution: The data points form an upward slope,


suggesting that higher numbers of commercials are associated with
higher sales volumes. The points are not perfectly aligned, indicating
some variability in the relationship.

• Strength of Relationship: Although the trend is positive, the re-


lationship is not perfectly linear, suggesting that other factors might
influence the sales volume or that the relationship may not be strictly
linear.

• Outliers: There are no significant outliers, indicating a relatively sta-


ble relationship between the two variables across the observed data.

T
(iii).
Let’s calculate the sample covariance for the given data in Table 10.3:

Table 10.3: Data Table


AF i

1
xi

2
yi

50
xi yi

100
2 5 57 285
3 1 41 41
4 3 54 162
5 4 54 216
6 1 38 38
7 5 63 315
DR

8 3 48 144
9 4 59 236
10 2 46 92

Total 30 510 1629

30
Mean of X : x̄ = =3
10
510
Mean of Y : ȳ = = 51
10

403
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Therefore,
1629 − 10 × 3 × 51 99
sxy = = = 11
10 − 1 9
(iv). The positive covariance of 11 suggests a positive relationship between
the number of commercials aired and the sales volume, indicating that as the
number of commercials increases, the sales volume tends to increase as well.

10.5 Correlation Analysis

T
Correlation analysis is a group of techniques to measure the strength and
direction of the relationship between two variables. There are different
types of correlation coefficients that can be used depending on the nature of
the variables being analyzed and the assumptions of the data. Some of the
common types of correlation coefficients include:
1. Pearson correlation coefficient: Measures linear relationship be-
AF tween two continuous variables.
2. Spearman rank correlation coefficient: Measures association be-
tween ranked variables.
3. Kendall tau correlation coefficient: Measures similarity of orderings
of data pairs.
4. Point-biserial correlation coefficient: Measures association between
continuous and binary variables.
5. Phi coefficient: Measures association between two binary variables.
DR

6. Cramér’s V: Measures association between two nominal variables.

10.6 Pearson’s Correlation Coefficient


The Pearson’s Correlation Coefficient is a statistical measure that indicates the
extent to which two variables fluctuate together. In many contexts, especially
when dealing with linear relationships between continuous variables, Pearson’s
Correlation Coefficient is used and often referred to simply as the correlation
coefficient.

Pearson’s Correlation Coefficient: Pearson’s Correlation Coefficient


is a measure of the strength of the linear relationship between two con-

404
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

tinuous variables. It is denoted by r or rxy and defined as


n
P
xi yi − nx̄ȳ
i=1
r = s s n . (10.1)
n
2
P 2
P 2 2
xi − nx̄ yi − nȳ
i=1 i=1

It can be easily shown that


n
X n
X
(xi − x̄)(yi − ȳ) = xi yi − nx̄ȳ;

T
i=1 i=1

n
X n
X
2
(xi − x̄) = x2i − nx̄2 ;
i=1 i=1

and
n
X n
X
AF 2
(yi − ȳ) =
i=1
yi2 − nȳ 2 .
i=1

Therefore, the formula given in Equation (10.1) can be written as


n
P
(xi − x̄)(yi − ȳ)
i=1
r= s s
n
P n
P
(xi − x̄)2 (yi − ȳ)2
i=1 i=1
n
P
(xi − x̄)(yi − ȳ)
i=1
=
DR

(n − 1)sx sy
sxy
=
sx sy
s
n
1
P
where sx = n−1 (xi − x̄)2 (the sample standard deviation); and analo-
i=1
gously for sy .

10.6.1 Interpretation of the value of Correlation Coeffi-


cient
The following drawing summarizes the strength and direction of the correlation
coefficient

405
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Perfect Moderate Moderate Perfect


No
negative negative positive positive
correlation
correlation correlation correlation correlation
Strong Weak Weak Strong
negative negative positive positive
correlation correlation correlation correlation

-1.00 -0.50 0 0.50 +1.00


Negative correlation Positive correlation

T
Problem 10.2. Consider Problem 10.1. Calculate the correlation coefficient
and interpret the results.

Solution
AF
From the solution of the Problem 10.1, we have the sample covariance sxy = 11.
The calculations for the variance are presented in Table 10.4.

Table 10.4: Data Table

i xi yi x2i yi2 xi yi

1 2 50 4 2500 100
2 5 57 25 3249 285
3 1 41 1 1681 41
DR

4 3 54 9 2916 162
5 4 54 16 2916 216
6 1 38 1 1444 38
7 5 63 25 3969 315
8 3 48 9 2304 144
9 4 59 16 3481 236
10 2 46 4 2116 92

30 510 110 26576 1629

We can compute the sample standard deviations for the two variables:
v ! r
u
u 1 n
X 1
sx = t x2 − nx̄ = (110 − 10 × 32 ) = 1.49
n − 1 i=1 i 9

406
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

v !
u n r
u 1 X 1
sy = t yi2 − nȳ = (26576 − 30 × 512 ) = 7.93
n−1 i=1
9
Hence, the sample correlation coefficient equal
sxy 11
r= = = 0.93
sx sy 1.49 × 7.93
which indicates a strong positive linear relationship between number of com-
mercials and sales volume.

Y Y

T
r = +1
r = −1
AF X X

Panel A Panel B

Figure 10.3: Perfect Positive (r = +1) and Perfect Negative r = −1 Linear


Relationship.

Y
DR

Figure 10.4: No or weak Relationship r → 0

10.6.2 Properties of the Correlation Coefficient


The properties of Pearson’s correlation coefficient include:

407
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

1. Range: The correlation coefficient r lies in the range −1 ≤ r ≤ 1. See


the Panel A in Figure 10.3.
• r = 1: Indicates a perfect positive linear correlation between
variables. See the Panel B in Figure 10.3.
• r = −1: Indicates a perfect negative linear correlation between
variables.
• r = 0: Indicates no linear correlation between variables. See
Figure 10.4.
• |r| close to 1 indicates a strong relationship.

FT
• |r| close to 0 indicates a weak relationship.
2. Symmetry: The correlation between X and Y is the same as between
Y and X:
r(X, Y ) = r(Y, X)

3. Unit-Free: The correlation coefficient is a dimensionless number, mean-


ing it does not depend on the units of the variables.
4. Unaffected by Change of Origin and Scale: The correlation coeffi-
cient remains unchanged if we add a constant to one or both variables, or
if we multiply one or both variables by a positive constant:
A
r(aX + b, cY + d) = r(X, Y )

where a, b, c, and d are constants, with a > 0 and c > 0.


5. Linear Relationship: The correlation coefficient measures the strength
and direction of a linear relationship between two variables. It does not
R
capture non-linear relationships.
6. Robustness: The correlation coefficient can be sensitive to outliers,
which can distort the value of r and provide a misleading interpretation
of the relationship between the variables.
7. Dependence on the Type of Data: The correlation coefficient is ap-
D

propriate for continuous data but may not be suitable for ordinal or cat-
egorical data without modification.
8. Pairwise Comparisons: The correlation coefficient is computed for
pairs of variables and does not extend naturally to more than two vari-
ables.

Problem 10.3. The following sample of observations were randomly selected.

x 4 5 3 6 10
y 4 6 5 7 7

408
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

(a). Draw a scatter diagram. Comment on your output.


(b). Determine the correlation coefficient and interpret the relationship be-
tween x and y. Interpret this statistical measure.

Solution
(a)
The scatter plot given in Figure 10.5, visualizes the relationship between two
variables, X and Y , for the given sample of observations. Each point on the
scatter plot represents a pair of x and y values.

T
8 y

7
AF 6

3
DR

x
2 4 6 8 10 12

Figure 10.5: The scatter plot between x and y.

The scatter plot indicates that there is a positive correlation between X and
Y , as the points generally trend upwards. This indicates that higher values of
X tend to be associated with higher values of Y . However, to draw definitive
conclusions about the strength and nature of this relationship, further statis-
tical analysis, such as calculating the Pearson correlation coefficient, would be
necessary.
(b)

409
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

xi yi x2i yi2 xi yi

4 4 16 16 16
5 6 25 36 30
3 5 9 25 15
6 7 36 49 42
10 7 100 49 70
x2i = 186 yi2 = 175
P P P P P
xi = 28 yi = 29 xi yi = 173
x̄ = 5.6 ȳ = 5.8

T
n
P
xi yi − nx̄ȳ
i=1
r = s s n 
n
2
P P 2
xi − nx̄2 yi − nȳ 2
AF =p
i=1 i=1
173 − 5 × 5.6 × 5.8
p
186 − 5 × (5.6)2 175 − 5 × (5.8)2
= 0.7522

The value r = 0.7522 indicates a strong positive linear relationship between


the two variables X and Y . In general, values of r between 0.7 and 1.0 (or
-0.7 and -1.0 for negative relationships) are considered to represent a strong
relationship.

10.6.3 Testing the Significance of the Correlation Coeffi-


DR

cient
Testing the significance of the correlation coefficient involves determining whether
the observed correlation between two variables is statistically significant, mean-
ing it is unlikely to have occurred by chance. Here’s a step-by-step guide on
how to perform this test:
1. Hypothesis

H0 : ρ = 0 (There is no linear relations between two variables.)


H1 : ρ ̸= 0 (There is a linear relations between two variables.)

2. Level of significance α
3. Test statistic

r n−2
T = √ ∼ t distribution with n − 2 degrees of freedom
1 − r2

410
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

4. Decision Rule
Reject H0 if
T > tα/2,n−2 or T < −tα/2,n−2
equivalently,
Reject H0 if and only if |T | > tα/2,n−2
at level of significance 100 × (1 − α)%.

Correlation Test: In general, if the null hypothesis is

H0 : ρ = 0

and if the null hypothesis is true, the test statistic T follows the Student’s-t
distribution with (n − 2) degrees of freedom, i.e., T ∼ t(n − 2).

T
Table 10.5: Decision rule for the test of hypothesis H0 : ρ = 0

Alternative hypothesis Reject H0 if


H1 : ρ < 0 T < −tα,n−2
AF
H1 : ρ > 0 T > tα,n−2
H1 : ρ ̸= 0 T > tα/2,n−2 or T < −tα/2,n−2

For example given in Problem 10.3, we consider the realized value of T is


tcal . Therefore, √
0.7522 × 5 − 2
tcal = √ = 1.9772
1 − 0.75222
DR
The critical value is t0.025,3 = 3.182. Since, |tcal | = |1.9772| < 3.182, we do
not reject the null hypothesis that ρ = 0 and conclude that there is not enough
evidence of a linear relationship between x and y in the population.

10.6.4 Python Code: Correlation Matrix


1 # example 1: using scipy library
2 from scipy . stats import pearsonr
3

4 # Data
5 heights = [150 , 160 , 170 , 180 , 190 , 182] # Heights in cm
6 weights = [55 , 60 , 65 , 70 , 75 , 81] # Weights in kg
7

8 # Calculate Pearson correlation coefficient


9 correlation , p_value = pearsonr ( heights , weights )
10

11 print ( f " Pearson correlation coefficient : { correlation } " )


12 print ( f "P - value : { p_value } " )

411
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

13

14 # Example 2: for pandas dataset


15 import pandas as pd
16

17 # Example data ( you should replace this with your actual


data )
18 data = pd . DataFrame ({
19 ’ x1 ’: [1 , 2 , 3 , 4 , 5] ,
20 ’ x2 ’: [2 , 3 , 4 , 5 , 6] ,
21 ’ x3 ’: [3 , 4 , 5 , 6 , 7]
22 })
23

T
24 # Compute correlation matrix
25 co rr elation_matrix = data . corr ()
26

27 print ( correlation_matrix )
28 # Compute correlation matrix with 2 decimal places
29 co rr elation_matrix = data . corr () . round (2)
30

31

32

33

34
AF # View the correlation matrix
print ( correlation_matrix )

10.7 Rank Correlation


Rank correlation is a statistical measure used to evaluate the strength and
direction of the association between two ranked variables. Unlike Pearson’s
correlation coefficient, which assesses linear relationships between two contin-
uous variables, rank correlation methods focus on the order or ranking of the
DR

data rather than the actual numerical values.

10.7.1 Key Types of Rank Correlation


Spearman’s Rank Correlation Coefficient (Spearman’s rho, ρ)
Spearman’s rank correlation coefficient is a non-parametric measure of rank
correlation. It is measured the relationship between rankings of different ordi-
nal variables or different rankings of the same variable, where a “ranking” is
the assignment of the ordering labels “first”, “second”, “third”, etc. to differ-
ent observations of a particular variable. It assesses how well the relationship
between two variables can be described using a monotonic function.

If ri , si are the ranks of the i-member according to the x and the y-quality

412
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

respectively, then the rank correlation coefficient is


6 d2i
P
rR = 1 −
n(n2 − 1)
where n is the number of data points of the two variables and di is the differ-
ence in the ranks of the ith element of each random variable considered, i.e.
di = ri − si , is the difference between ranks.

The Spearman correlation coefficient, rR , can take values from +1 to −1.


• ρ = 1: Perfect positive correlation.

T
• ρ = −1: Perfect negative correlation.

• ρ = 0: No correlation.
Problem 10.4. Based on the following data, find the rank correlation between
marks of English and Mathematics courses.
AF English
Maths
56
66
75
70
45
40
71
60
62
65
64
56
58
59
80
77
76
67
61
63

Solution
The procedure for ranking these scores is as follows:

English (mark) Maths (mark) Rank (Engilsh) Rank (Maths) di d2i

56 66 9 4 5 25
DR

75 70 3 2 1 1
45 40 10 10 0 0
71 60 4 7 -3 9
62 65 6 5 1 1
64 56 5 9 -4 16
58 59 8 8 0 0
80 77 1 1 0 0
76 67 2 3 -1 1
61 63 7 6 1 1

The realized value of the Spearman Rank Correlation is


6 d2i
P
6 × 54
rR = 1 − 2
=1− = 0.6727
n(n − 1) 10(102 − 1)

413
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

This indicates a strong positive relationship between the ranks individuals ob-
tained in the Maths and English exam. That is, the higher you ranked in maths,
the higher you ranked in English also, and vice versa.

Equal Ranks or Tie in Ranks


• For tie observations, the rank correlation can be computed by adjusting
m(m2 − 1)/12 to the value of
P 2
di , where m stands for the number of
items whose ranks are equal.

• If there are more than one such group of items with common rank, this
value is added as many times as the number of such groups.

T
• Then the formula for the rank correlation is
6{ d2i + m 2 m2 2
P
12 (m1 − 1) + 12 (m2 − 1) + · · · }
1

rR = 1 −
n(n2 − 1)
AF
Problem 10.5. Based on the following data, Find the rank correlation between
marks of English and Mathematics courses.

English 56 75 45 71 61 64 58 80 76 61
Maths 70 70 40 60 65 56 59 70 67 80

Solution
The procedure for ranking these scores is as follows:
DR

English (mark) Maths (mark) Rank (Engilsh) Rank (Maths) di d2i

56 70 9 3 6 36
75 70 3 3 0 0
45 40 10 10 0 0
71 60 4 7 -3 9
61 65 6.5 6 0.5 0.25
64 56 5 9 -4 16
58 59 8 8 0 0
80 70 1 3 -2 4
76 67 2 5 -3 9
61 80 6.5 1 5.5 30.25

414
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• The mark 61 is repeated 2 times in series X (in English) and hence


m1 = 2.

• In series Y (in Math), the marks 70 occurs 3 times and hence m2 = 3.


So the rank correlation is

2 2 3
6{104.5 + 12 (2 − 1) + 12 (32 − 1) + · · · }
rR = 1 − 2
10(10 − 1)
= 0.3569

T
10.7.2 Applications of Rank Correlation
• Non-Linear Relationships: Rank correlation is useful when the re-
lationship between variables is not linear but still monotonic (e.g., one
variable consistently increases as the other does, but not necessarily in
a straight line).
AF • Ordinal Data: It is appropriate for ordinal data, where the values
represent rankings or ordered categories (e.g., customer satisfaction
ratings).

• Handling Outliers: Since rank correlation relies on the order of val-


ues rather than their magnitude, it is less sensitive to outliers than
Pearson’s correlation.

10.7.3 Python Code: Rank Correlation


In Python, you can find rank correlation using the scipy.stats module, which
DR

provides a function called spearmanr() to calculate the Spearman rank correla-


tion coefficient. Here’s how you can do it:
1 import scipy . stats as stats
2

3 # Given data
4 english_marks = [56 , 75 , 45 , 71 , 61 , 64 , 58 , 80 , 76 , 61]
5 maths_marks = [70 , 70 , 40 , 60 , 65 , 56 , 59 , 70 , 67 , 80]
6

7 # Calculate the Spearman rank correlation coefficient


8 spearman_corr , p_value = stats . spearmanr ( english_marks ,
maths_marks )
9

10 # Output the results


11 print ( f " Spearman Rank Correlation Coefficient : {
spearman_corr :.4 f } " )
12 print ( f "P - value : { p_value :.4 f } " )
13

415
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

If you have two columns of data in a pandas DataFrame, you can calculate
the rank correlation directly from the DataFrame. Here’s an example:
1 # From Data file
2 import pandas as pd
3 # Example DataFrame ( you should replace this with your
actual data )
4 data = pd . DataFrame ({
5 ’ english_marks ’: [56 , 75 , 45 , 71 , 61 , 64 , 58 , 80 , 76 ,
61] ,
6 ’ maths_marks ’: [70 , 70 , 40 , 60 , 65 , 56 , 59 , 70 , 67 ,
80]
})

T
7

9 # Calculate Spearman rank correlation coefficient


10 rho , p_value = data . corr ( method = ’ spearman ’) . iloc [0 , 1]
11

12 print ( " Spearman rank correlation coefficient : " , rho )


13 print ( "p - value : " , p_value )
14

15
AF
10.7.4 Kendall Tau Correlation Coefficient
Kendall’s tau is another non-parametric measure of rank correlation that eval-
uates the ordinal association between two variables.

Calculation of Kendall Tau


To calculate the Kendall Tau correlation coefficient:
DR

1. Compare each pair of observations in terms of their ranks.

2. Determine whether they have the same order (concordant) or opposite


order (discordant) for both variables.
3. Count the total number of concordant pairs (nc ) and discordant pairs
(nd ).
4. Calculate the Kendall Tau coefficient using the formula:
nc − nd
τ= 1
2 n(n − 1)

5. It ranges from −1 to 1.

Interpretation:
• τ = 1: Perfect agreement between the rankings.

416
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• τ = −1: Perfect disagreement between the rankings.

• τ = 0: No association.

• Concordant Pairs:
In the context of correlation coefficients such as Kendall Tau, con-
cordant pairs refer to pairs of observations where the ranks for
both variables follow the same order. In other words, if (Xi , Yi )
and (Xj , Yj ) are two pairs of observations, they are considered
concordant if both Xi < Xj and Yi < Yj or if both Xi > Xj and
Yi > Yj .

T
• Discordant Pairs: Discordant pairs, on the other hand, refer
to pairs of observations where the ranks for the variables have
opposite orders. In other words, if (Xi , Yi ) and (Xj , Yj ) are two
pairs of observations, they are considered discordant if Xi < Xj
and Yi > Yj or if Xi > Xj and Yi < Yj .
AFIn the context of calculating correlation coefficients like Kendall Tau, un-
derstanding concordant and discordant pairs is crucial as they form the basis
for determining the strength and direction of association between two variables
based on their ranks.
Problem 10.6. Suppose we have the following data on two variables, X and
Y , with their corresponding ranks:

Observation Rank
X Y
DR

10 15
15 10
20 20
25 25
30 40
Calculate the Kendall Tau correlation coefficient.

Concordant and Discordant Pairs


To find the number of concordant and discordant pairs in the given data, we
need to compare each pair of observations in terms of their ranks and determine
whether they are concordant or discordant. Let’s analyze each pair:

1. (10, 15) and (15, 10): Discordant

417
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

2. (10, 15) and (20, 20): Concordant


3. (10, 15) and (25, 25): Concordant
4. (10, 15) and (30, 40): Concordant
5. (15, 10) and (20, 20): Concordant
6. (15, 10) and (25, 25): Concordant
7. (15, 10) and (30, 40): Concordant
8. (20, 20) and (25, 25): Concordant

T
9. (20, 20) and (30, 40): Concordant
10. (25, 25) and (30, 40): Concordant
So, out of the 10 pairs of observations, there are 9 concordant pairs and 1
discordant pair.
AF•

Number of concordant pairs (nc ): 9

Number of discordant pairs (nd ): 1

• Total number of pairs (n): 5(5−1)


2 = 10

• Kendall Tau coefficient (τ ):


nc − nd 9−1 8
τ= 1 = 1 = ≈ 0.8
2 n(n − 1) − 1)
2 (5)(5
10

The Kendall Tau correlation coefficient for the given data is τ = 0.8.
DR

10.7.5 Advantages and Disadvantages


Rank correlation methods, such as Spearman’s rank correlation coefficient and
Kendall’s tau are especially useful when dealing with ordinal data or when the
assumptions of parametric methods (like Pearson’s correlation) are not met.
Here are some key advantages and disadvantages of rank correlation:
• Advantages:
■ Robust to Outliers: Rank correlation is not affected by ex-
treme values.
■ Non-parametric: No assumptions about the distribution of
the data.
■ Suitable for Ordinal Data: Can be used when data are ranked
or ordinal.

418
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• Disadvantages:
■ Less Sensitive: Might not capture the strength of the relation-
ship as well as Pearson’s correlation in the presence of linear
relationships.
■ Reduced Power: May have less statistical power compared to
parametric tests when assumptions for those tests are met.

10.7.6 Python Code: Kendall Tau


1 import scipy . stats as stats

T
2

3 # Given data
4 X = [10 , 15 , 20 , 25 , 30]
5 Y = [15 , 10 , 20 , 25 , 40]
6

7 # Calculate the Kendall Tau correlation coefficient


8

10

11
AF tau , p_value = stats . kendalltau (X , Y )

# Output the results


print ( f " Kendall Tau correlation coefficient : { tau :.4 f } " )
12 print ( f "P - value : { p_value :.4 f } " )

10.7.7 Exercises
1. Define a scatter diagram. What is its primary purpose in data analy-
sis, and how can it help in understanding the relationship between two
variables?
DR

2. Describe how you would interpret a scatter diagram that shows a perfect
positive linear relationship between two variables. What characteristics
would you expect to see in the plot?
3. Explain how a scatter diagram can be used to detect non-linear relation-
ships between variables. Provide examples of different types of non-linear
relationships that might be observed.

4. What are some limitations of using a scatter diagram for analyzing rela-
tionships between variables? How can these limitations affect the inter-
pretation of the data?
5. Define covariance. What does it measure, and how is it different from
correlation?

6. What is meant by correlation analysis? How do you interpret the value


of correlation coefficient?

419
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

7. Define Pearson’s correlation coefficient. What does it measure, and how


is it calculated?
8. Describe the range of values that Pearson’s correlation coefficient can
take and what each value signifies about the relationship between two
variables.

9. Explain the concept of linearity in relation to Pearson’s correlation coef-


ficient. Why is Pearson’s correlation coefficient not appropriate for non-
linear relationships?
10. Discuss the assumptions underlying Pearson’s correlation coefficient. What

T
assumptions need to be met for Pearson’s correlation coefficient to provide
a valid measure of the strength and direction of the relationship?
11. How does Pearson’s correlation coefficient handle outliers in the data?
What impact can outliers have on the correlation coefficient, and how
might this affect data analysis?
AF
12. What is the purpose of using rank correlation methods, such as Spear-
man’s and Kendall’s Tau, instead of Pearson’s correlation coefficient?
13. Explain the differences between Spearman rank correlation and Kendall
Tau correlation. In what situations might one method be preferred over
the other?

14. Discuss the advantages and limitations of using rank-based methods for
measuring the association between two variables.
15. Define Spearman’s rank correlation coefficient. How is it computed, and
what does it measure?
DR

16. Describe how tied ranks are handled in the computation of Spearman’s
rank correlation coefficient. What impact do ties have on the correlation
value?
17. Define Kendall’s Tau correlation coefficient. How does it differ from
Spearman’s rank correlation in terms of interpretation and calculation?

18. Explain the concepts of concordant and discordant pairs in the context of
Kendall’s Tau. How are they used to calculate the correlation coefficient?
19. Kendall’s Tau is often considered more robust than Spearman’s rank cor-
relation in the presence of tied ranks. Discuss why this is the case and
how Kendall’s Tau adjusts for ties.

20. Given the following pairs of data representing the number of hours studied
(X) and the scores obtained (Y) by 10 students in an exam:

420
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Student Hours Studied (X) Exam Score (Y)


1 2 50
2 3 55
3 5 60
4 6 65
5 8 70
6 9 75

T
7 10 80
8 12 85
9 14 90
10 15 95
AF (i). Plot the scatter diagram for this data. Describe the relationship
between the hours studied and the exam scores based on the
scatter plot.
(ii). Calculate the covariance between the number of hours studied
(X) and the exam scores (Y).
(iii). Interpret the sign and magnitude of the covariance. What does
it tell you about the relationship between the two variables?
(iv). Calculate the Pearson correlation coefficient between the hours
studied and exam scores.
DR

(v). Interpret the value of the correlation coefficient. What does it


indicate about the strength and direction of the relationship be-
tween the two variables?
21. The owner of Maumee Ford-Volvo wants to study the relationship between
the age of a car and its selling price. Listed below is a random sample of
12 used cars sold at the dealership during the last year.

421
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Car Age (years) Selling Price ($000))


1 9 8.1
2 7 6.0
3 11 3.6
4 12 4.0
5 8 5.0
6 7 10.0
7 8 7.6

T
8 11 8.0
9 10 8.0
10 12 6.0
11 6 8.6
AF (i).
12 6 8.0

Draw a scatter diagram. Interpret the plot.


(ii). Determine the correlation coefficient.
(iii). Interpret the correlation coefficient. Does it surprise you that the
correlation coefficient is negative?
(iv). Test whether this linear relationship is significant or not.

22. Consider the following data on the ranks of two variables, X and Y :
DR

Observation Rank of X Rank of Y


1 2 3
2 1 2
3 4 1
4 3 4

(a) Calculate the Spearman rank correlation coefficient for the above
data.
(b) Interpret the result in the context of the relationship between X and
Y.

422
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

23. Suppose we have the following ranks for two variables A and B:

Observation Rank of A Rank of B


1 1 2
2 2 1
3 3 3
4 4 5
5 5 4

T
(a) Calculate the Kendall Tau correlation coefficient for the data pro-
vided.
(b) Discuss the strength and direction of the relationship between A and
B based on your result.
24. You are provided with the following data on two variables, P and Q.
AF Compute both the Spearman rank correlation and Kendall Tau correlation
coefficients:
Observation P Q
1 85 92
2 78 85
3 92 88
4 70 76
5 88 90
DR

(a) Compare the results from both rank correlation methods. Explain
any similarities or differences observed.

10.8 Regression Analysis


Imagine you are a data scientist working for a real estate company. Your task
is to predict the price of a house based on various features such as the number
of bedrooms, square footage, and location. You have a dataset with historical
sales data, and you want to use this data to build a model that can predict the
price of a house given its features. This is where regression analysis comes into
play.

The drawback of correlation analysis is that it only measures the strength


and direction of the (linear) relationship between two variables. It does not
provide information about causality or the nature of the relationship be-
yond linearity. Additionally, correlation coefficients can be affected by outliers

423
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

and may not capture complex relationships that exist between variables.

The motivation for regression analysis stems from the need to understand
and model the relationship between variables more comprehensively. Regression
analysis allows us to not only measure the direction and significant effect of
the relationship but also to make predictions and infer causality, provided
certain assumptions are met. By fitting a regression model, we can examine
how changes in one variable are associated with changes in another variable
while controlling for potential confounding factors.
• Dependent Variable: The variable is the outcome or response
that is being predicted or explained in an analysis.

T
• Independent Variable: The variable is the predictor or factor
that is used to predict or explain changes in the dependent vari-
able.
Regression analysis enables deeper insights into the underlying mechanisms
driving the relationship between variables.
AF
Regression analysis: Regression analysis is a set of statistical methods
used to estimate relationships between a dependent variable (also known
as the response or target variable) and one or more independent variables
(also known as predictors, features, or explanatory variables), allowing
for predictions and the determination of the strength and nature of these
relationships. This relationship can be expressed as

y = g(x1 , x2 , . . . , xp ) + e (10.2)

where y represents the dependent variable, x1 , x2 , . . . , xp are the indepen-


dent variables, and e denotes the error term.
DR

In model (10.2), the function g(x1 , x2 , . . . , xp ) represents the systematic part


of the model. It describes the relationship between the predictor variables
x1 , x2 , . . . , xp and the response variable y. The specific form of g depends on
the type of regression model. For example, in linear regression, g is typically
a linear combination of the predictors:
g(x1 , x2 , . . . , xp ) = β0 + β1 x1 + β2 x2 + · · · + βp xp .
For instance, banks might use regression analysis to evaluate the risk asso-
ciated with home-loan applicants. In this context, independent variables such
as the applicant’s age, income, expenses, occupation, number of dependents,
and total credit are used to predict the likelihood of loan repayment, thereby
aiding in risk assessment and decision-making processes.

The main goal is to understand how changes in the independent variables


affect the dependent variable, and to use this understanding to make predic-
tions.

424
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

10.8.1 Types of regression analysis


In a broad sense, there are two types of regression models: (i) parametric regres-
sion models and (ii) nonparametric regression models. Parametric regression
involves assuming a specific functional form for the relationship between the de-
pendent variable and the independent variables. The model is defined by a finite
number of parameters. The most common example is linear regression, where
the relationship between the variables is assumed to be linear. Non-parametric
regression does not assume a specific functional form for the relationship be-
tween the dependent and independent variables. Instead, the nonparametric
regression model is more flexible and can adapt to the data’s structure. We
will not discuss nonparametric regression models; instead, we will focus solely

T
on parametric regression models.

There are various types of regression models depending on the nature of the
response variable. Some of them are mentioned below:

• for continuous response variable


AF ■ Simple Linear Regression
■ Multiple Regression
■ Polynomial Regression
■ Multivariate Regression, etc.

• for categorical response variable


■ Logistic Regression
▶ binomial (also called ordinary) logistic regression
DR

▶ multinomial logistic regression


▶ ordinal logistic regression
▶ alternating logistic regressions, etc.

• for discrete response variable

■ Poisson Regression
■ Negative Binomial Regression, etc.

• for survival-time (time-to-event) outcomes

■ Cox Regression (or proportional hazards regression)


■ Accelerated Failure Time Model (AFT model), etc.

425
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

10.8.2 Simple Regression Model


To explore the relationship between the average value of the dependent variable
Y given each value of the independent variable X, we consider the following
population regression function:

E(Y | X = x) = β0 + β1 x (10.3)

where
• E(Y | X = x) represents the expected (or average) value of Y when X
is equal to x.

T
• β0 and β1 are the parameters of the regression function, with β0 being
the intercept and β1 the slope of the regression line.

E(Y ) E(Y )
AF Regression
line
β0 Regression
line

Slope β1 Slope β1
is positive is negative
β0
x x

Panel A: Positive Linear Relationship Panel B: Negative Linear Relationship


DR

E(Y )

Regression line
Intercept β0
Slope β1 is 0

Panel C: No Relationship

426
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Suppose xi is a given value of X for the i-th observation. Then the popula-
tion regression function (PRF) can be written as

E(Y | X = xi ) = β0 + β1 xi (10.4)

where E(Y | X = xi ) represents the expected value of Y given that X takes the
value xi . For a given value x = xi , the observed value yi for Y can be expressed
as:

yi = E(Y | X = xi ) + ei
= β0 + β1 xi + ei (10.5)

T
where ei is the error term for the i-th observation, representing the deviation
of the observed value yi from the expected value E(Y | X = xi ). The model
given by Equation (10.5) is known as the simple linear regression model.
This model is used to estimate the population regression function described in
Equation (10.4).
AF
The classical simple linear regression (CSLR) model for i-th obser-
vation is

y i = β0 + β1 x i + e i ; i = 1, 2, . . . , n (10.6)

where:

• yi is the dependent variable (response) for observation i.

• xi is the independent variable (predictor) for observation i.

• β0 is the intercept of the regression line.


DR

• β1 is the slope of the regression line.

• ei is the error term (residual) for observation i, representing the


difference between the observed and predicted values.

Note that the model (10.6) is a special case of (10.2) when g is a linear
function and p = 1.

A nonintercept model in regression analysis is a model where the regres-


sion line or hyperplane is forced to pass through the origin (the point where all
variables are zero). This means the intercept term is explicitly set to zero and
is not estimated from the data. The model equation simplifies to:

y = β1 x1 + β2 x2 + · · · + βp xp + ϵ

where y is the dependent variable, xi are the independent variables, βi are the
coefficients, and ϵ is the error term. There is no constant term (β0 ).

427
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

For instance, in a manufacturing process, if the number of products made


depends directly on the amount of raw material used, and no products can be
produced without any raw material, a non-intercept model is appropriate. The
model would be
Products = β1 × Raw Material + e
where β1 estimates the production rate per unit of raw material. This model
is suitable when there’s a clear theoretical reason to exclude the intercept, but
caution is needed if a baseline effect exists.

10.8.3 Assumptions of the CSLR Model (10.6)

T
The assumptions of the CSLR model given in Equation (10.6) are as follows:
1. Linearity: The relationship between X and Y must be linear.

2. Independence of Errors: The errors should be independent of each


other. More precisely, there should be no correlation between the errors
and the independent variable X.
AF
3. Zero Mean: For any fixed value of xi , the mean of the errors (residuals)
is zero. That is, E(ei | xi ) = 0.
4. Homoscedasticity: The variance of the errors (residuals) is constant
across all levels of the independent variable X. That is, for any fixed
value of xi , Var(ei | xi ) = σ 2 .
5. Normality: The errors are normally distributed for any fixed value of
xi .
DR

Figure 10.6: Conditional distribution of the disturbances ei

428
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

iid
Under these assumptions, we can write ei ∼ N (0, σ 2 ), and hence, yi | xi ∼
N (β0 + β1 xi , σ 2 ). Therefore, this model is also known as a normal linear
regression model. The graphical representation of the simple linear regression
model given in Equation (10.6) under these assumptions is presented in Figure
10.6.
Remark 10.8.1. Normality of the errors (or residuals) is not strictly required.
However, the normality assumption in Equation (10.6) is necessary to perform
hypothesis tests concerning regression parameters, as discussed in the next.
The estimated model (or estimated line) for Equation (10.6) can be
written as

T
ybi = βb0 + βb1 xi

where
• ybi is the estimator of E(Y | X = xi ) based on the sample data.
AF•

βb0 is the estimator of β0 .

βb1 is the estimator of β1 .


Therefore, in terms of the sample regression function (SRF), the observed
value yi can be expressed as

yi = ybi + ebi

where ebi is the residual, representing the deviation of the observed value yi from
the estimated value ybi .
DR

10.8.4 Ordinary Least Squares (OLS) Estimation


Ordinary Least Squares (OLS) is a method for estimating the parameters in
a linear regression model. The basic idea behind OLS is to find the best line
(or hyperplane in higher dimensions) that minimizes the sum of the squared
differences (error) between the observed values y and the predicted values of y
from the linear model.

Under the assumptions given in Section 10.8.3 for the simple linear regression
model provided in Equation (10.6), the OLS estimators for the parameters β0
and β1 are found by minimizing the sum of squared errors. Thus, the objective
function for OLS estimation is:
n
X n
X 2
Minimize e2i = (yi − (β0 + β1 xi ))
i=1 i=1
Pn 2
where i=1 ei is the sum of squared errors.

429
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

OLS Estimators
The OLS estimators for β0 and β1 can be derived by taking the partial deriva-
tives of Q(β0 , β1 ) with respect to β0 and β1 , setting them to zero, and solving
for the coefficients. The OLS estimators are:
Pn
xi yi − nx̄ȳ
β1 = Pi=1
b n 2 2
i=1 xi − nx̄
βb0 = ȳ − βb1 x̄
where:

T
• βb1 is the estimated slope of the regression line.
• βb0 is the estimated intercept of the regression line.

• x̄ and ȳ are the sample means of the independent variable X and the
dependent variable Y , respectively.
AF
Procedure for Calculating the OLS Estimators in a Simple
Linear Regression Model
For the simple linear regression model given in Equation (10.6), the OLS esti-
mators can be found using the following steps:
1. Objective Function:
n
X n
X
e2i = (yi − β0 − β1 xi )2
i=1 i=1

2. Find the Estimators: The estimators for β0 and β1 are obtained by


DR

solving the following normalized equations:


n n
∂ X 2 X
ei = −2 (yi − β0 − β1 xi ) = 0 (10.7)
∂β0 i=1 i=1

n n
∂ X 2 X
∂ e = −2 (yi − β0 − β1 xi )xi = 0 (10.8)
∂β1 i=1 i i=1

From Equation (10.7), we get:


n
X
(yi − β0 − β1 xi ) = 0
i=1
n
X n
X
⇒ yi − nβ0 − β1 xi = 0
i=1 i=1
⇒ β0 = ȳ − β1 x̄

430
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

From Equation (10.8), we get:


n
X
(yi − β0 − β1 xi )xi = 0
i=1
n
X n
X n
X
⇒ x i y i − β0 x i − β1 x2i = 0 (10.9)
i=1 i=1 i=1

Substitute β0 = ȳ − β1 x̄ into Equation (10.9):


n
X n
X n
X
xi yi − (ȳ − β1 x̄) x i − β1 x2i = 0

T
i=1 i=1 i=1
n
X n
X n
X n
X
⇒ xi yi − ȳ xi + β1 x̄ x i − β1 x2i = 0
i=1 i=1 i=1 i=1
n n
!
X X
⇒ xi yi − nȳx̄ − β1 x2i − nx̄2 =0
AF i=1
Pn
⇒ β1 = Pi=1
n
xi yi − nx̄ȳ
2
i=1 xi − nx̄
2
i=1

Hence, the Ordinary Least Squares (OLS) estimators are:


Pn
xi yi − nx̄ȳ
βb1 = Pi=1n 2 2
i=1 xi − nx̄
βb0 = ȳ − βb1 x̄

The estimator βb1 can also be expressed as:


Pn
DR

 
i=1 (xi − x̄)(yi − ȳ) sy
β1 =
b P n 2
= rxy
(x
i=1 i − x̄) sx

where rxy is the correlation coefficient between X and Y , and sx and sy are the
sample standard deviations of X and Y , respectively.

10.8.5 Interpretation of Regression Coefficients


The estimated regression model (or equation) is

ybi = βb0 + βb1 xi

where βb0 and βb1 are respectively, the estimator of β0 and β1


• Intercept (βb0 ): The intercept represents the expected value of the
dependent variable Y when the realized value of independent variable
X is zero. In other words, it is the point where the regression line
crosses the y-axis.

431
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• Slope of the line (βb1 ):


■ it shows the amount of change in yb for a change of one unit
in X
■ a positive value for βb1 indicates a direct relationship between the
two variables and a negative value indicates an inverse relation-
ship
■ the sign of βb1 and the sign of rxy , the correlation coefficient, are
always the same
Important properties related to correlation and regression coefficients are

T
stated in Theorem 10.1.
Theorem 10.1. Let βbxy be the slope of the regression of y on x and βbyx be the
slope of the regression of x on y. The estimated regression lines are:

• y = βbxy x + bxy (regression of y on x)


AF• x = βbyx y + byx (regression of x on y)

Then the geometric mean of βbxy and βbyx is equal to the absolute value of the
Pearson correlation coefficient r.
Proof. The slope βbxy of the regression of y on x is given by:
sy
βbxy = r ·
sx
where r is the Pearson correlation coefficient, sy is the standard deviation of y,
and sx is the standard deviation of x.
DR

The slope βbyx of the regression of x on y is given by:


sx
βbyx = r ·
sy

To find the geometric mean of βbxy and βbyx , we compute:


q
Geometric Mean = βbxy · βbyx

Substitute the expressions for βbxy and βbyx :


s    √
sy sx
Geometric Mean = r· · r· = r2 = |r|
sx sy

Thus, the geometric mean of the regression coefficients βbxy and βbyx is the
absolute value of the Pearson correlation coefficient r.

432
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

10.8.6 The Estimated Error Variance or Standard Error


The error variance or standard error of estimate measures the scatter, or dis-
persion, of the observed values around the line of regression. The formulas that
are used to compute the error variance or the standard error:

n n n n
(yi − ybi )2 yi2 − βb0 yi − βb1 xi yi
P P P P
i i i i
b2 =
σ =
n−2 n−2
where

FT
ybi = βb0 + βb1 xi
Hence, the standard error of estimate is
v
u n 2 b Pn n
vP uP P
u (yi − ybi )2 u yi − β0 yi − βb1 xi yi
u
t i t i i i
σ
b= =
n−2 n−2

Problem 10.7 (Shorshe Ilish restaurant Sales Dataset). Suppose data were
collected from a sample of 10 semesters from a restaurant located near to the
university campus. For the ith semester in the sample, xi is the size of the
student population (in thousands) and yi is the quarterly sales (in thousands of
A
dollars).

Table 10.6: Student Population and Quarterly Sales Data for Different
Semesters

Restaurant i Student Population Quarterly Sales


($1000s) yi
R
(1000s) xi
1 2 58
2 6 105
3 8 88
4 8 118
D

5 12 117
6 16 137
7 20 157
8 20 169
9 22 149
10 26 202

433
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

(i). Show the relationship between the size of student population and the
quarterly sales. Make a comment on the diagram.

(ii). Write down the regression model for this example, and mention the
assumptions of the model.

(iii). Find the least square estimates and write the estimated regression model.
Interpret the results.

(iv). Draw the regression line on the scatter diagram.

(v). Predict quarterly sales for a restaurant to be located near a campus


with 16,000 students.

T
(vi). Find the value of the standard error of the estimates.

Solution
(i). The scatter plot in Figure 10.7 shows the relationship between student
AF
population (in thousands) and quarterly sales (in thousands of dollars) based
on data from ten semesters.

200
Quarterly Sales ($1000s)

150
DR

100

50

0 5 10 15 20 25 30
Student Population (1000s)

Figure 10.7: Scatter Plot of Student Population vs. Quarterly Sales

Observations:
• There appears to be a positive correlation between student population
and quarterly sales. As the student population increases, the quarterly
sales also tend to increase.

434
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• The data points are somewhat clustered along a line, suggesting a linear
relationship.

• Some points, such as the one where the student population is 8, show
variability in sales, indicating that factors other than student popula-
tion might also influence sales.

(ii). Simple regression model:

yi = β0 + β1 xi + ei ; i = 1, 2, . . . , 10 (10.10)

where

T
• yi = Quarterly Sales ($1000s)

• xi = Student Population (1000s)

• β0 is intercept


AF •
β1 is slope coefficient

ei error term and we assume, ei is normally distributed with mean 0


and variance constant (say, σ 2 ).
Assumptions:
1. Linearity: The relationship between the dependent and independent
variables is linear.
2. Independence: The observations are independent of each other.
3. Homoscedasticity: The variance of the error terms is constant across
all levels of the independent variable.
DR

4. Zero Mean: For any fixed value of xi , the mean of the errors (residuals)
is zero.
5. Normality of Errors: The error terms are normally distributed (impor-
tant for inference, but not for estimation).

(iii). The estimated regression model (or equation) of (10.10) is

ybi = βb0 + βb1 xi

where βb0 and βb1 are respectively, the estimator of β0 and β1 . The ordinary
least square (OLS) estimators of β0 and β1 are
n
P
xi yi − nx̄ȳ
i=1
βb0 = ȳ − βb1 x̄ ; β1 = P
b
n .
x2i − nx̄2
i=1

435
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

The Least Squares Estimation The ordinary least square (OLS) es-
timators of β0 and β1 are
n
P
xi yi − nx̄ȳ
i=1 2840
β1 = P
b
n = =5
568
x2i − nx̄2
i=1

βb0 = ȳ − βb1 x̄ = 130 − 5(14) = 60

xi yi x2i xi yi

T
2 58 4 116
6 105 36 630
8 88 64 704
8 118 64 944
AF 12
16
20
117
137
157
144
256
400
1404
2192
3140
20 169 400 3380
22 149 484 3278
26 202 676 5252
Total= 140 1300 2528 21040

Thus, the estimated regression equation is


DR

ybi = 60 + 5xi

The slope of the estimated regression equation (βb1 = 5) is positive, implying


that as student population increases, sales increase. In fact, we can conclude
that an increase in the student population of 1000 is associated with an increase
of $5000 in expected sales; that is, quarterly sales are expected to increase by
$5 per student.

(iv). The regression line with the scatter plot is depicted in Figure 10.8.

436
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Scatter Diagram: Student Population vs. Quarterly Sales

ybi = 60 + 5xi

Quarterly Sales ($1000s) 200

150

T
100

50
AF 0 5 10 15 20
Student Population (1000s)
25 30

Figure 10.8: Scatter Plot of Student Population vs. Quarterly Sales with Re-
gression Line

(v). To predict quarterly sales for a restaurant to be located near a campus


with 16,000 students, we would compute

yb = 60 + 5 × 16 = 140

Hence, we would predict quarterly sales of $140,000 for this restaurant.


DR

437
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

xi yi ybi = 60 + 5xi (yi − ybi )2


2 58 70 144
6 105 90 225
8 88 100 144
8 118 100 324
12 117 120 9
16 137 140 9
20 157 160 9
20 169 160 81

T
22 149 170 441
26 202 190 144
140 1300 760 1530
AF
(vi). The Standard Error of the Estimates

σ 2
b =
P
i
(yi − ybi )2
=
1530
= 765 and σ

b = 765 = 27.66
n−2 10 − 2
Hence the standard error of the estimates is σ
b = 27.66.

10.8.7 Coefficient of Determination


The coefficient of determination is the proportion of the total variation in
the dependent variable Y that is explained by the independent variable(s) in a
regression model. It is denoted by R2 and defined by
DR

(yi − ybi )2
P
2 SSE i
R =1− =1− P
SST (yi − ȳ)2
i

where
• the total
Xsum of squares (proportional to the variance of the data):
SST = (yi − ȳ)2
i

• the sum of squares of residuals,P


also called the P
Residual or Error
Sum of Squares (SSE): SSE = (yi − ybi )2 = e2i
i i

For the Shorshe Ilish restaurant Sales Dataset given in Problem 10.7,
we have

438
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

xi yi (yi − ȳ)2 ybi = 60 + 5xi (yi − ybi )2


2 58 5184 70 144
6 105 625 90 225
8 88 1764 100 144
8 118 144 100 324
12 117 169 120 9
16 137 49 140 9
20 157 729 160 9

T
20 169 1521 160 81
22 149 361 170 441
26 202 5184 190 144
140 1300 15730 1300 1530
AF 2
P
i
(yi − ybi )2
1530
hence, R = 1 − P =1− = 0.9027
(yi − ȳ)2 15730
i
2
The R = 0.9027 implies that 90.27% of the variability of the dependent vari-
able is explained and the remaining 9.73% of the variability is still unexplained
by the regression model.

10.8.8 Relationship between R2 and rxy


DR

The relationship between R2 and rxy is that the square of the coefficient of
correlation (rxy ) is equal to the coefficient of determination (R2 ) for the simple
regression model. Mathematically,
2
R2 = (rxy )

Hence, for the previous Problem 10.3,


2
R2 = (rxy ) = (0.7522)2
= 0.566

The estimated regression equation for simple linear regression model pro-
vides
ybi = βb0 + βb1 xi

439
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

The sample correlation coefficient is



rxy = (sign of βb1 ) coefficient of determination

= (sign of βb1 ) R2 (10.11)

where R2 is the coefficient of determination for simple regression model yi =


β0 + β1 xi + ei ; i = 1, 2, . . . , n

2
Remarks: Note that the relationship R2 = (rxy ) only holds for the simple
linear regression model.

T
10.8.9 Advantages and Disadvantages of R2
Here are some advantages and disadvantages of using R2 :
• R2 is a statistic that will give some information about the goodness-
of-fit of a model.
AF• In regression, the R2 coefficient of determination is a statistical measure
of how well the regression predictions approximate the real data points.

• An R2 of 1 indicates that the regression predictions perfectly fit the


data.

• R2 increases as we increase the number of variables in the model (R2


is monotone increasing with the number of variables included i.e.,
it will never decrease).

10.8.10 Adjusted R2
DR

An adjusted R2 is a modification of R2 that adjusts for the number of inde-


pendent variables in a model (p) relative to the number of data points (n). The
adjusted R2 ( denoted by R̄2 ) is defined as
 
2 2
 n−1
R̄ = 1 − 1 − R .
n−p−1
where p is the total number of independent variables in the model (not including
the constant term), and n is the sample size. It can also be written as:

SSE/dfe
R̄2 = 1 −
SST /dft
where dft is the degrees of freedom n − 1 of the estimate of the population
variance of the dependent variable, and dfe is the degrees of freedom n − p − 1
of the estimate of the underlying population error variance.

440
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

The explanation of this statistic is almost the same as R2 but it penalizes


n−1
the statistic as extra variables are included in the model. The term n−p−1
is called the penalty of using the more covariates in a model. When the num-
n−1
ber of covariates p, increases, (1 − R2 ) will decrease, but n−p−1 will increase.
Whether more covariates improve the explanatory power of a model depends
on the trade-of between R2 and the penalty.

For the previous Problem 10.3, p = 1 and R2 = 0.566 and hence,


 
5−1
R̄2 = 1 − (1 − 0.566) = 0.4212
5−1−1

10.8.11 Python Code: Simple Regression Analysis

T
1 import numpy as np
2 import pandas as pd
3 import matplotlib . pyplot as plt
4 import statsmodels . api as sm
5

7 data = {
AF
# Data : Student Population and Quarterly Sales

8 ’ Restaurant ’: [1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10] ,
9 ’ Student Population (1000 s ) ’: [2 , 6 , 8 , 8 , 12 , 16 , 20 , 20 ,
22 , 26] ,
10 ’ Quarterly Sales ( $ 1000s ) ’: [58 , 105 , 88 , 118 , 117 , 137 ,
157 , 169 , 149 , 202]
11 }
12

13 # Convert the data to a DataFrame


DR
14 df = pd . DataFrame ( data )
15

16 # ( i ) Show the relationship between the size of the student


population and the quarterly sales
17 plt . figure ( figsize =(8 , 6) )
18 plt . scatter ( df [ ’ Student Population (1000 s ) ’] , df [ ’ Quarterly
Sales ( $ 1000s ) ’] , color = ’ blue ’)
19 plt . title ( ’ Relationship between Student Population and
Quarterly Sales ’)
20 plt . xlabel ( ’ Student Population (1000 s ) ’)
21 plt . ylabel ( ’ Quarterly Sales ( $ 1000s ) ’)
22 plt . grid ( True )
23 plt . show ()
24

25

26 # ( iii ) Write the estimated regression equation and find the


least squares estimates
27 X = df [ ’ Student Population (1000 s ) ’]
28 y = df [ ’ Quarterly Sales ( $ 1000s ) ’]

441
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

29

30 # Add a constant to the independent variable matrix


31 X = sm . add_constant ( X )
32

33 # Fit the regression model


34 model = sm . OLS (y , X ) . fit ()
35

36 # Print the regression results


37 print ( model . summary () )
38

39 # The estimated regression equation :


40 # y = model . params [0] + model . params [1] * x

T
41 # Interpret the results based on the model summary .
42

43 # ( iv ) Draw the regression line


44 plt . figure ( figsize =(8 , 6) )
45 plt . scatter ( df [ ’ Student Population (1000 s ) ’] , df [ ’ Quarterly
Sales ( $ 1000s ) ’] , color = ’ blue ’ , label = ’ Data Points ’)
46 plt . plot ( df [ ’ Student Population (1000 s ) ’] , model . predict ( X ) ,

47

48
AF color = ’ red ’ , label = ’ Regression Line ’)
plt . title ( ’ Regression Line : Quarterly Sales vs . Student
Population ’)
plt . xlabel ( ’ Student Population (1000 s ) ’)
49 plt . ylabel ( ’ Quarterly Sales ( $ 1000s ) ’)
50 plt . legend ()
51 plt . grid ( True )
52 plt . show ()
53

54 # ( v ) Predict quarterly sales for a restaurant near a campus


with 16 ,000 students
55 predicted_sales = model . predict ([1 , 16]) [0] # [1 , 16] where
DR

1 is for the constant term


56 print ( f ’ Predicted Quarterly Sales for 16 ,000 students : $ {
predicted_sales :.2 f } ( in $ 1000s ) ’)
57

58 # ( vi ) Find the value of the standard error of the estimates


59 standard_error = np . sqrt ( model . mse_resid )
60 print ( f ’ Standard Error of the Estimates : { standard_error :.2 f
} ’)
61

62

442
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Figure 10.9: Python output for the practice example.

T
AF
10.8.12 Interval Estimation and Hypothesis Testing
Confidence Interval for β0
The confidence interval for β0 can be computed using the standard formula for
linear regression parameter estimates. The formula for the confidence interval
for β0 is:

βb0 ± t α2 ,(n−2) · se(βb0 )


where:
DR

• tα/2 is the critical value of the t-distribution with n − 2 degrees of


freedom at a significance level of α/2 (where α is typically 0.05 for a
95% confidence interval),

• the standard error of the estimator βb0 is


v !
u
u 1 x̄ 2
se(βb0 ) = tσb2 + pPn .
n i=1 (xi − x̄)
2

Confidence Interval for β1


The confidence interval for β1 can be computed using the standard formula for
linear regression parameter estimates. The formula for the confidence interval
for β1 is:

βb1 ± t α2 ,(n−2) · se(βb1 )

443
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

where:
• tα/2 is the critical value of the t-distribution with n − 2 degrees of
freedom at a significance level of α/2 (where α is typically 0.05 for a
95% confidence interval),

• the standard error of the estimator βb1 is


σ
b
se(βb1 ) = pPn
i=1 (xi − x̄)2

where,

T
v
u n
uP
u (yi − ybi )2
t i
σ
b=
n−2
is the estimated standard error of the residuals (or the square root of the mean
squared error, often obtained from the regression output). Once you have these
AF
values, you can compute the confidence interval.

10.8.13 The F -tests in Simple Linear Regression Model


In a simple linear regression model, the F -test is used to assess the overall
significance of the regression model. The hypotheses are

H0 : β1 = 0
H1 : β1 ̸= 0

The null hypothesis H0 implying that the independent variable(s) do not have
any effect on the dependent variable. The alternative hypothesis H1 indicating
DR

that the independent variable(s) do have a significant effect on the dependent


variable. The formula for the F -statistic in a simple linear regression model is:

SSR/1
F =
SSE/(n − 2)
where:

Pn
SSR = i=1 (b
yi − ȳ)2 is the sum of squared regression (explained),

• SSE is the sum of squared error (residual) terms,

• Under the null hypothesis, the F -statistic follows an F -distribution


with p and n − p − 1 degrees of freedom, where p is the number of co-
variates (excluding the intercept) and n is the number of observations.

• In a simple linear regression model, p = 1 because you only have one


independent variable (excluding the intercept).

444
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• So, the degrees of freedom for the F -distribution in a simple linear


regression model are 1 and n − 2.

• Once you compute the F -statistic, you can compare it to the critical
value from the F -distribution at a chosen significance level (e.g., α =
0.05) to determine whether to reject the null hypothesis.

• If the F -statistic is greater than the critical value, you reject the null
hypothesis and conclude that the model is significant. Otherwise, you
fail to reject the null hypothesis.

Decision Rule for ANOVA F -test

T
Classical Approach
• Set the significance level α.

• Calculate the critical value Fcritical = Fα (p, n − p − 1) from the


F -distribution with appropriate degrees of freedom.
AF • Decision Rule:
■ If calculated F > Fcritical , reject H0 and conclude that the
regression model is statistically significant.
■ If calculated F ≤ Fcritical , fail to reject H0 and conclude
that the regression model is not statistically significant.

Decision Rule for ANOVA F -test


p-value Approach
DR

• Calculate the p-value associated with the calculated F -statistic.

• Decision Rule:
■ If p-value < α, reject H0 and conclude that the regression
model is statistically significant.
■ If p-value ≥ α, fail to reject H0 and conclude that the re-
gression model is not statistically significant.

10.8.14 The t-tests in Simple Linear Regression Model


Consider the simple linear regression model:

yi = β0 + β1 xi + ei ; 1, 2, . . . , n

445
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

To test the individual significance of the independent variable, the null


hypothesis is
H0 : β1 = 0
Under H0 , the test statistic is given by:

βb1 − 0
t=
se(βb1 )
v
u n
uP
u (yi − ybi )2
σb t i
where, se(β1 ) = pPn
b and σ
b=

T
i=1 (xi − x̄)
2 n−2
We reject H0 if |t| exceeds the critical value tcritical = t α2 ,(n−2) from the t-
distribution with n − 2 degrees of freedom, where n is the sample size.

Decision Rule for t-test


AF
Classical Approach
• Set the significance level α.

• Calculate the critical value tcritical from the t-distribution with


appropriate degrees of freedom.

• Decision Rule for each βb1 :

■ If |t| > tcritical = t α2 ,(n−2) , reject H0 and conclude that the


corresponding coefficient βb1 is statistically significant.
■ If |t| ≤ tcritical , fail to reject H0 and conclude that the cor-
DR

responding coefficient βb1 is not statistically significant.

p-value Approach
• Calculate the p-value associated with each calculated t-statistic.

• Decision Rule for each βb1 :


■ If p-value < α, reject H0 and conclude that the correspond-
ing coefficient βb1 is statistically significant.
■ If p-value ≥ α, fail to reject H0 and conclude that the cor-
responding coefficient βb1 is not statistically significant.

446
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Note that the p-value can be obtained by using the formula:

p-value = 2 × min(P (T < −|t|), P (T > |t|))

Confidence Interval for E(Y |X = x)


• The Confidence Interval for E(Y |X = x∗ ) provides a range of values
where we expect the mean response Y to lie for a given value of x∗
with a certain level of confidence.

• It is computed as:
y∗ )
yb∗ ± t α2 ,(n−2) · se(b

T
where yb∗ is the predicted value of Y for a given x∗ , tα/2 is the critical
value of the t-distribution, n is the number of observations, and
s
∗ 1 (x∗ − x̄)2
se(b
y )=σ + Pn 2
i=1 (xi − x̄)
b
n
AF•
is the standard error of the predicted value.

x∗ is the specific value of the independent variable for which you’re


predicting the response,

• The confidence level is typically chosen to be 95% (α = 0.05).

For Shorshe Ilish restaurant Sales Dataset P given in Problem 10.7,


b = 13.829. With x∗ = 10, x̄ = 14, and (xi − x̄)2 = 568, we have
we have σ
r
∗ 1 (10 − 14)2
se(b
y ) = 13.829 +
DR

10 568

= 13.829 0.1282
= 4.95

With yb∗ = 110 and a margin of error of t0.025,8 ×se(b


y ∗ ) = 2.306×4.95 = 11.4147,
the 95% confidence interval for an average quarterly sales for the Shorshe Ilish
restaurant located near campus for fixed x∗ = 10 is

110 ± 11.4147

where t α2 ,(n−1) = t0.025,8 = 2.306.


Thus, the 95% confidence interval for the mean quarterly sales, given a
student population of 10,000, ranges from $98585 to $121415. Observe that the
confidence interval for the mean value of Y will widen as x∗ deviates further
from x. This behavior is illustrated graphically in Figure 10.10.

447
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

T
Figure 10.10: Confidence and prediction intervals for sales Y at given values of
AF
student population X

Prediction Interval for an Individual Value of Y


• The Prediction Interval for an Individual Value of Y provides a range
of values where we expect a new observation of Y to lie with a certain
level of confidence.

• It is wider than the Confidence Interval for the mean response E(Y |X =
x∗ ) because it accounts for the variability of individual observations
around the regression line.
DR

• It is computed as:
yb∗ ± t α2 ,(n−2) · spred
where
s
∗ 2 1 (x∗ − x̄)2
s2pred =σ 2
b + se(b
y ) and hence spred = σ 1+ + Pn 2
i=1 (xi − x̄)
b
n

is the standard error of the predicted value and yb∗ is the predicted
value of Y for a given x∗ .

• The confidence level is typically chosen to be 95% (α = 0.05).

For Shorshe Ilish restaurant Sales Dataset given in Problem 10.7,


the estimated standard deviation corresponding to the prediction of quarterly
sales for a new restaurant located need to the campus with 10,000 students, is

448
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

computed as follows
r
1 (10 − 14)2
spred = 13.829 1+
+
10 568

= 13.829 1.1282
= 14.69

The 95% prediction interval for quarterly sales for the Shorshe Ilish restaurant
located near campus can be found t α2 ,(n−1) = 2.306. Thus, with yb∗ = 110 and
a margin of error of t0.025 × spred = 2.306 × 14.69 = 33.875, the 95% prediction
interval is
110 ± 33.875
In dollars, this prediction interval is $76125 to $143875.

T
Confidence intervals and prediction intervals become more precise as the
value of the independent variable x∗ approaches x. The typical shapes of con-
fidence intervals and the broader prediction intervals are illustrated together in
Figure 10.11.
AF
DR

Figure 10.11: Confidence and prediction intervals for sales Y at given values of
student population X

10.8.15 Python Code: Linear Regression Model


1 import numpy as np
2 import statsmodels . api as sm
3

4 # Define the data

449
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

5 x = np . array ([2 , 6 , 8 , 8 , 12 , 16 , 20 , 20 , 22 , 26])


6 y = np . array ([58 , 105 , 88 , 118 , 117 , 137 , 157 , 169 , 149 ,
202])
7

8 # Add constant for intercept


9 x_with_const = sm . add_constant ( x )
10

11 # Fit the model


12 model = sm . OLS (y , x_with_const ) . fit ()
13

14 # Print summary of the regression model


15 print ( model . summary () )

T
16

10.8.16 Exercises
1. What is meant by regression analysis? Distinguish between correlation
and regression analysis.
AF 2. Define and describe the main types of regression analysis (e.g., simple
linear regression, multiple linear regression). Provide a real-world example
for each type and explain why that particular type of regression would be
used.
3. Match the following scenarios with the appropriate type of regression
analysis:

(a) Predicting a student’s final grade based on hours studied.


(b) Modeling the relationship between house prices and multiple features
(e.g., square footage, number of bedrooms).
DR

(c) Estimating the effect of different levels of advertising spend on sales


with a non-linear relationship.

4. The following sample of observations were randomly selected.

No. of TV Commercials (x) 2 5 1 3 4 1 5 3 4 2


Total Sales(y) 50 57 41 54 54 38 63 48 59 46

(i). Find the linear relationship between number of TV Commercials


and total sales.
(ii). Fit a model y on x.
(iii). Find the coefficient of determination. Interpret your findings.
(iv). Find the adjusted R2 . Interpret your findings.

450
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

(v). Predict (or forecast) total sales when x = 5.

5. Consider the following sample of production volumes and total cost data
for a manufacturing operation.

Production Volume (units) Total Cost ($)


400 4000
450 5000
550 5400
600 5900

T
700 6400
750 7000
550 5500
615 6000
AF (i). Use these data to develop an estimated regression equation that
could be used to predict the total cost for a given production
volume. Interpret the value of regression intercept and slop co-
efficients. The company’s production schedule shows 500 units
must be produced next month. Predict the total cost for this
operation
(ii). Compute the coefficient of determination. What percentage of
the variation in total cost can be explained by production vol-
ume?
DR

6. Given a dataset with the following statistics:


• Pearson correlation coefficient r = 0.8
• Standard deviation of x, sx = 4
• Standard deviation of y, sy = 5

Compute the following:


(a) The regression coefficient βbxy for predicting y from x.
(b) The regression coefficient βbyx for predicting x from y.
7. Consider two regression coefficients:
• βbxy = 1.5

• βbyx = 0.7

451
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Calculate the geometric mean of these regression coefficients and verify if


it equals the absolute value of the Pearson correlation coefficient r.
8. You are given:
• βbxy = 2.0

• βbyx = 0.5
• Standard deviation of x, sx = 3
• Standard deviation of y, sy = 6
Determine the Pearson correlation coefficient r using the given regression

T
coefficients and standard deviations.
9. If βbxy = 1.2 and βbyx = 0.9, find the value of r2 and compare it with the
product of the regression coefficients.
10. For a simple linear regression model, you have the following sums of
AF squares:
• Total sum of squares (SST) = 300
• Regression sum of squares (SSR) = 180
Calculate the residual sum of squares (SSres ) and discuss its relation to
the total and regression sum of squares.

11. Suppose the Pearson correlation coefficient between two variables is −0.6.
If the standard deviations are sx = 5 and sy = 10, calculate:

(a) βbxy
DR

(b) βbyx
Discuss how a negative correlation affects the regression coefficients com-
pared to a positive correlation.

12. You have the regression coefficients:


• βbxy = −0.4

• βbyx = −1.1
Calculate the geometric mean of these coefficients and confirm if it matches
the absolute value of the Pearson correlation coefficient.
13. Given the following dataset, perform a simple linear regression analysis:

452
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Hours Studied Exam Score


1 55
2 60
3 65
4 70
5 75

(a) Compute the regression line equation Ŷ = β0 + β1 X.


(b) Plot the data and the regression line.

T
14. Discuss the key assumptions of the Classical Simple Linear Regression
(CSLR) model. For each assumption, provide an example of a potential
violation and explain how it could affect the results of the regression
analysis.
AF
15. Using the dataset from Exercise 13 , perform the following:

(a) Calculate the OLS estimates for β0 and β1 .


(b) Derive the formula for the OLS estimator and apply it to find the
estimates.

16. Given the regression equation Ŷ = 2 + 3X:

(a) Interpret the meaning of the intercept (β0 ) and slope (β1 ).
(b) Explain what the coefficient values tell you about the relationship
between X and Y .
DR

17. Calculate the estimated error variance and the standard error of the es-
timate for the dataset used in Exercise 13. Show all steps and formulas
used in your calculations.
18. For the regression analysis in Exercise 13, compute the coefficient of de-
termination (R2 ). Interpret the value of R2 in the context of the given
data.
19. Explain the relationship between the coefficient of determination R2 and
the correlation coefficient rxy . If rxy is 0.8, what is R2 and what does it
signify about the regression model?
20. List and discuss the advantages and disadvantages of using R2 as a mea-
sure of goodness-of-fit in regression analysis.
21. Given a multiple regression model with 3 predictors, calculate the adjusted
R2 if the R2 is 0.85, the sample size is 50, and the number of predictors
is 3. Interpret the result.

453
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

22. Write a Python script using pandas and statsmodels to perform a simple
linear regression analysis on a dataset of your choice. The script should:

(a) Load the dataset.


(b) Perform the regression analysis.
(c) Output the regression coefficients, standard errors, and R2 value.

23. Given the regression output where Ŷ = 1+2X, construct a 95% confidence
interval for β1 (slope) if the standard error of β1 is 0.5. Also, perform a
hypothesis test to determine if β1 is significantly different from 0.

T
24. For the regression analysis provided in Exercise 13, perform an ANOVA
F-test to determine if the overall regression model is significant. Explain
your decision rule and interpret the result.
25. Perform a hypothesis test to determine whether β1 is significantly different
from 0 in a regression model where the estimated β1 is 2 and the standard
error of β1 is 0.4. Use a significance level of 0.05.
AF
26. Calculate and interpret the confidence interval for the expected value
E(Y |X = x) using the dataset from Exercise 13.

10.9 Multiple Linear Regression


In a multiple linear regression model, we extend the simple linear regression
model (10.6) to include multiple independent variables. The general form of a
multiple linear regression model is:

yi = β0 + β1 x1i + β2 x2i + . . . + βp xpi + ei (10.12)


DR

where,
• yi is the dependent variable (response variable) for the ith observation.

• x1i , x2i , . . . , xpi are the independent variables for the ith observation.

• β0 is the intercept term.

• β1 , β2 , . . . , βp are the coefficients corresponding to the independent


variables x1i , x2i , . . . , xpi .

• ei is the error term or residual for the ith observation.

454
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

10.9.1 Model Assumptions


Under the multiple regression model (10.12), the following assumptions are
made.

Assumptions
1. Linearity: The relationship between the dependent variable yi and the
independent variables x1i , x2i , . . . , xpi is linear.
2. Independence of Errors: The error terms ei are independent of each
other.

T
3. Homoscedasticity: The variance of the error terms ei is constant for all
values of the independent variables.
4. Normality of Errors: The error terms ei are normally distributed with
mean zero.
5. No Perfect Multicollinearity: There is no perfect linear relationship
AF among the independent variables.
6. No Autocorrelation: The errors (e) are not correlated with each other
over time or across observations.
These assumptions are essential for valid estimation and interpretation of
the classical regression model.

10.9.2 Estimation Procedure


To estimate the parameters β0 , β1 , . . . , βp in the multiple regression model
DR

yi = β0 + β1 x1i + β2 x2i + . . . + βp xpi + ei

using ordinary least squares (OLS) in matrix notation, we define

       
y1 1 x11 x21 ... xp1 β0 e1
       
 y2  1 x12 x22 ... xp2   β1   e2 
⃗ =
Y .

X = .
  ⃗ =
β

⃗e =  . .
 
 ..   .. .. .. .. ..  .
 ..   .. 
   . . . . 
    
yn 1 x1n x2n ... xpn βp en

In this case, the model can be written as


⃗ + ⃗e
⃗ = Xβ
Y
⃗b ⃗ using ordinary least squares (OLS),
To find the estimated coefficients β of β
we minimize the sum of squared errors. The vector of errors is a define as

455
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS


⃗ − Xβ
⃗e = Y
Calculation of Estimated Coefficients
The sum of squared residuals is given by:
 T  
SSE = ⃗eT ⃗e = Y ⃗
⃗ − Xβ ⃗
⃗ − Xβ
Y

⃗ and set it to
To minimize SSE, we take the derivative with respect to β
zero:
∂SSE ⃗ =0
⃗ − X β)
= −2X T (Y

T

∂β
⃗ we get: β
Solving for β, ⃗ = (X T X)−1 X T Y ⃗ is
⃗ . Hence, the estimator of β

⃗b ⃗
β = (X T X)−1 X T Y
⃗b
AFThis equation provides the estimated coefficients β
of squared residuals.
that minimize the sum

10.9.3 Estimation Procedure of Error Variance


b2 , we use the following formula:
To estimate the error variance σ
 T  
2 1 ⃗ ⃗b ⃗ ⃗b
σ
b = Y − Xβ Y − Xβ
n−p−1
where:
⃗b

DR

β is the vector of estimated coefficients obtained using ordinary least


squares (OLS).

• n is the number of observations.

• p is the number of independent variables (excluding the intercept).


 T  
⃗ ⃗b ⃗ ⃗b
The term Y − X β Y − X β represents the sum of squared residu-
als, which measures the unexplained variability in the dependent variable after
accounting for the effects of the independent variables. Dividing by n − p − 1,
the degrees of freedom for the error term, provides an unbiased estimate of the
b2 .
error variance σ

456
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

10.9.4 Mean of the OLS Estimator


⃗b
To find the mean of β:

⃗b = (X ⊺ X)−1 X ⊺ Y
β ⃗
⃗ + ⃗e)
= (X ⊺ X)−1 X ⊺ (X β
⃗ + (X ⊺ X)−1 X ⊺⃗e
= (X ⊺ X)−1 X ⊺ X β
⃗ + (X ⊺ X)−1 X ⊺⃗e

Taking the expectation:

T
h i
⃗b = E β ⃗ + (X ⊺ X)−1 X ⊺⃗e = β
⃗ + E (X ⊺ X)−1 X ⊺⃗e
 
E[β]
⃗ + (X ⊺ X)−1 X ⊺ E[⃗e]

⃗ + (X ⊺ X)−1 X ⊺⃗0



AF
So, the mean of the OLS estimator is:

⃗b = β
E[β] ⃗

10.9.5 Variance of the OLS Estimator


⃗b
To find the variance of β:

⃗b = β
β ⃗ + (X ⊺ X)−1 X ⊺⃗e

⃗b is given by:
The covariance matrix of β
DR

⃗b = Var(β)
Cov(β) ⃗b
 
= Var β ⃗ + (X ⊺ X)−1 X ⊺⃗e

= Var (X ⊺ X)−1 X ⊺⃗e




Since ⃗e ∼ N (⃗0, σ 2 I):

Var(⃗e) = σ 2 I
Using the properties of variance:

Var (X ⊺ X)−1 X ⊺⃗e = (X ⊺ X)−1 X ⊺ Var(⃗e)X(X ⊺ X)−1




= (X ⊺ X)−1 X ⊺ (σ 2 I)X(X ⊺ X)−1


= σ 2 (X ⊺ X)−1 X ⊺ X(X ⊺ X)−1
= σ 2 (X ⊺ X)−1

457
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

So, the variance of the OLS estimator is:

⃗b = σ 2 (X ⊺ X)−1
Var(β)
- The mean of the OLS estimator is:

⃗b = β
E[β] ⃗
- The variance of the OLS estimator is:

⃗b = σ 2 (X ⊺ X)−1
Var(β)

T
10.9.6 Coefficient of Determination
The coefficient of determination (R2 ) measures the proportion of the variance
in the dependent variable (Y ) that is explained by the independent variables
(x1 , x2 , . . . , xp ) in the regression model.
AF
where:
R2 = 1 −
SSE
SST

 T  
⃗b ⃗b
• ⃗
SSE = Y − X β ⃗
Y − X β is the sum of squared errors (residu-
als), representing the unexplained variability in the dependent variable.

• SST = (Y ⃗ − Ȳ)T (Y⃗ − Ȳ) is the total sum of squares, representing the
total variability in the dependent variable.
Interpretation:
DR

• R2 ranges from 0 to 1, where a higher value indicates a better fit of


the regression model to the data.

• R2 represents the proportion of the variance in the dependent variable


that is explained by the independent variables.

• For example, if R2 = 0.75, it means that 75

10.9.7 Adjusted R2
• the adjusted R2 is denoted by R̄2 and is defined as

n−1
R̄2 = 1 − (1 − R2 )
n−p−1
where p is the total number of independent variables in the model (not
including the constant term), and n is the sample size

458
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• it can also be written as:


SSE/dfe
R̄2 = 1 −
SST /dft

where dft is the degrees of freedom n − 1 of the estimate of the pop-


ulation variance of the dependent variable, and dfe is the degrees of
freedom n − p − 1 of the estimate of the underlying population error
variance

Hence,
SSE/(n − p − 1)
R̄2 = 1 −

T
SST /(n − 1)

• the explanation of this statistic is almost the same as R2 but it penal-


izes the statistic as extra variables are included in the model

• the term n−1


n−p−1 is called the penalty of using the more covariates in
a model
AF• when the number of covariates p, increases, (1 − R2 ) will decrease, but
n−1
n−p−1 will increase

• whether more covariates improve the explanatory power of a model


depends on the trade-of between R2 and the penalty

Interpretation:
• R̄2 penalizes the addition of unnecessary predictors to the model, unlike
R2 .


DR

It is always less than or equal to R2 , and it increases only if the new


term improves the model more than would be expected by chance.

• Therefore, R̄2 is often preferred for comparing the goodness of fit of


models with different numbers of predictors.

10.9.8 Example Dataset and Regression Calculations


Consider a dataset with 10 observations and two independent variables (x1 and
x2 ).

459
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

i x1i x2i yi
1 9 16 10
1 13 14 12
1 11 10 14
1 11 8 16
1 14 11 18
1 15 17 20
1 16 9 22

T
1 20 16 24
1 15 12 26
1 15 12 28

To fit a multiple regression model, we use the least squares method to esti-
AF
mate the coefficients β0 , β1 , and β2 .
The model equation is:

yi = β0 + β1 x1i + β2 x2i + ei
We calculate the estimated coefficients βb using the formula:

βb = (X T X)−1 X T Y
where X is the design matrix of independent variables, Y is the vector of
the dependent variable, and βb is the vector of estimated coefficients.
⃗b
Calculation of β
DR

 
1 9 16
 
1 13 14
 
 
1 11 10
   
10
 
1 11 8 
   

1 14 11
 12
For the given dataset, we have: X =   ; Y ⃗ =  . .

1 15 17
   .. 
   
1 16 9 
  28
 
1 20 16
 
 
1 15 12
 
1 15 12
⃗b
⃗ , and β.
Now, let’s calculate X T X, (X T X)−1 , X T Y
T
Calculation of X X

460
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

 
1 9 16
  
1 1 1 ... 1 1
 13 14 
XT X = 
  
9 13 11 ...  1
15 11 10


. .. .. 
16 14 10 ... 12  .
. . .
1 15 12

 
10 139 125
 
=
139 2019 1757

T

125 1757 1651
⃗b
Calculation of β
Using matrix inversion, we find:
AF
(X T X)−1 =
1
T
det(X X)
adj(X T X) = 

−0.135

3.369 −0.135 −0.112
0.012

−0.003

−0.112 −0.003 0.012

Now, we calculate
 
2.821
⃗b
= (X T X)−1 X T ⃗y = 
 
β  1.591 

−0.475
DR

Calculation of Error Variance Estimation

461
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

i x1i x2i yi ybi (yi − ybi )2


1 9 16 10 9.542 0.209764
1 13 14 12 16.856 23.58074
1 11 10 14 15.573 2.474329
1 11 8 16 16.523 0.273529
1 14 11 18 19.871 3.500641
1 15 17 20 18.613 1.923769
1 16 9 22 24.003 4.012009

T
1 20 16 24 27.043 9.259849
1 15 12 26 20.988 25.12014
1 15 12 28 20.988 49.16814
Total 139 125 190 190 119.5229
AF
• b2 , is calculated as:
The error variance estimation, σ

n
1 X 119.5229
b2 =
σ (yi − ybi )2 = = 17.0747
n − p − 1 i=1 10 − 2 − 1

where n is the number of observations, p is the number of predictors,


yi is the observed value, and ybi is the predicted value.

• Coefficient of determination:
SSE 119.522
DR

R2 = 1 − =1− = 0.6378
TSS 330
(1 − R2 )(n − 1) (1 − 0.6378)(10 − 1)
R̄2 = 1 − =1− = 0.5343
n−p−1 10 − 2 − 1

Goodness-of-fit R2 and Adjusted R2


• The coefficient of determination R2 = 0.6378 suggests that approxi-
mately 63.78% of the variance in the dependent variable (y) can be ex-
plained by the independent variables x1 and x2 included in the model.

• The Adjusted R2 value takes into account the number of predictors


and the sample size, providing a more conservative estimate of the
model’s goodness-of-fit. In this case, R̄2 = 0.5343 indicating that ap-
proximately 53.43% of the variance in the dependent variable(y) is
explained by the independent variables x1 and x2 after adjusting for
the number of predictors and the sample size.

462
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

10.9.9 F -test in Multiple Regression


• The F -test in multiple regression assesses the overall significance of the
regression model.

• It tests whether at least one of the independent variables has a non-zero


coefficient.

• The null hypothesis H0 for the F -test is:

H0 : β 1 = β 2 = · · · = β p = 0

where β1 , β2 , . . . , βp are the coefficients of the independent variables.

T
10.9.10 ANOVA Table in Regression Analysis
• The ANOVA table in multiple regression assesses the overall signifi-
cance of the regression model.


AF

It partitions the total variance in the dependent variable into explained
variance and unexplained variance.

The table includes sums of squares (SS), degrees of freedom (df), mean
squares (MS), and the F -test statistic.

Source of Variation SS df MS F
Regression SSR p M SR = SSR
p
SSE M SR
Residual (Error) SSE n−p−1 M SE = n−p−1 F = M SE
Total SST n−1
DR

• Reject the null hypothesis H0 if the calculated F -statistic is greater


than the critical value from the F -distribution.

Decision Rule for ANOVA F -test


Classical Approach
• Set the significance level α.

• Calculate the critical value Fcritical from the F -distribution with


appropriate degrees of freedom.

• Decision Rule:
■ If calculated F > Fcritical , reject H0 and conclude that the
regression model is statistically significant.

463
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

■ If calculated F ≤ Fcritical , fail to reject H0 and conclude


that the regression model is not statistically significant.

p-value Approach
• Calculate the p-value associated with the calculated F -statistic.

• Decision Rule:

■ If p-value < α, reject H0 and conclude that the regression


model is statistically significant.

T
■ If p-value ≥ α, fail to reject H0 and conclude that the re-
gression model is not statistically significant.

Example: ANOVA Table

df SS MS F Significance F
AF
Regression
Residual
2
7
210.4651
119.5349
105.2326
17.0764
6.1625 0.0286

Total 9 330

10.9.11 The t-tests in Multiple Regression


• The t-tests in multiple regression assess the significance of individual
coefficients (parameters) in the model.

• Each t-test tests the null hypothesis that the corresponding coefficient
DR

is zero.

• The t-test statistic for the coefficient βbi is calculated as:

βbi
t=
SE(βbi )

where βbi is the estimated coefficient, and se(βbi ) is its standard error.

Decision Rule for t-tests in Multiple Regression

Classical Approach
• Set the significance level α.

• Calculate the critical value tcritical from the t-distribution with


appropriate degrees of freedom.

464
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• Decision Rule for each βbi :


■ If |t| > tcritical , reject H0 and conclude that the correspond-
ing coefficient βbi is statistically significant.
■ If |t| ≤ tcritical , fail to reject H0 and conclude that the cor-
responding coefficient βbi is not statistically significant.

p-value Approach

• Calculate the p-value associated with each calculated t-statistic.

FT
• Decision Rule for each βbi :
■ If p-value < α, reject H0 and conclude that the correspond-
ing coefficient βbi is statistically significant.
■ If p-value ≥ α, fail to reject H0 and conclude that the cor-
responding coefficient βbi is not statistically significant.

Problem 10.8. Consider a dataset containing information about the quarterly


sales of a restaurant chain, the student population near the restaurant, and the
advertising budget. The following data has been collected:
A
Restaurant Student Population Advertising Budget Quarterly Sales
(1000s) (1000s of Dollars) (1000s of Dollars) (y)
1 2 5 58
2 6 12 105
R
3 8 10 88
4 8 15 118
5 12 20 117
6 16 22 137
7 20 25 157
D

8 20 26 169
9 22 24 149
10 26 30 202

(i) Using the above data, formulate a multiple linear regression model where
the quarterly sales (y) is the dependent variable, and both the student
population (x1 ) and the advertising budget (x2 ) are independent variables.
(ii) Write down the assumptions of the multiple linear regression model.

465
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

(iii) Estimate the regression coefficients for the model formulated in Question
1 using the least squares method. Interpret the coefficients for both the
student population and advertising budget.
(iv) Based on the regression model obtained in part (a), what would be the es-
timated quarterly sales for a restaurant located near a campus with 16,000
students and an advertising budget of $20,000?
(v) Calculate the standard error of the estimates for the model obtained in
Question (iii).
(vi) How would you evaluate the goodness-of-fit for the regression model? What

T
statistical metrics would you consider, and why?
(vii) Create scatter plots to show the relationship between:
• Student population and quarterly sales.
• Advertising budget and quarterly sales.
AF Overlay the regression lines on these plots and comment on the observed
relationships.
(viii) If a new restaurant is to be established near a campus with a student pop-
ulation of 18,000 and an advertising budget of $25,000, use the regression
model from Question 2 to predict the quarterly sales.
(ix) Discuss the limitations of using the linear regression model in this context.
What are some factors not considered by the model that could affect the
accuracy of your predictions?
(x) Test the significance of each regression coefficient at the 5% significance
DR

level. Clearly state the null and alternative hypotheses, the test statistic,
and your conclusion.
(xi) Construct a 95% confidence interval for the coefficients of the student
population and advertising budget. Interpret the intervals.

Solution
(i).
The multiple linear regression model can be written as:

y = β0 + β1 x 1 + β2 x 2 + e

where:
• y is the quarterly sales in $1000s.

• x1 is the student population in 1000s.

466
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• x2 is the advertising budget in $1000s.

• β0 , β1 , and β2 are the coefficients to be estimated.

• e is the error term.

(ii).
The assumptions of the multiple linear regression model are:
• Linearity: The relationship between the independent variables and the
dependent variable is linear.

T
• Independence: The observations are independent of each other.

• Homoscedasticity: The variance of the error terms is constant.

• Normality: The error terms are normally distributed.

• No multicollinearity: The independent variables are not highly corre-


AF
(iii).
lated with each other.

After fitting the regression model to the data, the estimated coefficients are:
yb = βb0 + βb1 x1 + βb2 x2
= 38.62 + 0.97x1 + 4.11x2
Interpretation:
• The intercept βb0 = 38.62 suggests that when both the student popu-
lation and advertising budget are zero, the estimated quarterly sales
DR

would be $38,620. However, this may not have a practical interpreta-


tion.

• The coefficient βb1 = 0.97 suggests that for each additional 1000 stu-
dents, quarterly sales increase by $970, holding the advertising budget
constant.

• The coefficient βb2 = 4.11 suggests that for each additional $1000 spent
on advertising, quarterly sales increase by $4110, holding the student
population constant.

(iv).
Substituting x1 = 16 and x2 = 20 into the estimated regression equation:
yb = 38.62 + 0.97(16) + 4.11(20) = 141.30
Therefore, the estimated quarterly sales would be $141,300.

467
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

(v).
The standard error of the estimate is calculated as:
sP
(yi − ybi )2
Standard Error =
n−p

where n is the number of observations and p is the number of predictors includ-


ing the intercept. From the regression model, the standard error of the estimate
is calculated as:
Standard Error = 12.44

T
(vi).
The goodness-of-fit of the regression model can be evaluated using the following
metrics:
• R2 = 0.944: Represents the proportion of variance in the dependent
variable that is explained by the independent variables. A higher R2
AF •
indicates a better fit.

Adjusted R2 = 0.928: Adjusts the R2 value based on the number of


predictors, providing a more accurate measure in models with multiple
predictors.

• F -statistic: F = 59.36, with a p-value of 0.00004, tests the overall


significance of the regression model. A significant F -statistic indicates
that the model explains a significant portion of the variance in the
dependent variable.

• Residual plots: Used to check for homoscedasticity and the normality


DR

of the residuals.

468
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

(vii).

T
AF • The scatter plot of student population vs. quarterly sales shows a
positive linear relationship, indicating that as the student population
increases, quarterly sales also increase.

• The scatter plot of advertising budget vs. quarterly sales also shows a
positive linear relationship, indicating that an increase in the advertis-
ing budget leads to higher quarterly sales.

(viii).
Substituting x1 = 18 and x2 = 25 into the estimated regression equation:
DR

yb = 38.62 + 0.97(18) + 4.11(25) = 162.28


The predicted quarterly sales would be $162,280.

(ix).
Limitations of the linear regression model in this context include:
• The model assumes a linear relationship between the variables, which
may not always hold true.

• It does not account for potential interactions between the student pop-
ulation and advertising budget.

• The model does not consider external factors like economic conditions,
competition, or seasonal variations that could impact sales.

• The model assumes that the relationships are constant over time, which
might not be the case in reality.

469
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

(x).
For each coefficient:
• Null Hypothesis (H0 ): The coefficient is equal to zero (βi = 0).

• Alternative Hypothesis (H1 ): The coefficient is not equal to zero


(βi ̸= 0).
Test Statistics (from the regression output):

• For the intercept (β0 ): t = 3.229, p = 0.014.

• For student population (β1 ): t = 0.534, p = 0.610.

• For advertising budget (β2 ): t = 2.287, p = 0.056.

T
Conclusion:
• The intercept is statistically significant (p < 0.05).

• The coefficient for student population is not statistically significant


AF
(p > 0.05), so we fail to reject the null hypothesis.

• The coefficient for the advertising budget is marginally significant at


the 5% level (p ≈ 0.056), close to the threshold, so we might consider
it as not significant or marginally significant depending on the context.

(xi).
Using the regression output, the 95% confidence intervals are:
DR
• For student population (β1 ): −3.337 ≤ β1 ≤ 5.283

• For advertising budget (β2 ): −0.140 ≤ β2 ≤ 8.369


Interpretation:
• For student population: We are 95% confident that the true coefficient
lies within the interval [−3.337, 5.283]. Since this interval includes zero,
it suggests that the effect of student population on sales may not be
significant.

• For advertising budget: We are 95% confident that the true coefficient
lies within the interval [−0.140, 8.369]. Since this interval also includes
zero, it suggests that the effect of advertising budget on sales may not
be significant.

470
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

10.9.12 Python Code: Linear Regression Model


Consider the Problem 10.8. Suppose we have data on quarterly sales (y), stu-
dent population (x1 ), and advertising budget (x2 ). The Python code to solve
this problem is in the following.
1 import numpy as np
2 import pandas as pd
3 import statsmodels . api as sm
4 import matplotlib . pyplot as plt
5

6 # Example Data : Student Population , Advertising Budget ,


and Quarterly Sales
7 data = {
8 ’ Student Population (1000 s ) ’: [2 , 6 , 8 , 8 , 12 , 16 , 20 ,
20 , 22 , 26] ,
’ Advertising Budget ( $ 1000s ) ’: [5 , 12 , 10 , 15 , 20 , 22 ,

T
9

25 , 26 , 24 , 30] ,
10 ’ Quarterly Sales ( $ 1000s ) ’: [58 , 105 , 88 , 118 , 117 , 137 ,
157 , 169 , 149 , 202]
11 }
12

13
AF
# Convert the data to a DataFrame
14 df = pd . DataFrame ( data )
15

16 # Define the independent variables ( X ) and dependent


variable ( y )
17 X = df [[ ’ Student Population (1000 s ) ’ , ’ Advertising Budget
( $ 1000s ) ’ ]]
18 y = df [ ’ Quarterly Sales ( $ 1000s ) ’]
19
DR
20 # Add a constant to the independent variables matrix
21 X = sm . add_constant ( X )
22

23 # Fit the multiple linear regression model


24 model = sm . OLS (y , X ) . fit ()
25

26 # Print the summary of the regression model


27 print ( model . summary () )
28

29 # Plotting the relationship


30 # ( Optional ) You can create scatter plots to visualize the
relationship between each independent variable and the
dependent variable .
31

32 plt . figure ( figsize =(12 , 6) )


33

34 # Scatter plot for Student Population vs Quarterly Sales


35 plt . subplot (1 , 2 , 1)

471
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

36 plt . scatter ( df [ ’ Student Population (1000 s ) ’] , y , color = ’


blue ’)
37 plt . plot ( df [ ’ Student Population (1000 s ) ’] , model . params [0]
+ model . params [1] * df [ ’ Student Population (1000 s ) ’] +
model . params [2] * df [ ’ Advertising Budget ( $ 1000s ) ’ ]. mean
() , color = ’ red ’)
38 plt . title ( ’ Quarterly Sales vs . Student Population ’)
39 plt . xlabel ( ’ Student Population (1000 s ) ’)
40 plt . ylabel ( ’ Quarterly Sales ( $ 1000s ) ’)
41 plt . grid ( True )
42

43 # Scatter plot for Advertising Budget vs Quarterly Sales

T
44 plt . subplot (1 , 2 , 2)
45 plt . scatter ( df [ ’ Advertising Budget ( $ 1000s ) ’] , y , color = ’
green ’)
46 plt . plot ( df [ ’ Advertising Budget ( $ 1000s ) ’] , model . params
[0] + model . params [1] * df [ ’ Student Population (1000 s ) ’ ].
mean () + model . params [2] * df [ ’ Advertising Budget ( $ 1000s
) ’] , color = ’ red ’)
47

48

49

50
AF plt . title ( ’ Quarterly Sales vs . Advertising Budget ’)
plt . xlabel ( ’ Advertising Budget ( $ 1000s ) ’)
plt . ylabel ( ’ Quarterly Sales ( $ 1000s ) ’)
plt . grid ( True )
51

52 plt . tight_layout ()
53 plt . show ()
54

55 # Prediction example : Predict quarterly sales for a


restaurant with a student population of 16 ,000 and an
advertising budget of $ 20 ,000
56 new_data = pd . DataFrame ({ ’ const ’: [1] , ’ Student Population
(1000 s ) ’: [16] , ’ Advertising Budget ( $ 1000s ) ’: [20]})
DR

57 predicted_sales = model . predict ( new_data ) [0]


58 print ( f ’ Predicted Quarterly Sales : $ { predicted_sales :.2 f }
( in $ 1000s ) ’)
59

60 # Standard Error of the Estimates


61 standard_error = np . sqrt ( model . mse_resid )
62 print ( f ’ Standard Error of the Estimates : { standard_error
:.2 f } ’)
63

472
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Figure 10.12: Python output for the practice example.

T
AF
10.9.13 Exercises
1. Explain the difference between simple linear regression and multiple linear
regression. How does the addition of more predictors in multiple linear
regression improve the model?
2. What assumptions must be met for multiple linear regression analysis to
be valid? Explain why each assumption is important.
3. What is R2 in the context of multiple linear regression? How is it different
DR

from adjusted R2 , and why might adjusted R2 be preferred in some cases?

4. The following summary results are for a multiple regression model:

Table 10.7: Model Summary Statistics

Statistic Value
R-squared 0.76
Adjusted R-squared 0.74
F -statistic 38.42
p-value (F -statistic) 0.0001

473
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Table 10.8: Regression Coefficients

Variable Estimate Standard Error p-value


Intercept 2.50 0.65 0.001
X1 0.75 0.10 0.000
X2 -0.55 0.15 0.102
X2 1.20 0.20 0.000

(i). What is the regression model corresponding to the summary results

T
given in Table 10.8? Describe the assumptions of multiple regression
analysis. What diagnostic plots can be used to check these assump-
tions?
(ii). Explain the R-squared value given in Table 10.7, which indicates
the model’s performance. How does the adjusted R-squared value
improve on this interpretation?
AF (iii). Given the model summary in Table 10.8, write the estimated model
and
(a) Interpret the coefficient of X1 .
(b) What does the intercept term represent in this context?
(iv). Based on the p-values of the coefficients in Table 10.8, identify which
predictors are statistically significant at the 5% significance level and
explain why.
(v). Calculate the predicted value of Y when X1 = 4, X2 = 2, and
X3 = 3. Show all steps in your calculation.
DR

(vi). The F -statistic and its corresponding p-value are given in the model
summary. Explain what these values indicate about the overall
model.

5. Describe how hypothesis testing is used to determine the significance of


the regression coefficients in a multiple linear regression model. What is
the null hypothesis for each coefficient, and how is it tested?
6. What is the purpose of the F -test in multiple linear regression? How does
it differ from t-tests for individual coefficients?

10.10 Regression Model Diagnostics


Regression model diagnostics are critical for ensuring that the assumptions
underlying a regression model are met. These diagnostics help identify issues
such as non-linearity, heteroscedasticity, outliers, and multicollinearity, which
can affect the validity and interpretability of the model.

474
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

10.10.1 Assumptions of Linear Regression


The linear regression model is based on several key assumptions that must be
checked to ensure the model is appropriate for the data:

• Linearity: The relationship between the independent variables and


the dependent variable should be linear.

• Independence: Observations should be independent of each other.

• Homoscedasticity: The variance of the residuals should be constant


across all levels of the independent variables.

T
• Normality of Residuals: The residuals (errors) of the model should
be approximately normally distributed.

• No Multicollinearity: Independent variables should not be highly


correlated with each other.
AF
10.10.2 Residual Plots
Residuals of the model given in Equation (10.6) are defined as

ebi = yi − ybi for i = 1, 2, . . . , n.

These residuals can be regarded as the ‘observed’ errors. They are not the same
as the unknown true errors

ei = yi − E(yi ) for i = 1, 2, . . . , n.

If the model is appropriate for the data at hand, the residuals should reflect
DR

the properties assumed for ei (i.e., independence, normality, zero mean, and
constant variance). For diagnostic purposes, we sometimes use the semistu-
dentized residual, which is defined as

ebi − ¯eb ebi


eb∗i = √ =√ , (10.13)
M SE M SE
where
n n
X ei − ¯eb)2
(b X eb2i
M SE = = = s2
i=1
n−k i=1
n−k
and k = p + 1.

If M SE were an estimate of the standard deviation of ebi , then eb∗i would
be the studentized
√ residual. However, note that the standard √ deviation of ei
is not equal to M SE; it varies for each ebi . The estimator M SE is only
an approximation of the standard deviation of ei ; hence, we refer to eb∗i as a

475
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

semistudentized residual.

Residual plots are a crucial tool for diagnosing issues in a regression model.
A residual plot is a scatterplot of the residuals (errors) on the y-axis and the
predicted values or one of the independent variables on the x-axis. The following
plots of residuals (or semistudentized residuals) will be utilized here for this
purpose: ’
1. Plot of residuals against predictor variable.
2. Plot of absolute or squared residuals against predictor variable.

T
3. Plot of residuals against fitted values.
4. Plot of residuals against time or outlier sequence.
5. Plots of residuals against omitted predictor variables.
6. Box plot of residuals.
AF
7. Normal probability plot of residuals.
A few questions to consider when you analyze plots
1. Do the residuals follow any pattern indicating nonlinearity?
2. Are there any outliers?
3. Does the assumption of constant variance look correct?
4. Label any qualitative variables on the plot. Any patterns?

• Detecting Non-linearity: If the residuals show a systematic pat-


DR

tern (e.g., curvature), this indicates that the relationship between the
independent and dependent variables may not be linear.

• Identifying Heteroscedasticity: If the residuals display a funnel


shape (widening or narrowing), this suggests heteroscedasticity, mean-
ing the variance of the errors is not constant.

Univariate Plots of x and y


Univariate plots (See Figure 10.13(a)) of x and y are essential tools for exploring
the individual distributions of the predictor variable x and the response variable
y in a regression analysis. These plots, such as histograms, box plots, or density
plots, allow us to visualize the central tendency, spread, and shape of each
variable’s distribution. By examining these plots, we can identify characteristics
such as skewness, kurtosis, outliers, or gaps in the data, which may influence
the relationship between x and y. Understanding the univariate distributions of
x and y helps in choosing appropriate transformations or modeling approaches,

476
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

and provides a foundation for interpreting the results of more complex analyses,
such as regression or correlation.

Figure 10.13: Scatter Plot and Residual Plot Understanding Nonlinear Regres-
sion Function

T
AF
Plot of the Residuals versus x
Plot of the Residuals versus x (See Figure 10.13(b)) is more effective in detecting
nonlinearity than a scatter plot (e.g., nonlinearity may be more prominent on
a residual plot compared to on a scatter plot). It can also indicate other forms
of model departure besides nonlinearity (e.g., non-constancy of variance). See
Figures 10.11. For any observable pattern(s) in the plot of the residuals versus
x indicates a problem with the model assumptions!
DR

477
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Figure 10.14: Prototype Residual Plots.

T
AF
Plot of the Residuals versus yb
For simple linear regression (with a single predictor), the residuals versus yb
plot contains the same information as the plot of residuals versus X, but on
a different scale. For multiple linear regression, this plot allows us to examine
patterns in the residuals as y increases. Ideally, there should be no systematic
patterns.
DR

Plot of the Residuals versus Time


The plot of residuals versus time (See Figure 10.15), also known as a sequence
plot, is only meaningful when data are collected in a time sequence or some
other type of ordered sequence. In this plot, any discernible pattern suggests a
lack of independence among the residuals, indicating potential issues with the
model.

478
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

T
Figure 10.15: Residual Time Sequence Plots Illustrating Nonindependence of
Error Terms.

Plot of the Semistudentized Residuals versus X


Semistudentized residuals are defined in Equation (10.13). The plot shown in
AF
Figure 10.16 is particularly useful for identifying outliers in the data. By using
the semistudentized version of the residual, it becomes easier to detect potential
outliers, as the residuals are scaled relative to the variability of the data. Cases
where eb∗i falls outside the range (−3, 3) can be considered outliers, indicating
observations that deviate significantly from the fitted model.
DR

Figure 10.16: Residual Plot with Outlier.

Plot of the Residuals in a Normal Quantile Plot


A Q-Q plot, also known as a quantile-quantile (QQ) plot (See Figure 10.11), is
used to assess the normality of the residuals in a regression model. In this plot,
the ordered residuals are plotted against their corresponding quantiles from a
normal distribution. If the residuals are normally distributed, the points should
lie approximately along a straight line. Deviations from this line, especially in
the tails of the distribution, indicate non-normality in the residuals. This can

479
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Figure 10.17: Normal Probability Plots when Error Term Distribution Is Not
Normal.

T
signal potential issues with the model, such as the presence of outliers, skewness,
or other departures from the assumptions of normality.
AF
Added-Variable plot
An Added-Variable plot, also known as a partial regression plot, is a diagnos-
tic tool used in multiple linear regression to assess the relationship between a
specific predictor variable and the response variable, while accounting for the
influence of other predictors in the model. The Added-Variable plot is created
by plotting the residuals from regressing the response variable on the other
predictors against the residuals from regressing the specific predictor on the
same set of predictors. This helps in identifying the unique contribution of the
predictor of interest after removing the effects of other variables in the model.
DR

In some cases, an Added-Variable plot may also involve plotting residuals


against potential predictor variables that are not included in the model but
could have important effects on the response. Any distinctive pattern(s) in such
a plot may indicate that the model could be improved by adding the omitted
predictor variable(s). If the plot shows a clear linear trend, it suggests that
the predictor has a significant linear relationship with the response variable.
However, deviations from linearity or a lack of pattern may indicate that the
predictor does not contribute significantly to the model, or that a more complex,
nonlinear relationship exists.

10.10.3 Formal Tests


Graphical analysis can often be subjective, especially when patterns are not
very distinctive. In such cases, formal tests can be considered to assess potential
violations of the model assumptions.
• Tests for Normality

480
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• Test for Autocorrelation

• Tests for Non-Constancy of Variance


■ Breusch-Pagan Test

• Outlier Identification

• Lack-of-Fit Test

Tests for Normality


The goodness-of-fit tests can be used to examine the normality of the error

T
terms
• Chi-Square test

• Shapiro-Wilk test,


AF•
Kolmogorov-Smirnov test,

Lilliefors test

Shapiro-Wilk test for Normality


The Shapiro-Wilk test is a statistical test used to assess the normality of a
dataset. Specifically, it tests the null hypothesis that a sample comes from a
normally distributed population. The test calculates a W statistic, which mea-
sures how well the data conform to a normal distribution. A W value close to 1
indicates that the data are approximately normally distributed, while a W value
significantly less than 1 suggests a departure from normality. The associated
DR

p-value determines whether the null hypothesis can be rejected; a small p-value
(typically less than 0.05) indicates that the data significantly deviate from nor-
mality. The Shapiro-Wilk test is particularly powerful for detecting departures
from normality in small to moderately sized samples, making it a widely used
tool in statistical analysis. Using a hypothetical dataset, the Python code for
Shapiro-Wilk test is explained in the next section.

Example: Toluca Company dataset


The Toluca Company, which manufactures refrigeration equipment and replace-
ment parts, aimed to determine the optimal lot size for a specific replacement
part during a cost improvement program. Production involves setup, machin-
ing, and assembly, with labor hours varying by lot size. To establish the rela-
tionship between lot size and labor hours, data from 25 production runs were
collected over a stable six-month period. This data, shown in Table 10.9, records
lot sizes and corresponding work hours. Lot sizes are multiples of 10, as per

481
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

company policy. The conditions during this period were expected to remain
consistent for the next three years.

Table 10.9: Toluca Company Example.

Lot Size Work Hrs Lot Size Work Hrs


80 399 30 121
50 221 90 376
70 361 60 224
120 546 80 352
100 353 50 157
40 160 70 252

T
90 389 20 113
110 435 100 420
30 212 50 268
AF 90 377 110 421
30 273 90 468
40 244 80 342
70 323

We perform Shapiro-Wilk test for Normality using Python in the following.


DR
Python Code: Shapiro-Wilk test for Normality

1 # Shapiro - Wilk test for Normality


2 import numpy as np
3 import statsmodels . api as sm
4 from scipy . stats import shapiro
5 # Define the data
6 LotSize = np . array ([80 , 30 , 50 , 90 , 70 , 60 , 120 , 80 , 100 ,
50 , 40 , 70 , 90 , 20 , 110 , 100 , 30 , 50 , 90 , 110 , 30 , 90 ,
40 , 80 , 70])
7 WorkHrs = np . array ([399 , 121 , 221 , 376 , 361 , 224 , 546 ,
352 , 353 , 157 , 160 , 252 , 389 , 113 , 435 , 420 , 212 , 268 ,
377 , 421 , 273 , 468 , 244 , 342 , 323])
8 # Fit your regression model
9 X = sm . add_constant ( LotSize ) # Add a constant term to the
independent variable
10 model = sm . OLS ( WorkHrs , X ) . fit ()
11

482
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

12 # Obtain residuals
13 residuals = model . resid
14

15 # Perform Shapiro - Wilk test for normality


16 statistic , p_value = shapiro ( residuals )
17

18 # Print the test results


19 print ( " Shapiro - Wilk Test Statistic : " , statistic )
20 print ( "p - value : " , p_value )
21

22 # Interpret the results


23 alpha = 0.05

T
24 if p_value > alpha :
25 print ( " Residuals look Gaussian ( fail to reject H0 ) " )
26 else :
27 print ( " Residuals do not look Gaussian ( reject H0 ) " )

10.10.4 Test for Autocorrelation


AF Usually only meaningful for data collected in a time series or in some other
sequence. In cross-sectional studies, data collection should be done so that
observations are independent. A runs test is frequently used to test for lack of
randomness in the residuals arranged in time order. Another test, specifically
designed for lack of randomness in least squares residuals, is the Durbin-Watson
test.

Durbin-Watson Test
The model is:
yi = β0 + β1 xi + ei
DR

where
ei = ρei−1 + ui , |ρ| < 1
and
ui ∼ N (0, σ 2 ) independent
Hypotheses:
H0 : ρ = 0 versus HA : ρ > 0
Statistic: Pn
(e − e )2
D= Pni 2i−1
i=2
i=1 ei
Decision Rule:

D > du
 ⇒ do not reject H0
D < dl ⇒ reject H0

dl ≤ D ≤ du ⇒ inconclusive

Values for dl and du can be found in Table B.7 in Kutner et al.

483
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Python Code: Durbin-Watson Test


To conduct the Durbin-Watson test in Python, we consider the Toluca Company
example dataset given in Section 10.10.3. The Python code is provided in the
following.
1 # Durbin - Watson Test
2 import statsmodels . api as sm
3 from statsmodels . stats . stattools import durbin_watson
4

5 # Define the data


6 LotSize = np . array ([80 , 30 , 50 , 90 , 70 , 60 , 120 , 80 ,
100 , 50 , 40 , 70 , 90 , 20 , 110 , 100 , 30 , 50 , 90 , 110 , 30 ,

T
90 , 40 , 80 , 70])
7 WorkHrs = np . array ([399 , 121 , 221 , 376 , 361 , 224 , 546 ,
352 , 353 , 157 , 160 , 252 , 389 , 113 , 435 , 420 , 212 , 268 ,
377 , 421 , 273 , 468 , 244 , 342 , 323])
8

9 # Fit your regression model


10

11

12

13
AF model = sm . OLS ( WorkHrs , LotSize ) . fit ()
print ( model . summary () )
# Perform the Durbin - Watson test
d u r b in _w a ts o n_ st a ti st i c = durbin_watson ( model . resid )
14 print ( " Durbin - Watson statistic : " ,
d u r b i n _w a ts on _ st a ti st i c )
15

10.10.5 Tests for Non-Constancy of Variance


Tests for non-constancy of variance, also known as tests for heteroscedasticity,
assess whether the variance of residuals from a regression model is constant
DR

across levels of the predictor variables. One common test is the Breusch-Pagan
test. The procedure for this test is as follows:

1. Fit the regression model and obtain the residuals ebi .


2. Regress the squared residuals eb2i on the predictor variables x.
3. Obtain the test statistic from the resulting regression. For the Breusch-
Pagan test, this statistic follows a chi-squared distribution with degrees
of freedom equal to the number of predictor variables.
4. Compare the test statistic to the chi-squared distribution to determine
the p-value. A small p-value indicates that the null hypothesis of constant
variance is rejected, suggesting the presence of heteroscedasticity.

Breusch-Pagan Test
• Requires ei to be independent and normally distributed.

484
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• Requires large sample size.

• Can detect relationships such as

log σi2 = γ0 + γ1 xi

Regress the squared residuals, e2i , against xi and obtain SSR∗ from this regres-
sion.

Hypotheses:
H0 : γ1 = 0 versus HA : γ 1 > 0

T
Statistic:
2
SSR∗

SSE
χ2BP = (10.14)
2 n

where SSR∗ is the regression sum of squares when regressing eb2 on x and
AF
SSE is the error sum of squares when regressing Y on x. Under H0 , the test
statistic follows a χ2 -distribution.

Alternatively, the White test can be used, which is robust to various forms
of heteroscedasticity and follows a similar procedure but does not assume a
specific functional form of the variance.

Example: Toluca Company dataset


To conduct the Breusch-Pagan test for the Toluca Company example given
in Section 10.10.3, we regress the squared residuals in Table 10.10, column
5, against x and obtain SSR∗ = 7, 896, 128. We know from Figure 2.2 that
DR

SSE = 54, 825. Hence, test statistic 10.14 is:


 2
7, 896, 128 54, 825
χ2BP = = 0.821
2 25

485
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Table 10.10: Data on Lot Size and Work Hours

(1) (2) (3) (4) (5)


Run Lot Work Mean Residual Squared
Size (xi ) Hours (yi ) Response (ybi ) (ei ) Residual (e2i )

1 80 399 347.98 51.02 2,603.0


2 30 121 169.47 -48.47 2,349.3
3 50 221 240.88 -19.88 395.2
... ... ... ... ... ...
4 40 244 205.17 38.83 1,507.8
5 80 342 347.98 -5.98 35.8

T
6 70 323 312.28 10.72 114.9

Total 1,750 7,807 7,807 0 54,825

To control the α risk at 0.05, we require χ2 (0.95; 1) = 3.84. Since χ2BP =


AF
0.821 < 3.84, we conclude H0 , that the error variance is constant. The p-value
of this test is 0.64 so that the data are quite consistent with constancy of the
error variance.

Python Code: Breusch-Pagan Test


To conduct the Breusch-Pagan test in Python, we consider the Toluca Company
example dataset given in Section 10.10.3. The Python code is provided in the
DR
following.
1 # Breusch - Pagan Test
2

3 import numpy as np
4 import statsmodels . api as sm
5 from statsmodels . stats . diagnostic import het_breuschpagan
6

7 # Define the data


8 LotSize = np . array ([80 , 30 , 50 , 90 , 70 , 60 , 120 , 80 , 100 ,
50 , 40 , 70 , 90 , 20 , 110 , 100 , 30 , 50 , 90 , 110 , 30 , 90 ,
40 , 80 , 70])
9 WorkHrs = np . array ([399 , 121 , 221 , 376 , 361 , 224 , 546 , 352 ,
353 , 157 , 160 , 252 , 389 , 113 , 435 , 420 , 212 , 268 , 377 ,
421 , 273 , 468 , 244 , 342 , 323])
10

11 # Fit your regression model


12 X = sm . add_constant ( LotSize ) # Add a constant term to the
independent variable

486
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

13 model = sm . OLS ( WorkHrs , X ) . fit ()


14 print ( model . summary () )
15 # Perform the Breusch - Pagan test
16

17 test_results = het_breuschpagan ( model . resid , model . model .


exog )
18

19 # Extract results
20 bp_ test_statistic = test_results [0]
21 bp_test_p_value = test_results [1]
22

23 print ( f ’ Breusch - Pagan test statistic : { bp_test_statistic } ’)

T
24 print ( f ’ Breusch - Pagan p - value : { bp_test_p_value } ’)

Brown-Forsythe (Modified Levene) Test


The Brown-Forsythe test, also known as the Modified Levene’s test, is used
to assess the equality of variances across groups. It is a robust alternative to
AF Levene’s test and is particularly useful when the data may not be normally
distributed.

The Brown-Forsythe test is used to test the null hypothesis that the vari-
ances of different groups are equal. It is a modification of Levene’s test, where
the median is used instead of the mean to make it more robust to deviations
from normality.

Procedure:
1. Arrange the residuals by increasing values of x.
DR

2. Split the sample into two (more or less equal) groups.


• Group 1: n1 observations with x ≤ x̃
• Group 2: n2 observations with x > x̃
where x̃ denote the medians of x,
3. Compute di1 = |ei1 − ẽ1 | and di2 = |ei2 − ẽ2 |, where ẽ1 and ẽ2 denote the
medians of the residuals in the two groups.
Hypotheses:
2 2 2 2
H0 : σgrp1 = σgrp2 versus H1 : σgrp1 ̸= σgrp2
Statistic: Two-Sample t-Test
Pn1 Pn2
d¯1 − d¯2 i=1 (di1 − d¯1 )2 + i=1 (di2 − d¯2 )2
tBF = q where s2p =
sp n11 + n12 n−2

487
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Under H0 , the test statistic follows a t-distribution with n − 2 degrees of


freedom.
Table 10.11: Calculations for Brown Forsythe Test for Constancy of Error Vari-
ance Toluca Company Example.

Group 1
i Run Lot Size Residual (eil ) dil (dil − d¯1 )2 )

1 14 20 -20.77 0.89 1929.41


2 2 30 -48.47 28.59 263.25

T
... ... ... ... ... ...
12 12 70 -60.28 40.40 19.49
13 25 70 10.72 30.60 202.07

Total 582.60 — — 582.60 12,566.6


AF ¯
ẽ1 = −19.88 , d1 = 44.815
Group 2

i Run Lot Size Residual (ei2 ) di2 (di2 − d¯2 )2 )

1 1 80 51.02 53.70 637.56


2 8 80 4.02 6.70 473.06
... ... ... ... ... ...
11 20 110 -34.09 31.41 8.76
12 27 120 55.21 57.89 866.10
DR

Total 341.40 — — 341.14 9,610.2


ẽ2 =-2.68, d¯2 =28.45

We are now ready to calculate test statistic:


12, 566.6 + 9, 610.2
s2p = = 964.21
25 − 2
Hence,
sp = 31.05
Therefore,
d¯ − d¯2 44.815 − 28.450
tBF = q1 = q = 1.32
sp n11 + n12 31.05 131 1
+ 12
To control the α risk at 0.05, we require t(0.975; 23) = 2.069. The decision rule
therefore is:

488
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• If |tBF | ≤ 2.069, conclude the error variance is constant.

• If |tBF || > 2.069, conclude the error variance is not constant.


Since |tBF | = 1.32 ≤ 2.069, we conclude that the error variance is constant
and does not vary with the level of x. The two-sided p-value of this test is 0.20.

Python Code: Brown-Forsythe (Modified Levene) Test


To conduct the Brown-Forsythe (Modified Levene) Test in Python, we consider
the Toluca Company example dataset given in Section 10.10.3. The Python
code is provided in the following.
1 import numpy as np
2 import pandas as pd
3 import scipy . stats as stats

T
4

5 # Define the data


6 LotSize = np . array ([80 , 30 , 50 , 90 , 70 , 60 , 120 , 80 , 100 ,
50 , 40 , 70 , 90 , 20 , 110 , 100 , 30 , 50 , 90 , 110 , 30 , 90 ,
40 , 80 , 70])
7
AF
WorkHrs = np . array ([399 , 121 , 221 , 376 , 361 , 224 , 546 , 352 ,
353 , 157 , 160 , 252 , 389 , 113 , 435 , 420 , 212 , 268 , 377 ,
421 , 273 , 468 , 244 , 342 , 323])
8

9 # Convert to DataFrame
10 df = pd . DataFrame ({ ’ LotSize ’: LotSize , ’ WorkHrs ’: WorkHrs })
11

12 # Define bins for grouping ( adjust as needed )


13 bins = [0 , 50 , 70 , 90 , 110 , np . inf ] # Example bins
14 labels = [ ’ 0 -50 ’ , ’ 51 -70 ’ , ’ 71 -90 ’ , ’ 91 -110 ’ , ’ 111+ ’]
DR
15 df [ ’ Group ’] = pd . cut ( df [ ’ LotSize ’] , bins = bins , labels = labels
)
16

17 # Compute medians for each group


18 group_medians = df . groupby ( ’ Group ’) [ ’ WorkHrs ’ ]. median ()
19

20 # Compute absolute deviations from medians


21 df [ ’ Deviation ’] = df . apply ( lambda row : abs ( row [ ’ WorkHrs ’] -
group_medians [ row [ ’ Group ’ ]]) , axis =1)
22

23 # Perform ANOVA on deviations


24 anova_result = stats . f_oneway (
25 df [ df [ ’ Group ’] == ’ 0 -50 ’ ][ ’ Deviation ’] ,
26 df [ df [ ’ Group ’] == ’ 51 -70 ’ ][ ’ Deviation ’] ,
27 df [ df [ ’ Group ’] == ’ 71 -90 ’ ][ ’ Deviation ’] ,
28 df [ df [ ’ Group ’] == ’ 91 -110 ’ ][ ’ Deviation ’] ,
29 df [ df [ ’ Group ’] == ’ 111+ ’ ][ ’ Deviation ’]
30 )
31

489
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

32 print ( " Brown - Forsythe Test ANOVA Result : " )


33 print ( f "F - statistic : { anova_result . statistic :.4 f } " )
34 print ( f "P - value : { anova_result . pvalue :.4 f } " )
35

36

10.10.6 Influential Observations, Outliers, and Cook’s Dis-


tance
Cook’s Distance
Cook’s distance measures the influence of each observation on the fitted values
of the model. It combines the effect of leverage and residual to determine the
influence of each data point.

T
Pn 2
j=1 (ŷj − ŷj(i) )
Di =
p · MSE
where ŷj is the jth fitted value, ŷj(i) is the jth fitted value with the ith obser-
vation removed, p is the number of predictors in the model, and MSE is the
AF
mean squared error of the model.

Influential Observations
Influential observations are data points that have a large impact on the esti-
mated coefficients of the regression model. They can significantly alter the fit
of the model if removed. Influential observations are identified using Cook’s
distance, where observations with Cook’s distance greater than 4/n (where n
is the number of observations) are considered influential.
DR

Outliers
Outliers are data points that deviate significantly from the rest of the data.
They can affect the regression model’s accuracy and should be investigated
to determine if they are genuine data points or errors. Diagnostic tools such
as leverage plots and Cook’s distance can help identify outliers. Outliers are
identified by selecting observations with Cook’s distance greater than a certain
threshold (here, 4/n).

Influential Observations, Outliers, and Cook’s Distance


To conduct the Influential Observations, Outliers, and Cook’s Distance in Python,
we consider the Toluca Company example dataset given in Section 10.10.3. The
Python code is provided in the following.

490
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

1 import numpy as np
2 import pandas as pd
3 import statsmodels . api as sm
4 import matplotlib . pyplot as plt
5

6 # Define the data


7 LotSize = np . array ([80 , 30 , 50 , 90 , 70 , 60 , 120 , 80 , 100 ,
50 , 40 , 70 , 90 , 20 , 110 , 100 , 30 , 50 , 90 , 110 , 30 , 90 ,
40 , 80 , 70])
8 WorkHrs = np . array ([399 , 121 , 221 , 376 , 361 , 224 , 546 , 352 ,
353 , 157 , 160 , 252 , 389 , 113 , 435 , 420 , 212 , 268 , 377 ,
421 , 273 , 468 , 244 , 342 , 323])
9

10 # Create DataFrame
11 df = pd . DataFrame ({ ’ LotSize ’: LotSize , ’ WorkHrs ’: WorkHrs })
12

T
13 # Add a constant column for intercept
14 X = sm . add_constant ( df [ ’ LotSize ’ ])
15 y = df [ ’ WorkHrs ’]
16

# Fit the linear regression model


17

18
AF
model = sm . OLS (y , X ) . fit ()
19

20 # Get influence measures


21 influence = model . get_influence ()
22 cooks_d = influence . cooks_distance [0]
23

24 # Create a DataFrame for Cook ’s Distance


25 cooks_df = pd . DataFrame ({
26 ’ LotSize ’: df [ ’ LotSize ’] ,
’ WorkHrs ’: df [ ’ WorkHrs ’] ,
DR
27

28 ’ Cook \ ’ s Distance ’: cooks_d


29 })
30

31 # Print the Cook ’s Distance values


32 print ( " Cook ’s Distance : " )
33 print ( cooks_df )
34

35 # Plot Cook ’s Distance


36 plt . figure ( figsize =(10 , 6) )
37 plt . stem ( np . arange ( len ( cooks_d ) ) , cooks_d , basefmt = " " ,
u se _ line_collection = True )
38 plt . xlabel ( ’ Observation Index ’)
39 plt . ylabel ( " Cook ’s Distance " )
40 plt . title ( " Cook ’s Distance for Each Observation " )
41 plt . show ()
42

43 # Identify influential observations ( threshold typically 4/ n


)

491
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

44 threshold = 4 / len ( cooks_d )


45 influential = cooks_df [ cooks_df [ ’ Cook \ ’ s Distance ’] >
threshold ]
46

47 print ( " Influential Observations : " )


48 print ( influential )

10.10.7 Multicollinearity
Multicollinearity is a common issue in regression analysis. It occurs when pre-
dictor variables in a regression model are highly correlated. Multicollinearity

T
can lead to inaccurate estimates of the regression coefficients and their standard
errors.

Multicollinearity: Multicollinearity is a statistical phenomenon where


two or more independent variables in a regression model are highly corre-
lated, leading to unreliable estimates of regression coefficients.
AF Causes of Multicollinearity
Multicollinearity can be caused by:
• High Correlation Among Predictors: When independent vari-
ables are highly correlated with each other.

• Redundant Variables: Including variables that are redundant or


linear combinations of other predictors.

• Inclusion of Polynomial or Interaction Terms: When polynomial


DR

terms or interaction terms are added without centering the variables.

• Data Collection Issues: Poorly designed experiments or data col-


lection methods that capture similar information across multiple vari-
ables.

What is the problem if one feature is a linear combination of other


features? If one feature is a linear combination of other features, it indicates
multicollinearity. Let X be the design matrix of dimension n × p, where n is
the number of observations and p is the number of predictors. Each row of
X corresponds to an observation, and each column corresponds to a predictor
variable.
The multiple linear regression model can be represented as:

⃗ + ⃗e
⃗ = Xβ
Y
where:

492
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• ⃗ is the n × 1 vector of response variable values,


Y

• ⃗ is the p × 1 vector of regression coefficients,


β

• ⃗e is the n × 1 vector of errors.


⃗ is
The estimator of β

⃗b ⃗
β = (X T X)−1 X T Y
If one feature is a linear combination of other features, then the rank of the
design matrix X is less then the number of predictors. That is,

T
rank(X) < p.

If the rank of X is less than the number of predictors p, then the deter-
minant of XT X is close to zero. If one feature is an exact linear combination
of other features, then the determinant of XT X is exactly zero. Consequently,
(XT X)−1 does not exist, which implies that the Ordinary Least Squares (OLS)
AF
estimator β
⃗b
= (X T X)−1 X T Y
cannot be computed in such cases.
⃗b
⃗ and Var(β) = (X T X)−1 σ 2 are undefined or

In scenarios where multicollinearity is present, leading to a nearly singular


T ⃗b
X X, the OLS estimator β can become very large and provide unreliable coef-
⃗b
ficient estimates. Additionally, the variance of β becomes very large. This large
variance results in small t-ratios, which undermines the validity of hypothesis
tests. Furthermore, the large variance leads to wide confidence intervals for the
true parameters, affecting the accuracy of inference about the parameters.
DR

Therefore, when one feature is a linear combination of other features, it


introduces significant problems for statistical inference and makes it challenging
to interpret the effects of individual predictors on the response variable.

Practical Consequences of Multicollinearity


Some practical consequences of multicollinearity include:

1. Large Variances and Covariances: Although Ordinary Least Squares


(OLS) estimators are Best Linear Unbiased Estimators (BLUE), multi-
collinearity leads to large variances and covariances of the estimators.
This makes precise estimation of coefficients difficult.
2. Wider Confidence Intervals: Due to the large variances of the estima-
tors, the confidence intervals for the coefficients tend to be much wider.
This often leads to the acceptance of the null hypothesis that the true
population coefficient is zero, even when it may not be.

493
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

3. Insignificant t-Ratios: The large variances result in t-ratios that tend


to be statistically insignificant. This makes it challenging to determine
whether individual coefficients are significantly different from zero.
4. High R2 with Insignificant Coefficients: Despite the t-ratios of one
or more coefficients being statistically insignificant, the overall measure
of goodness of fit, R2 , can still be very high. This can create a misleading
impression of model fit.
5. Sensitivity to Data Changes: OLS estimators and their standard er-
rors can be highly sensitive to small changes in the data. This instability
further complicates the interpretation and reliability of the regression re-
sults.

Variance Inflation Factor (VIF)

T
The Variance Inflation Factor (VIF) is a measure used to quantify how
much the variance of an estimated regression coefficient increases due to mul-
ticollinearity. For a given predictor Xj in a multiple regression model, the VIF
for each predictor variable, denoted as VIFj for jth predictor, is calculated as:
AF 1
VIFj =
1 − Rj2
where Rj2 is the coefficient of determination from regressing the jth predictor
variable on all the other predictor variables.

To check for multicollinearity in the model:


1. Calculate the Variance Inflation Factor (VIF) for each predictor variable,
DR
x1i , x2i , . . . , xpi .
2. A commonly used rule of thumb is that if the VIF for any predictor
variable exceeds 10, multicollinearity may be a concern. If there is no
collinearity between x2 and x3 for three variable regression model, VIF
will be 1.
3. Repeat steps 1-2 for each predictor variable in the model.

Python Code: Variance Inflation Factor (VIF)


To calculate VIF in Python using the statsmodels library, we can use the
following code for hypothetical data:
1 import pandas as pd
2 import statsmodels . api as sm
3 from statsmodels . stats . outliers_influence import
v a r i a n c e _ i n fl a t i o n _ f a c t o r
4

494
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

5 # Example DataFrame
6 df = pd . DataFrame ({
7 ’ X1 ’: [1 , 2 , 3 , 4 , 5] ,
8 ’ X2 ’: [2 , 4 , 6 , 8 , 10] ,
9 ’ X3 ’: [5 , 4 , 3 , 2 , 1]
10 })
11

12 # Add a constant column for intercept


13 X = sm . add_constant ( df )
14

15 # Calculate VIF for each feature


16 vif = pd . DataFrame ()
17 vif [ ’ Feature ’] = X . columns
18 vif [ ’ VIF ’] = [ v a r i a n c e _ i n f l a t i on _ f a c t o r ( X . values , i ) for i
in range ( X . shape [1]) ]
19

T
20 print ( vif )
21

22
AF
Tolerance of Limit (TOL)
The Tolerance of Limit (TOL) is a measure used to quantify how much of the
variance of a predictor is not explained by the other predictors in a regression
model. It is defined as:
1
TOLj = = 1 − Rj2
VIFj
When Rj2 = 1 (i.e., perfect collinearity), TOLj = 0 and when Rj2 = 0 (i.e.,
DR
no collinearity whatsoever), TOLj is 1. Because of the intimate connection
between VIF and TOL, one can use them interchangeably

Python Code: Tolerance of Limit (TOL)


To calculate tolerance in Python, we can use the relationship with VIF, as VIF
is the reciprocal of tolerance. Here’s how we can calculate it using the same
example from before:
1 import pandas as pd
2 import statsmodels . api as sm
3 from statsmodels . stats . outliers_influence import
v a r i a n c e _ i n fl a t i o n _ f a c t o r
4

5 # Example DataFrame
6 df = pd . DataFrame ({
7 ’ X1 ’: [1 , 2 , 3 , 4 , 5] ,
8 ’ X2 ’: [2 , 4 , 6 , 8 , 10] ,
9 ’ X3 ’: [5 , 4 , 3 , 2 , 1]

495
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

10 })
11

12 # Add a constant column for intercept


13 X = sm . add_constant ( df )
14

15 # Calculate VIF for each feature


16 vif = pd . DataFrame ()
17 vif [ ’ Feature ’] = X . columns
18 vif [ ’ VIF ’] = [ v a r i a n c e _ i n f l a t i on _ f a c t o r ( X . values , i ) for i
in range ( X . shape [1]) ]
19 vif [ ’ Tolerance ’] = 1 / vif [ ’ VIF ’]
20

T
21 print ( vif )

Remedial Measures for Multicollinearity


To address multicollinearity, several remedial measures can be employed:
AF 1. Remove Highly Correlated Predictors:
• Identify and remove one or more predictors that are highly cor-
related with others.
• Use correlation matrices, Variance Inflation Factor (VIF), or other
diagnostics to detect high multicollinearity.
2. Combine Predictors:
• Combine correlated predictors into a single composite variable.
• Use techniques such as Principal Component Analysis (PCA) to
create a new set of uncorrelated variables.
DR

3. Apply Regularization Techniques:


• Regularization methods can help reduce multicollinearity by adding
a penalty to the size of the coefficients.
• Techniques include:
■ Ridge Regression (L2 Regularization): Adds a penalty
proportional to the square of the magnitude of coefficients.
■ Lasso Regression (L1 Regularization): Adds a penalty
proportional to the absolute value of coefficients, which
can also perform feature selection by shrinking some coef-
ficients to zero.
4. Use Principal Component Analysis (PCA):
• PCA transforms the predictors into a set of orthogonal compo-
nents.

496
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• Replace the original predictors with the principal components to


address multicollinearity.
5. Increase Sample Size:
• A larger sample size can reduce the variance of the coefficient
estimates and mitigate the effects of multicollinearity.
• Collect more data if feasible to improve the stability of the re-
gression model.
6. Center the Variables:

T
Centering involves subtracting the mean of each predictor from
the predictor values.
• This is particularly useful when dealing with polynomial or in-
teraction terms.
7. Drop or Transform Variables:
AF • Drop variables that do not contribute significantly to the model
or transform variables to reduce multicollinearity.
• Conduct feature selection or use transformations to address is-
sues.

10.10.8 Exercises
1. Define multicollinearity. How can it affect the results of a multiple linear
regression model, and how can it be detected?
DR

2. Explain how you would use a residual plot to assess the fit of a multiple
linear regression model. What patterns in the residual plot might suggest
problems with the model?
3. List the key assumptions of linear regression. For each assumption, pro-
vide a brief explanation of why it is important for the validity of the
regression model.
4. Given a dataset with a linear regression model fitted, describe how you
would check each of the assumptions (linearity, independence, homoscedas-
ticity, normality of errors).

5. Explain the Durbin-Watson test and how it is used to test for autocorre-
lation.
6. Explain how to perform the Breusch-Pagan test and the White test for
heteroscedasticity. Interpret the results of these tests and discuss how
they affect the validity of the regression model.

497
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

7. How would you handle outliers and influential observations? Discuss


methods such as robust regression or data transformation.
8. Explain how to detect multicollinearity in a regression model using Vari-
ance Inflation Factor (VIF) or condition indices.

9. If multicollinearity is present, what strategies can you use to address it?


Discuss methods like variable selection or principal component analysis.
10. How would you address an influential observation identified through Cook’s
Distance?

T
10.11 Concluding Remarks
This chapter has equipped you with essential techniques in correlation and re-
gression analysis, crucial for the practice of data science. By examining scatter
diagrams, covariance, and correlation coefficients, we laid the groundwork for
understanding the relationships between variables in a dataset.
AFThe exploration of regression analysis, including both simple and multiple
linear regression models, has highlighted key aspects such as model assumptions,
coefficient interpretation, and model fit evaluation. These tools are fundamen-
tal for building predictive models, making data-driven decisions, and deriving
actionable insights from complex datasets.

The inclusion of Python code examples throughout the chapter bridges the
gap between theoretical concepts and practical implementation, demonstrating
how to apply these techniques in real-world data science scenarios. Mastering
these methods will enhance your ability to analyze data, validate findings, and
DR

contribute to evidence-based decision-making across various domains.

In summary, a solid grasp of correlation and regression analysis not only


improves our analytical skills but also enables us to leverage data effectively to
drive strategic decisions and solve complex problems in the field of data science.

10.12 Chapter Exercises


1. A study was conducted to investigate the effect of a new drug on blood
pressure. The dataset contains information on the age of patients, their
baseline blood pressure, and their blood pressure after 6 weeks of treat-
ment with the drug. The following data has been collected:

498
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Patient Age Baseline Blood Blood Pressure


(years) (mm Hg) Pressure (mm Hg) after 6 weeks (y)
1 45 140 130
2 50 150 135
3 60 160 140
4 55 155 145
5 65 170 150

T
6 70 175 155
7 75 180 160
8 80 185 165
9 85 190 170
10 90 195 175
AF (i) Formulate a linear regression model to study the relationship be-
tween age (x1 ) and baseline blood pressure (x2 ) as independent vari-
ables and the blood pressure after 6 weeks (y) as the dependent
variable.
(ii) State the assumptions of the linear regression model in the context
of this study.
(iii) Estimate the regression coefficients using the least squares method.
Interpret the coefficients for age and baseline blood pressure.
(iv) Using the model obtained in part (iii), predict the blood pressure
after 6 weeks for a 65-year-old patient with a baseline blood pressure
DR

of 165 mm Hg.
(v) Discuss the potential impact of multicollinearity between age and
baseline blood pressure on the model’s estimates.
(vi) Calculate the standard error of the estimates and discuss the preci-
sion of the regression coefficients.
(vii) Perform a hypothesis test to determine whether age is a significant
predictor of blood pressure after 6 weeks. Use a 5% significance level.
(viii) Construct a 95% confidence interval for the coefficient of baseline
blood pressure. Interpret the interval.
(ix) Create a residual plot and comment on the model’s assumptions
regarding homoscedasticity and normality of errors.
(x) Discuss the limitations of using this linear regression model for pre-
dicting blood pressure after 6 weeks and suggest possible improve-
ments.

499
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

2. An engineer is studying the tensile strength of a new composite material.


The dataset contains information on the fiber content of the composite,
the curing temperature, and the measured tensile strength. The following
data has been collected:

Sample Fiber Content Curing Temperature Tensile Strength



(%) ( C) (MPa) (y)
1 30 150 450
2 35 160 470

T
3 40 170 480
4 45 180 490
5 50 190 510
6 55 200 530
7 60 210 540
AF 8
9
65
70
220
230
550
560
10 75 240 570

(i) Develop a multiple linear regression model where tensile strength (y)
is the dependent variable, and fiber content (x1 ) and curing temper-
ature (x2 ) are independent variables.
(ii) Explain the assumptions of the multiple linear regression model in
the context of this engineering problem.
(iii) Estimate the regression coefficients using the least squares method.
DR

Interpret the coefficients for fiber content and curing temperature.


(iv) Using the model obtained in part (iii), estimate the tensile strength
of a composite with 50% fiber content cured at 200◦ C.
(v) Calculate the coefficient of determination (R2 ) and interpret its value
in terms of the model’s goodness-of-fit.
(vi) Conduct a hypothesis test to assess whether fiber content has a sta-
tistically significant effect on tensile strength. Use a 5% significance
level.
(vii) Construct a 95% confidence interval for the coefficient of curing tem-
perature. What does this interval indicate about the relationship
between curing temperature and tensile strength?
(viii) Plot the predicted tensile strength against the actual tensile strength
and comment on the model’s predictive accuracy.

500
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

(ix) Discuss how changes in fiber content and curing temperature might
interact to affect tensile strength. Could an interaction term be
included in the model?
(x) Analyze the residuals from the model to check for any violations of
the regression assumptions, such as non-linearity or heteroscedastic-
ity.

3. Suppose age (days), birthweight (oz), and SBP are measured for 16 infants
and the data are as shown in Table 10.12. What is the relationship be-
tween infant systolic blood pressure (SBP) and their age and birthweight?
Can we predict SBP based on these factors?

T
Table 10.12: Sample data for infant blood pressure, age, and birthweight for 16
infants

i Age (days) (x1 ) Birthweight (oz) (x2 ) SBP (mm Hg) (y)
AF 1
2
3
3
4
3
135
120
100
89
90
83
4 2 105 77
5 4 130 92
6 5 125 98
7 2 125 82
8 3 105 85
DR

9 5 120 96
10 4 90 95
11 2 120 80
12 3 95 79
13 3 120 86
14 4 150 97
15 3 160 92
16 3 125 88

Based on the data provided in Example 3, answer the following questions:

(a) Descriptive Statistics:


i. Calculate the mean and standard deviation for the variables:

501
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

• Age (days) (x1 )


• Birthweight (oz) (x2 )
• SBP (mm Hg) (y)
ii. What do the mean and standard deviation tell you about the
distribution of each variable?
(b) Scatter Plots:
i. Create a scatter plot of Age (days) (x1 ) vs. SBP (mm Hg) (y).
ii. Create a scatter plot of Birthweight (oz) (x2 ) vs. SBP (mm Hg)
(y).

T
iii. Visually assess the relationship between SBP and each predictor.
Is the relationship linear or nonlinear?
(c) Correlation Analysis:
i. Calculate the Pearson correlation coefficient between Age (x1 )
and SBP (y).
AF ii. Calculate the Pearson correlation coefficient between Birthweight
(x2 ) and SBP (y).
iii. Interpret the correlation coefficients. Which variable is more
strongly correlated with SBP?
(d) Simple Linear Regression:
i. Perform a simple linear regression to predict SBP (y) based on
Age (x1 ). Write down the regression equation.
ii. Perform a simple linear regression to predict SBP (y) based on
Birthweight (x2 ). Write down the regression equation.
iii. Interpret the slope of each regression line. What does the slope
tell you about the relationship between each predictor and SBP?
DR

(e) Multiple Linear Regression:


i. Perform a multiple linear regression using Age (x1 ) and Birth-
weight (x2 ) as predictors of SBP (y).
ii. Write down the multiple regression equation.
iii. Interpret the coefficients of Age and Birthweight in the context
of predicting SBP.
(f) Model Comparison:
i. Calculate the R2 value for the simple linear regression models
from Question 3d.
ii. Calculate the R2 value for the multiple linear regression model
from Question 3e.
iii. Compare the R2 values. Which model explains more variance in
SBP?
(g) Prediction:

502
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

i. Using the multiple regression model from Question 3e, predict


the SBP for an infant who is 4 days old with a birthweight of
115 oz.
ii. Compare this prediction to the SBP predicted by each of the sim-
ple linear regression models (age-based and birthweight-based).
(h) Model Assumptions:
i. Discuss the assumptions underlying the multiple linear regres-
sion model (e.g., linearity, independence, homoscedasticity, nor-
mality).
ii. How would you check these assumptions with the given data?
iii. Check for multicollinearity between Age and Birthweight. What

T
are the potential implications of multicollinearity on your regres-
sion model?
Summarize the relationship between infant systolic blood pressure
(SBP) and the predictors (Age and Birthweight). Based on your
analysis, do you think Age and Birthweight are sufficient to predict
AF SBP? Suggest other factors that might be important to include in
the model.

4. Consider a dataset containing information about the annual revenue of a


tech company, the number of employees, and the research and develop-
ment (R&D) expenditure. The following data has been collected:

Company Number of Employees R&D Expenditure Annual Revenue


(1000s) (Millions of Dollars) (Millions of Dollars) (y)
1 5 8 120
2 12 15 180
DR

3 9 10 150
4 14 20 200
5 18 25 250
6 22 28 280
7 25 30 320
8 30 35 350
9 28 32 340
10 35 40 400

(i) Using the above data, formulate a multiple linear regression model
where the annual revenue (y) is the dependent variable, and both
the number of employees (x1 ) and the R&D expenditure (x2 ) are
independent variables.
(ii) Write down the assumptions of the multiple linear regression model.
(iii) Estimate the regression coefficients for the model formulated in Ques-
tion 1 using the least squares method. Interpret the coefficients for
both the number of employees and R&D expenditure.

503
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

(iv) Based on the regression model obtained in part (iii), what would be
the estimated annual revenue for a company with 20,000 employees
and an R&D expenditure of $30 million?
(v) Calculate the standard error of the estimates for the model obtained
in Question (iii).
(vi) How would you evaluate the goodness-of-fit for the regression model?
What statistical metrics would you consider, and why?
(vii) Create scatter plots to show the relationship between:
• Number of employees and annual revenue.
• R&D expenditure and annual revenue.

T
Overlay the regression lines on these plots and comment on the ob-
served relationships.
(viii) If a new company is planning to hire 25,000 employees and spend
$35 million on R&D, use the regression model from Question (iii) to
predict the annual revenue.
AF(ix) Discuss the limitations of using the linear regression model in this
context. What are some factors not considered by the model that
could affect the accuracy of your predictions?
(x) Test the significance of each regression coefficient at the 5% signif-
icance level. Clearly state the null and alternative hypotheses, the
test statistic, and your conclusion.
(xi) Construct a 95% confidence interval for the coefficients of the number
of employees and R&D expenditure. Interpret the intervals.

5. Grocery Retailer: A large, national grocery retailer tracks productivity


and costs of its facilities closely. Data given in Table 10.13 were obtained
DR

from a single distribution center for a one-year period. Each data point
for each variable represents one week of activity. The variables included
are:
• The number of cases shipped (X1 )
• The indirect costs of the total labor hours as a percentage (X2 )
• A qualitative predictor called holiday that is coded 1 if the week
has a holiday and 0 otherwise (X3 )
• The total labor hours (Y )

(i). Obtain the scatter plot matrix and the correlation matrix. What
information do these diagnostic aids provide here?
(ii). Write a multiple regression model to the data for three predictor
variables. State the estimate regression function.

504
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

(iii). Obtain the residuals and prepare a box plot of the residuals. What
information does this plot provide?
(iv). Plot the residuals against Y , X1 , X2 , X3 , and X1 X2 on separate
graphs. Also prepare a normal probability plot. Interpret the plots
and summarize your findings.
(v). Prepare a time plot of the residuals. Is there any indication that the
error terms are correlated? Discuss.
(vi). Conduct the Brown-Forsythe test for constancy of the error variance,
using α = 0.01. State the decision rule and conclusion.
(vii). Test whether there is a regression relation, using a level of significance

T
of 0.05. State the alternatives, decision rule, and conclusion. What
does your test result imply about β1 , β2 , and β3 ? What is the P-
value of the test?
(viii). Calculate the coefficient of multiple determination R2 . How is this
measure interpreted here?
AF
(ix). From separate shipments with the following characteristics must be
processed next month:

X1 X2 X3
230,000 7.50 0
250,000 7.30 0
280,000 7.10 0
340,000 6.90 0

Management desires predictions of the handling times for these ship-


DR

ments so that the actual handling times can be compared with the
predicted times to determine whether any are out of line. Develop
the needed predictions, using the most efficient approach and a fam-
ily confidence coefficient of 95%.
(x). Three new shipments are to be received, each with X1 = 282, 000,
X2 = 7.10, and X3 = 0.
(a). Obtain a 95% prediction interval for the mean handling time
for these shipments.
(b). Convert the interval obtained in part (a) into a 95% pre-
diction interval for the total labor hours for the three ship-
ments.

505
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Table 10.13: Grocery Retailer

y x1 x2 x3

4264 305657 7.17 0


4496 328476 6.2 0
4317 317164 4.61 0
4292 366745 7.02 0
4945 265518 8.61 1

T
4325 301995 6.88 0
4110 269334 7.23 0
4111 267631 6.27 0
4161 296350 6.49 0
AF 4560
4401
4251
277223
269189
277133
6.37
7.05
6.34
0
0
0
4222 282892 6.94 0
4063 306639 8.56 0
4343 328405 6.71 0
4833 321773 5.82 1
4453 272319 6.82 0
4195 293880 8.38 0
DR

4394 300867 7.72 0


4099 296872 7.67 0
4816 245674 7.72 1
4867 211944 6.45 1
4114 227996 7.22 0
4314 248328 8.5 0
4289 249894 8.08 0
4269 302660 7.26 0
4347 273848 7.39 0
4178 245743 8.12 0

Continued on next page

506
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Continued from previous page

y x1 x2 x3

4333 267673 6.75 0


4226 256506 7.79 0
4121 271854 7.89 0
3998 293225 9.01 0
4475 269121 8.01 0
4545 322812 7.21 0

T
4016 252225 7.85 0
4207 261365 6.14 0
4148 287645 6.76 0
4562 289666 7.92 0
AF 4146
4555
270051
265239
8.19
7.55
0
0
4365 352466 6.94 0
4471 426908 7.25 0
5045 369989 9.65 1
4469 472476 8.2 0
4408 414102 8.02 0
4219 302507 6.72 0
DR

4211 382686 7.23 0


4993 442782 7.61 1
4309 322303 7.39 0
4499 290455 7.99 0
4186 411750 7.83 0
4342 292087 7.77 0

507
DR

508
AF
T
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Appendix

T
A.1: Table of 1000 random digits

01 32924 22324 18125 09077 26 96772 16443 39877 04653


02 54632 90374 94143 49295 27 52167 21038 14338 01395
AF
03
04
05
88720
21727
80985
43035
11904
70799
97081
41513
57975
83373
31653
69282
28
29
30
69644
71011
31217
37198
62004
75877
00028
81712
85366
98195
87536
55500
06 40412 58826 94868 52632 31 64990 98735 02999 35521
07 43918 56807 75218 46077 32 48417 23569 59307 46550
08 26513 47480 77410 47741 33 07900 65059 48592 44087
09 18164 35784 44255 30124 34 74526 32601 24482 16981
10 39446 01375 75264 51173 35 51056 04402 58353 37332
11 16638 04680 98617 90298 36 39005 93458 63143 21817
DR

12 16872 94749 44012 48884 37 67883 76343 78155 67733


13 65419 87092 78596 91512 38 06014 60999 87226 36071
14 05207 36702 56804 10498 39 93147 88766 04148 42471
15 78807 79243 13729 81222 40 01099 95731 47622 13294
16 69341 79028 64253 80447 41 89252 01201 58138 13809
17 41871 17566 61200 15994 42 41766 57239 50251 64675
18 25758 04625 43226 32986 43 92736 77800 81996 45646
19 06604 94486 40174 10742 44 45118 36600 68977 68831
20 82259 56512 48945 18183 45 73457 01579 00378 70197
21 07895 37090 50627 71320 46 49465 85251 42914 17277
22 59836 71148 42320 67816 47 15745 37285 23768 39302
23 57133 76610 89104 30481 48 28760 81331 78265 60690
24 76964 57126 87174 6102550949 82193 32787 70451 91141
25 27694 17145 32439 68245 50 89664 50242 12382 39379
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

A.2: Cumulative Distribution Function of the Standard Normal Distribution

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

−3.4 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002
−3.3 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003
−3.2 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005
−3.1 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007
−3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
−2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014

T
−2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
−2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
−2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
−2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
−2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
AF
−2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
−2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
−2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
−2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
−1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
−1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
−1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
−1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
−1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
DR

−1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
−1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
−1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
−1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
−1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
−0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
−0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
−0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
−0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
−0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
−0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
−0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
−0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
−0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
510
0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

A.3: Cumulative Distribution Function of the Standard Normal Distribution

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5200 0.5240 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5754
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224

T
0.6 0.7258 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7518 0.7549
0.7 0.7580 0.7612 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7996 0.8023 0.8051 0.8079 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8600 0.8621
AF
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9430 0.9441
1.6 0.9452 0.9463 0.9474 0.9485 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9600 0.9616 0.9625 0.9633 0.9641
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9700 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9762 0.9767
DR

2.0 0.9773 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9924 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9942 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9958 0.9959 0.9960 0.9961 0.9962 0.9963
2.7 0.9964 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973
2.8 0.9974 0.9975 0.9976 0.9977 0.9978 0.9979 0.9980 0.9981 0.9982 0.9983
2.9 0.9984 0.9985 0.9986 0.9987 0.9988 0.9989 0.9990 0.9991 0.9992 0.9993
3.0 0.9994 0.9995 0.9996 0.9997 0.9998 0.9999 1.0000 1.0000 1.0000 1.0000

511
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

a z
−za

Figure 10.18: Standard Normal Distribution


A.4: The left-sided critical points, −za , of the standard normal distribution for

T
the probability a.

a 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009

0.01 -2.326 -2.290 -2.257 -2.226 -2.197 -2.170 -2.144 -2.120 -2.097 -2.075
0.02 -2.054 -2.034 -2.014 -1.995 -1.977 -1.960 -1.943 -1.927 -1.911 -1.896
AF
0.03
0.04
0.05

0.06
-1.881
-1.751
-1.645

-1.555
-1.866
-1.739
-1.635

-1.546
-1.852
-1.728
-1.626

-1.538
-1.838
-1.717
-1.616

-1.530
-1.825
-1.706
-1.607

-1.522
-1.812
-1.695
-1.598

-1.514
-1.799
-1.685
-1.589

-1.506
-1.787
-1.675
-1.580

-1.499
-1.774
-1.665
-1.572

-1.491
-1.762
-1.655
-1.563

-1.483
0.07 -1.476 -1.468 -1.461 -1.454 -1.447 -1.440 -1.433 -1.426 -1.419 -1.412
0.08 -1.405 -1.398 -1.392 -1.385 -1.379 -1.372 -1.366 -1.359 -1.353 -1.347
0.09 -1.341 -1.335 -1.329 -1.323 -1.317 -1.311 -1.305 -1.299 -1.293 -1.287
0.10 -1.282 -1.276 -1.270 -1.265 -1.259 -1.254 -1.248 -1.243 -1.237 -1.232

0.11 -1.227 -1.221 -1.216 -1.211 -1.206 -1.200 -1.195 -1.190 -1.185 -1.180
0.12 -1.175 -1.170 -1.165 -1.160 -1.155 -1.150 -1.146 -1.141 -1.136 -1.131
0.13 -1.126 -1.122 -1.117 -1.112 -1.108 -1.103 -1.098 -1.094 -1.089 -1.085
0.14 -1.080 -1.076 -1.071 -1.067 -1.063 -1.058 -1.054 -1.049 -1.045 -1.041
DR

0.15 -1.036 -1.032 -1.028 -1.024 -1.019 -1.015 -1.011 -1.007 -1.003 -0.999

0.16 -0.994 -0.990 -0.986 -0.982 -0.978 -0.974 -0.970 -0.966 -0.962 -0.958
0.17 -0.954 -0.950 -0.946 -0.942 -0.938 -0.935 -0.931 -0.927 -0.923 -0.919
0.18 -0.915 -0.912 -0.908 -0.904 -0.900 -0.896 -0.893 -0.889 -0.885 -0.882
0.19 -0.878 -0.874 -0.871 -0.867 -0.863 -0.860 -0.856 -0.852 -0.849 -0.845
0.2 -0.842 -0.838 -0.834 -0.831 -0.827 -0.824 -0.820 -0.817 -0.813 -0.810

512
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

a z
za

Figure 10.19: Standard Normal Distribution


A.5: The right-sided critical points, za , of the standard normal distribution for

T
the probability a

a 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009


0.01 2.326 2.290 2.257 2.226 2.197 2.170 2.144 2.120 2.097 2.075
0.02 2.054 2.034 2.014 1.995 1.977 1.960 1.943 1.927 1.911 1.896
0.03 1.881 1.866 1.852 1.838 1.825 1.812 1.799 1.787 1.774 1.762
AF
0.04
0.05
0.06
0.07
1.751
1.645
1.555
1.476
1.739
1.635
1.546
1.468
1.728
1.626
1.538
1.461
1.717
1.616
1.530
1.454
1.706
1.607
1.522
1.447
1.695
1.598
1.514
1.440
1.685
1.589
1.506
1.433
1.675
1.580
1.499
1.426
1.665
1.572
1.491
1.419
1.655
1.563
1.483
1.412
0.08 1.405 1.398 1.392 1.385 1.379 1.372 1.366 1.359 1.353 1.347
0.09 1.341 1.335 1.329 1.323 1.317 1.311 1.305 1.299 1.293 1.287
0.1 1.282 1.276 1.270 1.265 1.259 1.254 1.248 1.243 1.237 1.232
0.11 1.227 1.221 1.216 1.211 1.206 1.200 1.195 1.190 1.185 1.180
0.12 1.175 1.170 1.165 1.160 1.155 1.150 1.146 1.141 1.136 1.131
0.13 1.126 1.122 1.117 1.112 1.108 1.103 1.098 1.094 1.089 1.085
0.14 1.080 1.076 1.071 1.067 1.063 1.058 1.054 1.049 1.045 1.041
0.15 1.036 1.032 1.028 1.024 1.019 1.015 1.011 1.007 1.003 0.999
DR

0.16 0.994 0.990 0.986 0.982 0.978 0.974 0.970 0.966 0.962 0.958
0.17 0.954 0.950 0.946 0.942 0.938 0.935 0.931 0.927 0.923 0.919
0.18 0.915 0.912 0.908 0.904 0.900 0.896 0.893 0.889 0.885 0.882
0.19 0.878 0.874 0.871 0.867 0.863 0.860 0.856 0.852 0.849 0.845
0.2 0.842 0.838 0.834 0.831 0.827 0.824 0.820 0.817 0.813 0.810

513
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

a t
ta

Figure 10.20: Student’s-t Distribution


A.6: Critical points of the t-distribution with its degrees of freedom (ν)

a
ν

T
0.10 0.05 0.025 0.01 0.005 0.001 0.0005

1 3.078 6.314 12.706 31.821 63.657 318.31 636.62


2 1.886 2.920 4.303 6.965 9.925 22.326 31.598
3 1.638 2.353 3.182 4.541 5.841 10.213 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 1.476 2.015 2.571 3.365 4.032 5.893 6.869
AF 6
7
8
9
1.440
1.415
1.397
1.383
1.943
1.895
1.860
1.833
2.447
2.365
2.306
2.262
3.143
2.998
2.896
2.821
3.707
3.499
3.355
3.250
5.208
4.785
4.501
4.297
5.959
5.408
5.041
4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 1.350 1.771 2.160 2.650 3.012 3.852 4.221
14 1.345 1.761 2.145 2.624 2.977 3.787 4.140
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 1.337 1.746 2.120 2.583 2.921 3.686 4.015
17 1.333 1.740 2.110 2.567 2.898 3.646 3.965
DR

18 1.330 1.734 2.101 2.552 2.878 3.610 3.922


19 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 1.319 1.714 2.069 2.500 2.807 3.485 3.767
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 1.316 1.708 2.060 2.485 2.787 3.450 3.725
26 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 1.314 1.703 2.052 2.473 2.771 3.421 3.690
28 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 1.311 1.699 2.045 2.462 2.756 3.396 3.659
30 1.310 1.697 2.042 2.457 2.750 3.385 3.646
40 1.303 1.684 2.021 2.423 2.704 3.307 3.551
60 1.296 1.671 2.000 2.390 2.660 3.232 3.460
120 1.289 1.658 1.980 2.358 2.617 3.160 3.373
∞ 1.282 1.645 1.960 2.326 2.576 3.090 3.291

514
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

a χ2
5 χ2a 10 15 20

Figure 10.21: Chi-Square Distribution


A.7-: Percentage points of the chi-square distribution (χ2a,ν )

ν χ2.995 χ2.99 χ2.975 χ2.95 χ2.90 χ2.75 χ2.50 χ2.25 χ2.10 χ2.05 χ2.025 χ2.01 χ2.005 χ2.001
1 0.00 0.00 0.00 0.00 0.02 0.10 0.45 1.32 2.71 3.84 5.02 6.63 7.88 10.83
2 0.01 0.02 0.05 0.10 0.21 0.58 1.39 2.77 4.61 5.99 7.38 9.21 10.60 13.81

T
3 0.07 0.12 0.22 0.35 0.58 1.21 2.37 4.11 6.25 7.81 9.35 11.34 12.84 16.27
4 0.21 0.30 0.48 0.71 1.06 1.92 3.36 5.39 7.78 9.49 11.14 13.28 14.86 18.47
5 0.41 0.55 0.83 1.15 1.61 2.67 4.35 6.63 9.24 11.07 12.83 15.09 16.75 20.52
6 0.68 0.87 1.24 1.64 2.20 3.45 5.35 7.84 10.64 12.59 14.45 16.81 18.55 22.46
7 0.99 1.24 1.69 2.17 2.83 4.25 6.35 9.04 12.02 14.07 16.01 18.48 20.28 24.32
8 1.34 1.65 2.18 2.73 3.49 5.07 7.34 10.22 13.36 15.51 17.53 20.09 21.95 26.12
AF
9 1.73 2.09 2.70 3.33 4.17 5.90 8.34 11.39 14.68 16.92 19.02 21.67 23.59 27.88
10 2.16 2.56 3.25 3.94 4.87 6.74 9.34 12.55 15.99 18.31 20.48 23.21 25.19 29.59
11 2.60 3.05 3.82 4.57 5.58 7.58 10.34 13.70 17.28 19.68 21.92 24.72 26.76 31.26
12 3.07 3.57 4.40 5.23 6.30 8.44 11.34 14.85 18.55 21.03 23.34 26.22 28.30 32.91
13 3.57 4.11 5.01 5.89 7.04 9.30 12.34 15.98 19.81 22.36 24.74 27.69 29.82 34.53
14 4.07 4.66 5.63 6.57 7.79 10.17 13.34 17.12 21.06 23.68 26.12 29.14 31.32 36.12
15 4.60 5.23 6.27 7.26 8.55 11.04 14.34 18.25 22.31 25.00 27.49 30.58 32.80 37.70
16 5.14 5.81 6.91 7.96 9.31 11.91 15.34 19.37 23.54 26.30 28.85 32.00 34.27 39.25
17 5.70 6.41 7.56 8.67 10.09 12.79 16.34 20.49 24.77 27.59 30.19 33.41 35.72 40.79
18 6.26 7.01 8.23 9.39 10.86 13.68 17.34 21.60 25.99 28.87 31.53 34.81 37.16 42.31
19 6.84 7.63 8.91 10.12 11.65 14.56 18.34 22.72 27.20 30.14 32.85 36.19 38.58 43.82
20 7.43 8.26 9.59 10.85 12.44 15.45 19.34 23.83 28.41 31.41 34.17 37.57 40.00 45.32
DR

21 8.03 8.90 10.28 11.59 13.24 16.34 20.34 24.93 29.62 32.67 35.48 38.93 41.40 46.80
22 8.64 9.54 10.98 12.34 14.04 17.24 21.34 26.04 30.81 33.92 36.78 40.29 42.80 48.27
23 9.26 10.20 11.69 13.09 14.85 18.14 22.34 27.14 32.01 35.17 38.08 41.64 44.18 49.73
24 9.89 10.86 12.40 13.85 15.66 19.04 23.34 28.24 33.20 36.42 39.36 42.98 45.56 51.18
25 10.52 11.52 13.12 14.61 16.47 19.94 24.34 29.34 34.38 37.65 40.65 44.31 46.93 52.62
26 11.16 12.20 13.84 15.38 17.29 20.84 25.34 30.43 35.56 38.89 41.92 45.64 48.29 54.05
27 11.81 12.88 14.57 16.15 18.11 21.75 26.34 31.53 36.74 40.11 43.19 46.96 49.64 55.48
28 12.46 13.56 15.31 16.93 18.94 22.66 27.34 32.62 37.92 41.34 44.46 48.28 50.99 56.89
29 13.12 14.26 16.05 17.71 19.77 23.57 28.34 33.71 39.09 42.56 45.72 49.59 52.34 58.30
30 13.79 14.95 16.79 18.49 20.60 24.48 29.34 34.80 40.26 43.77 46.98 50.89 53.67 59.70
40 20.71 22.16 24.43 26.51 29.05 33.66 39.34 45.62 51.81 55.76 59.34 63.69 66.77 73.40
50 27.99 29.71 32.36 34.76 37.69 42.94 49.33 56.33 63.17 67.50 71.42 76.15 79.49 86.66
60 35.53 37.48 40.48 43.19 46.46 52.29 59.33 66.98 74.40 79.08 83.30 88.38 91.95 99.61
70 43.28 45.44 48.76 51.74 55.33 61.70 69.33 77.58 85.53 90.53 95.02 100.42 104.22 112.32
80 51.17 53.54 57.15 60.39 64.28 71.14 79.33 88.13 96.58 101.88 106.63 112.33 116.32 124.84
90 59.20 61.75 65.65 69.13 73.29 80.62 89.33 98.64 107.56 113.14 118.14 124.12 128.30 137.21
100 67.33 70.06 74.22 77.93 82.36 90.13 99.33 109.14 118.50 124.34 129.56 135.81 140.17 149.45

515
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS

Table 10.15: A.8. Critical Points for α = 0.05 of the F -distribution with its
degrees of freedom F (df1, df2).

df1
df2 1 2 3 4 5 6 7 8 9 10 12 15 20
1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.5 241.9 243.9 245.9 248.0
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.43 19.45
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56

T
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94
AF
10
11
12
4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77
4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65
4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19
DR

19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.07 1.99
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.06 1.97
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.04 1.96
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.03 1.94
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84
60 4.00 3.15 2.76 2.53 2.37 2.25
5162.17 2.10 2.04 1.99 1.92 1.84 1.75
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.75 1.66
1000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84 1.76 1.68 1.58
Bibliography

T
[1] David R Anderson, Thomas A Williams, and James J Cochran. Statistics
for Business & Economics. Cengage Learning, 2020.

[2] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M


Ljung. Time Series Analysis: Forecasting and Control. John Wiley &
Sons, 2015.
AF
[3] Peter J Brockwell and Richard A Davis. Introduction to Time Series and
Forecasting. Springer, 2002.

[4] Frederick Emory Croxton and Dudley J Cowden. Applied General Statis-
tics. Prentice Hall, 1939.
[5] Anthony J Hayter. Probability and Statistics for Engineers and Scientists.
Duxbury Press, 2012.

[6] Changquan Huang and Alla Petukhina. Applied Time Series Analysis and
Forecasting with Python. Springer, 2022.
[7] Francesca Lazzeri. Machine Learning for Time Series Forecasting with
DR

Python. John Wiley & Sons, 2020.

[8] Spyros Makridakis, Steven C Wheelwright, and Rob J Hyndman. Fore-


casting Methods and Applications. John wiley & sons, 2008.
[9] Robert Deward Mason. Statistical Techniques in Business and Economics.
McGraw Hill Inc, 1999.
[10] Sheldon Ross. Probability and Statistics for Engineers and Scientists. El-
sevier, New Delhi, 2009.
[11] Ronald E Walpole, Raymond H Myers, Sharon L Myers, and Keying Ye.
Probability and Statistics for Engineers and Scientists, volume 5. Macmil-
lan New York, 1993.

517

You might also like