Text Book
Text Book
Data Science
T
Applied Statistics and Probability with Python
AF Lecture Note
Professor
Department of Statistics and Data Science
Jahangirnagar University, Savar, Bangladesh
‘Are those who know equal to those who do not know?’ Only they will
remember [who are] people of understanding (Surah Al-Zumar (39:9),
Al-Quran).
T
AF
DR
T
In today’s data-driven world, the ability to analyze and interpret data has
become a crucial skill across various disciplines. As we navigate through an
abundance of information, the principles of statistics and probability serve as
the bedrock upon which data science stands. This book, Foundations of Data
Science: Applied Statistics and Probability with Python, aims to bridge the gap
between theory and practice, providing readers with the tools they need to har-
AF
ness the power of data effectively.
Whether you are a beginner or someone looking to refine your skills, I invite
you to dive in and explore the foundations that will empower you in your data
science endeavors.
Happy learning!
iv
Table of Contents
T
1 Introduction to Data Science 1
1.1 Welcome to Data Science . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Key Components of Data Science . . . . . . . . . . . . . . . . . . 2
1.3 Concepts of Statistics . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Population and Sample . . . . . . . . . . . . . . . . . . . 6
AF 1.3.2 Census and Sample Survey .
1.3.3 Parameters and Statistic . . .
1.3.4 Types of Statistics . . . . . .
1.4 What is Data? . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
8
9
10
1.4.1 Levels of Measurement . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Scope of Applied Statistics . . . . . . . . . . . . . . . . . . . . . 17
1.6 Statistical Methods in Data Science . . . . . . . . . . . . . . . . 17
1.7 Overview of Data Science Workflow . . . . . . . . . . . . . . . . 19
1.8 Popular Statistical Analysis Tools . . . . . . . . . . . . . . . . . . 20
1.9 Why Choose Python for This Book? . . . . . . . . . . . . . . . . 20
1.10 Getting Started with Python . . . . . . . . . . . . . . . . . . . . 22
DR
v
TABLE OF CONTENTS
T
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . 49
3.2.1 Arithmetic Mean . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.2 Advantages and Disadvantages of Arithmetic Mean . . . . 50
3.2.3 Harmonic Mean . . . . . . . . . . . . . . . . . . . . . . . 52
AF Advantages and Disadvantages of Harmonic Mean . . . .
3.2.4 Geometric Mean . . . . . . . . . . . . . . . . . . . . . . .
Advantages and Disadvantages of Geometric Mean . . . .
3.2.5 Relationships Between Arithmetic Mean, Geometric Mean,
52
53
54
vi
TABLE OF CONTENTS
3.8.3 Deciles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8.4 Interquartile Range (IQR) . . . . . . . . . . . . . . . . . . 86
3.8.5 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . 86
3.8.6 Python Code: Dispersion Measures . . . . . . . . . . . . . 88
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.10 Five-Number Summary and Boxplot . . . . . . . . . . . . . . . . 91
3.10.1 Five-Number Summary . . . . . . . . . . . . . . . . . . . 91
3.10.2 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.10.3 Importance of Boxplots . . . . . . . . . . . . . . . . . . . 93
3.10.4 Python Code: Boxplot . . . . . . . . . . . . . . . . . . . . 97
3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
T
3.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.13 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 100
vii
TABLE OF CONTENTS
T
5.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.5 The Expectation of a Random Variable . . . . . . . . . . . . . . 168
5.5.1 Example: Testing Electronic Components . . . . . . . . . 169
5.5.2 Example: Metal Cylinder Production . . . . . . . . . . . 171
5.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.6 The Variance of a Random Variable . . . . . . . . . . . . . . . . 172
AF 5.6.1 Example: Metal Cylinder Production . . . . . . . .
5.6.2 Chebyshev’s Inequality . . . . . . . . . . . . . . . . .
5.6.3 Example: Blood Pressure Measurement . . . . . . .
5.6.4 Example: Employee Salaries . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
175
176
176
177
5.6.5 Quantiles of Random Variables . . . . . . . . . . . . . . . 177
5.6.6 Example: Metal Cylinder Production . . . . . . . . . . . 178
5.6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.7 Essential Generating Functions . . . . . . . . . . . . . . . . . . . 181
5.7.1 Moment Generating Function . . . . . . . . . . . . . . . . 182
5.7.2 Key Properties of MGF . . . . . . . . . . . . . . . . . . . 182
5.7.3 Probability Generating Function (PGF) . . . . . . . . . . 185
5.7.4 Characteristic Function (CF) . . . . . . . . . . . . . . . . 189
DR
viii
TABLE OF CONTENTS
T
6.2.4 Characteristic Function . . . . . . . . . . . . . . . . . . . 230
6.2.5 Probability Generating Function . . . . . . . . . . . . . . 230
6.2.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.2.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.2.8 Python Code for Bernoulli Distribution . . . . . . . . . . 233
AF 6.2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Expected Value . . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Variance and Standard Deviation . . . . . . . . . . . . .
.
.
.
.
234
236
238
239
6.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.3.4 Python Code for Binomial Distribution . . . . . . . . . . 243
6.3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
6.4 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 246
6.4.1 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . 251
6.4.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.4.3 Moment Generating Function . . . . . . . . . . . . . . . . 253
6.4.4 Characteristic Function . . . . . . . . . . . . . . . . . . . 253
DR
ix
TABLE OF CONTENTS
T
7.4.6 Python Code for Normal Distribution Characteristics . . 307
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
7.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 311
7.7 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 312
x
TABLE OF CONTENTS
T
Normal Distribution (Two-Sided Alternative) . . . . . . . 387
9.10 Single Proportion Test . . . . . . . . . . . . . . . . . . . . . . . . 388
9.10.1 Sample Size Estimation for Proportion Test . . . . . . . . 390
9.11 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 391
9.12 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 392
AF
10 Correlation and Regression Analysis
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Scatter Diagram . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Python Code: Scatter diagram . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
396
396
397
399
10.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
10.5 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 404
10.6 Pearson’s Correlation Coefficient . . . . . . . . . . . . . . . . . . 404
10.6.1 Interpretation of the value of Correlation Coefficient . . . 405
10.6.2 Properties of the Correlation Coefficient . . . . . . . . . . 407
10.6.3 Testing the Significance of the Correlation Coefficient . . 410
10.6.4 Python Code: Correlation Matrix . . . . . . . . . . . . . 411
10.7 Rank Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
DR
xi
TABLE OF CONTENTS
T
Confidence Interval for E(Y |X = x) . . . . . . . . . . . . 447
10.8.15 Python Code: Linear Regression Model . . . . . . . . . . 449
10.8.16 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
10.9 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . 454
10.9.1 Model Assumptions . . . . . . . . . . . . . . . . . . . . . 455
10.9.2 Estimation Procedure . . . . . . . . . . . . . . . . . . . . 455
AF 10.9.3 Estimation Procedure of Error Variance . . . . . . . . .
10.9.4 Mean of the OLS Estimator . . . . . . . . . . . . . . . .
10.9.5 Variance of the OLS Estimator . . . . . . . . . . . . . .
10.9.6 Coefficient of Determination . . . . . . . . . . . . . . . .
.
.
.
.
456
457
457
458
10.9.7 Adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . 458
10.9.8 Example Dataset and Regression Calculations . . . . . . . 459
Goodness-of-fit R2 and Adjusted R2 . . . . . . . . . . . . 462
10.9.9 F -test in Multiple Regression . . . . . . . . . . . . . . . . 463
10.9.10 ANOVA Table in Regression Analysis . . . . . . . . . . . 463
10.9.11 The t-tests in Multiple Regression . . . . . . . . . . . . . 464
10.9.12 Python Code: Linear Regression Model . . . . . . . . . . 471
10.9.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
DR
xii
Chapter 1
Introduction to Data
T
Science
AF
1.1 Welcome to Data Science
Welcome to the exciting world of data science, where numbers tell hidden stories
and reveal valuable insights! In today’s digital era, data is incredibly powerful,
and those who can understand and use it have the key to endless opportu-
nities. Data science is all about extracting meaningful information from vast
amounts of data to make informed decisions, solve complex problems, and drive
innovation.
In the era of big data and advanced analytics, data science has become a
crucial field in both academia and industry. At the heart of data science is the
ability to extract meaningful insights from data, a process that heavily relies on
statistical methods. To succeed in this field, a strong foundation in statistics is
essential. Statistics provides the tools and techniques needed to analyze data,
identify patterns, and draw reliable conclusions. Without this knowledge, it’s
challenging to make sense of the data and leverage its full potential.
1
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
and data visualization. These skills allow data scientists to collect, clean, and
process data, build predictive models, and present their findings in a clear and
compelling way.
As we embark on this journey, we will dive deep into the statistical founda-
tions that every aspiring data scientist needs to master. We will also explore
how these principles are applied in real-world scenarios, from predicting cus-
tomer behavior to identifying trends in healthcare. So, buckle up and get ready
to unlock the secrets of data science!
T
Data Science is made up of several interrelated components, each playing a
crucial role in transforming raw data into actionable insights. Below are some
of the most important components:
Machine Learning
Machine learning, a subset of artificial intelligence, plays a central role in Data
Science by enabling computers to learn from data and make predictions or
decisions without explicit programming. It involves training models on histor-
ical data to detect patterns and generate accurate forecasts or classifications.
Machine learning is widely applied in areas such as recommendation systems,
fraud detection, image recognition, and natural language processing. Common
approaches include supervised learning, unsupervised learning, and reinforce-
ment learning. Popular tools and libraries used to build and deploy models
include Scikit-learn, TensorFlow, Keras, and PyTorch.
2
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
Deep learning
Deep learning is an advanced subset of machine learning that uses artificial
neural networks with multiple layers to model and understand complex pat-
terns in large volumes of data. It excels at tasks where traditional algorithms
struggle, such as image and speech recognition, natural language processing,
and autonomous systems. Deep learning models, such as convolutional neural
networks (CNNs) and recurrent neural networks (RNNs), learn hierarchical fea-
tures directly from raw data without the need for manual feature extraction.
These models require significant computational power and large datasets to per-
form effectively. Popular frameworks for developing deep learning applications
include TensorFlow, Keras, and PyTorch.
T
Data Engineering
Data Engineering involves the design, development, and maintenance of systems
and architectures that enable the collection, storage, and processing of large
datasets. Data engineers build data pipelines that move data from various
AF
sources to storage and analytics platforms, ensuring it is clean, reliable, and
accessible for analysis. They work with tools and technologies like SQL, Apache
Spark, Hadoop, Airflow, and cloud services (e.g., AWS, Azure, GCP) to handle
big data efficiently. Data Engineering lays the foundation for data analysis and
machine learning by making high-quality data available to data scientists and
analysts.
Data Visualization
Data visualization is a critical aspect of Data Science that involves represent-
ing data and analytical results through visual formats such as charts, graphs,
DR
maps, and dashboards. It helps transform complex datasets into easily under-
standable insights, enabling quicker interpretation and more informed decision-
making. Effective data visualization allows analysts and stakeholders to iden-
tify patterns, trends, and outliers that might not be immediately evident in raw
data. It is widely used in business intelligence, reporting, and exploratory data
analysis. Common tools and libraries include Tableau, Power BI, Matplotlib,
Seaborn, and Plotly, each offering powerful capabilities for creating both static
and interactive visualizations.
3
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
in industries where data is generated at high speed and scale, helping organi-
zations drive innovation, improve operations, and gain a competitive edge.
T
data and draw meaningful conclusions. As a data scientist, you will frequently
encounter questions such as:
4
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
T
with its new product. Surveying every customer is impractical, so the company
surveys a sample of 500 customers out of its 50,000 customers.
Collecting Data
The company distributes a satisfaction survey to 500 randomly selected cus-
AF
tomers, asking them to rate their satisfaction on a scale from 1 to 10.
Analyzing Data
Once the responses are collected, the company calculates the average satisfac-
tion score from the sample. Suppose the average satisfaction score from the 500
customers is 7.2.
Interpreting Data
Using statistical methods, the company estimates the average satisfaction score
for the entire population of 50,000 customers based on the sample. This involves
DR
calculating confidence intervals to understand the range within which the true
average satisfaction score likely falls.
Presenting Data
The company creates a report with graphs and charts to visualize the distri-
bution of satisfaction scores, the average satisfaction score, and the confidence
interval.
Organizing Data
The company stores the survey data in a database, ensuring it is organized for
future analysis or reference.
5
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
Making Inferences
By analyzing the sample data, the company infers that the average satisfaction
score of all its customers is approximately 7.2, with a certain level of confidence
(e.g., 95% confidence interval).
Quantifying Uncertainty
To quantify the uncertainty of their estimate, the company calculates a confi-
dence interval. For instance, they might determine that they are 95% confident
that the true average satisfaction score of all customers is between 6.9 and 7.5.
T
Through collecting, analyzing, interpreting, presenting, and organizing data
from a sample, the company uses statistics to make informed decisions about
customer satisfaction for the entire customer base. This approach helps the
company understand and quantify the uncertainty associated with their esti-
mate, enabling them to make more accurate and reliable business decisions.
AF
1.3.1 Population and Sample
In statistics, it is often impractical or impossible to study an entire population.
Instead, we rely on samples. Understanding the difference between a population
and a sample is fundamental to conducting statistical analyses.
Population
The population refers to the entire group of individuals or instances about whom
we want to draw conclusions. It includes all possible observations or outcomes
that are of interest in a particular study or analysis.
DR
Examples:
• If we are studying the prevalence of diabetes in adults aged 40-60 in
a country, the population would include all adults aged 40-60 in that
country. This would encompass every individual in that age range,
regardless of their health status, socioeconomic background, or other
characteristics.
6
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
Sample
A sample is a subset of the population that is selected for the actual study. The
goal is to choose a sample that is representative of the population so that the
findings can be used to make inferences or generalizations about the popula-
tion.
Examples:
T
every adult aged 40-60 in the country. Instead, researchers might select
a sample of 1,000 adults from this age group and measure their blood
sugar levels. The results from this sample can then be used to estimate
the prevalence of diabetes in the entire population of adults aged 40-60.
Importance of Census
The census data is essential for:
• Informing government policies and development strategies.
7
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
Sample Surveys
Sample surveys are a method of collecting data from a subset of the population
to infer information about the entire population. In Bangladesh, sample surveys
are conducted for various purposes, including economic research, social policy
evaluation, and program assessment.
T
Types of Sample Surveys
In Bangladesh, different types of sample surveys are carried out by the Bangladesh
Bureau of Statistics (BBS) and other organizations. Some key surveys include:
•
AF•
Labor Force Survey (LFS): Assesses employment, unemployment,
and labor market conditions.
Parameter
A parameter is a measurable characteristic of the population which is a nu-
merical value that summarizes a characteristic of a population. It is a fixed
value, although its exact value is often unknown. Parameters describe the en-
tire population.
Examples:
• Population Mean (µ): The average height of all adult men in a
country.
8
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
Statistic
A statistic is a measurable characteristic of the sample which is a numerical
value that summarizes a characteristic of a sample. It is used to estimate
the corresponding population parameter. Statistics can vary from sample to
sample.
Examples:
T
• Sample Mean (x̄): The average height of 100 randomly selected adult
men from the population.
• Spreads: To understand how different the data points are from each
other, such as how spread out the heights of people in a group are.
• Visual Tools: Charts, graphs, and tables that make it easier to see
patterns and trends in data.
9
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
T
For example, suppose you wanted to know how much time students spend
studying for exams in your school. Instead of asking every student, you could
ask a small group of students (a sample) and then use their answers to make
an estimate about the entire student population.
AF
Inferential statistics: Techniques for making predictions or inferences
about a population based on sample data.
Data: Raw facts and figures or units of information that can be collected
for analysis, which can be quantitative or qualitative.
Types of Data
Data can be categorized into several types, each with its own characteristics
and uses. These categories help us understand different ways to collect and
analyze information. The primary data types are as follows:
10
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
T
• How old you are: 15, 16, 17, etc.
• How many students are in a class: 25, 30, 35, etc.
• The temperature in a city: 20°C, 25°C, etc.
(ii). Qualitative Data or Non-numeric Data: Unlike quantitative data,
AF qualitative data or non-numeric data does not deal with numbers. In-
stead, it describes categories, qualities, or characteristics. This type of
data is often used to answer questions like “what kind?” or “which one?”
Sometimes, it is called categorical data. Qualitative data includes var-
ious forms of descriptive information, and examples include:
• Your favorite color: Red, Blue, Green, etc.
• The type of pets people have: Dog, Cat, Fish, etc.
• The kind of food people like: Pizza, Pasta, Salad, etc.
DR
11
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
(ii). Ordinal
(iii). Interval
(iv). Ratio
T
Nominal Level
The nominal level of measurement is the most basic type of data categoriza-
tion. It classifies data into distinct categories that do not have a natural order
or ranking. These categories are mutually exclusive and collectively exhaustive,
AF
meaning each observation fits into one and only one category, and all categories
together include all possible observations.
Examples:
• Gender (male, female)
12
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
Ordinal Level
The ordinal level of measurement classifies data into categories that have a
T
meaningful order or ranking among them. Unlike nominal data, ordinal data
allow for comparisons of the relative position of items, but the intervals between
the categories are not necessarily equal or known. Ordinal data are widely used
in surveys, questionnaires, and educational assessments to gauge attitudes, per-
ceptions, and performance levels.
AF
Characteristics of Ordinal Data:
• Categorical with Order: Data are grouped into categories that have
a logical sequence or ranking.
Examples:
•
DR
13
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
Interval Level
The interval level of measurement involves data that are not only ordered but
also have equal intervals between values. Unlike ordinal data, the differences
between values are meaningful. However, interval data do not have a true zero
point, which means that ratios are not meaningful. Despite lacking a true zero
point, interval data provide a higher level of detail compared to nominal and
ordinal data, allowing for more sophisticated analytical techniques.
T
meaningful. For example, consider temperature measured in degrees
Celsius. The difference between 10°C and 20°C is 10 degrees, and the
difference between 20°C and 30°C is also 10 degrees. These equal in-
tervals allow us to perform meaningful addition and subtraction.
• No True Zero Point: Zero does not indicate the absence of the
AF•
quantity being measured, so ratios are not meaningful.
Examples:
• Temperature Measurements in Celsius or Fahrenheit: 10°C, 20°C, 30°C.
Ratio Level
The ratio level of measurement is the highest level of measurement and includes
all the properties of the interval level, with the addition of a true zero point.
This allows for meaningful comparisons and calculation of ratios. The pres-
ence of a true zero point allows for a full range of mathematical and statistical
14
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
operations, making ratio data the most informative and versatile level of mea-
surement.
T
• True Zero Point: Zero indicates the absence of the quantity being
measured, making ratios meaningful.
Examples:
• Length of Engineering Components: 5 m, 10 m, 15 m.
By grasping these basic statistical concepts, data scientists can better analyze
and interpret data, leading to more accurate and meaningful insights. As we
delve deeper into data science, these foundational principles will serve as the
building blocks for more advanced topics and applications.
15
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
1.4.2 Variables
In statistics and data science, a variable is any characteristic or properties that
can take on different values. Variables are essential for research, as they rep-
resent the different factors or elements that can change or vary across different
individuals, conditions, or time periods. They can take on different values de-
pending on the nature of the data being collected.
T
of statistical analysis that can be performed. Variables are broadly categorized
into qualitative and quantitative variables, each with further subtypes.
Qualitative Variables
Qualitative variables, also known as categorical variables, describe categories
AF
or groups. These variables represent characteristics that cannot be measured
numerically but can be classified into distinct groups.
Quantitative Variables
Quantitative variables, also known as numerical variables, represent measurable
quantities and can be expressed numerically. They can be further divided into
discrete and continuous variables.
16
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
T
ments, and data collection methods.
17
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
T
the spread of the data.
include:
■ Linear Regression: Modeling the relationship between a depen-
dent variable and one or more independent variables.
■ Logistic Regression: Used for binary classification problems where
the outcome is categorical (e.g., yes/no, success/failure).
18
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
T
dataset while preserving as much information as possible.
• Time Series Analysis: Time series methods are used for analyzing
data collected over time. Techniques such as:
■ ARIMA (AutoRegressive Integrated Moving Average): A model
AF used for forecasting time series data.
■ Seasonal Decomposition: Identifying and removing seasonal pat-
terns from time series data to better understand underlying
trends.
19
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
T
There are several tools and software commonly used for statistical analysis in
data science, including:
relational databases.
20
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
2. Comprehensive Libraries
Python boasts a rich ecosystem of libraries that are essential for statistical
analysis and data science. Some of the most popular libraries include:
• NumPy: Provides support for large multidimensional arrays and ma-
trices, along with a collection of mathematical functions to operate on
these arrays.
T
and other advanced mathematical functions.
•
Statsmodels: Enables statistical modeling, hypothesis testing, and
data exploration.
4. Integration Capabilities
Python integrates well with other languages and technologies. It can be used
alongside other data tools like SQL for database queries, or even integrated
with languages such as R or C++ for specialized tasks. This flexibility makes
Python a versatile tool in a data scientist’s toolkit.
21
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
6. Versatility
Python is not only used for data analysis but also for web development, au-
tomation, scripting, and even artificial intelligence and machine learning. This
versatility means that once you learn Python, you can apply your skills to a
wide range of problems and projects.
7. Real-World Applications
Many industry leaders and tech giants like Google, Facebook, and NASA use
Python for data analysis and machine learning. This real-world application un-
derscores Python’s reliability and effectiveness in handling complex data tasks.
T
8. Continuous Development
Python and its libraries are continuously being developed and improved by the
community, ensuring that users have access to the latest tools and techniques
in data science and statistics.
AFIn summary, Python’s combination of simplicity, powerful libraries, commu-
nity support, integration capabilities, and versatility makes it an ideal choice
for data scientists and statisticians. Throughout this book, we will leverage
Python to demonstrate various statistical tools and methodologies, ensuring
that you can apply what you learn to real-world data challenges effectively.
22
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
Learning Path
To get the most out of this book, follow the chapters sequentially. Practice the
examples and exercises provided to reinforce your understanding.
T
statistical concepts and methods, data scientists can perform effective analyses
and make data-driven decisions. In the following chapters, we will delve deeper
into specific statistical methods and explore how they can be applied to real-
world data science problems.
By the end of this book, you will have a solid foundation in applied statistics
AF
and probability using Python. You will be equipped with the skills to tackle
real-world data science problems.
23
CHAPTER 1. INTRODUCTION TO DATA SCIENCE
11. List and describe three Python libraries commonly used in data science.
What functionalities do they provide?
12. Explain the importance of community and support in choosing Python
for data science.
13. How does Python’s integration capability benefit a data scientist working
on complex projects?
14. Identify the type of data (quantitative or qualitative) for each of the
following:
(i). The colors of cars in a parking lot.
T
(ii). The heights of students in a class.
(iii). The brands of smartphones owned by a group of people.
(iv). The number of books read by a group of students in a year.
AF (v). The types of cuisine served at different restaurants.
15. For each of the following variables, identify the level of measurement
(nominal, ordinal, interval, or ratio):
(i). The ranking of movies from a film festival.
(ii). The temperatures in degrees Celsius recorded over a week.
(iii). The number of steps taken by an individual in a day.
(iv). The blood types of patients in a hospital.
(v). The ages of participants in a survey.
DR
16. Categorize the following data sets as either nominal, ordinal, interval, or
ratio:
(i). Survey responses (Strongly Agree, Agree, Neutral, Disagree, Strongly
Disagree).
(ii). Birth years of employees in a company.
(iii). Types of pets owned (dog, cat, bird, etc.).
(iv). Test scores out of 100.
(v). Customer satisfaction ratings on a scale of 1 to 10.
24
Chapter 2
T
and Graphical Displays
AF
2.1 Introduction
In data science, data exploration is a crucial phase in the data analysis work-
flow, preceding more complex statistical modeling and hypothesis testing. It
provides a comprehensive overview of the dataset, allowing analysts to under-
stand the underlying structure and characteristics of the data. Through this
process, one can identify data quality issues, such as missing values or outliers,
and gain insights that inform the choice of appropriate analytical methods.
This chapter covers basic methods for exploring data using tables and charts.
First, we will look at tables, which are a simple and effective way to organize
DR
and summarize data. Then, we will explore charts and graphs, which help
us see patterns and trends more easily. By looking at both qualitative and
quantitative data through these methods, we aim to build a solid foundation
for more advanced analysis and ensure a strong approach to exploring data
25
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Key Components
• Frequency Tables: Display the count (or frequency) of each distinct
value or category or interval in a dataset, helping to summarize and
understand the distribution of the data.
T
Example
A frequency table showing the distribution of test scores for a class of students.
can reveal trends, relationships, and distributions more intuitively than tables.
Common graphical methods include histograms, bar charts, pie charts, stem-
and-leaf plots, and box plots. Each type of graphical display serves a specific
purpose:
26
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Graphical methods are indispensable for exploring the data visually and
communicating findings to a broader audience. They help to simplify complex
datasets and highlight patterns that might not be immediately apparent from
tabular data alone.
T
ical data in a way that highlights the frequency and distribution of the different
categories. Here are common methods:
Bar Chart: A chart that uses rectangular bars to represent the frequency
or proportion of categories in a dataset, with the length of each bar corre-
sponding to its value.
The bar chart is a powerful tool for visualizing categorical data. It allows
for easy comparison between different categories. This makes it straightforward
to identify patterns, trends, and outliers within the data.
DR
Problem 2.1. Suppose we conducted a survey asking students about their fa-
vorite movie genre from a list of options: Comedy, Action, Romance, Drama,
and Science Fiction. We gathered responses from a total of 20 students, and
their preferences are given in Table 2.1 and as follows:
27
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
To draw the bar chart, the frequency distribution of the student’s prefer-
ences is given in Table 2.2. The bar chart is presented in Figure 2.1.
Comedy 4 20%
Action 5 25%
Romance 6 30%
Drama 1 5%
Science Fiction 4 20%
T
P
Total
AF i fi = 20
6
6
Number of Students
4 4
4
2
DR
1
0
Comedy Action Romance Drama Science Fiction
28
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
7 " Comedy " , " Science Fiction " , " Comedy " , " Comedy " ,
8 " Action " , " Science Fiction " , " Action " , " Romance " ,
9 " Action " , " Romance " , " Science Fiction " , " Romance " ,
10 " Romance " , " Romance " , " Action " , " Drama " ,
11 " Comedy " , " Romance " , " Action " , " Science Fiction "
12 ]
13
T
19 counts = list ( movie_counts . values () )
20
27
28
29
AF plt . title ( ’ Favorite Movie Genres ’)
plt . show ()
30
Pie Chart: A circular chart that represents the proportion of each cate-
gory in a dataset as slices, allowing easy comparison of relative frequencies.
Pie charts are a straightforward way to show how different parts contribute
to a whole, making them a popular choice for visualizing proportions in data
analysis.
Problem 2.2. Refer to Table 2.1 for the dataset needed to create a pie chart.
Analyze the pie chart and interpret the key findings from the data visualized.
Solution
To draw the pie chart, the frequency distribution of the student’s preferences
is given in Table 2.2. The pie chart is presented in Figure 2.2.
29
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Table 2.3: Frequency Distribution
Comedy 4 20% 72
Action 5 25% 90
Romance 6 30% 108
Drama 1 5% 18
Science Fiction 4 20% 72
P
Total i fi = 20
T
Action
Comedy
25%
20%
AF
30% 20%
Drama
30
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
FT
13
Problem 2.3. Imagine you conducted a survey to find out the types of physical
activities performed by patients in a rehabilitation program. You collected data
from 100 patients, and the results are as follows:
• Walking: 25 patients
R
• Cycling: 20 patients
• Swimming: 15 patients
• Yoga: 15 patients
•
D
• Pilates: 8 patients
• Dancing: 7 patients
(a). What is the name of the variable under study? Is this a qualitative vari-
able? If the answer is no, why?
31
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
(b). Make a pie chart to represent the distribution of physical activities among
the patients. Write a summary based on this. Be sure to label the chart
accurately and show the percentage of patients for each activity.
Solution
(a). The variable under study is the type of physical activity performed by
patients in the rehabilitation program. Yes, this is a qualitative variable
because it describes categories or types of activities rather than numerical
values.
(b). Pie Chart:
T
Walking
20%
25% Cycling
AF 15%
Swimming
Yoga
Strength Training
7%
Pilates
Dancing
8%
15%
10%
DR
Summary:
The pie chart shows the distribution of physical activities among the pa-
tients in the rehabilitation program. Walking is the most common activity,
with 25% of the patients participating in it. This is followed by cycling
(20%), swimming (15%), and yoga (15%). Strength training is performed
by 10% of the patients, while pilates and dancing are the least common
activities, with 8% and 7% participation, respectively.
32
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
T
• Step 3: Decide on the number of classes (k) in the frequency distri-
bution.
k = 1 + 3.322 log10 (n).
Alternatively, we can also choose k such that
2k ≥ n.
AF• Step 4: Determine the class interval (h) size.
• Step 5: Decide the starting point: the lower class limit or class bound-
ary should cover the smallest value in the raw data.
33
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Solution
1. Step 1: sort data in ascending order
363 369 371 371 377 381 382 386 387 389 390
391 392 393 394 395 399 400 401 405 407 409
2. Step 2:
T
Minimum Observation is 363 and maximum observation is 431
3. Step 3: Number of classes
k = 1 + 3.322 log10 (30) = 5.907 ≈ 6
4. Step 4:
AF h≥
Maximum observation - Minimum observation
431 − 363
Number of class
≥ = 11.33 ≈ 12
6
5. Step 5: Decide the staring point 360
Using all steps, the frequency distribution table is presented in Table 2.4.
34
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
T
8 431 , 401 , 363 , 391 , 405 , 382 ,
9 400 , 381 , 399 , 415 , 428 , 422 ,
10 395 , 371 , 410 , 419 , 386 , 390
11 ]
12
15
16
17
AF df = pd . DataFrame ( data , columns =[ ’ Expenditure ’ ])
23
39
35
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
To draw we have to calculate the midpoint for each class interval. The
midpoint is the average of the lower and upper bounds of the class interval
(See Table 2.5). Then plot the midpoints on the x-axis and the corresponding
frequencies on the y-axis. The frequency polygon is depicted in Figure 2.3
T
Class Tally Frequency Relative Midpoint
Interval Frequency
360 - 372 4 0.1333 366
AF 372 - 384
384 - 396
396 - 408
3
9
5
0.1000
0.3000
0.1667
378
390
402
408 - 420 5 0.1667 414
420 - 432 4 0.1333 426
Total 30 1.0000
Frequency Polygon
10
8
Frequency
2
Frequency Polygon
0
360 372 384 396 408 420 432
Expenditure
36
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
T
9 400 , 381 , 399 , 415 , 428 , 422 ,
10 395 , 371 , 410 , 419 , 386 , 390
11 ]
12
16
17
AF # Define the class intervals ( bins )
bins = range (360 , 440 , 12) # Create bins from 360 to 440
with an interval of 12
18 labels = [ f ’{ bins [ i ]} -{ bins [ i +1] -1} ’ for i in range ( len ( bins
) -1) ] # Labels for each bin
19
38
37
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
T
Table 2.6: Distribution of Weekly Expenditure of 30 Students
the frequency polygon, the peak at the midpoint of 390 indicates that most
students’ expenditures fall around this value. The shape of the polygon, with
a rise to the peak and a gradual decline, shows that while expenditures are
somewhat concentrated in the 384-396 range, there is a moderate spread across
other ranges. This visualization helps quickly grasp the central tendency and
variability of expenditures among students.
38
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Figure 2.4: Ogive Curve of Weekly Expenditure of 30 Students
Ogive Curve
Cumulative Frequency
30
25
20
15
10
5
Ogive Curve
0
T
360 372 384 396 408 420 432
Expenditure
In the example of weekly expenditures for 30 students, the ogive curve il-
AF
lustrates that as expenditure increases, the cumulative number of students also
rises. The curve starts at the cumulative frequency of 4 for the interval 360-
372 and gradually increases to 30 for the interval 420-432, reflecting that 30
students’ expenditures are up to 432. The steepness of the curve indicates in-
tervals with higher frequencies, while flatter sections show lower frequencies.
Key features like the median can be identified where the curve reaches 50%
of the total cumulative frequency (15 students), and the quartiles reveal the
spread of expenditures across different percentiles. Overall, the ogive provides
insights into data distribution, helping to visualize the proportion of students
DR
spending up to various amounts.
39
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
T
labels , right = False )
27 f r e q u e n c y _d is tri bu ti on = df [ ’ Bins ’ ]. value_counts () .
sort_index ()
28
30
AF
c u m u l a t i ve_frequency = f re qu enc y_ di str ib ut ion . cumsum ()
31
40
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
52 plt . show ()
2.5.7 Histogram
A histogram is a graphical representation of the distribution of numerical data.
It consists of a series of adjacent rectangles, or bars, where each bar’s height
corresponds to the frequency or count of data points falling within a specific
range or bin. It provides a visual summary of data distribution, helping to
identify patterns such as trends, peaks, and the spread of data.
T
weekly expenditures for 30 students. The x-axis represents the expenditure
ranges (bins), and the y-axis represents the number of students in each range.
8
AF
Frequency
0
360-372 372-384 384-396 396-408 408-420
DR
Value
41
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
T
9 400 , 381 , 399 , 415 , 428 , 422 ,
10 395 , 371 , 410 , 419 , 386 , 390
11 ]
12
16
17
18
AF # Define the bin intervals with an interval of 12
bin_start = 360
bin_end = 440
19 bin_interval = 12
20 bins = list ( range ( bin_start , bin_end + bin_interval ,
bin_interval ) )
21
42
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
FT
(in mmHg) of 15 patients.
120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148
Solution
The stem-and-leaf plot is presented in Table 2.7. The “stem” represents the
tens and the “leaf” represents the ones digit. For example, for 120, the stem is
12 and the leaf is 0.
• The leaves show the units digits for each stem, indicating the exact
D
43
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Solution
The temperatures recorded in Celsius are:
T
Stem Leaf
−15 4
−10 0
AF −7 2
−3 8
0 5
6 0
12 0
18 9
DR
21 6
25 3
In this plot:
• The ‘stem’ represents the integer part (before the decimal point) of
each temperature. For instance, the temperature 25.3 can be broken
down into a stem of 25 (representing the integer part) and a leaf of 3
(representing the decimal part).
• The ‘leaf’ represents the decimal part (after the decimal point) of each
temperature.
44
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
3 # Data
4 data = [25 , -4 , 12 , 1 , -10 , 19 , -7 , 6 , 22 , -15]
5
T
13 stem_leaf = defaultdict ( list )
14 for value in data :
15 stem = value // 10
16 leaf = value % 10
17 stem_leaf [ stem ]. append ( leaf )
18 return stem_leaf
19
20
21
22
23
AF pos_stem_leaf = create_stem_leaf ( pos_data )
neg_stem_leaf = create_stem_leaf ( neg_data )
45
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
tical analysis and more complex data science endeavors. As you move forward,
the principles outlined here will serve as a cornerstone for more advanced topics,
reinforcing the importance of clear, accurate, and insightful data presentation
in the broader field of data science.
Flu, Cold, Flu, Allergies, Cold, Flu, Cold, Flu, Allergies, Flu, Flu, Aller-
T
gies, Cold, Flu, Cold, Flu, Cold, Allergies, Cold, Flu
Construct a frequency distribution table for the diseases. Draw a bar chart
and a pie chart based on the frequency distribution. Interpret the results
and comment on the most and least common diseases in the sample.
AF
2. A local community center conducted a survey to find out the preferred
recreational activities of its members. The results of the survey are sum-
marized below:
Running 10
3. The following table given in Table 2.8, shows the distribution of sales (in
thousands of units) of five different products in a company during the first
quarter of the year.
46
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
Table 2.8: Sales Distribution of Products
T
(a). Calculate the percentage share of each product in the total sales.
(b). Make a pie chart. Write a summary based on this.
4. Imagine you conducted a survey to find out how people spend their leisure
time on a typical weekend. You collected data from 100 respondents, and
the results are as follows:
AF • Watching TV: 30 respondents
• Reading: 20 respondents
• Playing Sports: 15 respondents
• Socializing with Friends: 10 respondents
• Playing Video Games: 10 respondents
• Hiking and Outdoor Activities: 8 respondents
DR
45, 67, 53, 52, 61, 59, 68, 72, 56, 54,
63, 75, 49, 62, 60, 58, 66, 64, 55, 70
47
CHAPTER 2. DATA EXPLORATION: TABULAR AND GRAPHICAL
DISPLAYS
102, 98, 105, 110, 95, 107, 101, 99, 103, 106,
104, 100, 108, 97, 96, 109, 111, 94, 93, 92
Create a frequency distribution table using an appropriate number of
classes and appropriate class intervals for the given ages.
T
7. Here are the weekly sales figures (in units) for a pharmaceutical company
over a period of 20 weeks:
23, 27, 31, 35, 29, 33, 22, 28, 26, 34,
32, 25, 30, 24, 36, 21, 37, 38, 39, 20
AFConstruct a frequency distribution table using an appropriate number of
classes and appropriate class intervals for the given sales figures.
8. The following data represents the weights (in kilograms) of a sample of
fruits used in a nutritional study:
5.2, 6.3, 7.1, 8.4, 5.5, 6.8, 7.6, 8.0, 5.9, 6.1,
7.3, 8.7, 5.0, 6.5, 7.8, 8.1, 5.7, 6.9, 7.4, 8.5
Make a frequency distribution table using an appropriate number of classes
and appropriate class intervals for the given weights.
DR
9. You are given the following set of scores from a recent medical examina-
tion:
82, 91, 85, 87, 89, 95, 88, 92, 84, 90,
93, 83, 86, 96, 94, 81, 97, 98, 99, 80
Create a frequency distribution table using an appropriate number of
classes and appropriate class intervals for the given exam scores.
10. Consider the following data set representing the heights (in cm) of 10
plants:
Data: 150, 155, 160, 162, 165, 168, 170, 175, 180, 185
(a). Construct a stem-and-leaf plot for the data set.
(b). What is the maximum number of heights of the data?
48
Chapter 3
Data Exploration:
T
Numerical Measures
AF
3.1 Introduction
In this chapter, we delve into the fundamental concepts of data exploration
with a focus on numerical measures. Understanding these measures is crucial
for analyzing and interpreting data effectively. We begin by examining vari-
ous measures of central tendency, which provide insights into the typical value
within a dataset. These include the arithmetic mean, harmonic mean, geometric
mean, and median, each with its unique properties, advantages, and limitations.
deviation, help us understand the spread and variability of data points around
the central value. The chapter will also cover measures of distribution shape,
including skewness and kurtosis, which describe the asymmetry and peakedness
of the data distribution.
Moreover, we will discuss quartiles, percentiles, and deciles, which are instru-
mental in dividing the data into meaningful segments, and outline methods for
detecting outliers. The chapter concludes with an overview of the five-number
summary and boxplots, essential tools for summarizing and visualizing data
distribution. Python code will be provided throughout to illustrate practical
applications of these concepts.
49
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
(ii). Median
(iii). Mode
T
These measures provide a single value that represents the middle or center
of the data distribution and are essential for summarizing large datasets.
n
!
x1 + x2 + · · · + xn 1 X
x̄ = = xi
n n i=1
Consider a study measuring the time in days for patients to recover from a
specific illness, with recovery times being 4, 5, 6, 7, 8 days. We can use the
arithmetic mean to find the average recovery time.
DR
50
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
Advantages
(i) Rigidity and Simplicity: It is rigidly defined, simple, easy to under-
stand, and easy to calculate.
(ii) Uses all data points: It is based upon all the observations in the data
set.
(iii) Uniqueness: Its value being unique allows for comparisons between dif-
ferent sets of data.
(iv) Mathematical Properties: The arithmetic mean has useful mathemat-
ical properties. For instance, it can be used in further statistical analysis,
T
like calculating variance and standard deviation.
(v) Best for Symmetric Distributions: The arithmetic mean is a reliable
measure of central tendency when the data follows a symmetric distribu-
tion (like a normal distribution), where the mean is the most representa-
tive value.
AF
(vi) Stability: It is least affected by sampling fluctuations compared to other
measures of central tendency.
Disadvantages
(i) Sensitivity to Outliers: The mean is highly affected by extreme values
or outliers. A few very high or low numbers can skew the mean, making
it not represent the ”typical” value of the data set.
(ii) Not Suitable for Skewed Distributions: In data sets that are heavily
skewed, the mean may not reflect the central location accurately, as it can
DR
Problem 3.1. Suppose we have the following data on the systolic blood pressure
(in mmHg) of 10 patients:
120, 130, 125, 140, 135, 128, 132, 138, 124, 126
51
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
Solution
To calculate the mean systolic blood pressure:
120 + 130 + 125 + 140 + 135 + 128 + 132 + 138 + 124 + 126
x̄ =
10
1298
= = 129.8 mmHg
10
T
The formula for the harmonic mean x̄HM of n values x1 , x2 , . . . , xn is:
n n
x̄HM = P
n = P
n
1 1
xi xi
i=1 i=1
AF
Using the recovery times 4, 5, 6, 7, 8 days, we calculate the harmonic mean to
find the average rate of recovery.
5
x̄HM = 1 1 1 1 1
4 + 5 + 6 + 7 + 8
5
= ≈ 5.59
0.25 + 0.20 + 0.1667 + 0.1429 + 0.125
DR
So, the harmonic mean of the recovery times is approximately 5.59 days,
which represents an average recovery time weighted by the rates of recovery.
52
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
Disadvantages
(i) Sensitivity to Zero Values: The harmonic mean cannot be computed
if any value in the dataset is zero, as it involves division by the values.
(ii) Less Intuitive: It is less intuitive than the arithmetic mean and is not as
commonly used, which can make interpretation and communication more
challenging.
T
(iii) Not Suitable for All Data Types: It is not suitable for data that does
not represent rates or ratios. It is not typically used for general numerical
data where other means are more appropriate.
(iv) Potential for Misleading Results: In cases where there is significant
variability in the data, particularly if large values are present, the har-
monic mean can provide misleading results.
AF
(v) Complex Calculation: The calculation of the harmonic mean is more
complex compared to the arithmetic mean, which can be a drawback in
some practical applications.
v
u n
√ uY
x̄GM = n
x1 × x2 × · · · × xn = t
n
xi
i=1
53
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
values are in proportional or percentage terms, such as in economic or
financial analysis.
(iv) Stability in Long-Term Growth Rates: In contexts like investment
returns, the geometric mean offers a more accurate measure of average
growth rates over time compared to the arithmetic mean.
AF
Disadvantages
(i) Cannot Handle Zero or Negative Values: The geometric mean is
undefined for datasets containing zero or negative values, as it involves
taking the nth root of the product of values.
(ii) Less Intuitive: It is less intuitive and harder to understand compared
to the arithmetic mean, making it less accessible for some audiences.
(iii) Requires Logarithmic Transformation: Calculating the geometric
mean involves the logarithm of values, which adds complexity compared
to simpler averages.
DR
54
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
x̄ ≥ x̄GM ≥ x̄HM ,
where x̄, x̄GM , and x̄HM represent the arithmetic mean, geometric mean, and
harmonic mean, respectively. This inequality reflects a fundamental property
of these measures of central tendency.
T
Theorem 3.1. Let x1 , x2 , . . . , xn be positive real numbers. Let x̄AM denote
the arithmetic mean, x̄GM denote the geometric mean, and x̄HM denote the
harmonic mean of these numbers. Then,
Consider the function f (x) = − ln(x) for x > 0. The first derivative is
f ′ (x) = − x1 , and the second derivative is f ′′ (x) = x12 . Since f ′′ (x) > 0 for all
DR
x > 0, the function f (x) = − ln(x) is strictly convex on the interval (0, ∞).
55
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
!1/n
n
Y
− ln (x̄AM ) ≤ − ln xi
i=1
− ln (x̄AM ) ≤ − ln (x̄GM )
Since the natural logarithm function ln(x) is strictly increasing, the function
− ln(x) is strictly decreasing. Therefore, multiplying by −1 and reversing the
inequality sign gives:
ln (x̄AM ) ≥ ln (x̄GM ) .
Again, due to the strictly increasing nature of ln(x), we conclude:
T
x̄AM ≥ x̄GM
1
AF
Part 2: Proof of x̄GM ≥ x̄HM (GM-HM Inequality)
Consider the reciprocals of the positive numbers x1 , x2 , . . . , xn , which are
1 1
x1 , x2 , . . . , xn . Applying the AM-GM inequality (proven in Part 1) to these
positive numbers, we have:
1 1 1 1/n
+ + ··· +
x1 x2 xn 1 1 1
≥ · · ··· ·
n x1 x2 xn
Pn 1 1/n
i=1 xi 1
≥ Qn
n i=1 xi
Pn 1
i=1 xi 1
DR
≥ Qn 1/n
n ( i=1 xi )
Pn 1
i=1 xi 1
≥
n x̄GM
Taking the reciprocal of both sides of the inequality. Since both sides are
positive, the inequality sign reverses:
n
Pn 1 ≤ x̄GM
i=1 xi
x̄HM ≤ x̄GM
56
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
The equality in the AM-GM inequality applied to x11 , x12 , . . . , x1n holds if and
only if x11 = x12 = · · · = x1n . This condition is equivalent to x1 = x2 = · · · = xn .
Therefore, equality in x̄GM ≥ x̄HM holds if and only if x1 = x2 = · · · = xn .
3.2.6 Median
T
The median is a measure of central tendency that divides a dataset into two
equal halves. Let x1 , x2 , . . . , xn be a set of n numerical observations. To find
the median, first arrange the data in ascending (or descending) order. Let the
ordered data set be denoted by x(1) , x(2) , . . . , x(n) , where x(i) is the i-th value
in the ordered set.
AF
The median is then defined as follows:
Let m = n+1
2 .
(
x(m) if m is an integer (i.e., n is odd)
Median = x(⌊m⌋) +x(⌈m⌉)
2 if m is not an integer (i.e., n is even)
Here, ⌊m⌋ is the floor function (the greatest integer less than or equal
to m), and ⌈m⌉ is the ceiling function (the smallest integer greater than
or equal to m).
DR
Since there are seven observations (an odd number), the median is the fourth
value. In this case, m = n+1
2 = 7+1
2 = 4 which is an integer and hence, the
median is
57
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
the median is found by taking the average of the fourth and fifth values. In this
case, m = n+1 8+1
2 = 2 = 4.5 is not an integer. Hence, the median is
x(⌊4.5⌋) + x(⌈4.5⌉)
Median =
2
x(4) + x(5) 225 + 230
= = = 227.5
2 2
Median is particularly useful in data science for understanding the typical
value in a dataset that is not skewed by extreme values.
Problem 3.2. Consider the following data on the number of hours of sleep per
night for a group of 9 adults:
T
7, 8, 5, 6, 9, 7, 6, 10, 8
What is the median hours of sleep per night?
Solution
AF
First, arrange the data in ascending order:
5, 6, 6, 7, 7, 8, 8, 9, 10
n+1 9+1
In this case, n = 9, so m = 2 = 2 = 5, which is an integer. Hence,
Median = x(m) = x(5) = 7 hours
Problem 3.3. Consider the following data on the test scores of 6 students:
85, 92, 78, 95, 88, 80
DR
Solution
First, arrange the data in ascending order:
78, 80, 85, 88, 92, 95
In this case, n = 6, so m = n+1 2 = 6+1
2 = 3.5, which is not an integer.
Hence, we use the second case of the median formula,
x(⌊m⌋) + x(⌈m⌉) x(⌊3.5⌋) + x(⌈3.5⌉) x(3) + x(4)
Median = = =
2 2 2
From the ordered data, x(3) = 85 and x(4) = 88. Therefore,
85 + 88 173
Median = = = 86.5
2 2
The median test score is 86.5.
58
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
(iv) Applicable to Ordinal Data: Can be used with ordinal data, where
data values are ranked but not necessarily numeric.
Disadvantages of Median
(i) Ignores Data Values: Does not take into account the magnitude of all
AF data values, only their order.
3.2.8 Mode
The mode is the value that appears most frequently in a dataset. A dataset can
have more than one mode if multiple values have the same highest frequency.
The mode is useful for categorical data or when identifying the most common
value.
Problem 3.4. Consider a manufacturing process where engineers are measur-
ing the diameter of a set of machine components to ensure they meet quality
specifications. The following is a list of diameters (in millimeters) of 20 com-
ponents that were measured:
50, 52, 51, 50, 53, 52, 54, 50, 51, 52, 55, 50, 53, 50, 51, 52, 55, 50, 52, 51
59
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
Solution
To find the mode of this dataset:
50, 52, 51, 50, 53, 52, 54, 50, 51, 52, 55, 50, 53, 50, 51, 52, 55, 50, 52, 51
T
• Diameters 53 and 55 occur 2 times each.
A, O, B, AB, O, A, B, O, A, A, B, O, O, A, B
Solution
The frequency distribution of blood types is:
• A: 5
• B: 4
DR
• O: 5
• AB: 1
Since blood types A and O both have the highest frequency, the dataset is
bimodal:
Mode = A and O
60
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
(iii) Reflects Commonality: Represents the value or values that occur most
often, which can be useful for understanding common trends or prefer-
ences.
(iv) Handles Categorical Data: Ideal for categorical data where numerical
calculations are not applicable.
Disadvantages of Mode
(i) May Not Be Unique: A dataset can have more than one mode or no
mode at all, which can complicate interpretation.
T
(ii) Not Useful for Continuous Data: Less useful for continuous data
with many unique values, as identifying the most frequent value can be
challenging.
(iii) Does Not Reflect Data Distribution: Does not provide information
about the spread or shape of the data distribution.
AF
(iv) Insensitive to Changes: The mode does not account for changes in
data that do not affect frequency, potentially overlooking variations.
• Mode: Best for categorical data and identifying the most frequent
values.
61
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
n
P
wi xi
x̄W M = i=1
Pn
wi
i=1
where:
• xi represents the i-th data point,
T
Problem 3.6. A biostatistician is analyzing the average blood pressure readings
of three different age groups in a study. The average blood pressure for each age
group and the number of individuals in each group are as follows:
• Age Group 1: Average blood pressure = 120 mmHg, Number of indi-
AF•
viduals = 30
Solution:
The weighted mean is
DR
62
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
Solution:
The weighted mean is
(85 × 10) + (90 × 20) + (80 × 15)
x̄W M =
10 + 20 + 15
850 + 1800 + 1200 3850
= = ≈ 85.56%
45 45
Hence, the weighted mean efficacy rate is approximately 85.56%.
Problem 3.8. A public health researcher records the incidence rates of a disease
in three different regions. The incidence rates and the population sizes for each
T
region are:
• Region A: Incidence rate = 0.02 cases per person, Population = 50, 000
• Region B: Incidence rate = 0.03 cases per person, Population = 75, 000
• Region C: Incidence rate = 0.04 cases per person, Population = 25, 000
AF
Solution:
The weighted mean is
(0.02 × 50,000) + (0.03 × 75,000) + (0.04 × 25,000)
x̄W M =
50,000 + 75,000 + 25,000
1000 + 2250 + 1000 4250
= = ≈ 0.02833 cases per person
150,000 150,000
Hence, the weighted mean incidence rate is approximately 0.02833 cases per person.
Problem 3.9. An academic advisor evaluates the performance of students
DR
based on their grades in three courses, with different credit hours for each course:
• Course 1: Grade = 85, Credits = 3
Solution:
The weighted mean is
(85 × 3) + (90 × 4) + (80 × 2)
x̄W M =
3+4+2
255 + 360 + 160 775
= = ≈ 86.11
9 9
Hence, theweighted mean grade is approximately 86.11.
63
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
372 - 384 3 7 378
384 - 396 9 16 390
396 - 408 5 21 402
408 - 420 5 26 414
AF 420 - 432
Total
4
30
30 426
Mean Estimation
The mean for grouped data is given by the formula:
P
fi mi
x̄ = P
fi
360 + 372
m1 = = 366
2
372 + 384
m2 = = 378
2
384 + 396
m3 = = 390
2
396 + 408
m4 = = 402
2
408 + 420
m5 = = 414
2
420 + 432
m6 = = 426
2
64
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
Median Estimation
The median for grouped data is given by the formula:
n
− cf
2
Median = l + ×h
f
T
where l is the lower boundary of the median class, n is the total frequency,
cf is the cumulative frequency before the median class, f is the frequency of
the median class, and h is the class width.
Mode Estimation
The mode for grouped data is given by the formula:
f1 − f0
DR
Mode = l + ×h
(f1 − f0 ) + (f1 − f2 )
where l is the lower boundary of the modal class, f1 is the frequency of the
modal class, f0 is the frequency of the class before the modal class, f2 is the
frequency of the class after the modal class, and h is the class width.
65
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
4 # Sample data
5 data = [10 , 15 , 15 , 20 , 25 , 30]
T
6
7 # Arithmetic Mean
8 arithmetic_mean = np . mean ( data )
9 print ( f " Arithmetic Mean : { arithmetic_mean } " )
10
11 # Geometric Mean
12 geometric_mean = gmean ( data )
13
14
15
16
AF print ( f " Geometric Mean : { geometric_mean } " )
# Harmonic Mean
harmonic_mean = hmean ( data )
17 print ( f " Harmonic Mean : { harmonic_mean } " )
18
19 # Median
20 median = np . median ( data )
21 print ( f " Median : { median } " )
22
23 # Mode
24 mode_value , count = mode ( data )
DR
• Median: 17.5
• Mode: 15
3.3 Exercises
1. Suppose we have the following dataset representing the scores of students
in a test:
85, 90, 78, 92, 88, 76, 95, 89, 84, 91
66
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100
T
as follows (in Mbps):
10, 20, 30, 40
(a) Calculate the arithmetic mean of the data transmission speeds.
(b) Compute the harmonic mean of the data transmission speeds.
AF (c) Calculate the median of the data transmission speeds.
4. The following dataset represents the time (in hours) taken by a worker to
complete four tasks:
4, 6, 8, 12
(a) Calculate the arithmetic mean of the task completion times.
(b) Compute the harmonic mean of the task completion times.
5. A company tracks the monthly returns (in percentage) on three different
investments over a year:
4%, 8%, 12%
DR
67
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
82, 85, 88, 82, 90, 85, 92, 82, 88, 85, 87, 90, 82, 85, 92, 90, 87, 88, 85, 82
T
• Weights:
2, 3, 1
Compute the weighted mean of the hours spent on activities.
10. A company reports the sales (in units) of three different products with
AF the following weights:
• Sales:
200, 300, 400
• Weights:
1, 3, 2
Calculate the weighted mean of the sales data.
11. Define the weighted mean. A student’s overall grade in a course is de-
termined by three components: homework, exams, and projects. The
weights assigned to each component are as follows: homework (30%), ex-
DR
• Frequencies:
[8, 15, 20, 12]
13. A student receives grade points in three subjects with the following weights:
68
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
• Grade points:
3.5, 3.7, 4
• Credit:
3, 4, 2
Compute an average of the grade points.
14. A healthcare provider tracks the number of patients visiting a clinic over
a week (in patients per day) as follows:
T
(a) Find the median number of patients per day.
(b) Determine the mode of the number of patients.
69
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
3.4.1 Range
The range is the simplest measure of variability and is calculated as the differ-
ence between the maximum and minimum values in a dataset. Mathematically,
the range is defined as:
Range = xmax − xmin
AF
In a study measuring the blood glucose levels of 10 patients, the highest reading
is 120 mg/dL and the lowest reading is 85 mg/dL.So, the range is
3.4.2 Variance
Variance measures the average squared deviation of each data point from the
mean. It reflects how data points spread out around the mean. Mathematically,
the sample variance is denoted by s2 and is defined as:
n n
!
1 X 1
DR
X
2
s = (xi − x̄)2 = x2i − nx̄ 2
n − 1 i=1 n−1 i=1
where x̄ is the arithmetic mean, xi represents the data points, and n is the
number of data points.
Consider the weights of 5 patients: 65, 70, 75, 80, 85 kg. We have,
65 + 70 + 75 + 80 + 85
x̄ = = 75
5
1
s2 = (65 − 75)2 + (70 − 75)2 + (75 − 75)2 + (80 − 75)2 + (85 − 75)2
5−1
1 250
= [100 + 25 + 0 + 25 + 100] = = 62.5 kg2
4 4
70
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
Using the variance from the previous example, the sample standard deviation
is
T
√
s = 62.5 ≈ 7.91 kg
Problem 3.10. A physician collected an initial measurement of hemoglobin
(g/L) after the admission of 10 inpatients to a hospital’s department of cardiol-
ogy. The hemoglobin measurements were 139, 158, 120, 112, 122, 132, 97, 104,
159, and 129 g/L. Calculate the variance and standard deviation of hemoglobin
AF
level.
Solution
The variance is calcualted by the following way:
xi x2i
139 19321
158 24964
120 14400
DR
112 12544
122 14884
132 17424
97 9409
104 10816
159 25281
129 16641
n n
x2i = 165684
P P
xi = 1272
i=1 i=1
71
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
deviation of hemoglobin for these 10 patients was 20.8 g/L.
s2 =
P
fi (mi − x̄)2
P
fi − 1
To compute sample variance for the grouped data, we define di = (mi − x̄)
and hence
fi d2i
P
2
s =P .
fi − 1
Now, we can easily compute s2 . The detailed calculations are given in Table
3.2.
DR
72
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
where
x̄ = 396.4
Sample variance:
10288.64 10288.64
s2 = = ≈ 354.78
30 − 1 29
T
sP
fi (mi − x̄)2
s= P
fi − 1
Standard deviation: √
s= 354.78 ≈ 18.83
AF
3.5 Exercises
1. Given the following dataset of temperatures (in degrees Celsius) recorded
over a week:
22, 25, 19, 23, 24, 20, 21
(a) Calculate the range of the temperatures.
2. Consider the following dataset representing the scores of 8 students in a
test:
75, 85, 95, 80, 90, 70, 88, 92
DR
• Frequencies:
[6, 10, 15, 12, 7]
73
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
5. A company tracks the number of units sold each month for the last 6
months:
150, 170, 160, 180, 175, 165
(a) Calculate the standard deviation of the units sold.
T
6. A class of students took two different tests with the following scores:
• Test 1 Scores:
78, 82, 85, 88, 90
• Test 2 Scores:
AF 72, 80, 78, 85, 90
• Frequencies:
[5, 8, 12, 6]
8. A company records the monthly sales (in thousands of dollars) over the
last year:
45, 52, 48, 55, 50, 47, 53, 60, 49, 51, 54, 57
(a) Find the range of the monthly sales data.
(b) Find the standard deviation of the monthly sales data.
74
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
3.6.1 Skewness
Skewness is a statistical measure that characterizes the degree of asymme-
T
try of a distribution around its mean. It indicates whether the data are
concentrated more on one side of the mean compared to the other.
Coefficient of Skewness
The coefficient of skewness is a standardized measure of skewness that allows
AF
for comparison of the degree of asymmetry between different distributions. It
provides insight into the direction and extent of the skew of a data distribution.
75
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
This method uses the difference between the mean and the median to gauge
T
skewness.
Interpretation
• Sk = 0: The distribution is symmetric.
•
AF•
Sk > 0: Positive skewness (right-skewed distribution).
Types of Skewness
Skewness can be categorized into three main types based on the direction of
the asymmetry:
76
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
• Visual Description: The tail on the left side of the distribution is
longer or fatter.
• Example: Exam scores where most students score well, but a few
score significantly lower.
AF
Problem 3.11. Consider the following dataset of the number of hours spent
studying per day by a group of students 2, 4, 4, 4, 5, 6, 8, and 10 hours.
Calculate the skewness.
Solution
Fisher-Pearson Coefficient of Skewness: The mean is
Pn
xi 2 + 4 + 4 + 4 + 5 + 6 + 8 + 10 43
x̄ = i=1 = = = 5.375 hours
n 8 8
DR
xi −x̄ xi −x̄ 3
xi x̄ xi − x̄ (xi − x̄)2 s s
2.0000 5.3750 -3.3750 11.3906 -1.3184 -2.2914
4.0000 5.3750 -1.3750 1.8906 -0.5371 -0.1549
4.0000 5.3750 -1.3750 1.8906 -0.5371 -0.1549
4.0000 5.3750 -1.3750 1.8906 -0.5371 -0.1549
5.0000 5.3750 -0.3750 0.1406 -0.1465 -0.0031
6.0000 5.3750 0.6250 0.3906 0.2441 0.0146
8.0000 5.3750 2.6250 6.8906 1.0254 1.0781
10.0000 5.3750 4.6250 21.3906 1.8066 5.8968
43 45.875 4.230
77
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
Skewness formula:
n 3
n X xi − x̄
Skewness = .
(n − 1)(n − 2) i=1 s
For n = 8:
AF Skewness =
8
(8 − 1)(8 − 2)
× 4.635 =
8
42
× 4.230 = 0.8057
x̄ − Median 5.375 − 4.5
Skewness (Sk) = 3 =3 = 1.0254
s 2.56
The skewness of the dataset is approximately 1.0254. This positive value
indicates that the distribution is right-skewed, meaning it has a longer tail on
the right side.
3.6.2 Kurtosis
Even with knowledge of central tendency, dispersion, and skewness, we still
don’t have a full understanding of a distribution. To gain a complete perspec-
tive on the shape of the distribution, we also need to consider kurtosis. Kurtosis
is a statistical measure that describes the shape, or peakedness, of the probabil-
ity distribution of a real-valued random variable. It indicates whether the data
78
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
n
1
(xi − x̄)4
P
n
i=1
K=
n
2 − 3
1
P
n (xi − x̄)2
i=1
T
Alternatively, many software programs (such as Excel’s KURT function, which
uses a bias-corrected formula) use the following formula to measure kurtosis.
n 4
3(n − 1)2
n(n + 1) xi − x̄
AF K=
X
(n − 1)(n − 2)(n − 3) i=1 s
−
(n − 2)(n − 3)
Both formulas aim to measure the kurtosis of a dataset, but the second
formula includes additional terms to correct for bias, making it more accurate
for small sample sizes. The term −3 in the simpler formula adjusts for ex-
cess kurtosis, normalizing the kurtosis value to compare it against a normal
distribution.
Interpretation
• K = 0: The distribution has the same kurtosis as a normal distribution
DR
(mesokurtic).
Problem 3.12. Consider Problem 3.11. Calculate the kurtosis of the following
dataset, which represents the number of hours spent studying per day by a group
of students: 2, 4, 4, 4, 5, 6, 8, and 10 hours.
Solution
In the solution to Problem 3.11, the mean is x̄ = 5.375 hours and the standard
deviation is s = 2.56 hours. The calculation for the kurtosis is provided in Table
3.4.
79
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
xi −x̄ xi −x̄ 4
xi x̄ xi − x̄ (xi − x̄)2 s s
2.0000 5.3750 -3.3750 11.3906 -1.3184 3.0209
4.0000 5.3750 -1.3750 1.8906 -0.5371 0.0832
4.0000 5.3750 -1.3750 1.8906 -0.5371 0.0832
4.0000 5.3750 -1.3750 1.8906 -0.5371 0.0832
5.0000 5.3750 -0.3750 0.1406 -0.1465 0.0005
6.0000 5.3750 0.6250 0.3906 0.2441 0.0036
T
8.0000 5.3750 2.6250 6.8906 1.0254 1.1055
10.0000 5.3750 4.6250 21.3906 1.8066 10.6535
43 45.8750 15.03358
AF Using above table, we have
n
X xi − x̄
s
4
= 15.03358
i=1
n(n + 1) 8×9
Bias-Correction Factor = = = 0.343
(n − 1)(n − 2)(n − 3) 7×6×5
So,
K = 0.343 × 15.03358 − 4.9 = 0.2543
The excess kurtosis of the dataset is approximately 0.2543, the positive value
suggests that the distribution is leptokurtic. This means the distribution has
heavier tails and is more peaked compared to a normal distribution, indicating
a higher probability of extreme values or outliers.
80
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
If the mean blood pressure is x̄ = 128.67 mmHg and the standard deviation
is s = 7.91 mmHg, the CV is:
7.91
CV = × 100% ≈ 6.15%
128.67
3.7 Exercises
1. Given the following dataset representing the monthly number of new cus-
tomer sign-ups for a company:
T
Calculate the skewness of the dataset. Use a statistical software or formula
for skewness calculation.
2. The following dataset represents the heights (in cm) of 10 students:
150, 155, 160, 165, 170, 175, 180, 185, 190, 195
AF Determine the kurtosis of the height dataset. Use a statistical software or
formula for kurtosis calculation.
3. Consider the dataset representing the weekly earnings (in dollars) of 5
freelancers:
400, 420, 450, 480, 500
(a) Calculate the mean and standard deviation of the earnings.
(b) Compute the coefficient of variation (CV) for the earnings dataset.
4. A company tracks monthly sales figures (in thousands of dollars) for a
year:
DR
40, 42, 44, 45, 47, 50, 52, 55, 60, 65, 70, 75
(a) Calculate the skewness of the sales data.
(b) Determine the kurtosis of the sales data.
5. A dataset of exam scores for a class is given as follows:
(a) Find the mean and standard deviation of the exam scores.
(b) Compute the coefficient of variation (CV) for the exam scores.
6. The daily maximum temperatures (in degrees Celsius) for a week are:
81
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
3.8.1 Quartiles
Quartiles provide a concise summary of a data set, highlighting its central ten-
dency and variability without requiring a full description of the data. They help
identify the spread of the data, allowing you to see how values are distributed.
FT
Quartiles divide a data set into four equal parts, each representing 25% of the
data. The range between the first quartile (Q1 ) and the third quartile (Q3 ) is
known as the interquartile range (IQR), which indicates where the middle 50%
of the data lies.
• f = m − ⌊m⌋
R
• x(⌊m⌋) is the value at the integer part of the mth position
Qk = x(m)
D
• Q1 (First Quartile): The value below which 25% of the data falls.
• Q3 (Third Quartile): The value below which 75% of the data falls.
82
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
For example, for the floor function: ⌊2.4⌋ = 2, and for the ceiling function:
⌈2.4⌉ = 3.
Problem 3.13. Suppose that a college placement office sent a questionnaire
T
to a sample of business school graduates requesting information on monthly
starting salaries. Table 3.5 shows the collected data.
Table 3.5: Monthly Starting Salaries for a Sample of 12 Business School Grad-
uates
AF Graduate
1
Monthly Starting Salary ($)
5850
2 5950
3 6050
4 5880
5 5755
6 5710
7 5890
DR
8 6130
9 5940
10 6325
11 5920
12 5880
Compute the quartiles of the monthly starting salary for the sample of 12
business college graduates.
Solution
To compute the quartiles, we first sort the data in ascending order:
5710, 5755, 5850, 5880, 5880, 5890, 5920, 5940, 5950, 6050, 6130, 6325
83
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
Quartile Computations
• Minimum: 5710
• Maximum: 6325
T
For n = 12, the value of m = 12+1
4 th position = 3.25th position:
Q1 = x(3) + 0.25 · x(3) − x(4)
= 5850 + 0.25 × (5880 − 5850)
= 5850 + 0.25 × 30
AF = 5850 + 7.5
= 5857.5
• Minimum: 5710
• Maximum: 6325
3.8.2 Percentiles
Percentiles divide a data set into 100 equal parts, providing a more detailed
breakdown of the data distribution. A kth percentile can be defined as
84
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
Pk = x(⌊m⌋) + f · x(⌈m⌉) − x(⌊m⌋) k = 1, 2, . . . , 99
where
n+1
m= ×k
100
is the position of ordered data set.
• P10 (10th Percentile): The value below which 10% of the data falls.
• P25 (25th Percentile): The value below which 25% of the data falls.
(Q1)
T
• P50 (50th Percentile): The value below which 50% of the data falls.
(Median)
• PP75 (75th Percentile): The value below which 75% of the data
falls. (Q3)
AF•
3.8.3
P90 (90th Percentile): The value below which 90% of the data falls.
Deciles
Deciles divide a data set into ten equal parts, representing specific percentiles.
A kth decile can be defined as
Dk = x(⌊m⌋) + f · x(⌈m⌉) − x(⌊m⌋) k = 1, 2, . . . , 9
where
n+1
m= ×k
10
DR
• D1 (1st Decile): The value below which 10% of the data falls.
• D2 (2nd Decile): The value below which 20% of the data falls.
• D3 (3rd Decile): The value below which 30% of the data falls.
• D4 (4th Decile): The value below which 40% of the data falls.
• D5 (5th Decile): The value below which 50% of the data falls. (Me-
dian)
• D6 (6th Decile): The value below which 60% of the data falls.
• D7 (7th Decile): The value below which 70% of the data falls.
• D8 (8th Decile): The value below which 80% of the data falls.
85
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
• D9 (9th Decile): The value below which 90% of the data falls.
• D10 (10th Decile): The value below which 100% of the data falls.
(Maximum)
T
In a dataset of test scores: 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, the first quar-
tile (Q1 ) is 62.5 and the third quartile (Q3 ) is 87.5. Hence, the interquartile
range is
IQR = 87.5 − 62.5 = 25
if
xi < Q1 − 3 × IQR or xi > Q3 + 3 × IQR
Problem 3.14. Consider the following systolic blood pressure (SBP) readings
(in mmHg):
165, 50, 110, 120, 125, 130, 135, 140, 145, 150, 155,
160, 175, 180, 185, 190, 195, 200, 115, 220, 170
Calculate the first and third quartiles and then determine mild and extreme
outliers.
Solution
We will calculate the quartiles, IQR, and determine mild and extreme outliers
by following the steps. The sorted data is:
86
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
50, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155,
160, 165, 170, 175, 180, 185, 190, 195, 200, 220
The number of data points is n = 21.
Finding Q1
n+1 21 + 1
m= ×1= × 1 = 5.5
4 4
Using the formula:
T
Q1 = x⌊5.5⌋ + f · (x⌈5.5⌉ − x⌊5.5⌋ )
m=
n+1
4
×3=
21 + 1
4
×3=
22
4
× 3 = 16.5
Compute IQR:
87
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
• Extreme Outliers:
Identify Outliers:
• Mild Outlier: Mild outliers are values below 45 or above 265. The
readings do not have any mild outliers.
T
347.5. Again, there are no extreme outliers.
1 import numpy as np
2 from scipy . stats import iqr , skew , kurtosis
DR
3
4 # Sample data
5 data = [10 , 15 , 15 , 20 , 25 , 30]
6
7 # Range
8 data_range = np . ptp ( data )
9 print ( f " Range : { data_range } " )
10
11 # Variance
12 variance = np . var ( data , ddof =1)
13 print ( f " Variance : { variance } " )
14
15 # Standard Deviation
16 std_deviation = np . std ( data , ddof =1)
17 print ( f " Standard Deviation : { std_deviation } " )
18
19 # Coefficient of Variation
20 coe f_of_variation = std_deviation / np . mean ( data )
21 print ( f " Coefficient of Variation : { coef_of_variation } " )
88
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
22
23 # Quartiles
24 quartiles = np . percentile ( data , [25 , 50 , 75])
25 print ( f " Quartiles (25 th , 50 th , 75 th ) : { quartiles } " )
26
31 # Skewness
32 data_skewness = skew ( data )
33 print ( f " Skewness : { data_skewness } " )
T
34
35 # Kurtosis
36 data_kurtosis = kurtosis ( data )
37 print ( f " Kurtosis : { data_kurtosis } " )
The output of the code will be:
• Range: 20
AF •
•
Variance: 62.5
• Skewness: 0.246
DR
• Kurtosis: -1.09
3.9 Exercises
1. Given the following dataset representing the monthly expenses (in dollars)
of 12 households:
450, 600, 550, 620, 700, 480, 510, 540, 580, 660, 710, 690
Compute the first quartile, the second quartile ( or median), and the third
quartile of the dataset.
2. Given the following dataset representing the ages of 12 participants in a
study:
22, 25, 28, 30, 32, 35, 37, 40, 42, 45, 48, 50
Compute first quartile, third quartile and the interquartile range (IQR)
of the ages.
89
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
3. The dataset represents the heights (in cm) of 20 plants measured over a
period:
40, 42, 45, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
Compute the 1st decile (D1), 5th decile (D5, which is the median), and
the 9th decile (D9) of the plant heights.
4. Consider the following dataset representing the number of books read by
10 students in a year:
12, 15, 18, 20, 22, 25, 28, 30, 35, 100
T
(a) Compute Q1 and Q3 .
(b) Identify any outliers in the dataset using the Interquartile Range
(IQR) method.
5. Given the following dataset of monthly rainfall (in mm) for 8 cities:
AF 100, 110, 120, 150, 170, 190, 200, 220
(a) Compute the quartiles (Q1, Q2, Q3) for the dataset.
(b) Detect any potential outliers using the IQR method.
6. The following dataset represents the scores of 25 students in an exam:
45, 47, 48, 50, 51, 53, 55, 57, 58, 60, 62, 63, 65,
67, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88
(a) Calculate the 10th percentile (P10) and the 80th percentile (P80) of
the exam scores.
DR
30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150
(a) Find the 3rd decile (D3) and 7th decile (D7) of the salary data.
(b) Use the IQR method to detect any outliers in the salary data.
8. The following dataset represents the number of hours spent on the internet
per week by a sample of 20 people:
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
(a) Compute the quartiles (Q1, Q2, Q3) for the dataset.
(b) Find the 25th percentile (P25) and the 75th percentile (P75) of the
dataset.
90
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
(iv). third quartile (Q3 )
Solution
To find the five-number summary, we first arrange the data in ascending order:
170, 175, 180, 185, 190, 190, 195, 195, 200, 200, 205, 210, 215, 220, 225
• Minimum: The smallest value in the dataset.
DR
• First Quartile (Q1 ): The median of the lower half of the dataset (not
including the median if the number of observations is odd).
Q1 = 185 mg/dL
• Third Quartile (Q3 ): The median of the upper half of the dataset
(not including the median if the number of observations is odd).
Q3 = 210 mg/dL
91
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
3.10.2 Boxplot
A boxplot (or box-and-whisker plot) is a graphical representation of the five-
number summary. It displays the median, quartiles, and potential outliers.
• Median: The 50th percentile, dividing the dataset into two equal
halves.
T
• Third Quartile (Q3 ): The 75th percentile of the data.
• Outliers: Data points that fall outside the range of 1.5 times the IQR
AF from Q1 and Q3 .
Q1 Median Q3
LL UL
DRGroup
| {z } | {z }
Outliers Outliers
| {z }
Interquartile Range (IQR)
Values
92
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
the median. The position of the median within the box indicates the data’s
skewness: if it is centered, the data is symmetrical; if skewed towards one
end, it shows skewness. Second, look at the length of the whiskers extending
from the box, which represent the range of the data within 1.5 times the IQR
from the quartiles; data points beyond this range are considered outliers, which
are plotted as individual points. Third, check for outliers and extreme values,
which are represented as dots outside the whiskers. A higher number of outliers
might suggest variability or anomalies in the data. Fourth, observe the width
of the box and the lengths of the whiskers to assess data spread and identify
potential data dispersion. Finally, compare multiple boxplots side-by-side to
analyze differences between groups, noting shifts in the median, variations in
FT
the IQR, and the presence of outliers. By focusing on these elements, you can
gain insights into the data’s distribution characteristics, identify patterns, and
make informed conclusions about the underlying data trends.
Remark 3.10.1. The lines extending from either end of the box are called
whiskers. The term whisker plot is often used interchangeably with “box-
plot”, focusing on the whiskers. It highlights the range of data within 1.5 times
the IQR from the quartiles but might not always show the box or median. Both
terms generally refer to the same plot, but “boxplot” is the more comprehensive
term, including the full depiction of the quartiles and median along with the
whiskers.
A
3.10.3 Importance of Boxplots
Boxplots are essential tools for:
• Visualizing Data Distribution: They show the range, quartiles,
and outliers of the data.
R
• Comparing Distributions: They allow for comparisons between dif-
ferent groups or datasets.
• Detecting Outliers: They help identify unusual data points that may
need further investigation.
D
5710, 5755, 5850, 5880, 5880, 5890, 5920, 5940, 5950, 6050, 6130, 6325
93
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
• Minimum: 5710
• Maximum: 6325
Draw the boxplot to represent this data and comment on your findings.
T
Solution
From the solution of 3.13, the five-number summary for the monthly starting
salaries is as follows:
• Minimum: 5710
AF•
•
1st Quartile (Q1 ): 5857.5
• Maximum: 6325
To find the lower and upper whiskers (bounds), we first calculate the interquar-
tile range (IQR):
Using 1.5 times the IQR, we can calculate the lower and upper whiskers as
follows:
• Upper outlier: Any value above 6276.25. The value 6325 is greater
than 6276.25, so 6325 is an upper outlier.
94
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
Q1 Q2 Q3
1.5
0.5
T
5,600 5,700 5,800 5,900 6,000 6,100 6,200 6,300 6,400
Monthly Starting Salaries
Comment on the Boxplot: The boxplot of the monthly starting salaries re-
AF
veals a positively skewed distribution. The median (Q2 = 5905) is positioned
slightly closer to the lower quartile, and the right whisker is longer than the left
whisker, which indicates that there are some higher salaries pulling the data to
the right.
The lower whisker extends to 5710, while the upper whisker reaches
6130. The interquartile range (IQR), between Q1 = 5857.5 and Q3 = 6025, is
fairly narrow, indicating that the middle 50% of the data are clustered together.
However, the longer upper whisker and the presence of an outlier at 6325 show
that there are some higher salaries that deviate from the general pattern.
The outlier (6325) is a clear indication of the positive skewness in the
data, suggesting that while most starting salaries are within a consistent range,
DR
180, 195, 170, 200, 210, 175, 205, 190, 195, 220, 185, 215, 200, 190, 275
Calculate the five-number summary and draw a boxplot to represent this data.
Solution
Given the ordered cholesterol levels:
170, 175, 180, 185, 190, 190, 195, 195, 200, 200, 205, 210, 215, 220, 275
• Minimum = 170
• Maximum = 275
95
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
• Median (Q2 ): Since there are 15 data points (odd), the median is the
middle value:
Q2 = x( n+1 ) = x( 15+1 ) = x(8) = 195
2 2
n+1 15 + 1
m= = =4
4 2
The 4th value is:
Q1 = x(4) = 185
• Third Quartile (Q3 ):
T
15 + 1
m= × 3 = 12
4
Therefore,
Q3 = x(12) = 210
AF • Calculate the IQR:
IQR = Q3 − Q1 = 210 − 185 = 25
Q1 Q2 Q3
Cholesterol Levels
1.5
0.5
0
140 160 180 200 220 240 260 280
Cholesterol (mg/dL)
96
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
The boxplot of cholesterol levels given in Figure 3.3, provides a clear visual
representation of the data distribution. The spread of the data is captured
by the range from the minimum value (170 mg/dL) to the maximum value
(225 mg/dL), as well as by the interquartile range (IQR) of 25 mg/dL, which
measures the spread of the middle 50% of the data between the first quartile
(185 mg/dL) and the third quartile (210 mg/dL). There is an outlier in this
dataset which is 275. The median (195 mg/dL) lies closer to the first quartile,
suggesting a slight right skewness in the data, as the upper quartile range
is wider than the lower quartile range. Overall, the data appears relatively
symmetric but with a mild skew to the right, indicating that higher cholesterol
values are slightly more spread out than the lower values.
T
3.10.4 Python Code: Boxplot
To create a boxplot for the given data in Example 3.13 in Python, you can use
the matplotlib library. Here’s the Python code that will generate a boxplot
for the monthly starting salaries provided in the table:
1
4
AF import matplotlib . pyplot as plt
6 # Create a boxplot
7 plt . figure ( figsize =(10 , 6) )
8 plt . boxplot ( salaries , vert = False , patch_artist = True ,
9 boxprops = dict ( facecolor = ’ lightblue ’ , color = ’ blue
’) ,
10 whiskerprops = dict ( color = ’ blue ’) ,
DR
19 # Show plot
20 plt . show ()
97
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
3.11 Exercises
1. Consider the following dataset representing the daily temperatures (in °C)
recorded over 15 days:
18, 21, 20, 22, 24, 19, 23, 25, 27, 26, 28, 30, 29, 31, 32
(a) Compute the five-number summary for this dataset, including the
minimum, first quartile (Q1), median, third quartile (Q3), and max-
imum.
(b) Draw a boxplot and comment on your findings.
T
2. The following dataset represents housing prices (in thousands of dollars)
in a neighborhood:
220, 230, 250, 270, 290, 310, 330, 350, 370, 400
55, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86
4. Given the following dataset of monthly sales (in thousands of dollars) for
a retail store over 12 months:
25, 30, 28, 35, 33, 32, 31, 29, 37, 40, 42, 38
98
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
5, 6, 7, 8, 8, 9, 10, 10, 11, 12, 12, 13, 14, 15, 20
(a) Create a boxplot for this dataset.
(b) Identify and describe the distribution of the data, including any po-
tential outliers.
8. The following dataset represents the weights (in kg) of 18 animals:
AF 2, 3, 3, 4, 5, 5, 6, 6, 7, 8, 8, 9, 10, 10, 11, 12, 13, 15
(a) Use Python to compute the five-number summary and create a box-
plot for this dataset.
(b) Write a brief explanation of how Python can be used to visualize
data distributions using boxplots.
99
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
2. The average number of patients visiting a hospital each day over a week
is recorded as 150, 160, 170, 140, 155, 165, 160.
5. If a class has two sections with 25 and 30 students, and the average scores
for the sections are 85 and 90, respectively, find the weighted average
score for the entire class.
(a) Find the weighted average score for the entire class.
(b) Calculate the variance and standard deviation of the scores, assum-
ing that the scores in each section are the same.
100
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
7. For the dataset of daily temperatures: 22, 24, 21, 25, 23, 26, 24, 27,
T
(b) Find the mean, range, variance, and standard deviation of the daily
temperatures.
• Dataset A: 5, 8, 7, 6, 10
AF • Dataset B: 3, 6, 5, 7, 8, 9
(a) Find the median number of hours spent exercising per week.
(b) Compute the mean, range, variance, and standard deviation of the
DR
10. A car travels at speeds of 50 km/h for the first part of the trip and
70 km/h for the second part. If the distances traveled are the same,
what is the harmonic mean of the two speeds?
101
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
12. The annual returns of two investment portfolios over the past 3 years are
10%, 15%, 5% and 7%, 12%, 8%.
T
(b) Compute the variance and standard deviation of the effectiveness
rates.
Calculate the weighted mean calorie intake per day for all patients.
15. A professor is calculating the overall grade for a student based on the
grades from three different assessments with the following weights:
•
DR
16. This exercise focuses on comparing error rates for two software projects.
You will analyze the central tendency and variability of the error rates.
• Project Alpha: 5, 7, 6, 8, 5, 9, 6, 7, 8, 5, 7, 6
• Project Beta: 8, 10, 9, 7, 11, 9, 8, 10, 11, 9, 12, 10
(a) Draw a boxplot for the error rates for Project Alpha and Project
Beta.
102
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
• Start of Study: 22.5, 23.1, 22.8, 21.9, 23.5, 22.7, 23.0, 21.5,
22.3, 22.9, 21.8, 23.2
• After 6 Months: 21.0, 21.5, 21.8, 20.8, 22.0, 21.3, 21.6, 20.7,
21.1, 21.9, 20.9, 21.4
AF Using the above data, answer the following questions.
(a) Draw a boxplot for BMI at the start of the study and after six
months.
(b) How did the median BMI change from the start to after six months?
(c) Did the diet intervention lead to a reduction in the variability of
BMI?
(d) Are there any outliers in the BMI data at the start or after six
months? What can be inferred from them?
(e) Based on the boxplots, evaluate the effectiveness of the diet inter-
DR
vention.
18. This exercise involves analyzing cholesterol levels across three different
groups. You will interpret the boxplots to compare the cholesterol levels
between these groups.
• Group A: 190, 200, 195, 210, 205, 215, 202, 198, 220, 210, 195,
200
• Group B: 180, 185, 190, 175, 195, 190, 180, 185, 175, 190, 185,
180
• Group C: 210, 220, 215, 230, 225, 240, 220, 235, 225, 230, 240,
215
103
CHAPTER 3. DATA EXPLORATION: NUMERICAL MEASURES
T
AF
DR
104
Chapter 4
Introduction to Probability
T
4.1 Introduction
AF
In the realm of data science, understanding probability is essential for making
informed decisions based on uncertain and incomplete information. Probability
theory provides the mathematical foundation for analyzing data, modeling un-
certainty, and deriving insights from complex datasets. As data scientists, we
frequently encounter situations where outcomes are not deterministic but rather
subject to variability and chance. Probability offers tools and frameworks to
quantify this uncertainty and to make predictions that guide decision-making.
As we delve deeper, we will cover key topics such as joint and marginal
probabilities, conditional probability, and posterior probabilities. Each of these
concepts is crucial for analyzing relationships between variables, updating be-
liefs based on new data, and making predictions about future events.
By the end of this chapter, you will gain a solid understanding of proba-
105
CHAPTER 4. INTRODUCTION TO PROBABILITY
bility and its applications in data science. This knowledge will equip you with
the tools needed to tackle complex data challenges and to make data-driven
decisions with confidence.
T
come of an experiment is uncertain, but over repeated trials, patterns emerge
that allow us to assign probabilities to different outcomes.
Example of an Experiment
To test the fairness of a coin used in a cricket match to decide whether a
AF
team bats or bowls first, we can design an experiment to assess whether the
coin has an equal probability of landing on heads or tails. The goal is to
determine if the coin is unbiased, meaning both outcomes, heads and tails, are
equally likely. In this experiment, the possible outcomes are heads or tails.
The procedure involves flipping the coin a significant number of times, say 100
flips, and recording the result of each flip. The next step is to analyze the data
by calculating the relative frequencies of heads and tails. These frequencies
are then compared to the expected probability of 0.5 for each outcome. If the
proportion of heads deviates significantly from 0.5, it may suggest the coin is
biased. Conversely, if the proportions are close to 0.5, there is no evidence to
suggest that the coin is unfair.
DR
Examples
• Rolling a Die: A random experiment with outcomes {1, 2, 3, 4, 5,
6}, where each outcome is unpredictable.
106
CHAPTER 4. INTRODUCTION TO PROBABILITY
Consider the experiment of rolling a fair six-sided die. The possible outcomes
T
of this experiment are the numbers that appear on the top face of the die after
a roll. Then the sample space S for this experiment is the set of all possible
outcomes. That is,
S = {1, 2, 3, 4, 5, 6}
Here are some examples of sample spaces in different contexts within data
science:
AF• The sample space for tossing a fair coin is
S = {Heads, Tails}.
• The sample space S for rolling two six-sided dice consists of all possible
ordered pairs (x1 , x2 ), where x1 represents the outcome of the first die
and x2 represents the outcome of the second die. Since each die has 6
DR
This sample space shows all the possible outcomes when two dice are
rolled simultaneously.
107
CHAPTER 4. INTRODUCTION TO PROBABILITY
A = {2, 4, 6}
4.3 Probability
Probability is a branch of mathematics that deals with the likelihood of differ-
ent outcomes in uncertain situations. It quantifies the chance of an event occur-
ring, providing a way to model and analyze randomness. In essence, probability
T
helps us understand and predict the behavior of systems in which outcomes are
not deterministic, but rather subject to chance.
For example, when rolling a fair six-sided die, the probability of getting a 3 is:
1
P (3) =
6
Because there is only one “3” out of six possible outcomes.
S = {A1 , A2 , . . . , An }
108
CHAPTER 4. INTRODUCTION TO PROBABILITY
For example, when tossing a fair coin, the probability of getting heads is 0.5,
and the probability of getting tails is also 0.5, which are both between 0 and 1.
The sum of these probabilities is
0.5 + 0.5 = 1
showing that the total probability of all possible outcomes (heads or tails) is
always 1.
Problem 4.1. An experiment has five outcomes, I, II, III, IV, and V. If P (I) =
0.08, P (II) = 0.20, and P (III) = 0.33, (a) what are the possible values for the
probability of outcome V? (b) If outcomes IV and V are equally likely, what are
T
their probability values?
Solution
(a).
An experiment has five outcomes: I, II, III, IV, and V. Given the probabilities
AF
for outcomes I, II, and III are:
Since the sum of the probabilities of all outcomes must equal 1, the sum of the
probabilities of outcomes IV and V is:
Thus, the possible values for the probability of outcome V, denoted as P (V),
depend on the probability of outcome IV, denoted as P (IV):
(b).
P (IV) = P (V)
109
CHAPTER 4. INTRODUCTION TO PROBABILITY
T
Solution
An experiment has three outcomes: I, II, and III. Let the probabilities of these
outcomes be P (I), P (II), and P (III), respectively.
6x + 3x + x = 1
DR
or,
10x = 1
1
∴x= = 0.1
10
Thus, the probabilities of the three outcomes are:
P (III) = x = 0.1
P (II) = 3x = 3 × 0.1 = 0.3
P (I) = 6x = 6 × 0.1 = 0.6
110
CHAPTER 4. INTRODUCTION TO PROBABILITY
A ∪ B = {ω | ω ∈ A or ω ∈ B}
Example: Consider rolling a standard six-sided die. Let:
T
The union A ∪ B represents rolling a number that is either even or greater
than 3 (or both). The possible outcomes for
A ∪ B = {2, 4, 5, 6}.
A ∩ B = {ω | ω ∈ A and ω ∈ B}
Example: Using the same die roll, let:
DR
A ∩ B = {4, 6}.
111
CHAPTER 4. INTRODUCTION TO PROBABILITY
A = {2, 4, 6}
Ac = {1, 3, 5}
T
The probability of the complementary event Ac is given by:
P (Ac ) = 1 − P (A)
Example: If A is the event of getting a head when flipping a coin, then the
AF
complementary event Ac is the event of getting a tail. If P (A) = 0.5, then
P (Ac ) = 1 − 0.5 = 0.5.
Odds: The odds in favor of an event A are defined as the ratio of the
probability that the event A occurs to the probability that the event A
does not occur (i.e., the complement of A). Mathematically, the odds in
favor of A are given by:
P (A)
Odds in favor of A =
P (Ac )
Solution
(a). The odds in favor of success are given by:
p
Odds =
1−p
If the odds are 1, then:
p
1=
1−p
112
CHAPTER 4. INTRODUCTION TO PROBABILITY
Solving for p:
1
1−p=p ⇒ 1 = 2p ⇒ p=
2
So, p = 0.5.
T
2(1 − p) = p ⇒ 2 − 2p = p ⇒ 2 = 3p ⇒ p=
3
So, p = 23 .
Equally Likely Events: Events that have the same probability of occur-
ring.
Example
Consider the experiment of rolling a fair six-sided die. The sample space is:
S = {1, 2, 3, 4, 5, 6}
Since the die is fair, each of the six outcomes is equally likely. The proba-
bility of each outcome is:
1
P ({1}) = P ({2}) = P ({3}) = P ({4}) = P ({5}) = P ({6}) = .
6
113
CHAPTER 4. INTRODUCTION TO PROBABILITY
A∩B =∅
where ∩ denotes the intersection of events, and ∅ represents the empty
T
set, indicating that there are no outcomes common to both A and B.
Example:
Consider rolling a standard six-sided die. Let:
The events A and B are mutually exclusive because you cannot roll a 2 and
a 5 at the same time.
P (A ∪ B) = P (A) + P (B)
DR
Generalization
For any finite or countable collection of mutually exclusive events A1 , A2 , . . . , An :
n
! n
[ X
P Ai = P (Ai )
i=1 i=1
(i). 0 ≤ P (Ai ) ≤ 1,
n
P
(ii). P (Ai ) = 1,
i=1
114
CHAPTER 4. INTRODUCTION TO PROBABILITY
T
If A and B are not mutually exclusive, then the addition rule is
If two events are mutually exclusive, then the probability of both occurring
is denoted as P (A ∩ B) and
AF P (A and B) = P (A ∩ B) = 0.
Problem 4.4. A single 6-sided die is rolled. What is the probability of rolling
a 2 or a 5?
Solution
• Pr(2) = 1
6 and Pr(5) = 1
6
• Therefore,
DR
Problem 4.5. In a Math class of 30 students, 17 are boys and 13 are girls.
On a unit test, 4 boys and 5 girls made an A grade. If a student is chosen at
random from the class, what is the probability of choosing a girl or an A-grade
student?
Solution
• Pr(girl) = 13
30 , Pr(A-grade student) = 9
30 and Pr(girl ∩ A-grade student) =
5
30
115
CHAPTER 4. INTRODUCTION TO PROBABILITY
• Therefore,
T
Probability can be categorized into different types based on how it is determined
or calculated. Below are the key types:
Example 1: For a fair six-sided die, the total number of possible outcomes is
6 (since the die has six faces). If we are interested in the probability of rolling a
3, there is only one favorable outcome (rolling a 3). Therefore, the probability
is:
1
P (rolling a 3) =
6
Example 2: For a standard deck of 52 playing cards, the total number of
possible outcomes is 52. If we want to know the probability of drawing an Ace,
there are 4 Aces in the deck (one in each suit). Therefore, the probability is:
4 1
P (drawing an Ace) = =
52 13
Classical probability is particularly useful for situations where the outcomes
are well-defined, and each outcome is equally likely, such as rolling dice, drawing
cards, or selecting outcomes from a set of equally likely possibilities.
116
CHAPTER 4. INTRODUCTION TO PROBABILITY
T
late theoretical probabilities, or when we want to verify theoretical predictions
by comparing them with actual outcomes. The empirical approach to probabil-
ity relies on the principle known as the law of large numbers. This principle
suggests that as the number of observations increases, the estimate of the prob-
ability becomes more accurate. Therefore, by gathering more data, one can
obtain a more precise estimation of the probability.
AF
Law of large numbers: As the number of trials or observations increases,
the empirical probability of an event will get closer to its actual probability.
Example 1: For example, if we flip a coin 100 times and get 52 heads, the
estimated probability of getting a head would be:
Number of heads 52
P (Heads) = = = 0.52
Total number of flips 100
Based on the empirical data, the estimated probability of getting heads on a
coin flip is 0.52, or 52%.
DR
For this example, 100 flights from Dhaka to Chattogram were monitored.
117
CHAPTER 4. INTRODUCTION TO PROBABILITY
5
P (Failure) = = 0.05
100
In this example, the empirical probability of a successful flight from Dhaka
to Chattogram is 0.95, while the probability of an unsuccessful flight is 0.05,
based on the actual outcomes of the monitored flights.
T
intuition or experience. For an event A, the subjective probability is denoted
as:
P (A) = Subjective belief about A
In this approach, probabilities are not necessarily based on frequency or
equal likelihood but on personal estimation.
AF
Example 1: Consider an entrepreneur deciding whether to launch a new
product. Since there is no historical data or empirical studies available for this
specific product, the entrepreneur uses their expertise and market knowledge
to estimate the likelihood of success.
P (Success) = 0.70
This subjective probability is based on the entrepreneur uses their expertise
and market knowledge, rather than on empirical data or mathematical models.
118
CHAPTER 4. INTRODUCTION TO PROBABILITY
P (A) = 0.70
T
This means you believe there is a 70% chance that Team A will win the
game. This probability is derived from your subjective evaluation of the teams’
recent performances, player conditions, and other relevant factors.
The joint probability is the probability that a student has a good study
habit and passed the exam:
P (A ∩ B) = 0.80
119
CHAPTER 4. INTRODUCTION TO PROBABILITY
Marginal Probability of Studying (A) that a student studied for the exam,
regardless of whether they passed or not, is obtained by summing the joint
probabilities involving A:
T
Problem 4.6. Suppose two dice are thrown together. What is the probability
that at least one 6 is obtained on the two dice?
Solution
Since each die has 6 faces, the sample space contains 6 × 6 = 36 possible
outcomes:
AF
(1, 1), (1, 2), (1, 3),
(2, 1), (2, 2), (2, 3),
(1, 4), (1, 5), (1, 6),
(2, 4), (2, 5), (2, 6),
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),
S=
(4, 1), (4, 2), (4, 3),
(4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)
In the sample space S, we see the possible outcomes with at least one 6 are
DR
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6), (1, 6), (2, 6), (3, 6), (4, 6), (5, 6)
120
CHAPTER 4. INTRODUCTION TO PROBABILITY
P (A ∩ B)
P (A|B) = , provided P (B) > 0
P (B)
T
P (A ∩ B) 0
P (A | B) = = = 0.
P (B) P (B)
Another example involves a scenario where event B is a subset of event A,
denoted B ⊆ A. In this case, if event B occurs, event A must also occur. Thus,
the probability of event A given that B has occurred should be one. This is
AF
supported by the formula:
P (A | B) =
P (A ∩ B)
P (B)
=
P (B)
P (B)
= 1.
Problem 4.7. If somebody rolls a fair die without showing you but announces
that the result is even, then what is the probability of scoring a 6?
Solution
The sample space for a fair die roll is S = {1, 2, 3, 4, 5, 6}. The event (B) that
the result is even is B = {2, 4, 6}.
DR
3 1
P (B) = =
6 2
The event (A) of scoring a 6 given that the result is even is A = {6}.
1
P (A ∩ B) 6 1
P (A|B) = = 1 =
P (B) 2
3
Problem 4.8. Suppose somebody rolls a red die and a blue die together without
showing you, but announces that at least one 6 has been showed. What is the
probability that the red die showed a 6?
Solution
In the sample space S mentioned in Problem 4.6, we see the possible outcomes
with at least one 6 are
B = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6), (1, 6), (2, 6), (3, 6), (4, 6), (5, 6)}
121
CHAPTER 4. INTRODUCTION TO PROBABILITY
. Therefore, the number of outcomes with at least one 6 is 11. The number of
outcomes where the red die scores a 6 is 6. That is,
A = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
6/36 6
P (A|B) = = .
11/36 11
Problem 4.9. Suppose somebody rolls a red die and a blue die together. What
is the probability that the red die scores a 6 given that exactly one 6 of the two
T
outcomes has been scored?
Solution
In the sample space S mentioned in Problem 4.6, we see the possible outcomes
with at least one 6 are
AF B = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (1, 6), (2, 6), (3, 6), (4, 6), (5, 6)}.
5/36 1
P (A|B) = =
10/36 2
variables. Each cell in the table represents the count of observations where
the two variables take specific values. We use contingency tables to compute
marginal and joint probabilities.
Consider two categorical variables: Variable A with categories A1 and
A2 and Variable B with categories B1 and B1 The contingency table 4.1 is
structured as follows:
A1 A2 Total
B1 a b a+b
B2 c d c+d
Total a+c b+d n=a+b+c+d
122
CHAPTER 4. INTRODUCTION TO PROBABILITY
A1 A2 Total
a b a+b
B1
T
n n n
c d c+d
B2 n n n
a+c b+d
Total n n 1
(i). Develop a joint probability table for these data. What are the marginal
probabilities? Suppose a male officer is selected randomly, what is the
chance that the officer will be promoted?
(ii). Suppose a female officer is selected randomly, what is the chance that
the officer will not be promoted? Suppose an officer is selected randomly
who got promotion, what is the chance that the officer will be male?
(iii). Suppose an officer is selected randomly who did not get promotion,
what is the chance that the officer will be female?
123
CHAPTER 4. INTRODUCTION TO PROBABILITY
Solution
(i) Joint Probability Table and Marginal Probabilities
To develop the joint probability table, we divide each cell count by the total
number of officers, which is 1200.
T
672 204 876
Not Promoted 1200 1200 1200
960 240
Total 1200 1200 1
Simplifying the fractions, we get:
204/1200
P (Not Promoted | Female) = = 0.85
240/1200
Probability that a randomly selected officer who got promoted is
male:
288/1200
P (Male | Promoted) = ≈ 0.889
324/1200
124
CHAPTER 4. INTRODUCTION TO PROBABILITY
204/1200
P (Female | Not Promoted) = ≈ 0.233
876/1200
T
terms, this is expressed as:
P (A ∩ B) = P (A) · P (B).
P (A ∩ B)
P (B)
P (A) · P (B)
P (B | A) = = = P (B).
P (A) P (A)
These equations indicate that knowing the occurrence of one event does not
change the probability of the other event.
P (A ∩ B) = P (A) · P (B).
DR
P (A | B) = P (A), P (B | A) = P (B).
125
CHAPTER 4. INTRODUCTION TO PROBABILITY
The coin has two possible outcomes: heads (H) or tails (T). Therefore, the
sample space for the coin flip is:
SA = {H, T }
The outcome of rolling the die does not affect the outcome of flipping the
coin, and vice versa. Therefore, events A and B are independent. We can verify
this as follows:
1 1
P (A) = , P (B) =
6 2
The combined sample space consists of 12 outcomes.
FT
(1, H), (2, H), (3, H), (4, H), (5, H), (6, H),
S=
(1, T ), (2, T ), (3, T ), (4, T ), (5, T ), (6, T )
1
P (A ∩ B) = P (die shows 3 and coin lands on heads) =
12
1 1 1
P (A ∩ B) = P (A) · P (B) = · =
6 2 12
Thus, the events are independent.
126
CHAPTER 4. INTRODUCTION TO PROBABILITY
(a). Suppose that the system works only when all four computers are work-
ing. What is the probability that the system works?
(b). Suppose that the system works only if at least one computer is working.
What is the probability that the system works?
(c). Suppose that the system works only if at least three computers are work-
ing. What is the probability that the system works?
Solution
(a). To find the probability in this scenario, we multiply the probabilities
T
of all four computers working, as they are independent.
(c).
Problem 4.12. Suppose that somebody secretly rolls two fair six-sided dice,
and what is the probability that the face-up value of the first one is 2, given the
information that their sum is no greater than 5?
127
CHAPTER 4. INTRODUCTION TO PROBABILITY
Solution
To find the probability that the face-up value of the first die is 2 given that
the sum of the two dice is no greater than 5, we use the concept of conditional
probability.
Let A be the event that the face-up value of the first die is 2, and B be
the event that the sum of the two dice is no greater than 5. We want to find
P (A | B).
The conditional probability P (A | B) is given by:
P (A ∩ B)
P (A | B) =
P (B)
T
First, we determine P (B). The possible outcomes for the sum of the two
dice being no greater than 5 are:
Thus, the probability that the face-up value of the first die is 2 given that
3
their sum is no greater than 5 is 10 .
128
CHAPTER 4. INTRODUCTION TO PROBABILITY
S = A1 ∪ A2 ∪ · · · ∪ An
Let B be another event in the sample space given in Figure 4.1. The initial
question of interest is how to use the probabilities P (Ai ) and P (B | Ai ) to
calculate P (B), the probability of the event B. This can be achieved by noting
T
that
S
DR
This result, known as the Law of Total Probability, has the interpretation
that if it is known that one and only one of a series of events Ai can occur, then
the probability of another event B can be obtained as the weighted average of
the conditional probabilities P (B | Ai ), with weights equal to the probabilities
P (Ai ).
129
CHAPTER 4. INTRODUCTION TO PROBABILITY
The law of total probability states that if you have a partition of the sample
space into mutually exclusive events, the probability of an event can be found by
summing the probabilities of the event occurring within each partition, weighted
by the probability of each partition.
Example
FT
Suppose we have a sample space divided into three mutually exclusive events
A1 , A2 , and A3 with the following probabilities and conditional probabilities:
Each new car sold carries a one-year bumper-to-bumper warranty. The com-
pany has collected data showing the following conditional probabilities of making
a warranty claim:
130
CHAPTER 4. INTRODUCTION TO PROBABILITY
The probability of interest is the probability that a claim on the warranty of the
car will be required. If B is the event that a claim is made, we want to find
P (B).
Solution
We can use the Law of Total Probability to find P (B). According to the Law
of Total Probability:
T
P (B) =P (B | Plant I) · P (Plant I) + P (B | Plant II) · P (Plant II)
+ P (B | Plant III) · P (Plant III) + P (B | Plant IV) · P (Plant IV)
No Claim (0.95)
AF Plant I
Claim (0.05)
No Claim (0.89)
Plant II
Claim (0.11)
DR
Plant
No Claim (0.97)
Plant III
Claim (0.03)
No Claim (0.92)
Plant IV
Claim (0.08)
Substitute the given values:
131
CHAPTER 4. INTRODUCTION TO PROBABILITY
Thus, the probability that a claim on the warranty will be required is 0.0687.
FT
Probability allows us to compute the overall probability by summing the con-
tributions from all mutually exclusive combinations of those factors.
ing status. The probability of being in the young age group (less than 40 years) is
P (Young) = 0.60, and the probability of being in the old age group (40 years or
older) is P (Old) = 0.40. For the young age group, the conditional probabilities
of having high blood pressure (BP) are P (High BP | Young, Smoker) = 0.10 for
smokers and P (High BP | Young, Non-smoker) = 0.05 for non-smokers. For
the old age group, the conditional probabilities of having high BP are P (High BP |
Old, Smoker) = 0.40 for smokers and P (High BP | Old, Non-smoker) = 0.25
for non-smokers. The probability of being a smoker in the young age group
is P (Smoker | Young) = 0.30, and the probability of being a non-smoker is
P (Non-smoker | Young) = 0.70. In the old age group, the probability of being
132
CHAPTER 4. INTRODUCTION TO PROBABILITY
Solution
To compute the overall probability of having high blood pressure, P (High BP),
we use the Law of Total Probability. The total probability is given by:
XX
P (High BP) = P (High BP | Ai , Sj ) · P (Sj | Ai ) · P (Ai )
i j
T
Where
• P (High BP | Ai , Sj ) is the conditional probability of having high blood
pressure given that the individual belongs to age group Ai and has
smoking status Sj ,
133
CHAPTER 4. INTRODUCTION TO PROBABILITY
Not High BP
Non Smoker
High BP
Old
Not High BP
Smoker
T
High BP
Age
Not High BP
Non Smoker
AF High BP
Young
Not High BP
Smoker
High BP
Therefore, we have
DR
134
CHAPTER 4. INTRODUCTION TO PROBABILITY
P (B | A) · P (A)
P (A | B) =
P (B)
T
where:
• P (A | B) is the posterior probability of event A given that event B has
occurred.
P (B | Ai ) · P (Ai )
P (Ai | B) = Pn
j=1 P (B | Aj ) · P (Aj )
DR
P (Ai ) · P (B | Ai )
P (Ai | B) = Pn
j=1 P (B | Aj ) · P (Aj )
where:
135
CHAPTER 4. INTRODUCTION TO PROBABILITY
T
the following conditional probabilities of making a warranty claim:
•
AF•
P (claim | Plant III) = 0.03
If a claim is made on the warranty of the car, how does this change these
probabilities?
Solution
From Bayes’ theorem, the posterior probabilities are calculated as follows:
136
CHAPTER 4. INTRODUCTION TO PROBABILITY
0.25 × 0.03
P (Plant III | Claim) = = 0.109
0.0687
P (Plant IV) · P (Claim | Plant IV)
P (Plant IV | Claim) =
P (Claim)
Substitute the given values:
0.31 × 0.08
P (Plant IV | Claim) = = 0.361
0.0687
T
• P (Plant I | Claim) = 0.146
Notice that Plant II has the largest claim rate (0.11), and its posterior
probability (0.384) is much larger than its prior probability (0.24). This is
expected since the fact that a claim is made increases the likelihood that the
car has been assembled in a plant with a high claim rate. Similarly, Plant III
has the smallest claim rate (0.03), and its posterior probability (0.109) is much
smaller than its prior probability (0.25), as expected.
Problem 4.16. Suppose it is known that 1% of the population suffers from
a particular disease. A blood test has a 97% chance of identifying the disease
for diseased individuals, but also has a 6% chance of falsely indicating that a
DR
(a) What is the probability that a person will have a positive blood test?
(b) If your blood test is positive, what is the chance that you have the disease?
(c) If your blood test is negative, what is the chance that you do not have the
disease?
Solution
(a) Probability of a Positive Blood Test
Let D be the event that a person has the disease, and Dc be the event that a
person does not have the disease. Let T + be the event of a positive test result,
and T − be the event of a negative test result.
137
CHAPTER 4. INTRODUCTION TO PROBABILITY
P (D) = 0.01
P (Dc ) = 0.99
P (T + |D) = 0.97
P (T + |Dc ) = 0.06
T
= (0.97 × 0.01) + (0.06 × 0.99)
= 0.0097 + 0.0594
= 0.0691
So, the probability that a person will have a positive blood test is 0.0691.
AF
(b) Probability of Having the Disease Given a Positive Test
We use Bayes’ theorem:
P (T + |D)P (D)
P (D|T + ) =
P (T + )
0.97 × 0.01
=
0.0691
0.0097
=
0.0691
≈ 0.1403
DR
So, if your blood test is positive, the chance that you have the disease is
approximately 0.1403 or 14.03%.
138
CHAPTER 4. INTRODUCTION TO PROBABILITY
P (T − |Dc )P (Dc )
P (Dc |T − ) =
P (T − )
0.94 × 0.99
=
0.9409
0.9406
=
0.9409
≈ 0.9997
So, if your blood test is negative, the chance that you do not have the disease
T
is approximately 0.9997 or 99.97%.
As we move forward, the next chapter will delve into random variables and
their properties. Random variables are crucial for quantifying and modeling
uncertainty in a more structured way. We will explore different types of ran-
dom variables, their distributions, and key properties, further building on the
DR
probability concepts introduced here. Mastering these topics will enhance your
ability to handle complex data challenges and apply statistical techniques effec-
tively. Understanding random variables is essential for advanced data analysis
and developing predictive models.
139
CHAPTER 4. INTRODUCTION TO PROBABILITY
(b) P (B)
(c) P (A ∩ B)
(d) P (A ∪ B)
(e) P (Ac )
2. In a bag of 10 balls, 4 are red and 6 are blue. Two balls are drawn at
random without replacement. Define the following events:
• A: Drawing a red ball on the first draw.
• B: Drawing a red ball on the second draw.
T
Calculate the following probabilities:
(a) P (A)
(b) P (B | A)
(c) P (A ∩ B)
AF(d) P (A ∪ B)
3. You are given a deck of 52 playing cards. Define the following events:
• A: Drawing a card that is a heart.
• B: Drawing a card that is a queen.
Calculate the following probabilities:
(a) P (A)
(b) P (B)
(c) P (A ∩ B)
DR
(d) P (A ∪ B)
(e) P (Ac )
(a) What is the probability that a young person likes exactly one of
the two social media platforms?
(b) What is the probability that a young person likes at least one of
the two platforms?
(c) What is the probability that a young person likes only Facebook
and not YouTube?
140
CHAPTER 4. INTRODUCTION TO PROBABILITY
• A: Preferring coffee.
•
T
B: Preferring tea.
Given:
• P (A) = 0.6
• P (A ∩ B) = 0.3
AFCalculate: P (A ∪ B)
7. In a class of 30 students, 18 like mathematics, 12 like science, and 8 like
both. If a student is chosen at random, calculate:
141
CHAPTER 4. INTRODUCTION TO PROBABILITY
The probability of developing the health condition given the type of treat-
ment is known to be:
T
• P (Condition | A) = 0.10
• P (Condition | B) = 0.25
• P (Condition | C) = 0.15
AF Find the overall probability of a patient developing the health condition,
denoted as P (Condition).
Using Bayes’ theorem, calculate the following probabilities:
(a) P (A | Condition)
(b) P (B | Condition)
(c) P (C | Condition).
10. Suppose that somebody secretly rolls two fair six-sided dice, and what
is the probability that the face-up value of the first one is 3, given the
information that their sum is no greater than 5?
DR
142
CHAPTER 4. INTRODUCTION TO PROBABILITY
The system works if components A and B work and either of the com-
ponents C or D works. The reliability (probability of working) of each
component is also shown in the above figure. Find the probability that
(a) the entire system works.
(b) the component C does not work, given that the entire system works.
Assume that the four components work independently.
(c) the component D does not work, given that the entire system works.
12. An agricultural research establishment grows vegetables and grades each
one as either good or bad for its taste, good or bad for its size, and good
T
or bad for its appearance. Overall 78% of the vegetables have a good
taste. However, only 69% of the vegetables have both a good taste and a
good size. Also, 5% of the vegetable have both a good taste and a good
appearance, but a bad size. Finally, 84% of the vegetables have either a
good size or a good appearance.
(a). If a vegetable has a good taste, what is the probability that it
AF (b).
also has a good size?
If a vegetable has a bad size and a bad appearance, what is the
probability that it has a good taste?
.
13. A company produces electronic components, and it has two types of ma-
chines, A and B, that manufacture these components. Machine A pro-
duces 60% of the components, while Machine B produces 40%. Historical
data shows that 2% of the components produced by Machine A are defec-
tive, while 5% of the components produced by Machine B are defective.
DR
143
CHAPTER 4. INTRODUCTION TO PROBABILITY
(b) If a person tests positive, what is the probability that they actually
have the disease?
(c) If a person tests negative, what is the probability that they do not
have the disease?
15. As a mining company evaluates the likelihood of discovering a gold de-
posit in a specific region, they have gathered data on the probabilities
associated with geological features. Given that the probability of finding
a gold deposit is P (G) = 0.3, the likelihood of observing specific geolog-
ical features if a deposit is present is P (E|G) = 0.8, and the chance of
observing those features if no deposit exists is P (E|Gc ) = 0.1, answer the
T
following:
16. A factory produces 80% of products with Machine A and 20% with Ma-
chine B. If 2% of A’s products and 5% of B’s products are defective, what
is the probability that a defective product came from Machine A?
17. Suppose you are on a game show with three doors: one has a car, the
other two have goats. You choose Door 1. The host, who knows what’s
behind the doors, opens Door 3 to show a goat and asks if you want to
switch to Door 2.
DR
(a) What’s the chance of winning the car if you switch to Door 2?
(b) What’s the chance of winning the car if you stay with Door 1?
144
Chapter 5
T
Properties
AF
5.1 Introduction
In the realm of data science, understanding and manipulating uncertainty is
a fundamental skill. At the core of this capability lies the concept of a ran-
dom variable. A random variable is a quantitative variable whose values are
determined by the outcome of a random phenomenon. It serves as a bridge
connecting the abstract world of probability theory to the concrete domain of
data analysis.
Random variables can be classified into two main types: discrete and contin-
uous. Discrete random variables take on a countable number of distinct values,
DR
This chapter delves into the foundational aspects of random variables, ex-
ploring their properties and the critical role they play in statistical modeling
and data analysis. We will discuss probability distributions, expected values,
variances, and other By the end of this chapter, readers will gain a robust un-
derstanding of how random variables function and how they can be applied to
solve real-world problems in data science.
145
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
Random Variable: A variable whose possible values are determined by
outcomes of a random experiment or process, with each value associated
with a probability.
ponents in the sample. The possible values of X are denoted by x, and their
corresponding outcomes are listed in Table 5.1. The random variable X can
take on the following values:
• x = 0: No defective components (Outcome: NNN)
x 0 1 1 1 2 2 2 3
146
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
fair coin multiple times, the number of heads observed in the series is a
discrete random variable. For example, if you flip a coin 10 times, the
number of heads (0 to 10) is a discrete outcome.
147
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
(i). Non-negativity: For every possible value x that X can take, the
probability p(x) is non-negative:
p(x) ≥ 0
(ii). Normalization: The sum of the probabilities for all possible values
AF of X equals 1: X
p(x) = 1
x∈Range of X
(iii). Probability Assignment: For any specific value x, p(x) gives the
probability that the random variable X takes the value x:
p(x) = P (X = x)
Outcome x P (X = x)
1
{NNN} 0 8
3
{NDN, NND, DNN} 1 8
3
{NDD, DND, DDN} 2 8
1
{DDD} 3 8
148
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
0.5
Probability P (X = x)
0.3
0.2
0.125 0.125
0.1
T
0
0 1 2 3
X (Number of Defective Components)
AF Figure 5.1: Probability Mass Function of X
F (x) = P (X ≤ x).
149
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
x P (X = x) F (x)
1
0 8
0.125
3
1 8
0.500
3
2 8
0.875
1
3 8
1
T
1
0.875
F (x)
AF 0.5
0.125
0
0 1 2 3
X (Number of Defective Components)
150
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
5. Step Function: For discrete random variables, the cdf F (x) is a step
function, with the value increasing at each point where the random vari-
able takes a value.
Problem 5.1. An office has four copying machines, and the random variable
T
X measures how many of them are in use at a particular moment in time.
Suppose that P (X = 0) = 0.08, P (X = 1) = 0.11, P (X = 2) = 0.27, and
P (X = 3) = 0.33.
Solution
(a) Since the sum of all probabilities must be 1, we have:
P (X = 4) = 1 − (P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3))
= 1 − (0.08 + 0.11 + 0.27 + 0.33) = 1 − 0.79
= 0.21
DR
(b) The graphical presentation of the probability mass function is the follow-
ing:
151
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
0.4
0.33
Probability P (X = x)
0.3
0.27
0.21
0.2
0.11
0.1 0.08
T
0
0 1 2 3 4
Number of copying machines in use
AF Figure 5.3: Probability Mass Function
F (x) = P (X ≤ x)
x 0 1 2 3 4
DR
152
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
where,
F (0) = P (X = 0) = 0.08
F (1) = P (X ≤ 1) = P (X = 0) + P (X = 1) = 0.08 + 0.11 = 0.19
F (2) = P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2)
= 0.08 + 0.11 + 0.27 = 0.46
F (3) = P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
= 0.08 + 0.11 + 0.27 + 0.33 = 0.79
F (4) = P (X ≤ 4)
= P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4)
T
= 0.08 + 0.11 + 0.27 + 0.33 + 0.21 = 1.00
1
AF 0.8
F (X)
0.6
0.4
0.2
0
0 1 2 3 4
X
DR
Problem 5.2. Let the number of phone calls received by a switchboard during
a 5-minute interval be a random variable X with probability function
e−2 2x
p(x) = , for x = 0, 1, 2, . . .
x!
(a) Determine the probability that x equals 0, 1, 2, 3, 4, 5, and 6.
(b) Graph the probability mass function for these values of x.
(c) Determine the cumulative distribution function for these values of X.
Solution
(a) Probabilities
The probability function is given by
e−2 2x
p(x) =
x!
153
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
e−2 20
P (X = 0) = = e−2 ≈ 0.1353
0!
e−2 21
P (X = 1) = = 2e−2 ≈ 0.2707
1!
e−2 22
P (X = 2) = = 22 e−2 /2 ≈ 0.2707
2!
e−2 23
P (X = 3) = = 23 e−2 /6 ≈ 0.1804
3!
e−2 24
P (X = 4) = = 24 e−2 /24 ≈ 0.0902
T
4!
e−2 25
P (X = 5) = = 25 e−2 /120 ≈ 0.0361
5!
e−2 26
P (X = 6) = = 26 e−2 /720 ≈ 0.0120
6!
AF
(b) Graph of the Probability Mass Function
0.3
0.27 0.27
Probability P (X = x)
0.2 0.18
0.14
DR
0.1 0.09
0.04
0.01
0
0 1 2 3 4 5 6
x
154
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
F (0) = P (X ≤ 0) = P (X = 0) = 0.1353
F (1) = P (X ≤ 1) = P (X = 0) + P (X = 1) = 0.1353 + 0.2707 = 0.4060
F (2) = P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2)
= 0.4060 + 0.2707 = 0.6767
F (3) = P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
= 0.6767 + 0.1804 = 0.8571
F (4) = P (X ≤ 4) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4)
= 0.8571 + 0.0902 = 0.9473
T
F (5) = P (X ≤ 5) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
+ P (X = 4) + P (X = 5)
= 0.9473 + 0.0361 = 0.9834
F (6) = P (X ≤ 6) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
+ P (X = 4) + P (X = 5) + P (X = 6)
AF
5.3.4
= 0.9834 + 0.0120 = 0.9954
Exercise
1. An office has four copying machines, and the random variable X denotes
how many of them are in use in a particular time. Suppose the probability
mass function X is given below:
x 0 1 2 3 4
Pr(X = x) k 0.02 0.05 0.4 (k + 0.3)
DR
(a) What is the value of k and draw the line graph of the probability
mass function Pr(X = x).
(b) Find the Value of Pr(X ≤ 2).
(c) Find the probability that at least two copying machines are in
use.
(d) Find the cumulative Function F (x) and draw the F (x).
2. An office has five printers and the random variable Y measures how many
of them are currently being used. Suppose that P (Y = 0) = 0.05, P (Y =
1) = 0.10, P (Y = 2) = 0.20, P (Y = 3) = 0.30, and P (Y = 4) = 0.25.
155
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
3. A hospital has six emergency rooms and the random variable Z measures
how many of them are occupied at a given time. Suppose that P (Z =
0) = 0.04, P (Z = 1) = 0.10, P (Z = 2) = 0.20, P (Z = 3) = 0.25,
P (Z = 4) = 0.20, and P (Z = 5) = 0.15.
4. A hospital has three emergency rooms, and the random variable W de-
notes how many of them are occupied at a particular time. Suppose the
T
probability mass function W is given below:
w 0 1 2 3
Pr(W = w) q 0.15 0.25 (q + 0.05)
AF(a) What is the value of q and draw the line graph of the probability
mass function Pr(W = w).
(b) Find the value of Pr(W ≤ 2). Also, find the probability that at least
one emergency room is occupied.
(c) Find the cumulative distribution function F (w) and draw the graph
of F (w).
5. A clinic has three doctors and the random variable W measures how
many of them are available at a particular moment in time. Suppose that
P (W = 0) = 0.15, P (W = 1) = 0.20, and P (W = 2) = 0.30.
DR
6. A warehouse has seven forklifts and the random variable V measures how
many of them are currently in operation. Suppose that P (V = 0) = 0.02,
P (V = 1) = 0.08, P (V = 2) = 0.18, P (V = 3) = 0.25, P (V = 4) = 0.20,
and P (V = 5) = 0.15.
7. A manufacturing plant has four assembly lines and the random variable
U measures how many of them are operating at a given time. Suppose
that P (U = 0) = 0.10, P (U = 1) = 0.20, and P (U = 2) = 0.35.
156
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
variable because it can take any value within a given range. For example,
the height could be 170.2 cm, 175.5 cm, etc.
2. Time Taken to Complete a Task: The time required to finish a task,
such as running a marathon, is a continuous random variable. It can be
measured in hours, minutes, seconds, and fractions of a second.
AF
3. Temperature: The temperature at a specific location and time is a
continuous random variable. It can take any value within the possible
range of temperatures, such as 23.45°C, 37.8°C, etc.
tinuous random variable. It can vary continuously and take on any value
within the range of possible stock prices.
7. Age of an Individual: The age of a person can be considered a contin-
uous random variable if measured precisely. For instance, someone could
be 25.3 years old, 45.7 years old, etc.
8. Voltage in an Electrical Circuit: The voltage at a point in an electrical
circuit is a continuous random variable. It can take any value within the
possible voltage range.
157
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
2. Normalization: The total area under the pdf curve over the entire range
of X is equal to 1: Z ∞
f (x) dx = 1.
T
−∞
Note that the pdf f (x) does not provide the probability of X taking any spe-
cific value (which is always zero for continuous random variables). Instead, it
indicates the density of probability at each point. To find the probability that
X falls within a specific interval [a, b] given in Figure 5.5, you integrate the pdf
AF
over that interval:
P (a ≤ X ≤ b) =
Z
a
b
f (x) dx.
Rb
f (x) P (a ≤ X ≤ b) = f (x) dx
a
DR
a b x
Figure 5.5: The area under the probability density function f (x) between a and
b.
158
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
that is: Z b
P (a ≤ X ≤ b) = f (x) dx.
a
(ii). Find the probability that the diameter of the metal cylinder lies between
49.8 mm and 50.1 mm, i.e., calculate P (49.8 ≤ X ≤ 50.1).
Solution
(i). To determine if f (x) = 1.5−6(x−50.0)2 for 49.5 ≤ x ≤ 50.5 and f (x) = 0
elsewhere is a valid probability pdf, we need to check two conditions:
1. Non-negativity: f (x) ≥ 0 for all x.
2. Normalization: The total integral of f (x) over all possible values must
equal 1.
Non-negativity Check
We need to ensure that f (x) ≥ 0 for 49.5 ≤ x ≤ 50.5:
159
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
For 49.5 ≤ x ≤ 50.5, let’s calculate the minimum value of the quadratic
function: (x − 50.0)2 is minimized at x = 50.0, and (x − 50.0)2 ranges from 0
to (0.5)2 = 0.25.
Normalization Check
We need to integrate f (x) over the interval 49.5 ≤ x ≤ 50.5 and check if the
integral equals 1:
T
Z 50.5 Z 50.5
1.5 − 6(x − 50.0)2 dx
f (x) dx =
49.5 49.5
Z 50.5 Z 50.5
= 1.5 dx − 6(x − 50.0)2 dx
49.5 49.5
AF = 1.5 × (50.5 − 49.5) −
Z 50.5
Z 50.5
49.5
6(x − 50.0)2 dx
160
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
and Z 0.5
1 6
6 u2 du = 6 × = = 0.5
−0.5 12 12
Therefore, from the Equation (5.2), we have
Z 50.5
f (x) dx = 1.5 − 0.5 = 1
49.5
Since both conditions are satisfied, f (x) is indeed a valid probability density
function.
T
2
f (x) = 1.5 − 6(x − 50.0)2
1.5
AF
f (x)
0.5
0
49.4 49.6 49.8 50 50.2 50.4 50.6
x
DR
(ii). We can find the probability that a metal cylinder has a diameter between
49.8 and 50.1 mm which is
161
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Z 50.1
P (49.8 ≤ X ≤ 50.1) = f (x) dx
49.8
Z 50.1
1.5 − 6(x − 50.0)2 dx
=
49.8
Z 50.1 Z 50.1
= 1.5 dx − 6(x − 50.0)2 dx
49.8 49.8
Z 0.1
50.1
= 1.5 [x]49.8 −6 u2 du [let, u = x − 50.0]
−0.2
T
Z 0.1
= 1.5(50.1 − 49.8) − 6 u2 du
−0.2
(0.1)3 (−0.2) 3
= 0.45 − −
3 3
= 0.45 − 0.018 = 0.432
AF
Thus, the probability that a metal cylinder has a diameter between 49.8 and
50.1 mm is 0.432 or 43.2%.
dF (x)
f (x) = .
dx
This relationship connects the probability density function to the cumulative
probability.
162
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
P (a ≤ X ≤ b) = F (b) − F (a).
T
proaches 0, indicating that the probability of the random variable being
less than any finite value is 0.
Solution
DR
163
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
0
for x < 49.5
F (x) = 1.5(x − 49.5) − 2(x − 50)3 − 0.25 for 49.5 ≤ x ≤ 50.5
1 for x > 50.5
1 F (x)
T
0.8
0.6
F (x)
0.4
AF 0.2
49 49.5 50 50.5 51
x
(
2x for 0 ≤ x ≤ 1
f (x) =
0 otherwise
(i). Verify that the function f (x) is a valid probability density function by
showing that the total area under the curve is equal to 1.
Solution
(i). To verify that the function
(
2x for 0 ≤ x ≤ 1
f (x) =
0 otherwise
164
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
is a valid probability density function (pdf), we calculate the integral over its
range:
Z ∞ Z 1
f (x) dx = 2x dx
−∞ 0
Calculating the integral:
Z 1 1
2x dx = x2 0 = 12 − 02 = 1
0
Since Z ∞
f (x) dx = 1,
T
−∞
we conclude that f (x) is a valid probability density function.
(ii). We compute
Z 0.8 0.8
2x dx = x2 0.5 = (0.8)2 − (0.5)2 = 0.39
AF P (0.5 < X < 0.8) =
For 0 ≤ x ≤ 1: Z x x
F (x) = 2t dt = t2 0 = x2
0
DR
(i). Find the probability density function (pdf ) f (x). Verify that the f (x)
is a valid pdf.
165
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Solution
(i). It is noted that f (x) ≥ 0 for all x and
Z ∞ Z 1 3 1
2 x
f (x) dx = 3x dx = 3 = 1.
−∞ 0 3 0
R∞
Since f (x) ≥ 0 and −∞ f (x) dx = 1, f (x) is a valid probability density function.
(ii). To find the probability P (0.2 < X < 0.5), we can use the pdf:
Z 0.5 Z 0.5
P (0.2 < X < 0.5) = f (x) dx = 3x2 dx
0.2 0.2
T
0.5
= x3 0.2 = (0.5)3 − (0.2)3
= 0.125 − 0.008 = 0.117
Alternatively,
P (0.2 < X < 0.5) = F (0.5) − F (0.2) = 0.53 − 0.23 = 0.117
AF
Hence, the probability P (0.2 < X < 0.5) is 0.117.
Problem 5.7. Let X be a continuous random variable with the pdf:
(
1 −|x|
e for − ∞ < x < ∞
f (x) = 2
0 otherwise
Compute F (x).
Solution
For −∞ < x < ∞, the cumulative function is
Z x
DR
1 −|t|
F (x) = e dt
−∞ 2
Since e−|t| can be split into two parts depending on the range of t.
For x < 0:
Z x
1 t 1 t x 1 1
F (x) = e dt = e −∞ = (ex − 0) = ex
−∞ 2 2 2 2
For x ≥ 0:
Z 0 Z x
1 t 1 −t
F (x) = e dt + e dt
−∞ 2 0 2
1 t 0 1 −t x
= e −∞ + −e 0
2 2
1 1
1 − e−x
= (1 − 0) +
2 2
1 −x
=1− e
2
166
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
5.4.3 Exercises
1. Consider a random variable measuring the following quantities. In each
case, state with reasons whether you think it is more appropriate to define
the random variable as discrete or continuous.
T
(a) The number of books in a library
(b) The duration of a phone call
(c) The number of steps a person takes in a day
(d) The amount of rainfall in a month
AF (e)
(f)
The
The
number of languages a person speaks
speed of a car on a highway
167
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
(a) Find the value of k and then make a sketch of the probability density
function.
(b) What is P (2.5 ≤ Z ≤ 4)?
(c) Construct and sketch the cumulative distribution function.
T
(b) What is P (X ≤ 2)?
(c) What is P (1 ≤ X ≤ 3)?
(d) Construct and sketch the probability density function.
While the probability mass function or the probability density function provides
complete information about the probabilistic properties of a random variable,
it is often useful to use some summary measures of these properties. One of
the most fundamental summary measures is the expectation or mean of a ran-
dom variable, denoted by E(X), which represents the “average” value of the
random variable. Two random variables with the same expected value can be
considered to have the same average value, even though their probability mass
functions or probability density functions may differ significantly.
168
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
X
E[X] = x · P (X = x)
x
Given the probability mass function (pmf) for X in Table 5.2, the calcula-
tions for the E[X] are shown in the following table.
AF x
0
P (X = x)
1
8
3
x · P (X = x)
3
0
3
1 8 1· 8 = 8
3 3 6
2 8 2· 8 = 8
1 1 3
3 8 3· 8 = 8
12
Total 8
X 12
E[X] = x · P (X = x) = = 1.5
x
8
Therefore, the expected number of defective components E[X] is 1.5.
Problem 5.8.
An office has four copying machines, and the random variable X denotes how
many of them are in use in a particular time. Suppose the probability mass
function X is given below:
x 0 1 2 3 4
Pr(X = x) k 0.02 0.05 0.4 (k + 0.2)
169
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Solution
(a) To find the value of k, we set up the equation based on the property that
the sum of probabilities must equal 1:
or, 2k + 0.67 = 1
or, 2k = 1 − 0.67
T
or, 2k = 0.33
0.33
∴k= = 0.165
2
AF
(b) To find the expectation E(X) of the random variable X, we use the
formula for the expected value:
4
X
E(X) = x · Pr(X = x)
x=0
Pr(X = 0) = k = 0.165,
Pr(X = 1) = 0.02,
Pr(X = 2) = 0.05,
DR
Pr(X = 3) = 0.4,
Pr(X = 4) = k + 0.2 = 0.165 + 0.2 = 0.365.
170
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
50.5 Z 50.5
x2
= 1.5 − 6x(x − 50)2 dx [let, u = x − 50]
2 49.5 49.5
Z 0.5
50.52 49.52
= 1.5 − −6 (u3 + 50u2 ) du
2 2 −0.5
Z 0.5 3 0.5 !
AF = 1.5
2550.25 2450.25
2
−
= 75 − 0 − 6 × 50 ×
2
3
0.5
+
−6
0.53
−0.5
u3 du − 6 50
u
3 −0.5
3 3
= 75 − 1.5 × 50
= 50
5.5.3 Exercises
DR
1. Suppose the Laptop repair costs are $50, $200, and $350 with respective
probability values of 0.3, 0.2, and 0.5. What is the expected Laptop repair
cost?
2. Suppose the daily sales of a small shop are $100, $150, and $250 with
respective probability values of 0.4, 0.3, and 0.3. What is the expected
daily sales?
3. A game offers prizes of $10, $50, and $100 with respective probability
values of 0.6, 0.3, and 0.1. What is the expected prize amount?
4. Consider the waiting times (in minutes) at a bus stop: 5, 10, and 15 with
respective probability values of 0.5, 0.3, and 0.2. What is the expected
waiting time?
5. The lifetime (in years) of a certain type of light bulb is either 1, 3, or
5 with respective probability values of 0.2, 0.5, and 0.3. What is the
expected lifetime of the light bulb?
171
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
6. The number of daily website visits for a company is either 200, 500, or
800 with respective probability values of 0.25, 0.5, and 0.25. What is the
expected number of daily visits?
7. Let the temperature X in degrees Fahrenheit of a particular chemical
reaction with density
x − 190
f (x) = 220 ≤ x ≤ 280.
3600
Find the expectation of the temperature.
T
5.6 The Variance of a Random Variable
Another key summary measure of the distribution of a random variable is the
variance, which quantifies the spread or variability in the values that the random
variable can take. While the mean or expectation captures the central or average
value of the random variable, the variance measures the dispersion or deviation
AF
of the random variable around its mean value. Specifically, the variance of a
random variable is defined as
or equivalently
Var(X) = E(X 2 ) − (E(X))2 .
The variance is a positive measure that indicates the spread of the distri-
bution of the random variable around its mean value. Larger values of the
variance suggest that the distribution is more spread out.
172
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
The concept of variance can be illustrated graphically. Figure 5.8 shows two
probability density functions with different mean values but identical variances.
T
The variances are the same because the shape or spread of the density func-
tions around their mean values is the same. In contrast, Figure 5.9 shows two
probability density functions with the same mean values but different variances.
The density function that is flatter and more spread out has the larger variance.
AF 0.4 µ = 0, σ 2 = 1
µ = 14, σ 2 = 1
0.3
f (x)
0.2
0.1
DR
−10 0 10 20 30 40
x
Figure 5.8: Two normal distributions with different means but identical vari-
ances.
173
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
0.4 µ = 0, σ 2 = 1
µ = 0, σ 2 = 4
0.3
f (x) 0.2
0.1
T
0
−6 −4 −2 0 2 4 6
x
Figure 5.9: Two normal distributions with identical means but different vari-
AF
ances.
It is important to note that the standard deviation has the same units as the
random variable X, while the variance has units that are squared. For instance,
if the random variable X is measured in seconds, then the standard deviation
will also be in seconds, but the variance will be measured in seconds squared
(seconds2 ).
X
E[X 2 ] = x2 · P (X = x)
x
Given the probability mass function (pmf) for X:
x P (X = x) x · P (X = x) x2 · P (X = x)
0 1
8
0· 1
8
=0 02 · 1
8
=0
3 3 3 2 3
1 8
1· 8
= 8
1 · 8 = 38
2 3
8
2· 3
8
= 6
8
22 · 38 = 12
8
3 1
8
3· 1
8
= 3
8
32 · 81 = 98
12 24
Total 8
= 1.5 8
=3
174
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Var(X) = 3 − (1.5)2
Var(X) = 3 − 2.25
Var(X) = 0.75
Therefore, the variance of X is 0.75.
T
5.6.1 Example: Metal Cylinder Production
The probability density function of the diameter of a metal cylinder (X) is
f (x) = 1.5 − 6(x − 50.0)2 for 49.5 ≤ x ≤ 50.5
AF
and the E(X) = 50. To find the variance V (X), we need E(X 2 ):
E(X 2 ) =
Z 50.5
x2 f (x) dx =
Z 50.5
x2 1.5 − 6(x − 50.0)2 dx
49.5 49.5
Z 50.5 Z 50.5
= 1.5x2 dx − 6x2 (x − 50)2 dx
49.5 49.5
50.5 0.5
x3
Z
= 1.5 −6 (u + 50)2 u2 du [let, u = x − 50]
3 49.5 −0.5
Z 0.5
50.53 49.53
= 1.5 − −6 (u4 + 100u2 + 2500) du
3 3 −0.5
DR
√ Thus, the variance V (X) = 0.05 and the standard deviation sd(X) =
0.05 = 0.2236.
Problem 5.9. Consider a random variable X representing the number of heads
in three tosses of a fair coin. The possible values of X are 0, 1, 2, and 3. The
pmf of X is given by:
3
3 1
P (X = x) =
x 2
Find the expected value and variance of X.
175
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
or equivalently,
1
P (µ − kσ ≤ X ≤ µ + kσ) ≥ 1 − .
k2
1
1− ≥ 0.90
k2
Solving for k:
1
≤ 0.10
k2
or,
1
k2 ≥ = 10
0.10
√
∴ k ≥ 10 ≈ 3.16
So, at least 90% of systolic blood pressure measurements should fall within:
176
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
k2
1 1
≤ 0.20 =⇒ k 2 ≥ =5
k2 0.20
√
k ≥ 5 ≈ 2.24
AF
Thus, at least 80% of salaries should fall within:
F (x) = p
DR
meaning that there is a probability p that the random variable is less than
the p-th quantile. The probability p is often expressed as a percentage, and
the corresponding quantiles are known as percentiles. For instance, the 70th
percentile is the value x for which F (x) = 0.70. It is important to note that
the 50th percentile of a distribution is also known as the median.
F (x) = p
This is also known as the p × 100-th percentile of the random variable.
The probability p signifies the chance that the random variable takes on a
value less than the p-th quantile.
177
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
The interquartile range, which is the distance between the upper and lower
quartiles as depicted in Figure 2.55, serves as an indicator of distribution spread
similar to variance. A larger interquartile range suggests that the distribution
of the random variable is more spread out.
T
The interquartile range, defined as the distance between these two quar-
tiles, provides a measure of distribution spread analogous to variance.
F (x) = 0.75.
That is,
1.5x − 2(x − 50.0)3 − 74.5 = 0.75
DR
This equation can be solved numerically to find the precise value of x which
corresponds to Q3 = 50.17 mm.
178
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
f (x)
0.5
T
Q1 Q3
0 x
49.4 49.6 49.8 50 50.2 50.4 50.6
F (x) = 0.25
resulting in Q1 = 49.83 mm. Consequently, the interquartile range is calcu-
lated as
5.6.7 Exercises
1. Consider the Laptop repair costs discussed in question 1 of Exercises 5.5.3,
calculate the variance and standard deviation of the number of copying
machines in use at a particular moment.
2. In machine breakdown problem, suppose that electrical failures generally
cost $400 to repair, mechanical failures have repair cost of $550, and
operator misuse failures have an repair cost of only $100. These repair
costs generate a random variable cost, as illustrated in the following Table.
179
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
(c) Find the upper and lower quartiles of this random variable.
(d) What is the interquartile range?
(a) What is the probability that the strength is between 450 MPa and
550 MPa?
(b) What is the 95th percentile of the strength?
(c) Calculate the variance of the strength.
(d) What proportion of material samples have a strength greater than
600 MPa?
6. A random variable Z represents the systolic blood pressure (in mmHg) of
a population, which is uniformly distributed between 90 and 140 mmHg.
1
f (z) = for 90 ≤ z ≤ 140.
50
(a) What is the variance of this random variable?
(b) What is the standard deviation of this random variable?
(c) Find the upper and lower quartiles of this random variable.
180
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
(c) If a randomly selected adult has a cholesterol level of 250 mg/dL,
what is the maximum probability that this level deviates from the
mean by at least 50 mg/dL according to Chebyshev’s Inequality?
for x ≥ 0.
(a) Find the expected battery failure time.
(b) What is the probability that the battery fails within the first 4 hours.
(c) Find the cumulative distribution function of the battery failure times.
(d) Find the median of the battery failure times.
181
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
FT
It can help us find moments (for example, mean, variance), combine variables,
and understand the distribution better. The Moment Generating Function of
a random variable X is defined as:
M (t) = E[etX ].
E[X 2 ] = M ′′ (0)
R
■ The rth moment is found by taking the rth derivative of the
MGF and evaluating it at t = 0:
dr M (t)
E[X r ] =
dtr
t=0
D
• Combining Variables:
182
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Solution
T
The MGF of a random variable X is given by:
X
M (t) = E[etX ] = etx P (X = x)
x
We have,
P (X = 1) = p and P (X = 0) = 1 − p
AF
Substituting into the MGF formula:
Mean of X: To find the mean E[X], we differentiate the MGF with respect
to t and evaluate at t = 0:
d
M ′ (t) = pet + (1 − p) = pet
DR
dt
Evaluating at t = 0:
M ′ (0) = pe0 = p
Thus, the mean of X is:
E[X] = p
Variance of X:
To find the variance, we first calculate the second moment E[X 2 ], which is the
second derivative of the MGF evaluated at t = 0:
d
M ′′ (t) = pet = pet
dt
Evaluating at t = 0
M ′′ (0) = pe0 = p
183
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
E[X 2 ] = p
The variance is given by
Var(X) = p(1 − p)
T
Problem 5.11 (Continuous Random Variable). The pdf of X is given by:
(
1 if 0 ≤ x ≤ 1,
f (x) =
0 otherwise.
AF
This is a density of the uniform distribution on [0, 1]. Find the moment gener-
ating function of X, and hence find the mean and variance.
Solution
For this density the moment generating function is
Z 1 tx 1
e et − 1
M (t) = E etX = etx · 1 dx =
= , for t ̸= 0.
0 t 0 t
d et − 1
′
M (t) = .
dt t
Applying the quotient rule:
t · et − (et − 1) tet − et + 1 et (t − 1) + 1
M ′ (t) = = = .
t2 t2 t2
To evaluate the mean, we need to find the limit of M ′ (t) as t → 0. We have
the indeterminate form 00 , so we apply L’Hôpital’s rule. To do this, we need to
differentiate the numerator and denominator separately. So, we find,
d d d
(t − 1)et + 1 = (t − 1)et + (1) = et (t − 1) + et = et (t).
dt dt dt
and
d 2
(t ) = 2t.
dt
184
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
et (t2 − 2t + 2) − 2(t − 1)et
M ′′ (t) = .
t3
Evaluating at t = 0:
et (t2 − 2t + 2) − 2(t − 1)et 1
M ′′ (0) = lim 3
= .
t→0 t 3
AF
So, E[X 2 ] = 13 .
Thus, the first two moments for a uniform random variable X on [0, 1] are:
1 1
E[X] = and E[X 2 ] = .
2 3
variables that take non-negative integer values, such as the number of successes
in a binomial distribution, or the number of events in a Poisson distribution.
185
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
1. Normalization Property
The PGF at s = 1 is always equal to 1:
∞
X ∞
X
G(1) = 1x P (x) = P (x) = 1
x=0 x=0
2. Probability Recovery
The probability that the random variable X takes the value x can be recovered
T
by differentiating the PGF:
1 dx
P (X = x) = G(s)
x! dsx s=0
This formula allows for the extraction of individual probabilities from the PGF.
AF
3. Expected Value (Mean)
The expected value E[X] of the random variable X can be obtained by differ-
entiating the PGF and evaluating at s = 1:
d
E[X] = G(s)
ds s=1
This gives the mean of the distribution directly from the PGF.
4. Variance
DR
The variance Var(X) can be derived using the PGF. First, compute the first
and second derivatives of the PGF:
d2
E[X 2 ] = G(s)
ds2 s=1
This property is useful for dealing with sums of independent random variables.
186
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
6. PGF of a Constant
If X is a constant random variable, i.e., P (X = c) = 1 for some constant c, the
PGF is:
G(s) = sc
This reflects that the random variable always takes the value c, so the PGF has
only one non-zero term at x = c.
7. Derivative Relations
The r-th moment E[X r ] can be derived from the PGF by differentiating it r
times and evaluating at s = 1:
T
dr
E[X r ] = G(s)
dsr s=1
This formula is useful for calculating higher-order moments of the distribution.
Problem 5.12. Consider a random variable X with parameter λ. The proba-
bility mass function of X is:
AF λx e−λ
p(x) =
x!
x = 0, 2, 3, . . . .
Find PGF, and hence mean and variance.
Solution
The PGF G(s) is defined as:
∞
X
G(s) = E[sX ] = p(x)sx .
x=0
DR
187
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
E[X] = G′ (1).
First, compute the derivative of G(s):
G(s) = eλ(s−1) .
G′ (s) = λeλ(s−1) .
T
Evaluating at s = 1:
Var(X) = λ2 + λ − λ2 = λ.
Thus, the variance Var(X) is λ.
188
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Applications
PGFs are used in various fields including:
The PGF is a compact and powerful tool for handling problems involving
sums of random variables and their distributions.
T
5.7.4 Characteristic Function (CF)
The characteristic function (CF) of a random variable X is a fundamental tool
in probability theory, and it is closely related to the moment generating func-
tion. The characteristic function provides an alternative way to describe the
distribution of X, and it is particularly useful in the study of sums of indepen-
AF
dent random variables. It is important in data science, particularly in areas
related to probability theory, statistical inference, and stochastic processes.
φ(t) = E eitX ,
189
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
FT
7. Derivatives and Moments: The n-th moment of X, if it exists, is given
by:
dn ϕ(t)
E[X n ] = i−n .
dtn t=0
190
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
• P (X = 1) = 1 − p
where 0 ≤ p ≤ 1. Find the characteristic function of X, and hence find mean
and variance.
Solution
1. Characteristic Function
The characteristic function φ(t) of a discrete random variable X is defined as:
φ(t) = E[eitX ]
T
where E denotes the expectation and i is the imaginary unit.
For the given pmf:
X
φ(t) = E[eitX ] = eitx P (X = x)
x
AF
Substituting the values for X:
φ(t) = e0 · p + eit · (1 − p)
φ(t) = p + (1 − p)eit
2. Mean of X
The mean E[X] can be derived from the characteristic function as follows:
d
DR
E[X] = i φ(t)
dt t=0
d d
p + (1 − p)eit
φ(t) =
dt dt
d
φ(t) = (1 − p) · ieit
dt
Evaluate at t = 0:
d
φ(t) = (1 − p) · iei·0 = (1 − p) · i
dt t=0
E[X] = i · (1 − p) · i = 1 − p
191
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
3. Variance of X
To find the variance, we first need the second moment E[X 2 ].
The second moment E[X 2 ] can be derived from the characteristic function
as follows:
d2
E[X 2 ] = − 2 φ(t)
dt t=0
d2 d
(1 − p) · ieit
2
φ(t) =
dt dt
T
d2
φ(t) = (1 − p) · i2 eit = −(1 − p)eit
dt2
Evaluate at t = 0:
d2
φ(t) = −(1 − p)ei·0 = −(1 − p)
dt2
AF t=0
= p(1 − p)
Hence,
• Characteristic function: φ(t) = p + (1 − p)eit
• Mean: E[X] = 1 − p
Find the characteristic function of X, and hence find the mean and variance.
192
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Solution
Let’s find the characteristic function of a uniform random variable X on the
interval [0, 1].
The characteristic function is:
Z ∞
φ(t) = E eitX = eitx f (x) dx,
−∞
T
1 if 0 ≤ x ≤ 1,
f (x) =
0 otherwise.
1
1
eitx eit − 1
Z
φ(t) = eitx dx = = .
0 it 0 it
eit − 1
φ(t) = .
it
This can also be written as:
DR
1 sin(t) cos(t) − 1
φ(t) = +i , for t ̸= 0.
t t t
Mean
The mean E[X] is:
d
E[X] = i φ(t)
dt t=0
eit − 1
d d
φ(t) =
dt dt it
193
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
Variance
To find the variance, we first need the second moment E[X 2 ]:
d2
E[X 2 ] = − φ(t)
AF
Compute the second derivative:
dt2 t=0
d2 1 − eit (t + 1)
d
2
φ(t) =
dt dt t2
t=0
E[X 2 ] = 2
The variance is:
Var(X) = E[X 2 ] − (E[X])2
Var(X) = 2 − 02
Var(X) = 2
Hence,
• Mean: E[X] = 0
• Variance: Var(X) = 2
194
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
5.7.6 Exercises
1. Consider a discrete random variable Y with the following probability mass
function (pmf): (
1
for k = 1,
P (Y = k) = 21
2 for k = 2.
T
2. Let Z be a continuous random variable with the probability density func-
tion (pdf): (
2z
2 if 0 ≤ z ≤ θ,
fZ (z) = θ
0 otherwise.
where θ > 0 is a parameter.
AF (a) Find the moment generating function MZ (t) of Z.
(b) Find the characteristic function φZ (t) of Z.
(c) Using the moment generating function, determine the mean and vari-
ance of Z.
3k
P (X = k) = P∞ for k = 0, 1, 2, . . .
i=0 3i
(a) Find the probability generating function (PGF) G(s) of the random
variable X.
(b) Use the PGF to determine E[X] and Var(X).
(c) Verify your results using the properties of the PGF.
195
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
6. Let V be an exponential random variable with rate parameter λ. The
probability density function is:
(
λe−λv if v ≥ 0,
fV (v) =
0 otherwise.
AF (a) Find the moment generating function MV (t) of V .
(b) Find the characteristic function φV (t) of V .
(c) Using the moment generating function, determine the mean and vari-
ance of V .
196
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
F (x, y) = P (X ≤ x, Y ≤ y).
T
X≤x Y ≤y
Let the random variable X denote the maintenance time in hours at a lo-
cation, taking values 1, 2, 3, and 4. Let the random variable Y represent the
number of servers at the location, taking values 1, 2, and 3. These two random
variables are considered jointly distributed.
DR
The joint probability mass function pij for these variables is given in the
table below:
Number of Servers (Y )
1 2 3
Maintenance
For example, the table shows that there is a 0.12 probability that X = 1
and Y = 1, meaning a randomly selected location has one server that takes one
hour to maintain. Similarly, the probability is 0.08 that a location with three
197
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
For instance, the probability that a location has no more than two servers and
T
that the maintenance time does not exceed two hours is:
F (2, 2) = p11 + p12 + p21 + p22 = 0.12 + 0.07 + 0.12 + 0.03 = 0.32
∂2
fX,Y (x, y) = P (X ≤ x, Y ≤ y).
∂x∂y
The joint probability density function must satisfy the condition:
ZZ
f (x, y) dx dy = 1.
state space
198
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
about the joint probabilistic behavior of the random variables X and Y . For
instance, the probability that a randomly selected ore sample has a zinc content
between 0.8 and 1.0 and an iron content between 25 and 30 is given by
Z 1.0 Z 30.0
f (x, y) dy dx,
AF 0.8 25.0
which evaluates to 0.092. Thus, only about 9% of the ore at this location
has mineral levels within these specified ranges.
X
pY (y) = pX,Y (x, y)
x
Using these marginal distributions, we can easily find the mean of X and Y .
199
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Number of Servers (Y )
Maintenance 1 2 3 PX (x)
Time (X) 1 0.12 0.08 0.01 0.21
2 0.08 0.15 0.01 0.24
T
3 0.07 0.21 0.02 0.30
4 0.05 0.13 0.07 0.25
PY (y) 0.32 0.57 0.11 1.00
Table 5.7: Joint probability mass function for server maintenance with marginal
AF
distributions.
Mean of X
X
µX = x · PX (x)
x
= 1 · 0.21 + 2 · 0.24 + 3 · 0.30 + 4 · 0.25
= 2.59
Expected Value of X 2
DR
X
E(X 2 ) = x2 · PX (x) = 12 · 0.21 + 22 · 0.24 + 32 · 0.30 + 42 · 0.25
x
= 7.87
Variance of X
Var(X) = E(X 2 ) − (µX )2 = 7.87 − (2.59)2 = 1.1619
Standard Deviation of X
p √
σX = Var(X) = 1.1619 ≈ 1.0779
200
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Mean of Y
X
µY = y · PY (y) = 1 · 0.32 + 2 · 0.57 + 3 · 0.11
y
= 1.79
Expected Value of Y 2
X
E(Y 2 ) = y 2 · PY (y) = 12 · 0.32 + 22 · 0.57 + 32 · 0.11 = 3.59
y
T
Variance of Y
Var(Y ) = E(Y 2 ) − (µY )2 = 3.59 − (1.79)2 = 0.3859
Standard Deviation of Y
p √
AF
5.8.7
σY = Var(Y ) = 0.3859 ≈ 0.6212
Z 35.0
fX (x) = f (x, y) dy
20.0
Z 35.0
17(x − 1)2 (y − 25)2
39
− −
DR
= dy
20.0 400 50 10, 000
35.0
39y 17y(x − 1)2 (y − 25)3
= − −
400 50 30, 000 20.0
57 51(x − 1)2
= − for 0.5 ≤ x ≤ 1.5.
40 10
So, the expected zinc content E(X) is:
Z 1.5
E(X) = xfX (x) dx
0.5
Z 1.5
57 51(x − 1)2
= x − dx
0.5 40 10
Z 1.5 Z 1.5
57 51
= x dx − x(x − 1)2 dx
40 0.5 10 0.5
= 1.
201
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
1.0 can be determined using the marginal probability density function. This
probability is given by:
Z 1.0
P (0.8 ≤ X ≤ 1.0) = fX (x) dx
0.8
AF =
Z 1.0
0.8
57 51(x − 1)2
40
−
= −
40 10 0.8
= [1.425] − [1.1536]
= 0.2714
Therefore, approximately 27% of the ore has a zinc content within these
limits.
DR
The marginal probability density function of Y , the iron content of the ore,
is given by:
Z 1.5
fY (y) = f (x, y) dx
0.5
Z 1.5
17(x − 1)2 (y − 25)2
39
= − − dx
0.5 400 50 10, 000
1.5
39x 17(x − 1)3 x(y − 25)2
= − −
400 150 10, 000 0.5
3
1.5
x(y − 25)2
39x 17(x − 1)
= − −
400 150 10, 000 0.5
83 (y − 25)2
= − for 20.0 ≤ y ≤ 35.0.
1200 10, 000
The expected iron content and the standard deviation of the iron content, which
are E(Y ) = 27.36 and σ = 4.27, respectively.
202
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
pX,Y (x, y)
pY |X (y|x) =
pX (x)
Using these conditional distributions, we can easily find the mean of X given
Y and Y given X. Conditional distributions are often used to make predictions,
assess risks, and uncover underlying patterns in data that may not be apparent
from marginal distributions alone.
5.8.2.
x 1 2 3 4
.
pX|Y (x|1) 0.375 0.250 0.219 0.156
For Y = 1:
E(X|Y = 1) = 1 · 0.375 + 2 · 0.250 + 3 · 0.215 + 4 · 0.156 = 2.1563
203
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Similarly, we can easily find the conditional distribution with its mean,
variance, and standard deviation of X given Y = 2 and Y = 3. We can also
find the conditional distribution with its mean, variance, and standard deviation
of Y given different values of X.
T
about its iron content? The information regarding the iron content Y is en-
capsulated in the conditional probability density function, which is expressed
as:
f (0.55, y)
fY |X=0.55 (y) =
fX (0.55)
AF
Here, the denominator represents the marginal distribution of the zinc con-
tent X evaluated at 0.55. Evaluating fX (0.55):
57 51(0.55 − 1.00)2
fX (0.55) = − = 0.39225
40 10
Thus, the conditional probability density function becomes:
39 17(0.55−1.00)2 (y−25)2
f (0.55, y) 400 − 50 − 10,000
fY |X=0.55 (y) = =
0.39225 0.39225
Simplifying, we get:
DR
(y − 25)2
fY |X=0.55 (y) = 0.073 −
3922.5
for 20.0 ≤ y ≤ 35.0. It can be easily find the conditional expectation of
the iron content, which is calculated to be 27.14, and the conditional standard
deviation, which is 4.14.
204
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
fX,Y (x, y) = fX (x) · fY (y).
Example:
• Let X be the result of rolling a fair six-sided die, and Y be the result
AF of flipping a fair coin, where X can take values 1 through 6, and Y can
take values 0 (tails) and 1 (heads). The events are independent, so:
P (X = 3 and Y = 1) = P (X = 3) · P (Y = 1) =
1 1
· =
1
6 2 12
Problem 5.15. It is known that the ratio of gallium to arsenide does not affect
the functioning of gallium-arsenide wafers, which are the main components of
microchips. Let X denote the ratio of gallium to arsenide and Y denote the
functional wafers retrieved during a 1-hour period. X and Y are independent
random variables with the joint density function
x(1+3y 2 )
(
DR
Solution
To show that X and Y are independent random variables, we need to ver-
ify that the joint density function f (x, y) can be factored into the product of
the marginal density functions fX (x) and fY (y). Specifically, X and Y are
independent if and only if the joint density function f (x, y) can be written as:
205
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Z 1
fX (x) = f (x, y) dy.
0
2
Given the joint density function f (x, y) = x(1+3y 4
)
for 0 < x < 2 and
0 < y < 1, compute:
Z 1
x(1 + 3y 2 )
fX (x) = dy
0 4
Z 1
x
= (1 + 3y 2 ) dy
4 0
T
Z 1 Z 1
x 2
= 1 dy + 3y dy
4 0 0
x
= (1 + 1)
4
x
= .
2
AF
So, the marginal density function for X is:
fX (x) =
x
, 0 < x < 2.
2
Z 2
fY (y) = f (x, y) dx
DR
0
2
x(1 + 3y 2 )
Z
= dx
0 4
1 + 3y 2 2
Z
= x dx
4 0
1 + 3y 2 1 + 3y 2
= ·2= .
4 2
Therefore, the marginal density function for Y is:
1 + 3y 2
fY (y) = , 0 < y < 1.
2
Verify independence
Check if f (x, y) can be written as fX (x) · fY (y):
206
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
x 1 + 3y 2 x(1 + 3y 2 )
fX (x) · fY (y) = · = .
2 2 4
This matches the given joint density function f (x, y).
Since f (x, y) = fX (x) · fY (y), the random variables X and Y are indepen-
dent.
FT
Covariance is essential for understanding and quantifying relationships between
variables, which is a cornerstone of many data science techniques and analyses.
It measures the joint variability of two random variables.
Correlation
Correlation is a normalized form of covariance that measures the strength and
direction of the linear relationship between two random variables. This normal-
R
ization makes correlation a more interpretable metric, useful for understanding
the strength and direction of the relationship between variables. It’s widely
used in statistical analysis, machine learning, and data visualization to reveal
and quantify relationships that might otherwise be obscured by differences in
scale or units.
fined as:
Cov(X, Y )
Corr(X, Y ) =
σX σY
where σX and σY are the standard deviations of X and Y , respectively.
207
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Number of Servers (Y )
1 2 3
Maintenance
1 0.12 0.08 0.01
Time (X) 2 0.08 0.15 0.01
T
3 0.07 0.21 0.02
4 0.05 0.13 0.07
Solution
Find the correlation of X and Y .
208
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
Cov(X, Y )
ρX,Y =
σX σY
Given the standard deviations σX = 1.0779 and σY = 06212, we can find
the correlation coefficient ρX,Y using the covariance Cov(X, Y ) = 0.2239.
Substitute the given values:
AF ρX,Y =
0.2239
1.0779 × 0.6212
≈ 0.3344
The correlation of 0.3344 suggests that there is a tendency for more servers
to need maintenance as the maintenance time increases.
Problem 5.17. Consider two continuous random variables X and Y with the
following joint probability density function:
(
4xy if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1,
fX,Y (x, y) =
0 otherwise.
DR
Solution
(a). Are X and Y independent?
To check if X and Y are independent, we need to verify if the joint PDF
factorizes into the product of the marginal PDFs of X and Y . That is, we need
to check if:
209
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Marginal PDF of X:
The marginal PDF of X is obtained by integrating the joint PDF over all
possible values of y:
Z 1
fX (x) = fX,Y (x, y) dy.
0
For 0 ≤ x ≤ 1, we compute:
1 1 1
y2
Z Z
1
fX (x) = 4xy dy = 4x y dy = 4x = 4x · = 2x.
0 0 2 0 2
T
Thus, the marginal PDF of X is:
(
2x if 0 ≤ x ≤ 1,
fX (x) =
0 otherwise.
Marginal PDF of Y :
AF
Similarly, the marginal PDF of Y is obtained by integrating the joint PDF over
all possible values of x:
Z 1
fY (y) = fX,Y (x, y) dx.
0
For 0 ≤ y ≤ 1, we compute:
1 1 1
x2
Z Z
1
fY (y) = 4xy dx = 4y x dx = 4y = 4y · = 2y.
0 0 2 0 2
Thus, the marginal PDF of Y is:
DR
(
2y if 0 ≤ y ≤ 1,
fY (y) =
0 otherwise.
210
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
1 1 1 1
x3
Z Z Z
2
T
2
E[X] = xfX (x) dx = x(2x) dx = 2 x dx = 2 = .
0 0 0 3 0 3
For Y :
1 1 1 1
y3
Z Z Z
2 2
E[Y ] = yfY (y) dy = y(2y) dy = 2 y dy = 2 = .
3 3
AF
Compute E[XY ]:
0 0 0 0
Z 1 Z 1
2 2
=4 x dx y dy
0 0
1 1
=4· ·
3 3
4
= .
9
Cov(X, Y ) = 0.
211
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Cov(X, Y )
Corr(X, Y ) = .
σX σY
Since Cov(X, Y ) = 0, the correlation is:
Corr(X, Y ) = 0.
T
5.8.13 Linear Functions of a Random Variable
We will now explore some properties that will simplify calculating the means
and variances of random variables discussed in later chapters. These properties
allow us to express expectations using other parameters that are either known
or easily computed. The results presented are applicable to both discrete and
AF
continuous random variables, though proofs are provided only for the continuous
case. We start with a theorem and two corollaries that should be intuitively
understandable to the reader.
Theorem 5.1. If a and b are constants, then
E(aX + b) = aE(X) + b
Z ∞
E(aX + b) = (ax + b)f (x) dx.
−∞
The first integral on the right is E(X) and the second integral equals 1. There-
fore, we have
E(aX + b) = aE(X) + b.
212
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
Thus, we have shown that:
Var(aX + b) = a2 Var(X)
and
Problem 5.19. Suppose that a temperature has a mean of 110◦ F and a stan-
dard deviation of 2.2◦ F. The conversion formula from Fahrenheit to Centigrade
is given by:
9C
F = + 32
5
where F is the temperature in Fahrenheit and C is the temperature in Centi-
grade. What are the mean and the standard deviation in degrees Centigrade?
Solution
To find the mean temperature in Centigrade, we use:
5
Cmean = (Fmean − 32)
9
213
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
◦
Thus, the mean temperature is approximately 43.3 C and the standard
deviation is approximately 1.22◦ C.
Theorem 5.2. The expected value of the sum or difference of two or more
functions of a random variable X is the sum or difference of the expected values
of the functions. That is,
AF E[g(X) ± h(X)] = E[g(X)] ± E[h(X)].
Solution
DR
Y 2 = (X − 1)4 .
214
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Therefore,
Var(Y ) = E[Y 2 ] − (E[Y ])2 = 3 − 12 = 2.
T
Problem 5.21. The weekly demand for a particular drink, measured in thou-
sands of liters, at a chain of convenience stores is a continuous random variable
g(X) = X 2 + X − 2, where X has the following density function:
(
2(x − 1), 1 < x < 2,
AF
Solution
f (x) =
0, elsewhere.
To find the expected value of the weekly demand for the drink, we use Theorem
5.1:
E(X 2 + X − 2) = E(X 2 ) + E(X) − E(2).
From Theorem 5.1, E(2) = 2. By direct integration, we find:
Z 2
5
E(X) = 2x(x − 1) dx = ,
1 3
DR
and Z 2
17
E(X 2 ) = 2x2 (x − 1) dx = .
1 6
Thus,
17 5 5
E(X 2 + X − 2) = + −2= .
6 3 2
Therefore, the average weekly demand for the drink at this chain of convenience
stores is 2500 liters.
215
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
is applied to the scores. This means, for example, that a raw score of x = 12
corresponds to a standardized score of y = (4 × 12) + 20 = 68.
The expected value of the standardized scores is then known to be
with a variance of
T
5.8.14 Linear Combinations of Random Variables
When dealing with two random variables, X1 and X2 , it is often beneficial to
analyze the random variable formed by their sum. A general principle states
that:
AF E(X1 + X2 ) = E(X1 ) + E(X2 )
This means the expected value of the sum of two random variables is equal to
the sum of their individual expected values.
In addition:
Note that if the two random variables are independent, their covariance is
zero, simplifying the variance of their sum to the sum of their variances:
Thus, the variance of the sum of two independent random variables is equal to
the sum of their individual variances.
These results are straightforward, but it’s crucial to remember that while
the expected value of the sum of two random variables always equals the sum
of their expected values, the variance of the sum only equals the sum of their
variances if the random variables are independent.
216
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
then:
Var(X1 + X2 ) = Var(X1 ) + Var(X2 )
Y = a1 X1 + · · · + an Xn + b
T
which is simply the linear combination of the expectations of the random vari-
ables Xi . Additionally, if the random variables X1 , . . . , Xn are independent,
then:
Var(Y ) = a21 Var(X1 ) + · · · + a2n Var(Xn )
Note that the constant b does not affect the variance of Y , and the coefficients
AF
ai are squared in this expression.
Theorem 5.3. If X1 , . . . , Xn is a sequence of random variables and a1 , . . . , an
and b are constants, then
Solution
Mean of the Sample Mean: Using the linearity of expectation:
n
! n n
1X 1X 1X nµ
E(X̄) = E Xi = E(Xi ) = µ= =µ
n i=1 n i=1 n i=1 n
217
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
Variance of the Sample Mean: Since the Xi are independent and each
has a variance σ 2 :
n
!
1X
Var(X̄) = Var Xi
n i=1
n
1 X
= Var(Xi )
n2 i=1
n
1 X 2
= σ
n2 i=1
nσ 2
T
=
n2
σ2
= .
n
Therefore, the mean and variance of the sample mean X̄ are:
E(X̄) = µ
AF σ2
Var(X̄) =
n
Problem 5.23. Let X1 and X2 represent the scores on two tests, with the
following information:
E(X1 ) = 18, Var(X1 ) = 24, E(X2 ) = 30, Var(X2 ) = 60.
The scores are standardized as:
10 5 50
Y1 = X1 , Y2 = X2 + .
3 3 3
The final score is:
2 1
DR
Z= Y1 + Y2 .
3 3
(a). Calculate the expected value of the final score E(Z).
Solution
Let X1 and X2 represent the scores on two tests with the following data:
E(X1 ) = 18, Var(X1 ) = 24, E(X2 ) = 30, Var(X2 ) = 60.
The scores are standardized as:
10 5 50
Y1 = X1 , Y2 = X2 + .
3 3 3
The final score is:
2 1
Z = Y1 + Y2 .
3 3
218
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
2 1
Var(Z) = Var(Y1 ) + Var(Y2 ).
3 3
2 2
10 5
Var(Y1 ) = × 24 = 266.67, Var(Y2 ) = × 60 = 166.67.
3 3
4 1
AF Var(Z) =
9
× 266.67 + × 166.67 = 137.04.
√
9
σZ = 137.04 = 11.71.
Hence,
E(Z) = 62.22, Var(Z) = 137.04, σZ = 11.71.
5.8.15 Exercises
1. Suppose X, taking the values 1, 2, 3, and 4, is the service time in hours
taken at Bashundhara residential area, and the Y , taking the values 1, 2,
and 3, is the number of air conditioner (AC) units at the same location.
The joint probability of X and Y are presented in the Table 5.9.
DR
219
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
(e) 5X − 9Z + 8
(f) −3Y − Z − 5
(g) X + 2Y + 3Z
(h) 6X + 2Y − Z + 16
AF
3. Suppose that items from a manufacturing process are subject to three
separate evaluations, and that the results of the first evaluation X1 have
a mean value of 59 with a standard deviation of 10, the results of the
second evaluation X2 have a mean value of 67 with a standard deviation
of 13, and the results of the third evaluation X3 have a mean value of 72
with a standard deviation of 4. In addition, suppose that the results of
the three evaluations can be taken to be independent of each other.
220
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
decides to use the weighted average 0.5Xα + 0.3Xβ + 0.2Xγ , what is the
standard deviation of the cholesterol level obtained by the doctor?
8. Suppose that the impurity levels of water samples taken from a particular
source are independent with a mean value of 3.87 and a standard deviation
of 0.18.
AF (a) What are the mean and the standard deviation of the sum of the
impurity levels from two water samples?
(b) What are the mean and the standard deviation of the sum of the
impurity levels from three water samples?
(c) What are the mean and the standard deviation of the average of the
impurity levels from four water samples?
(d) If the impurity levels of two water samples are averaged, and the
result is subtracted from the impurity level of a third sample, what
are the mean and the standard deviation of the resulting value?
DR
221
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
accurate).
logsf(x, p, loc=0) Log of the survival function.
ppf(q, p, loc=0) Percent point function (inverse of cdf —
percentiles).
isf(q, p, loc=0) Inverse survival function (inverse of sf).
AF
stats(p,
ments=’mv’)
loc=0, mo- Mean (‘m’), variance (‘v’), skew (‘s’),
and/or kurtosis (‘k’).
entropy(p, loc=0) (Differential) entropy of the RV.
expect(func, p, loc=0, Expected value of a function (of one argu-
lb=None, ub=None, condi- ment) with respect to the distribution.
tional=False)
median(p, loc=0) Median of the distribution.
mean(p, loc=0) Mean of the distribution.
var(p, loc=0) Variance of the distribution.
DR
222
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
T
group of patients. The probability distribution of the number of adverse
reactions is given by:
0.5 if x = 0
P (X = x) = 0.3 if x = 1
AF
0.2 if x = 2
Calculate the cumulative distribution function FH (h) and find the prob-
ability that a plant’s height is between 121 and 123 cm.
223
CHAPTER 5. RANDOM VARIABLE AND ITS PROPERTIES
4. Let the length L in meters of a certain type of fish have the probability
density function (
0.2l for 0 ≤ l ≤ 2
fL (l) =
0 elsewhere
Find the expectation, variance, and cumulative distribution function FL (l)
of the length.
T
3600
(a) Find the cumulative distribution function F (x) of the the tempera-
ture and it’s median.
(b) Find the expectation and standard deviation of the temperature.
(c) Find the the expectation and variance of Y where Y = 95 X − 160
9 .
AF
6. The random variable X measures the concentration of ethanol in a chem-
ical solution, and the random variable Y measures the acidity of the so-
lution. They have a joint probability density function
tion?
224
Chapter 6
T
Distributions
AF
6.1 Introduction
In the field of data science, understanding discrete probability distributions is
crucial for analyzing and modeling data that can be categorized into distinct
outcomes. These distributions help data scientists interpret and predict the
likelihood of various events based on historical data, which can be essential for
making informed decisions and developing predictive models.
ranging from binary classification problems to event counting and rate model-
ing.
225
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
T
(
p if x = 1
P (X = x) =
1 − p if x = 0
or more compactly, if X ∼ Bernoulli(0.1) then the pmf is
E(X) = 1 · P (X = 1) + 0 · P (X = 0)
DR
E(X) = 1 · p + 0 · (1 − p) = p
So, the mean of a Bernoulli distribution is:
E(X) = p
6.2.2 Variance
The variance Var(X) of a Bernoulli distributed random variable X is defined
as:
Var(X) = E (X − E(X))2
226
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
X
E(X 2 ) = x2 · P (X = x)
x
E(X 2 ) = 12 · P (X = 1) + 02 · P (X = 0) = p
Now, using the formula for variance:
T
Var(X) = p − p2 = p(1 − p)
So, the variance of a Bernoulli distribution is:
Var(X) = p(1 − p)
AF
Properties
Problem 6.1. A factory produces light bulbs, and each bulb has a 95% chance
of passing the quality control test. Define a random variable X such that X = 1
if a light bulb passes the quality control test (success) and X = 0 if it fails
(failure).
(a). What is the probability that a randomly selected light bulb passes the quality
control test?
(b). What is the expected value (mean) of X?
227
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Solution
Let’s define the random variable X as follows:
(
1 with probability p = 0.95
X=
0 with probability 1 − p = 0.05
FT
P (X = 1) = p = 0.95
So, the probability that a light bulb passes the quality control test is 0.95,
or 95%.
E(X) = p
A
Substituting the value of p:
E(X) = 0.95
So, the expected value of X is 0.95.
(c). Variance of X
R
The variance Var(X) of a Bernoulli distributed random variable X is given by:
Var(X) = p(1 − p)
Substituting the value of p:
D
Problem 6.2. A new vaccine is being tested for its effectiveness. In clinical
trials, it was found that the vaccine successfully immunizes 90% of the par-
ticipants. Define a random variable X such that X = 1 if a participant is
successfully immunized (success) and X = 0 if not (failure).
228
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Solution
Let’s define the random variable X as follows:
(
1 with probability p = 0.90
X=
T
0 with probability 1 − p = 0.10
E(X) = p = 0.90
DR
(iii) Variance of X
The variance Var(X) of a Bernoulli distributed random variable X is given by:
E(Y ) = np
229
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
E(Y ) = 10 × 0.90 = 9
So, the expected number of participants successfully immunized in a group
of 10 is 9.
T
MX (t) = E(etX )
For a Bernoulli distributed random variable X with probability p of success
(i.e., X = 1 with probability p and X = 0 with probability 1 − p):
MX (t) = 1 − p + pet
GX (s) = E[sX ]
Since X can take values 0 and 1, we have:
230
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
X
GX (s) = E[sX ] = P (X = x) · sx
x
= (1 − p) · s0 + p · s1
= (1 − p) + p · s
GX (s) = 1 − p + p · s
T
6.2.6 Example
Let’s consider a biased coin where the probability of getting “Heads” is p = 0.7.
The random variable X representing the outcome of a single coin flip follows a
Bernoulli distribution with parameter p = 0.7.
The pmf of X is:
AF P (X = x) =
(
0.7 if x = 1
0.3 if x = 0
The mean and variance of X are:
E(X) = 0.7
Problem 6.3. A factory produces light bulbs, and each light bulb is tested
for quality. The probability that a light bulb is defective is p = 0.1. Let X
be a random variable that represents whether a randomly selected light bulb is
defective (1 if defective, 0 if not defective).
Solution
Let X follow a Bernoulli distribution with parameter p = 0.1, i.e., X ∼
Bernoulli(0.1).
231
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
P (X = 0) = 1 − p = 1 − 0.1 = 0.9
T
E[X] = p = 0.1
232
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
• Genetics
The Bernoulli distribution is used in genetics to model the inheritance
T
of a particular gene. For example, X = 1 might represent the presence
of a specific gene in an offspring, and X = 0 represents its absence,
assuming a certain probability of inheritance.
6.2.8
AF
Python Code for Bernoulli Distribution
In Python, we can calculate various characteristics of the Bernoulli distribu-
tion using the ‘scipy.stats‘ module. Below, we detail how to compute the
Probability Mass Function, Cumulative Distribution Function, Mean (Expected
Value), Variance, and Probability Generating Function, etc.
Python Code
Here’s how we can compute these characteristics using Python:
DR
1 import numpy as np
2 from scipy . stats import bernoulli
3
7 # Bernoulli distribution
8 dist = bernoulli ( p )
9
233
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
18
23 # 4. Variance
24 variance = dist . var ()
25 print ( " Variance : " , variance )
26
T
30
Explanations
AF • Probability Mass Function (pmf ):
■ The pmf gives the probability of each outcome (0 or 1). Use
dist.pmf(x values) to compute these probabilities.
• Variance:
■ The variance of a Bernoulli distribution is p · (1 − p). Use
dist.var() to compute this.
6.2.9 Exercises
1. A medical test for a certain disease has a 98% chance of correctly iden-
tifying a diseased person (true positive) and a 2% chance of incorrectly
234
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
(a) What is the probability that a randomly selected test result is posi-
tive?
(b) What is the expected value (mean) of X?
(c) What is the variance of X?
(d) In a group of 50 people who took the test, what is the expected
number of positive test results?
T
A genetic trait is passed on to the next generation with a probability of
25%. Define a random variable X such that X = 1 if the trait is passed
on (success) and X = 0 if it is not (failure).
2. (a) What is the probability that a randomly selected offspring inherits
the trait?
AF(b) What is the expected value (mean) of X?
(c) What is the variance of X?
(d) In a group of 40 offspring, what is the expected number of offspring
that will inherit the trait?
3. In a clinical trial, a new drug is found to be effective in 85% of the pa-
tients. Define a random variable X such that X = 1 if a patient responds
positively to the drug (success) and X = 0 if not (failure).
(a) What is the probability that a randomly selected test result is posi-
tive?
(b) What is the expected value (mean) of X?
(c) What is the variance of X?
235
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
(d) In a group of 100 people who took the test, what is the expected
number of positive test results?
(a) What is the probability that a randomly selected subject has the
gene variant?
(b) What is the expected value (mean) of X?
T
(c) What is the variance of X?
(d) In a sample of 30 subjects, what is the expected number of subjects
who have the gene variant?
pk × (1 − p)n−k
We need to account for all possible sequences of n trials that result in exactly
k successes. The number of ways to choose k positions for successes out of n
positions is given by the Binomial coefficient
n n!
=
k k!(n − k)!
n
where k represents the number of combinations of n items taken k at a time.
236
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
n k
P (X = k) = p (1 − p)n−k for k = 0, 1, 2, . . . , n,
k
where X is the random variable representing the number of successes.
The Binomial distribution is a discrete probability distribution. This distri-
bution has the following conditions:
T
4. Independence: The outcome of one trial is independent of the outcomes
of other trials.
X ∼ B(n, p).
237
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
5 # Parameters
6 n = 10
7 p = 0.5
8
# Values
T
9
10 k = np . arange (0 , n +1)
11 pmf = binom . pmf (k , n , p )
12
13 # Plot
14 plt . bar (k , pmf , color = ’ blue ’ , edgecolor = ’ black ’)
15 plt . xlabel ( ’ Number of Successes ’)
16
17
18
19
AF plt . ylabel ( ’ Probability ’)
plt . title ( ’ Binomial PMF ( n =10 , p =0.5) ’)
plt . xticks ( k )
plt . grid ( True )
20 plt . show ()
238
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Thus,
E(X) = n · p · 1 = np
Therefore, the expected value of X is
E(X) = np
T
Alternatively:
By recognizing that X can be written as the sum of n independent Bernoulli
trials Xi
X = X1 + X2 + · · · + Xn
Each Xi ∼ Bernoulli(p) with
AF E[Xi ] = p
Using the linearity of expectation, we have
E[X] = E[X1 + X2 + · · · + Xn ] = E[X1 ] + E[X2 ] + · · · + E[Xn ] = n · p
Thus, the expected value E[X] is
E[X] = np
Properties
239
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
6.3.3 Example
Let us consider a biased coin where the probability of getting “Heads” is p = 0.7.
Suppose we flip this coin n = 10 times. The random variable X representing the
T
number of “Heads” in 10 flips follows a Binomial distribution with parameters
n = 10 and p = 0.7.
The pmf of X is
10
P (X = k) = (0.7)k (0.3)10−k for k = 0, 1, 2, . . . , 10.
k
AF
The mean and variance of X are
E(X) = 10 × 0.7 = 7
(a). What is the probability that exactly 2 of the 8 bulbs are defective?
(b). What is the probability that at most 2 bulbs are defective?
(c). What is the probability that more than 1 bulb is defective?
(d). What is the probability that at least 1 bulb is defective?
(e). What is the expected number of defective bulbs in the sample of 8?
(f ). What is the variance and standard deviation of the number of defective
bulbs in the sample?
Solution
Let X be the number of defective bulbs in the sample. Here, X follows a
binomial distribution with parameters n = 8 and p = 0.05. The pmf of X is
8
P (X = k) = (0.05)k (1 − 0.05)8−k for k = 0, 1, 2, . . . , 8.
k
240
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Thus, the probability that exactly 2 bulbs are defective is approximately 0.0514.
P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2)
T
We have
P (X = 0) = (0.95)8 ≈ 0.6634
P (X = 1) = 8 × 0.05 × (0.95)7 ≈ 0.2770
P (X = 2) ≈ 0.0514 (from part (a))
Compute each term
AF P (X = 0) =
8
0
(0.05)0 (0.95)8 ≈ 0.6634
8
P (X = 1) = (0.05)1 (0.95)7 ≈ 0.2770
1
8
P (X = 2) = (0.05)2 (0.95)6 ≈ 0.0514 (from part (a))
2
Hence,
P (X ≤ 2) ≈ 0.6634 + 0.2770 + 0.0514 = 0.9918
DR
241
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Solution
Let X be the number of patients who respond positively out of 10 patients, with
each patient responding positively with probability p = 0.25. Then X follows
a binomial distribution with pmf of X is
8
P (X = k) = (0.25)k (1 − 0.25)8−k for k = 0, 1, 2, . . . , 10.
k
T
10
P (X = 3) = (0.25)3 (0.75)7
3
= 120 × (0.25)3 × (0.75)7 ≈ 120 × 0.015625 × 0.1335
≈ 0.2503
(b). The probability that at most 3 patients respond positively is the cu-
mulative probability P (X ≤ 3), which is the sum of the probabilities for
X = 0, 1, 2, 3. Therefore,
P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
10 0 10 10 1 9 10
= (0.25) (0.75) + (0.25) (0.75) + (0.25)2 (0.75)8
0 1 2
+ 0.2503 (from part (a))
DR
(c). For a binomial distribution X ∼ B(n, p), the expected value and variance
are given by
242
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
(d). Since Y represents the number of patients who do not respond positively,
we can express Y = 10 − X. The number of patients who do not respond
positively follows a binomial distribution Y ∼ B(n = 10, p = 0.75).
The probability mass function of Y is
10
P (Y = k) = (0.75)k (0.25)10−k , k = 0, 1, 2, . . . , 10
k
The expected value and variance of Y are
T
6.3.4 Python Code for Binomial Distribution
In Python, you can compute various characteristics of the Binomial distribution
AF
using the ‘scipy.stats‘ module. Below is a demonstration of how to compute
these characteristics.
Python Code
Here’s how you can calculate various characteristics of a Binomial distribution
1 import numpy as np
2 from scipy . stats import binom
3
DR
4 # Define parameters
5 n = 10 # number of trials
6 p = 0.25 # probability of success
7
8 # Binomial distribution
9 dist = binom (n , p )
10
243
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
24 # 4. Variance
25 variance = dist . var ()
26 print ( " Variance : " , variance )
27
T
33 pgf_values = [ pgf (t , n , p ) for t in [0 , 1]]
34 print ( " PGF values for t = 0 and t = 1: " , pgf_values )
Explanations
• Probability Mass Function (pmf ):
AF ■ The pmf provides the probability of each number of successes.
Compute these probabilities using dist.pmf(x values).
•
DR
Variance:
■ The variance is given by n · p · (1 − p). Compute this using
dist.var().
6.3.5 Exercises
1. Define the Binomial distribution. Include the conditions that must be
met for a random variable to follow a Binomial distribution.
2. Prove that the expected value (mean) of a Binomially distributed random
variable X with parameters n and p is E(X) = np. Also, prove that the
variance of X is given by Var(X) = np(1 − p).
244
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
3. Prove that the sum of the probabilities of all possible outcomes of a Bi-
nomial random variable X with parameters n and p is equal to 1. That
is, show that
n
X n k
p (1 − p)n−k = 1.
k
k=0
T
6. Show that
n n
=
k n−k
and use this property to demonstrate that the sum of the Binomial coef-
ficients for a given n is symmetric around k = n2 .
AF
7. Find the moment generating function (MGF) of the Binomial distribution
and use it to compute the first two moments (mean and variance).
8. A factory produces widgets, with a 1% defect rate. If a sample of 50 wid-
gets is tested, what is the probability that at least 3 widgets are defective?
9. A medical test has a 95% sensitivity and a 90% specificity. If 10 patients
are tested who all have the disease, calculate the probability that exactly
9 patients test positive.
10. A soccer player has a 60% chance of scoring a goal in a penalty kick. If the
player takes 12 penalty kicks, find the probability that the player scores
more than 8 goals.
DR
245
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
λk e−λ
T
P (X = k) = k = 0, 1, 2, . . .
k!
where
• k is the number of events (the number of customers),
46 e−4 4096e−4
P (X = 6) = = ≈ 0.1042
6! 720
That is, the probability of having exactly 6 customers in an hour is
approximately 0.1042.
The Poisson distribution is a discrete probability distribution that models
the number of events occurring within a fixed interval of time or space,
given the following conditions:
(i). Each event happens independently of the others. (For example, if
we’re modeling the number of customers arriving at a coffee shop, the
number of customers arriving in one hour does not affect the number
of customers arriving in the next hour.)
(ii). The average rate (mean number of events) λ is constant over time or
space. This means that the expected number of events in any interval
of the same length is the same.
246
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
(iii). The events are relatively rare in the given interval. More specifically,
the probability of more than one event occurring in a very short interval
is negligible.
(iv). The number of events can only be whole numbers (0, 1, 2, ...). You
can’t have fractional events.
(v). The number of events is counted over a fixed interval of time or space.
The intervals are non-overlapping, meaning events in different intervals
do not influence each other.
When these conditions are met, the number of events occurring in a fixed in-
T
terval follows a Poisson distribution. It is observe that the series expansion of
eλ guarantees that the total probability sums to 1, as shown below:
∞ ∞
λk λ2 λ3
X X 1 λ
P (X = k) = e−λ = e−λ + + + + ···
k! 0! 1! 2! 3!
k=0 k=0
−λ
=e · eλ = 1
AF
Additionally, for a random variable X that follows a Poisson distribution
with parameter λ (denoted X ∼ P(λ)), it holds that
E(X) = Var(X) = λ
k!
The Poisson distribution is particularly useful for modeling the number of
occurrences of a certain event within a specified unit of time, distance, or
volume, with both its mean and variance equal to λ.
(i). Calculate the pmf of a Poisson random variable with λ = 2 for integer
values of k from 0 to 10.
(ii). Generate a plot to illustrate the pmf for λ = 2.
(iii). Discuss how the pmf and cumulative distribution functions for both λ = 2
and λ = 5 compare, particularly in terms of expected value and variance.
247
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Solution
(i). The pmf of a Poisson random variable with parameter λ = 2 is given by
e−2 · 2k
P (X = k) = k = 0, 1, 2, . . .
k!
(ii). We plot this pmf for integer values of k from 0 to 10 in Figure 6.2.
0.3
0.271 0.271
0.25
FT
0.2 0.181
P (X = k)
0.15 0.135
0.1 0.09
0.05 0.036
0.012 0.003
0.001 0 0
0
0 1 2 3 4 5 6 7 8 9 10
A
k
(iii). Figures 6.2 and 6.3 compare the probability mass functions and cumula-
tive distribution functions of Poisson distributions with parameters λ = 2 and
R
λ = 5. These figures demonstrate that, given that the mean and variance of
a Poisson distribution both equal the parameter value, a distribution with a
higher parameter value will have a greater expected value and exhibit a wider
spread.
These figures illustrate that a Poisson distribution with a higher parameter
value, such as λ = 5, has a greater expected value and wider spread compared
to one with λ = 2.
D
The Poisson distribution is used to model the number of rare events oc-
curring within a fixed interval of time or space, under the assumption that these
events occur independently and at a constant average rate. This distribution
helps answer questions about the likelihood of observing a specific number of
events given an average rate, such as predicting the number of genetic mutations
in bacterial cultures or call arrivals in a call center.
Problem 6.7. A quality inspector at a glass manufacturing company checks
each glass sheet for imperfections. Suppose the number of flaws in each sheet
248
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
0.2
0.176
0.176
0.105
0.1 0.084
0.065
0.05 0.034 0.036
0.018
T
0.007 0.008
0.003
0.001 0 0
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
k
Solution
(a). Probability of No Flaws: The probability that a glass sheet has no
DR
flaws (X = 0) is given by
e−0.5 · 0.50
P (X = 0) = = e−0.5 ≈ 0.607
0!
Thus, approximately 61% of the glass sheets are in “perfect” condition.
P (X ≥ 2) = 1 − P (X = 0) − P (X = 1)
Where
e−0.5 · 0.51
P (X = 1) = = e−0.5 · 0.5 ≈ 0.305
1!
Therefore,
249
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Hence, about 9% of the glass sheets have two or more flaws and need to be
scrapped.
(a). Calculate the probability that exactly one individual in the sample has the
mutation.
T
(b). Determine the probability that at least three individuals in the sample have
the mutation.
(c). Estimate the percentage of samples in which two or more individuals are
expected to have the mutation.
AF
Solution
(a). Probability of Exactly One Individual with the Mutation
The probability that exactly one individual has the mutation is given by
e−2 · 21
P (X = 1) = = 2e−2 ≈ 0.2707
1!
P (X ≥ 3) = 1 − P (X < 3) = 1 − [P (X = 0) + P (X = 1) + P (X = 2)]
DR
Where,
P (X = 0) = e−2 ≈ 0.1353
P (X = 1) = 2e−2 ≈ 0.2707
22 · e−2
P (X = 2) = ≈ 0.2707
2!
Therefore,
(c). Percentage of Samples with Two or More Individuals Having the Mutation
To estimate the percentage of samples where two or more individuals have
the mutation:
P (X ≥ 2) = 1 − P (X < 2) = 1 − [P (X = 0) + P (X = 1)]
250
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Thus,
The percentage is
0.5940 × 100% = 59.40%
T
∞
X
E(X) = x · P (X = x)
x=0
∞
X λx e−λ
= x·
x=0
x!
AF =0·
∞
λ0 e−λ X
0!
+
x=1
x·
∞
λx e−λ
x!
X λx e−λ
= .
x=1
(x − 1)!
′
x′ !
x =0
∞ ′
−λ
X λx
= λe .
x′ !
x′ =0
Thus,
E(X) = λe−λ · eλ = λ
251
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
6.4.2 Variance
To find the variance, we first need to calculate E(X 2 ). We use,
∞
X
E(X 2 ) = x2 · P (X = x)
x=0
∞
X λx e−λ
= x2 ·
x=0
x!
∞
X λx
= e−λ x·
x=0
(x − 1)!
T
∞
X xλx e−λ
=
x=1
(x − 1)!
2
E(X ) = e −λ
∞
X
x′ !
′
(x′ + 1)λx +1
′
x =0
∞ ∞
′ ′
!
−λ
X x′ λx +1 X λx +1
=e +
x′ ! x′ !
x′ =0 x′ =0
∞ ∞
′ ′
!
X x′ λx X λx
= e−λ λ +λ
x′ ! x′ !
x′ =0 x′ =0
= e−λ λ · λeλ + λeλ
DR
= λ2 + λ.
252
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
t
−1)
FT
= eλ(e .
it
−1)
= eλ(e .
R
Thus, the characteristic function ϕX (t) of a Poisson random variable X with
parameter λ is,
it
ϕX (t) = eλ(e −1) .
253
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
T
(ii) Use the Poisson Distribution to make an approximation.
Solution:
AF
Given, total number of switches n = 500, and probability of a switch being
defective p = 0.005. Let a random variable X represents the number of defective
switches. We are required to find P (X ≤ 3), i.e., the probability that the
number of defective switches is no more than 3.
(i). Using Binomial Distribution: The probability mass function for the
Binomial Distribution is given by
500
P (X = k) = 0.005k (1 − 0.005)500−k for k = 0, 1, 2, . . . , 500.
k
DR
P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
We have,
500
P (X = 0) = (0.005)0 (0.995)500 ≈ 0.08157
0
500
P (X = 1) = (0.005)1 (0.995)499 ≈ 0.20495
1
500
P (X = 2) = (0.005)2 (0.995)498 ≈ 0.25697
2
500
P (X = 3) = (0.005)3 (0.995)497 ≈ 0.21435
3
Hence,
254
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
T
P (X ≤ 3) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3)
where X follows a Poisson distribution with parameter λ = 2.5. The probability
mass function of the Poisson distribution is
e−2.5 2.5k
P (X = k) = for k = 0, 1, 2, . . . .
k!
AF
Therefore,
P (X = 0) =
e−2.5 · 2.50
0!
= e−2.5 = 0.08208
e−2.5 · 2.51
P (X = 1) = = 2.5e−2.5 = 0.20521
1!
e−2.5 · 2.52 2.52 e−2.5
P (X = 2) = = = 0.25652
2! 2
e−2.5 · 2.53 2.53 e−2.5
P (X = 3) = = = 0.21376
3! 6
Hence,
DR
255
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Python Code
Here’s how you can calculate various characteristics of the Poisson distribution:
T
1 import numpy as np
2 from scipy . stats import poisson
3
7 # Poisson distribution
8
10
11
AF dist = poisson ( mu = lambda_ )
23 # 4. Variance
24 variance = dist . var ()
25 print ( " Variance : " , variance )
26
256
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Explanations
• Probability Mass Function (pmf ):
■ The pmf provides the probability of observing a certain number
of events. Compute these probabilities using dist.pmf(x values).
T
• Mean (Expected Value):
■ The mean of a Poisson distribution is λ. Compute this using
dist.mean().
• Variance:
AF
•
■ The variance of a Poisson distribution is also λ. Compute this
using dist.var().
6.4.7 Exercises
1. The number of patients arriving at a clinic follows a Poisson distribution
DR
257
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
T
a sample of 1000 individuals.
(c) Find the probability of discovering 2 or more patients with the dis-
ease in a sample of 1000 individuals.
5. An employee receives an average of 8 emails per day. Assume the number
of emails follows a Poisson distribution.
AF (a) What is the probability of receiving exactly 10 emails in a day?
(b) Calculate the probability of receiving fewer than 6 emails in a day.
(c) Determine the probability of receiving more than 12 emails in a day.
6. A retail store has an average of 20 customers arriving per hour. The
number of customer arrivals follows a Poisson distribution.
(a) What is the probability that exactly 25 customers will arrive in an
hour?
(b) Find the probability that fewer than 15 customers arrive in an hour.
DR
258
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
T
10. A library has an average of 12 book checkouts per day. The number of
checkouts follows a Poisson distribution.
(a) What is the probability of exactly 10 book checkouts in a day?
(b) Determine the probability of having 15 or more book checkouts in a
day.
AF (c) Calculate the probability of having fewer than 8 book checkouts in
a day.
11. In a small town, a rare disease has an average occurrence rate of 1 case
per month. Assume the number of cases follows a Poisson distribution.
(a) What is the probability of having exactly 2 cases of the disease in a
month?
(b) Find the probability of having no cases of the disease in a month.
(c) Calculate the probability of having at least 1 case of the disease in
a month.
DR
(a) What is the probability that in any given period of 400 days there
will be an accident on one day?
(b) What is the probability that there are at most three days with an
accident?
259
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
FT
vital for modeling complex systems and assessing the behavior of different sce-
narios. Similarly, in A/B testing, where subjects are randomly assigned to dif-
ferent groups, the assumption of a uniform distribution ensures that the groups
are comparable, thereby enhancing the validity of the test results.
If the lower bound of X is zero and the upper bound is n then the pmf
is given by
1
P (X = x) = ; x = 0, 1, 2, . . . , n.
n+1
D
260
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
n n2 − 1
µ=and σ 2 = .
2 12
Proof. Mean: The mean µ of X is calculated as follows
n n n
X 1 X 1 X
µ = E[X] = x · P (X = x) = x· = x.
x=0 x=0
n+1 n + 1 x=0
Using the formula for the sum of the first n integers, we have
T
n
X n(n + 1)
x= .
x=0
2
Substituting this back into the equation for µ
AF
Variance:
µ=
1
n+1
·
n(n + 1)
2
n
= .
2
σ 2 = E[X 2 ] − (E[X])2 .
First, we compute E[X 2 ]
n n
X X 1
E[X 2 ] = x2 · P (X = x) = x2 · .
x=0 x=0
n+1
This simplifies to
DR
n
1 X 2
E[X 2 ] = x .
n + 1 x=0
Using the formula for the sum of the squares of the first n integers, we have
n
X n(n + 1)(2n + 1)
x2 = .
x=0
6
261
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Thus,
Problem 6.10. A company has a database of 100 customer IDs ranging from
1 to 100. The marketing team wants to select a random sample of 10 customer
IDs to send a promotional offer. Each customer ID should have an equal chance
of being selected.
FT
(a). What is the probability of selecting any specific customer ID?
(b). If the marketing team selects 10 IDs, what is the expected number of times
a specific customer ID (e.g., ID 5) will be included in the sample?
Solution
(a). Probability of Selecting Any Specific Customer ID
Since the customer IDs are uniformly distributed, the probability mass function
(pmf) is
1
P (X = x) = ; x = 1, 2, . . . , 100.
A
n
Here, n = 100. Thus, the probability of selecting any specific customer ID
(say ID 5) is
1
P (X = 5) = = 0.01.
100
R
(b). Expected Number of Times a Specific Customer ID Will Be
Included
When selecting 10 customer IDs from the total of 100, the probability of select-
1
ing a specific customer ID (like ID 5) in one draw is 100 .
To find the expected number of times a specific customer ID (ID 5) will be
D
included in the sample of 10 IDs, we can use the formula for the expected value
E[X] = n · p,
where n is the number of trials (in this case, the number of IDs selected) and p
is the probability of success (selecting ID 5).
1
Here, n = 10 and p = 100
1
E[X] = 10 · = 0.1.
100
262
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
Solution
(a). Probability of Drawing Any Specific Ticket Number
Since the tickets are uniformly distributed, the probability mass function (pmf)
is given by
1
P (X = x) = ; x = 1, 2, . . . , 50.
n
T
Here, n = 50. Thus, the probability of drawing any specific ticket number (e.g.,
ticket number 10) is:
1
P (X = 10) = = 0.02.
50
AF
(b). Expected Number of Times a Specific Ticket Will Be Drawn
When drawing 5 tickets from a total of 50, the probability of drawing a specific
1
ticket (like ticket number 10) in one draw is 50 .
To find the expected number of times a specific ticket (ticket number 10)
will be drawn in the sample of 5 tickets, we can use the expected value formula
E[X] = n · p,
DR
where n is the number of draws (in this case, the number of tickets drawn)
and p is the probability of success (drawing ticket number 10).
1
Here, n = 5 and p = 50
1
E[X] = 5 · = 0.1.
50
7 def pmf_discrete (x , a , b ) :
8 if a <= x <= b :
263
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
9 return 1 / ( b - a + 1)
10 else :
11 return 0
12
13 def cdf_discrete (x , a , b ) :
14 if x < a :
15 return 0
16 elif a <= x <= b :
17 return ( x - a + 1) / ( b - a + 1)
18 else :
19 return 1
20
T
21 # PMF values
22 x_di screte_values = np . arange ( a_discrete , b_discrete + 1)
23 p mf _ v alues_discrete = [ pmf_discrete (x , a_discrete ,
b_discrete ) for x in x_discrete_values ]
24 print ( " PMF values for x from {} to {}: " . format ( a_discrete ,
b_discrete ) , pmf_values_discrete )
25
26
27
28
AF # CDF values
c df _ v alues_discrete = [ cdf_discrete (x , a_discrete ,
b_discrete ) for x in x_discrete_values ]
print ( " CDF values for x from {} to {}: " . format ( a_discrete ,
b_discrete ) , cdf_values_discrete )
29
34 # Variance
35 vari ance_discrete = (( b_discrete - a_discrete + 1) ** 2 -
DR
1) / 12
36 print ( " Variance : " , variance_discrete )
37
38 # Standard Deviation
39 std_dev_discrete = np . sqrt ( variance_discrete )
40 print ( " Standard Deviation : " , std_dev_discrete )
6.4.9 Exercises
1. A box contains 20 different colored balls, numbered from 1 to 20. If a ball
is drawn at random, what is the probability of drawing ball number 15?
2. A survey is conducted with 30 participants, each assigned a unique ID
from 1 to 30. If 5 IDs are randomly selected for a follow-up interview,
what is the expected number of times a specific ID (e.g., ID 12) will be
selected?
264
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
3. A raffle has 100 tickets numbered from 1 to 100. If the raffle draws 10
tickets, what is the probability that ticket number 25 will be drawn at
least once?
4. In a game, a player rolls a fair six-sided die. What is the probability of
rolling a specific number, say 4? Additionally, if the player rolls the die 10
times, what is the expected number of times the number 4 will appear?
5. A classroom has 15 students, each assigned a number from 1 to 15. If the
teacher randomly selects 3 students for a project, what is the probability
that student number 7 is selected?
6. A bag contains 50 different coupons numbered from 1 to 50. If 5 coupons
T
are drawn randomly, what is the expected number of times coupon number
30 will be drawn?
7. A committee consists of 12 members, each assigned a number from 1 to
12. If 4 members are randomly chosen to form a subcommittee, what is
the probability that member number 6 is included in the selection?
AF
8. A local lottery involves selecting 6 numbers from a set of 1 to 49. What
is the probability that the number 7 is chosen in a single drawing? If you
play the lottery 10 times, what is the expected number of times you will
have number 7 in your selected numbers?
9. In a game show, contestants choose a number from 1 to 100. If a contes-
tant has chosen number 45, what is the probability that this number is
drawn if the game draws 5 numbers randomly without replacement?
265
CHAPTER 6. SOME DISCRETE PROBABILITY DISTRIBUTIONS
2. Suppose a factory produces light bulbs, and each bulb has a 90% prob-
ability of being functional. If you randomly select one bulb, what is the
probability that it is functional?
3. A basketball player has a free throw success rate of 75%. If she takes 10
free throws, what is the probability that she makes exactly 8 of them?
(Use the binomial probability formula.)
4. In a survey, it is found that 60% of people prefer coffee over tea. If you
randomly sample 15 people, what is the probability that exactly 9 prefer
coffee?
T
5. A call center receives an average of 4 calls per hour. What is the proba-
bility that they receive exactly 6 calls in the next hour?
6. In a certain city, the average number of accidents at a particular inter-
section is 2 per month. What is the probability that there will be no
accidents in the next month?
AF
7. A six-sided die is rolled. Define the random variable Y as the outcome of
the roll. Calculate the mean and variance of Y .
8. A spinner is divided into 8 equal sections numbered from 1 to 8. What
is the probability of landing on an even number when the spinner is spun
once?
266
Chapter 7
Some Continuous
T
Probability Distributions
AF
7.1 Introduction
In probability theory and statistics, continuous probability distributions play
a fundamental role in modeling and analyzing real-world phenomena. Unlike
discrete distributions, which are defined for countable outcomes, continuous
distributions are used to describe outcomes that can take on any value within
a given range. This chapter delves into some of the most widely used continu-
ous probability distributions, including the Uniform, Exponential, and Normal
distributions.
engineering, economics, and the natural sciences, due to their ability to repre-
sent diverse processes and events accurately. Understanding these distributions
enables us to calculate probabilities and make inferences about populations
based on sample data.
We begin with the Uniform distribution, which serves as a simple model for
random variables that have equally likely outcomes over a specific interval. Fol-
lowing this, we explore the Exponential distribution, commonly used to model
the time between events in a Poisson process. We then delve into the Normal
distribution, arguably the most important distribution in statistics, due to the
Central Limit Theorem’s implication that it approximates many natural phe-
nomena.
Each section will provide a detailed definition of the distribution, its proper-
ties, and practical examples to illustrate its application. Additionally, exercises
are included to reinforce the concepts and allow for hands-on practice in calcu-
267
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
The continuous uniform distribution is commonly used in simulations, ran-
dom sampling, and scenarios where a uniform distribution of outcomes is as-
sumed, such as in generating random numbers or modeling processes where
each outcome within a specified range is equally probable. Its simplicity and
intuitive nature make it a fundamental concept in statistics and probability
theory.
AF
Definition: A random variable X is said to follow a continuous uniform
distribution on the interval [a, b], denoted X ∼ U (a, b), if its probability
density function (pdf) is
(
1
for a ≤ x ≤ b
f (x) = b−a
0 otherwise
1
f (x) = b−a
a b x
268
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
2
Hence,
a+b
Mean = E(X) =
2
1 x
=
b−a 3 a
3
a3 b3 − a3
1 b
= − =
b−a 3 3 3(b − a)
2 2
b + ab + a
= .
3
Hence, the variance is
2
b2 + ab + a2
a+b
Var(X) = −
3 2
4(b2 + ab + a2 ) − 3(a2 + 2ab + b2 )
=
12
(b − a)2
=
12
269
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
FT
(c). What is the variance of the time for a customer to be served?
Solution
(a). Probability that a customer is served within 10 minutes:
Let X be the time taken to be served, and X ∼ Uniform(5, 15). To find
P (X ≤ 10), the pdf of X is given by
1 1
f (x) = = for 5 ≤ x ≤ 15
15 − 5 10
The probability is given by the integral of the pdf from 5 to 10
A
Z 10
1 10 − 5
P (X ≤ 10) = dx = = 0.5
5 10 10
(b − a)2
Var(X) =
12
Here, a = 5 and b = 15, the variance is
270
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
Problem 7.2. Suppose you are designing a random number generator that
outputs a number between 1 and 100. Each number in this range is equally
likely to be selected.
(a). What is the probability that the generator outputs a number between 20
and 50?
(b). Determine the mean and variance of the numbers generated by this
random number generator.
(c). If you generate 10,000 numbers, what is the expected number of times
T
a number between 20 and 50 is generated?
Solution
(a). Probability Calculation
Since the numbers generated are uniformly distributed between 1 and
AF 100, we can model this using a continuous uniform distribution U (a, b)
with a = 1 and b = 100.
1 1
Here, f (x) = 100−1 = 99 for 1 ≤ x ≤ 100.
To find the probability that the generator outputs a number between
DR
Z 50 Z 50
1
P (20 ≤ X ≤ 50) = f (x) dx = dx
20 20 99
Z 50
1
= dx
99 20
50 − 20
=
99
≈ 0.303.
271
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
7.2.2 Python Code for Uniform Distribution Character-
istics
AF
The Uniform distribution models scenarios where all outcomes are equally likely
within a given interval. Below is Python code demonstrating how to compute
various characteristics for both Continuous and Discrete Uniform distributions.
272
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
24 # 4. Variance
25 v ar i ance_continuous = dist_continuous . var ()
26 print ( " Variance : " , variance _continuous )
27
28 # 5. Standard Deviation
29 st d_ dev_continuous = dist_continuous . std ()
30 print ( " Standard Deviation : " , std_dev_continuous )
31
32 # 6. Quantiles
T
33 q u a n tiles_continuous = dist_continuous . ppf ([0.25 , 0.5 ,
0.75]) # 25 th , 50 th ( median ) , and 75 th percentiles
34 print ( " Quantiles at 0.25 , 0.5 , and 0.75: " ,
q u a n tiles_continuous )
35
36 # 7. Percentiles
37 p e r c en til es _c ont in uo us = dist_continuous . ppf ([0.1 , 0.9])
38
39
AF # 10 th and 90 th percentiles
print ( " Percentiles at 0.1 and 0.9: " ,
p e r c en til es _c ont in uo us )
7.2.3 Exercises
1. Consider a discrete uniform distribution where X takes values {1, 2, 3, . . . , n}.
(i) Show that the expected value E(X) is n+12 and (ii) Find the Variance
of X.
5. Suppose the waiting time for a bus is uniformly distributed between 0 and
30 minutes. What is the probability that a person will wait more than 20
minutes?
6. A factory produces items with weights that are uniformly distributed
between 50 grams and 150 grams. What is the probability that a randomly
chosen item weighs between 80 grams and 120 grams?
7. The cholesterol level of adults in a certain region follows a uniform distri-
bution between 150 mg/dL and 250 mg/dL.
273
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
(a) Write the probability density function f (x) of the cholesterol level.
(b) What is the probability that a randomly selected adult has a choles-
terol level between 180 mg/dL and 220 mg/dL?
(c) Find the mean and variance of the cholesterol level.
T
spection time for a randomly chosen product is more than 4 minutes.
10. The download speed of a certain internet connection is uniformly dis-
tributed between 10 Mbps and 100 Mbps. What is the probability that
the download speed at any given time is less than 50 Mbps?
11. The delivery time for a package from a warehouse to a customer is uni-
AF formly distributed between 2 and 7 days. What is the probability that a
package will be delivered in less than 4 days?
12. The time a wildlife photographer waits to see a particular bird is uniformly
distributed between 30 minutes and 3 hours. Find the probability that
the wait time is more than 2 hours.
Imagine you are at a busy city street waiting for a taxi. Taxis arrive at the
street randomly but at an average rate of 5 taxis per hour. You’re curious about
how long you might have to wait for the next taxi. Is there a way to predict or
understand the waiting time better?
You notice that sometimes the wait is short, and other times it can be quite
long. This variability and randomness in waiting times suggest a need for a
mathematical model to describe it.
274
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
or time between events. In our taxi example, it helps us understand and pre-
dict the waiting time until the next taxi arrives. The exponential distribution
provides a framework to calculate probabilities and make informed decisions
based on the average rate of arrivals.
T
where λ > 0 is the rate parameter.
f (x)
AF
f (x) = 0.4e−0.4x
DR
This probability is the area under the probability density function between
the points a = 1 and b = 2 as illustrated in Figure 7.3.
275
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
f (x)
Rb
P (a ≤ X ≤ b) = a
f (x) dx
T
x
0 a b
Figure 7.3: The area under the probability density function f (x) between a and
b.
AF
The cdf is obtained by integrating the pdf
Z x
F (x) = λe−λt dt
0
x
= −e−λt 0
= −e−λx + e0
= 1 − e−λx .
DR
276
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
0.8
0.6
F (x)
0.4
0.2
T
F (x) = 1 − e−0.4x
0
0 2 4 6 8 10
x
f (x) = λe−λx , x ≥ 0.
r
To find E[X ], we compute
DR
Z ∞ Z ∞
r
E[X ] = r
x f (x) dx = xr λe−λx dx.
0 0
277
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
Mean
For the exponential distribution (specifically when r = 1)
1! 1
E[X] = E[X 1 ] = =
λ1 λ
Variance
T
The variance Var(X) is calculated as follows
Var(X) = E[X 2 ] − (E[X])2 .
Calculating E[X 2 ] (using r = 2), we have
2! 2
E[X 2 ] = = 2
AF 2
2
1
λ2
2 1 1
Var(X) = 2 − = 2− 2 = 2
λ λ λ λ λ
Problem 7.3. A data center experiences server failures where the time between
failures follows an exponential distribution with a mean of 10 days.
(i). Calculate the rate parameter λ.
(ii). What is the probability that a server will fail within the next 3 days?
(iii). What is the probability that the server will last longer than 12 days without
DR
failing?
278
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
So, the probability that the server will last longer than 12 days without failing
is approximately 0.3012 (or 30.12%).
Problem 7.4. In a hospital, the time between arrivals of patients in the emer-
gency room follows an exponential distribution with an average time of 15 min-
utes.
T
(a) What is the probability that the time between two successive arrivals is
more than 20 minutes?
(b) What is the probability that the time between two successive arrivals is
less than 10 minutes?
AF
(c) Calculate the expected time between two successive arrivals and its stan-
dard deviation.
Solution
(a). Let X be the time between arrivals, which follows an exponential distri-
bution with parameter λ. The rate parameter λ is the reciprocal of the
1
mean, so λ = 15 per minute. The probability that the time between two
successive arrivals is more than 20 minutes is calculated as follows
1
Substituting λ = 15 ,
20 4
P (X > 20) = e− 15 = e− 3 ≈ 0.2636
Thus, the probability that the time between two successive arrivals is
more than 20 minutes is approximately 0.2636.
(b). The probability that the time between two successive arrivals is less than
10 minutes is calculated as follows
Thus, the probability that the time between two successive arrivals is less
than 10 minutes is approximately 0.4866.
279
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
(c). The expected time between two successive arrivals (the mean of the ex-
ponential distribution) is given by
1
E(X) = = 15 minutes
λ
The standard deviation of the time between two successive arrivals is the
same as the mean for an exponential distribution, so
1
Standard deviation = = 15 minutes
λ
T
Therefore, the expected time between two successive arrivals is 15 min-
utes, and the standard deviation is also 15 minutes.
0 λ
• Variance:
1
Var(X) = E(X 2 ) − [E(X)]2 = .
λ2
• Standard Deviation:
1
σX = .
λ
• Memoryless Property: The exponential distribution has the mem-
oryless property, which states that
280
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
• Characteristic Function:
λ
φX (t) = E[eitX ] = , for t ∈ R.
λ − it
• Quantile Function: The quantile function (inverse of the CDF) for
0 < p < 1 is given by:
1
Q(p) = F −1 (p) = − ln(1 − p).
λ
• Relationship with the Poisson Process: If X ∼ Exponential(λ),
it can be interpreted as the waiting time between events in a Poisson
FT
process with rate λ.
0 0
R∞ −ax 1
The integral is of the form e 0
dx = for ℜ(a) > 0
a
1 λ
φX (t) = λ = .
λ − it λ − it
Thus, the characteristic function of X is
λ
φX (t) = .
λ − it
281
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
Thus, the moment generating function of X is
λ
MX (t) = , for t < λ.
AF λ−t
Proof. The proof of the memoryless property relies on the definition of condi-
tional probability and the exponential distribution’s probability density func-
tion.
By definition of conditional probability, we have
282
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
and the survival function (which gives the probability that X is greater than a
certain value) is
P (X > x) = e−λx .
Using the survival function, we can rewrite the conditional probability
P (X > s + t) e−λ(s+t)
P (X > s + t | X > s) = = = e−λt = P (X > t).
P (X > s) e−λs
T
Thus, the memoryless property is proved.
Problem 7.5. Suppose that the waiting time for a bus at a certain bus stop is
exponentially distributed with a mean waiting time of 10 minutes. Let X denote
the waiting time. Given that a person has already waited for 5 minutes, what
is the probability that they will have to wait at least an additional 10 minutes?
AF
Solution
. To solve this problem, we use the properties of the exponential distribution,
specifically its memoryless property. For an exponential distribution, the mean
is given by
1
Mean =
λ
Therefore,
1 1
λ= =
Mean 10
DR
We need to find the probability that the waiting time exceeds 15 minutes,
given that the person has already waited for 5 minutes. Using the memoryless
property
P (X > 15 | X > 5) = P (X > 10)
The probability that X exceeds a certain time x is given by
P (X > x) = e−λx
1
Substitute λ = 10 and x = 10
10
P (X > 10) = e− 10 = e−1 ≈ 0.3679
Problem 7.6. Assume that the lifetime of a light bulb follows an exponential
distribution with a mean lifetime of 1000 hours. Let X be the lifetime of the light
bulb. If a light bulb has already been used for 800 hours, what is the probability
that it will last at least an additional 500 hours?
283
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
Solution
To solve this problem, we use the memoryless property of the exponential dis-
tribution. Here’s a step-by-step solution.
The mean lifetime of the light bulb is 1000 hours. For an exponential dis-
tribution, the mean is given by
1
Mean =
λ
Therefore,
1 1
T
λ= =
Mean 1000
We need to find the probability that the light bulb will last for at least
500 more hours, given that it has already been used for 800 hours. Using the
memoryless property
P (X > x) = e−λx
1
Substitute λ = 1000 and x = 500
500
P (X > 500) = e− 1000 = e−0.5 ≈ 0.6065
exponentially distributed with a rate parameter λ = 0.01 failures per hour. Let
X represent the time to failure of the component. If the component has been
operational for 100 hours without failure, what is the probability that it will
operate for at least another 50 hours?
Solution
To solve this problem, we use the properties of the exponential distribution,
specifically its memoryless property. Here’s a step-by-step solution: The rate
parameter for the exponential distribution is given as λ = 0.01 failures per hour.
We need to find the probability that the component will operate for at least
50 more hours, given that it has already operated for 100 hours. Using the
memoryless property
284
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
P (X > x) = e−λx
e−0.5 ≈ 0.6065
T
The important applications of exponential distribution are mentioned in the
following:
285
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
Python Code
1 import numpy as np
2 from scipy . stats import expon
3
8 # Exponential distribution
9 dist = expon ( scale = scale )
10
T
13
19
AF
20 # 3. Mean ( Expected Value )
21 mean = dist . mean ()
22 print ( " Mean ( Expected Value ) : " , mean )
23
24 # 4. Variance
25 variance = dist . var ()
26 print ( " Variance : " , variance )
27
DR
28 # 5. Standard Deviation
29 std_dev = dist . std ()
30 print ( " Standard Deviation : " , std_dev )
31
32 # 6. Quantiles
33 quantiles = dist . ppf ([0.25 , 0.5 , 0.75]) # 25 th , 50 th (
median ) , and 75 th percentiles
34 print ( " Quantiles at 0.25 , 0.5 , and 0.75: " , quantiles )
35
36 # 7. Percentiles
37 percentiles = dist . ppf ([0.1 , 0.9]) # 10 th and 90 th
percentiles
38 print ( " Percentiles at 0.1 and 0.9: " , percentiles )
39
286
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
Explanations
T
• Probability Density Function (PDF):
• Variance:
1
■ The variance is λ2 . Compute this using dist.var().
•
DR
Standard Deviation:
1
■ The standard deviation is λ. Compute this using dist.std().
• Quantiles:
■ Quantiles are values below which a given proportion of data falls.
Compute these using dist.ppf() for the desired quantile prob-
abilities.
• Percentiles:
■ Percentiles are specific quantiles, such as the 10th and 90th per-
centiles. Compute these using dist.ppf() for the desired per-
centiles.
287
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
1
■ The MGF for the Exponential distribution is MX (t) = 1−t/λ for
t < λ. Define a function mgf(t, lambda ) to compute MGF
values.
7.3.4 Exercises
T
1. The lifetime of a particular brand of lightbulb is exponentially distributed
with a mean of 1000 hours.
(a) What is the probability that a lightbulb lasts more than 1200 hours?
(b) What is the probability that a lightbulb lasts between 800 and 1200
hours?
AF
2. A machine in a factory breaks down on average once every 500 hours.
The time between breakdowns is exponentially distributed.
(a) What is the probability that the machine will operate for at least
1000 hours without a breakdown?
(b) Determine the probability that the machine will break down within
the next 200 hours.
3. The waiting time for a specific genetic test result from a laboratory follows
an exponential distribution with a mean of 2 days.
(a) Find the probability that the test result will be available in less than
DR
1 day.
(b) Find the probability that the test result will take more than 3 days.
4. The time until failure of a critical component in a medical device follows
an exponential distribution with a mean of 5 years.
(a) Calculate the probability that the component will fail within the first
3 years.
(b) Determine the probability that the component will last more than 7
years.
5. In a call center, the time between consecutive calls follows an exponential
distribution with a mean time of 4 minutes.
(a) What is the probability that the next call comes within 2 minutes?
(b) What is the probability that the next call will not come for at least
6 minutes?
288
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
hours.
8. The response time of a server to a network request is exponentially dis-
tributed with an average time of 0.5 seconds.
(a) What is the probability that the server responds in less than 0.3
seconds?
AF (b) Calculate the probability that the server takes more than 1 second
to respond.
9. The time it takes for a chemical reaction to complete in a lab experiment
follows an exponential distribution with a mean of 45 minutes.
(a) What is the probability that the reaction will complete in less than
30 minutes?
(b) What is the probability that the reaction will take more than 60
minutes to complete?
10. The lifespan of a certain species of laboratory mice follows an exponential
DR
289
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
In the field of data science, the normal distribution plays a pivotal role due
to its ubiquitous nature and the mathematical properties that simplify analysis.
It is characterized by its symmetric, bell-shaped curve and is instrumental
in various statistical methods, including hypothesis testing, regression analysis,
and many machine learning algorithms.
AFThe normal distribution is especially useful because of the Central Limit
Theorem, which states that the sum of a large number of independent, iden-
tically distributed variables tends toward a normal distribution, regardless of
the original distribution of the variables. This makes the normal distribution
a powerful tool for modeling real-world phenomena and for making inferences
about populations based on sample data.
the distribution.
290
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
denotes that the random variable X has a normal distribution with mean
µ and variance σ 2 .
f (x)
N (µ, σ 2 )
T
AF µ
x
of the distribution.
The probability density function of a normal random variable is symmetric
DR
around the mean value µ and exhibits a “bell-shaped” curve. Figure 7.6 displays
the probability density functions of normal distributions with µ = 5, σ = 2
and µ = 10, σ = 2. It illustrates that while altering the mean value µ shifts
the location of the density function, it does not affect its shape. In contrast,
Figure 7.7 presents the probability density functions of normal distributions
with µ = 5, σ = 2 and µ = 5, σ = 0.5. Here, the central position of the
density function remains the same, but its shape changes. Larger values of the
variance σ 2 lead to wider, flatter bell-shaped curves, whereas smaller values of
the variance σ 2 produce narrower, sharper bell-shaped curves.
291
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
0.4
N (10, 4)
0.3
N (5, 4)
f (x)
0.2 N (15, 4)
T
0.1
0 5 10 15 20
AF x
µ = 0, σ 2 = 1
0.4
µ = 0, σ 2 = 4
0.3
DR f (x)
0.2
0.1
−6 −4 −2 0 2 4 6
x
292
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
µ.
3. Mean, Median, and Mode: For a normal distribution, the mean, me-
dian, and mode are all equal and located at µ.
4. Total Area: The total area under the curve and above the horizontal
axis is equal to 1.
AF5. Inflection Points: The points at which the curve changes concavity are
located at µ − σ and µ + σ.
99.7%
95%
68%
−3σ −2σ −σ µ σ 2σ 3σ x
293
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
10. Moment Generating Function: The moment generating function MX (t)
of a normal random variable X ∼ N (µ, σ 2 ) is given by:
1 2 2
MX (t) = exp µt + σ t .
2
AF
11. Characteristic Function: The characteristic function φX (t) of a normal
random variable X ∼ N (µ, σ 2 ) is given by:
1 2 2
φX (t) = exp iµt − σ t .
2
Theorem 7.5. If X ∼ N (µ, σ 2 ), then the mean and standard deviation are µ
and σ 2 , respectively.
Proof. To evaluate the mean, we first calculate
Z ∞
1 1 x−µ 2
E(X − µ) = (x − µ) √ e− 2 ( σ ) dx.
−∞ 2πσ
x−µ
Setting z = σ and dx = σdz, we obtain
Z ∞
1 1 2
E(X − µ) = √ ze− 2 z dz = 0,
2π −∞
294
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
E(X) = µ.
The variance of the normal distribution is given by
Z ∞
1 1 x−µ 2
2
E[(X − µ) ] = √ (x − µ)2 e− 2 ( σ ) dx.
2πσ −∞
x−µ
Again setting z = and dx = σdz, we obtain
σ
Z ∞
2 1 1 2
2
z 2 e− 2 z dz.
T
E[(X − µ) ] = σ √
2π −∞
2
Integrating by parts with u = z and dv = ze−z /2
dz so that du = dz and
2
v = −e−z /2 , we find that
" 2
#
AF 2
E[(X − µ) ] = σ 2 −ze−z /2
√
2π
∞
−∞
+
Z
−∞
∞
e −z 2 /2
dz = σ 2 (0 + 1) = σ 2 .
Theorem 7.6. For a normal distribution with mean µ and standard deviation
σ, the inflection points occur at x = µ − σ and x = µ + σ.
Proof. To find the inflection points, we need to determine where the second
derivative of the density function changes sign.
First Derivative
DR
f ′ (x) = − √ e 2σ2
σ 3 2π
Second Derivative
The second derivative of f (x) with respect to x is:
′′ d (x − µ) − (x−µ) 2
f (x) = − √ e 2σ 2
dx σ 3 2π
295
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
(x − µ)2
′′ 1 (x−µ)2
− 2σ2
f (x) = − √ e · 1−
σ 3 2π σ2
Simplifying:
1 (x−µ)2
f ′′ (x) = e− (x − µ)2 − σ 2
√ 2σ 2
σ5 2π
Setting the Second Derivative to Zero
To find the inflection points, we set f ′′ (x) = 0:
T
(x − µ)2 − σ 2 = 0
Solving for x:
(x − µ)2 = σ 2
x − µ = ±σ
AF x=µ±σ
Thus, the inflection points of the normal distribution are at x = µ − σ and
x = µ + σ.
296
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
µ=0
0.4
2
√1 e− 2
z
2π
0.2 N (0, 1)
fZ (z) =
T
0
z
−4 −2 0 2 4
Figure 7.9: The standard normal distribution with mean µ = 0 and standard
deviation σ = 1.
AF
The symmetry of the standard normal distribution about 0 implies that if
the random variable Z has a standard normal distribution, then
1 − Φ(z) = Pr(Z ≥ z) = Pr(Z ≤ −z) = Φ(−z),
as illustrated in Figure 7.9. This equation can be rearranged to provide the
easily remembered relationship
Φ(z) + Φ(−z) = 1.
The plot presented in Figure 7.10, illustrates the cumulative distribution func-
tions Φ(z) and Φ(−z) of the standard normal distribution. The symmetry of
the standard normal distribution is evident from these plots.
DR
µ=0
0.4
2
√1 e− 2
z
2π
0.2 N (0, 1)
fZ (z) =
Φ(−2) Φ(2)
0 z
−4 −2
−z 0 2 4
Figure 7.10: Standard normal distribution with shaded areas for Φ(−2) and
Φ(2).
297
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
P (a ≤ X ≤ b) = F (b) − F (a).
T
Direct computation of F (·) for a general normal distribution can be challenging.
However, it is readily facilitated by using the cdf of the standard normal distri-
bution of Z. The following steps explain how to compute P (a ≤ X ≤ b) for a
normally distributed random variable X by leveraging the cdf of the standard
normal random variable Z.
AFf (x) P (a ≤ X ≤ b) =
Rb
a
f (x) dx
DR
a µ b x
Figure 7.11: The area under the probability density function f (x) between a
and b.
298
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
2. Standardize the limits: Convert the lower limit a and the upper limit
b to their corresponding z-scores:
a−µ b−µ
za = and zb =
σ σ
3. Find the cumulative probabilities: Use the cdf of the standard nor-
mal distribution, denoted by Φ(z) = P (Z ≤ z), to find the cumulative
probabilities at za and zb :
a−µ a−µ
Φ(za ) = Φ =P Z≤
σ σ
b−µ b−µ
T
Φ(zb ) = Φ =P Z≤ .
σ σ
4. Calculate the probability: The probability P (a ≤ X ≤ b) is the dif-
ference between the cumulative probabilities at zb and za :
a−µ X −µ b−µ
P (a ≤ X ≤ b) = P ≤ ≤
σ σ σ
AF = P (za ≤ Z ≤ zb )
= P (Z ≤ zb ) − P (Z ≤ za )
= Φ(zb ) − Φ(za )
b−µ a−µ
=Φ −Φ
σ σ
P (a ≤ X ≤ b) = P ≤ ≤
σ σ σ
b−µ a−µ
=Φ −Φ .
σ σ
299
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
Solution
Given Y ∼ N (10, 16), we know that the mean µ = 10 and standard deviation
σ = 4. We need to compute P (|Y − 10| ≥ 12). This is equivalent to the
following:
T
≈ 0.00135 + [1 − 0.99865] = 0.0027
Problem 7.10. Assume the heights of adult males in a certain population are
normally distributed with a mean of 70 inches and a standard deviation of 3
inches. What is the probability that a randomly selected adult male has a height
AF
between 67 and 73 inches?
Solution
To find the probability that a randomly selected adult male has a height be-
tween 67 and 73 inches, we standardize the values and use the standard normal
distribution. Therefore,
67 − 70 73 − 70
P (67 ≤ X ≤ 73) = P ≤Z≤
3 3
73 − 70 67 − 70
=P Z≤ −P Z ≤
DR
3 3
= Φ(1) − Φ(−1)
≈ 0.8413 − 0.1587 = 0.6826
P (Z ≥ a) = 0.72
This is equivalent to:
300
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
a ≈ −0.58
Thus, the value of a is approximately −0.58.
Problem 7.12. Given a standard normal distribution, find the value of k such
that
T
(a). P (Z > k) = 0.3015
(b). P (k < Z < −0.18) = 0.4197.
Solution (a). Finding k for P (Z > k) = 0.3015
AF The area to the right of k is 0.3015. Therefore, the area to the left of k
is:
1 − 0.3015 = 0.6985
Using the standard normal distribution table, we look up the value that
corresponds to an area of 0.6985 to the left. This value is:
k ≈ 0.52
The area between k and −0.18 is 0.4197. Therefore, the area to the left
of k is
0.4286 − 0.4197 = 0.0089
That is
P (Z < k) = 0.0089
Using the standard normal distribution table, we look up the value that
corresponds to an area of 0.0089 to the left. This value is
k ≈ −2.37
301
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
X ∼ N (175, 62 ).
We need to find the height x such that the cumulative probability up to x
is 0.10. In other words, we want to find x for which
T
P (X ≤ x) = 0.10.
To find the height below which the shortest 10% fall, we find the Z-score
for the 10th percentile, which is approximately z = −1.28. Using the formula
X = Z · σ + µ, we get
AF X = (−1.28) · 6 + 175 ≈ 167.32 cm
Thus, the height below which the shortest 10% of adult males fall is approx-
imately 167.32 cm.
Problem 7.14. The Wall Street Journal Interactive Edition spend average of
27 hours per week using the computer at work. Assume the normal distribution
applies and that the standard deviation is 8 hours.
(a). What is the probability a randomly selected subscriber spends less than 10
hours using the computer at work?
DR
(b). What percentage of the subscribers spends more than 35 hours per week
using the computer at work?
(c). A person is classified as a heavy user if he or she is in the upper 20%
in terms of hours of usage. How many hours must a subscriber use the
computer in order to be classified as a heavy user?
Solution
Let X be the number of hours a randomly selected subscriber spends using the
computer at work per week. Assume X follows a normal distribution:
X ∼ N (27, 82 )
where µ = 27 hours and σ = 8 hours.
302
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
(a). To find the probability that a subscriber spends less than 10 hours on
the computer, we need to calculate P (X < 10). First, convert this to
the standard normal variable Z:
X −µ 10 − 27 −17
Z= = = = −2.125
σ 8 8
Using standard normal distribution tables or software, find:
T
than 10 hours using the computer is approximately 0.0169 or 1.69%.
(b). To find the percentage of subscribers who spend more than 35 hours per
week, we need to calculate P (X > 35). Convert this to the standard
normal variable Z:
AF Z=
X −µ
σ
=
35 − 27
8
8
= =1
8
Using standard normal distribution tables or software, find:
Thus, the percentage of subscribers who spend more than 35 hours per
week is approximately 15.87%.
z0.80 ≈ 0.84
x = µ + zσ = 27 + 0.84 × 8
x = 27 + 6.72 = 33.72
Therefore, a subscriber must use the computer for at least 33.72 hours
per week to be classified as a heavy user.
303
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
X̄n = nXi
n i=1
Then, as n approaches infinity, the distribution of the standardized sample mean
approaches the standard normal distribution:
X̄n − µ d
AF
or equivalently,
√ −
σ/ n
d
→ N (0, 1)
→ N (µ, σ 2 /n)
X̄n −
d
where −
→ denotes convergence in distribution.
X̄30 can be approximated by a normal distribution with mean 3.5 and standard
deviation √σ30 .
d 35
X̄30 −
→ N 3.5, .
12 × 30
Problem 7.15. A researcher is studying the systolic blood pressure levels in
a population of adults. It is known that the systolic blood pressure levels are
normally distributed with a mean (µ) of 120 mmHg and a standard deviation
(σ) of 15 mmHg.
(a). What proportion of the population has systolic blood pressure levels be-
tween 110 mmHg and 130 mmHg?
(b). What is the probability that a randomly selected individual from this pop-
ulation has a systolic blood pressure level above 140 mmHg?
304
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
(c). If the researcher takes a random sample of 25 adults, what is the probabil-
ity that the sample mean systolic blood pressure is less than 115 mmHg?
Solution (a). Proportion of the Population between 110 mmHg and 130
mmHg
We need to find P (110 ≤ X ≤ 130) where X is the systolic blood
pressure level.
110 − 120 X −µ 130 − 120
P (110 ≤ X ≤ 130) = P ≤ ≤
15 σ 15
= P (−0.67 ≤ Z ≤ 0.67)
T
= P (Z ≤ 0.67) − P (Z ≤ −0.67)
= Φ(0.67) − Φ(−0.67)
= 0.7486 − 0.2514
= 0.4972
AF
(b).
So, approximately 49.72% of the population has systolic blood pressure
levels between 110 mmHg and 130 mmHg.
305
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
P (Z ≤ −1.67) ≈ 0.0475
So, the probability that the sample mean systolic blood pressure of 25
adults is less than 115 mmHg is approximately 0.0475 or 4.75%.
(a). What percentage of the population has cholesterol levels between 175 mg/dL
T
and 225 mg/dL?
(b). What is the probability that a randomly selected individual has a choles-
terol level below 180 mg/dL?
(c). If a sample of 36 adults is taken, what is the probability that the sample
AF mean cholesterol level is greater than 210 mg/dL?
Solution
(a). Percentage of the Population between 175 mg/dL and 225 mg/dL
Let X is the cholesterol level. Then,
175 − 200 X −µ 225 − 200
P (175 ≤ X ≤ 225) = P ≤ ≤
25 σ 25
= P (−1 ≤ Z ≤ 1)
= Φ(1) − Φ(−1)
≈ 0.8413 − 0.1587 = 0.6826
DR
So, approximately 68.26% of the population has cholesterol levels between 175
mg/dL and 225 mg/dL.
X −µ 180 − 200
P (X < 180) = P <
σ 25
= P (Z ≤ −0.8)
= Φ(−0.8) ≈ 0.2119
So, the probability that a randomly selected individual has a cholesterol level
below 180 mg/dL is approximately 0.2119 or 21.19%.
306
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
So, the probability that the sample mean cholesterol level of 36 adults is greater
than 210 mg/dL is approximately 0.0082 or 0.82%.
Python Code
1 import numpy as np
from scipy . stats import norm
DR
2
8 # Normal distribution
9 dist = norm ( loc = mu , scale = sigma )
10
307
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
24 # 4. Variance
25 variance = dist . var ()
26 print ( " Variance : " , variance )
27
28 # 5. Standard Deviation
29 std_dev = dist . std ()
30 print ( " Standard Deviation : " , std_dev )
31
T
32 # 6. Quantiles
33 quantiles = dist . ppf ([0.25 , 0.5 , 0.75]) # 25 th , 50 th (
median ) , and 75 th percentiles
34 print ( " Quantiles at 0.25 , 0.5 , and 0.75: " , quantiles )
35
36 # 7. Percentiles
37 percentiles = dist . ppf ([0.1 , 0.9]) # 10 th and 90 th
38
39
40
AF percentiles
print ( " Percentiles at 0.1 and 0.9: " , percentiles )
48
DR
Explanations
• Probability Density Function (PDF):
■ The PDF provides the likelihood of each value. Compute these
values using dist.pdf(x values).
308
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
• Variance:
■ The variance is σ 2 . Compute this using dist.var().
• Standard Deviation:
• Quantiles:
T
Percentiles:
7.5 Exercises
1. Given a normal distribution X ∼ N (µ, σ 2 ), answer the following ques-
tions:
(a) If µ = 10 and σ = 2, what is the probability that X is less than 8?
(b) Find the probability that X lies between 8 and 12.
DR
309
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
score needed to be in the top 5
5. Given a population with mean µ = 60 and standard deviation σ = 15:
(a) If a sample of size 50 is taken, what is the expected value and stan-
dard deviation of the sample mean?
AF(b) Using the Central Limit Theorem, find the probability that the sam-
ple mean is less than 58.
(c) Calculate the probability that the sample mean lies between 59 and
62.
6. A study measures the cholesterol levels (in mg/dL) of a group of patients,
which are found to follow a normal distribution with a mean of 200 mg/dL
and a standard deviation of 20 mg/dL.
(a) What is the probability that a randomly selected patient has a choles-
terol level between 180 mg/dL and 220 mg/dL?
(b) What is the 95th percentile of the cholesterol levels?
DR
(c) Calculate the variance and standard deviation of the cholesterol lev-
els.
(d) If a cholesterol level above 240 mg/dL is considered high, what pro-
portion of the patients have high cholesterol levels?
7. Consider a population where the heights of adult women are approxi-
mately normally distributed with a mean of 65 inches and a standard
deviation of 4 inches.
310
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
(a) What is the probability that a fish of the first species weighs more
than 2.5 kg?
(b) What is the probability that a fish of the second species weighs be-
T
tween 2.5 kg and 3.5 kg?
(c) What is the expected value of the difference in weights D = X − Y ?
(d) What is the variance of the difference in weights D?
’
AF
9. Consider a sample mean X̄ from a normal distribution with population
mean µ and variance σ 2 . Assume the sample size is n = 36.
(a) If µ = 50 and σ = 12, find the probability that the sample mean is
greater than 52.
(b) Calculate the probability that the sample mean lies between 48 and
51.
(c) Determine the value of x̄ such that P (X̄ ≤ x̄) = 0.95.
311
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
T
2. If the length of a rod is uniformly distributed between 10 and 20 cm, what
is the probability that a randomly selected rod is longer than 15 cm?
3. The time between arrivals of customers at a coffee shop follows an expo-
nential distribution with a mean of 5 minutes. What is the probability
AF that the next customer will arrive within 3 minutes?
4. A radioactive substance has a half-life of 10 years. What is the probability
that a sample will decay in less than 5 years?
5. A set of test scores is normally distributed with a mean of 70 and a
standard deviation of 10. What is the probability that a randomly selected
score is greater than 85?
6. In a factory, the weight of bags of flour is normally distributed with a
mean of 50 kg and a standard deviation of 2 kg. What percentage of bags
weigh between 48 kg and 52 kg?
DR
9. A car rental service finds that the time a customer spends renting a car
follows a normal distribution with a mean of 4 days and a standard devi-
ation of 1.5 days. What is the probability that a customer rents a car for
less than 3 days?
312
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
−3.4 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002
−3.3 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003
−3.2 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005
−3.1 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007
−3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
T
−2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
−2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
−2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
−2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
−2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
AF
−2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
−2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
−2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
−2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
−2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
−1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
−1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
−1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
−1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
DR
−1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
−1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
−1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
−1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
−1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
−1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
−0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
−0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
−0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
−0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
−0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
−0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
−0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
−0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
313
−0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
CHAPTER 7. SOME CONTINUOUS PROBABILITY DISTRIBUTIONS
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5200 0.5240 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5754
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
T
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7258 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7518 0.7549
0.7 0.7580 0.7612 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7996 0.8023 0.8051 0.8079 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
AF
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8600 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9430 0.9441
1.6 0.9452 0.9463 0.9474 0.9485 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9600 0.9616 0.9625 0.9633 0.9641
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9700 0.9706
DR
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9762 0.9767
2.0 0.9773 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9924 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9942 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9958 0.9959 0.9960 0.9961 0.9962 0.9963
2.7 0.9964 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973
2.8 0.9974 0.9975 0.9976 0.9977 0.9978 0.9979 0.9980 0.9981 0.9982 0.9983
2.9 0.9984 0.9985 0.9986 0.9987 0.9988 0.9989 0.9990 0.9991 0.9992 0.9993
3.0 0.9994 0.9995 0.9996 0.9997 0.9998 0.9999 1.0000 1.0001 1.0002 1.0003
314
Chapter 8
Confidence Interval
T
Estimation
AF
8.1 Introduction
Interval estimation is a crucial concept in statistics and data science, providing
a range of values within which a population parameter is expected to lie. Unlike
point estimation, which gives a single value as an estimate of the population
parameter, interval estimation provides an interval, giving a measure of relia-
bility to the estimation.
within which the parameter is likely to fall. This chapter will explore various
methods of constructing confidence intervals for different population parame-
ters, including means, proportions, and variances.
315
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
T
Let’s say we want to know the average height of adult women in a city. We
can’t measure every woman, so we take a sample of 100 women. From this sam-
ple, we find that the average height is 64 inches, and the variability in heights
is 3 inches.
AFNow, we’re pretty confident that the average height of all women in the
city is around 64 inches but we can’t be absolutely certain. There’s a chance
the true average is a bit higher or lower. To show this uncertainty, we use a
confidence interval. This is a range of values where we believe the true average
lies, with a certain level of confidence. For example, a 95% confidence interval
might be from 63.41 to 64.59 inches, meaning we’re 95% sure the true average
is between 63.41 and 64.59 inches.
In the next sections, we will explore the math and theory behind confidence
intervals.
DR
where
σ
ME = z √
n
is called the margin of error (ME). This margin quantifies the uncertainty
associated with our estimate of the population mean. Here, x̄ represents the
sample mean, z is the critical value from the standard normal distribution
corresponding to P (Z < −z) = α2 and P (Z > z) = α2 , σ is the population
316
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
standard deviation, and n is the sample size. We denote this critical value as
z = z α2 , as illustrated in Figure 8.1. The term √σn represents the standard error
(SE) of x̄.
The confidence interval for the population mean, as described in Equation
8.1, can be expressed as:
σ σ
P x̄ − z √ ≤ µ ≤ x̄ + z √ = 1 − α.
n n
f (z)
T
α α
2 2
z
AF −z α2 z α2
Figure 8.1: Standard Normal Distribution with shaded area to the right of z α2
and to the left of −z α2
support for a candidate with a margin of error of ±4%, the actual support
could be between 56% and 64%. The expression of ME of the confidence
interval for the mean is given in Figure 8.2.
α α
100(1 − α)% α 2 1− 2 z = z α2
317
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
α
Table 8.2: The value of 2 with corresponding values z α2 .
α α α α α
2
z α2 2
z α2 2
z α2 2
z α2 2
z α2
0.001 3.090 0.021 2.034 0.041 1.739 0.061 1.546 0.081 1.398
0.002 2.878 0.022 2.014 0.042 1.728 0.062 1.538 0.082 1.392
0.003 2.748 0.023 1.995 0.043 1.717 0.063 1.530 0.083 1.385
0.004 2.652 0.024 1.977 0.044 1.706 0.064 1.522 0.084 1.379
0.005 2.576 0.025 1.960 0.045 1.695 0.065 1.514 0.085 1.372
0.006 2.512 0.026 1.943 0.046 1.685 0.066 1.506 0.086 1.366
FT
0.007 2.457 0.027 1.927 0.047 1.675 0.067 1.499 0.087 1.359
0.008 2.409 0.028 1.911 0.048 1.665 0.068 1.491 0.088 1.353
0.009 2.366 0.029 1.896 0.049 1.655 0.069 1.483 0.089 1.347
0.010 2.326 0.030 1.881 0.050 1.645 0.070 1.476 0.090 1.341
0.011 2.290 0.031 1.866 0.051 1.635 0.071 1.468 0.091 1.335
0.012 2.257 0.032 1.852 0.052 1.626 0.072 1.461 0.092 1.329
0.013 2.226 0.033 1.838 0.053 1.616 0.073 1.454 0.093 1.323
0.014 2.197 0.034 1.825 0.054 1.607 0.074 1.447 0.094 1.317
0.015 2.170 0.035 1.812 0.055 1.598 0.075 1.440 0.095 1.311
A
0.016 2.144 0.036 1.799 0.056 1.589 0.076 1.433 0.096 1.305
0.017 2.120 0.037 1.787 0.057 1.580 0.077 1.426 0.097 1.299
0.018 2.097 0.038 1.774 0.058 1.572 0.078 1.419 0.098 1.293
0.019 2.075 0.039 1.762 0.059 1.563 0.079 1.412 0.099 1.287
0.020 2.054 0.040 1.751 0.060 1.555 0.080 1.405 0.100 1.282
R
When the population variance σ 2 is unknown and the sample size is less
than 30 (i.e., n < 30), the confidence interval for the population mean µ is
given by:
x̄ ± ME ⇒ (x̄ − ME, x̄ + ME)
D
where
s
ME = t √ ,
n
and t = t α2 ,ν is the critical value from the t-distribution with ν = n−1 degrees
of freedom, s is the sample standard deviation, and n is the sample size. The
critical value of t for the desired confidence level can be found in Table 8.3.
Remark 8.3.1. When the sample size is greater than or equal to 30 (i.e.,
n ≥ 30), the central limit theorem suggests that the sampling distribution of
the sample mean is approximately normally distributed, even if the population
318
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
σ 2 is known σ 2 is unknown
n ≥ 30 n < 30
σ
ME = z √
n
T
s s
ME = z √ ME = t √
n n
Figure 8.2: The margin of error (ME) for the confidence interval for the popu-
lation mean.
AF
distribution is unknown. In such cases, it is recommended to use the critical
value z = z α2 instead of t = t α2 .
However, if the sample size is smaller than 30 and the population variance
is unknown, it is generally recommended to use the critical value from the t-
distribution, denoted as t = t α2 , which accounts for additional variability due to
the smaller sample size.
Solution
It is given that
• Sample mean (x̄) = 120 minutes
• Population
√ variance σ 2 = 25 minutes², so population standard deviation
σ = 25 = 5 minutes
319
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Figure 8.3: A.4: Critical points = t α2 ,ν of the t-distribution with its degrees of
freedom (ν)
α
2
ν 0.10 0.05 0.025 0.01 0.005 0.001 0.0005
T
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 1.350 1.771 2.160 2.650 3.012 3.852 4.221
AF
14
15
16
1.345
1.341
1.337
1.761
1.753
1.746
2.145
2.131
2.120
2.624
2.602
2.583
2.977
2.947
2.921
3.787
3.733
3.686
4.140
4.073
4.015
17 1.333 1.740 2.110 2.567 2.898 3.646 3.965
18 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 1.319 1.714 2.069 2.500 2.807 3.485 3.767
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 1.316 1.708 2.060 2.485 2.787 3.450 3.725
DR
(i). 90% Confidence Interval: For a 90% confidence interval, the critical
value z is approximately 1.645.
Calculating the margin of error:
320
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
σ 5
Margin of Error = z √ = 1.645 √ ≈ 1.645 × 0.912 ≈ 1.502
n 30
Thus, the 90% confidence interval is:
Comment: The 90% confidence interval suggests that we are 90% confident
that the true mean time spent on social media by all teenagers lies between
T
118.498 and 121.502 minutes.
(ii). 95% Confidence Interval: For a 95% confidence interval, the critical
value z is approximately 1.96.
Calculating the margin of error:
AF Margin of Error = 1.96 √
5
30
Thus, the 95% confidence interval is:
≈ 1.96 × 0.912 ≈ 1.790
(iii). 99% Confidence Interval: For a 99% confidence interval, the critical
value z is approximately 2.576.
Calculating the margin of error:
5
Margin of Error = 2.576 √ ≈ 2.576 × 0.912 ≈ 2.352
30
Thus, the 99% confidence interval is:
Comment: The 99% confidence interval suggests that we are 99% confident
that the true mean time spent on social media by all teenagers is between
117.648 and 122.352 minutes. There is only a 1% chance that the true mean is
not in this interval.
321
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
T
Implications: A narrower interval provides a more precise estimate but with
lower confidence, while a wider interval provides higher confidence but with
less precision. The choice of confidence level depends on the desired level of
certainty and the acceptable margin of error for the study. For critical decisions,
a higher confidence level (e.g., 99%) might be preferred to minimize the risk
AF
of error, even if it results in a wider interval. For less critical situations, a
lower confidence level (e.g., 90% or 95%) might be acceptable if a more precise
estimate is desired. .
Problem 8.2. A factory produces light bulbs with a known standard deviation
of 100 hours in their lifespan. A sample of light bulbs has an average lifespan
of 1000 hours. Construct a 99% confidence interval for the mean lifespan of the
light bulbs for the following sample sizes:
(i). n = 20
(ii). n = 30
(iii). n = 50
DR
322
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
FT
50
CI = 1000 ± 36.45 ⇒ (963.55, 1036.45)
This interval shows we are 99% confident that the true mean lifespan of the
light bulbs is between 963.55 and 1036.45 hours. With a larger sample size, the
confidence interval is even narrower, reflecting greater precision in estimating
the population mean.
Comparison of Confidence Intervals For the three sample sizes, the con-
fidence intervals are:
• For n = 20: CI = (942.31, 1057.69)
A
• For n = 30: CI = (952.97, 1047.03)
323
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Increasing sample size, reducing variability, and lowering the confidence level
lead to a narrower confidence interval.
Problem 8.3. Consider a sample of 20 measurements of blood pressure with a
mean of 130 mmHg and a standard deviation of 15 mmHg. Construct a 95%
confidence interval for the population mean blood pressure.
Solution: In this case, we have, s = 15, x̄ = 130, n = 20. Since the sample
size is small (n < 30) and the population standard deviation is unknown, we will
use the t-distribution. The degrees of freedom (ν) is calculated as n − 1 = 19.
From the appropriate t-distribution table (e.g., see “Statistical Tables for the
FT
Student’s t-Distribution give in Table 8.3”), the critical value t for a 95% con-
fidence level with 19 degrees of freedom is approximately 2.093.
CI = x̄ ± ME = 130 ± 7.02
This gives us:
A
CI = (130 − 7.02, 130 + 7.02) ⇒ (122.98, 137.02)
We are 95% confident that the true population mean blood pressure is between
122.98 mmHg and 137.02 mmHg.
Problem 8.4. A study measures the daily caffeine intake of 25 adults. The
R
sample has a mean intake of 200 mg and a standard deviation of 50 mg. Con-
struct a 90% confidence interval for the population mean daily caffeine intake.
Solution: Here, s = 50, x̄ = 200, n = 25. Since the sample size is small
(n < 30), we will use the t-distribution. The degrees of freedom (ν) is n−1 = 24.
From Table 8.3, the critical value t for a 90% confidence level with 24 degrees
D
CI = x̄ ± ME = 200 ± 17.11
This gives us
324
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Solution: In this case, we have, s = 1.2, x̄ = 4.5, n = 25. Since the sample
T
size is small (n < 30) and the population standard deviation is unknown, we will
use the t-distribution. The degrees of freedom (ν) is calculated as n − 1 = 24.
From the appropriate t-distribution table (e.g., see Table 8.3), the critical value
t for a 90% confidence level with 24 degrees of freedom is approximately 1.711.
CI = x̄ ± ME = 4.5 ± 0.41064
This results in:
Solution: In this case, we have, s = 1.8, x̄ = 6.3, n = 30. Since the sample
size is large (n ≥ 30), we can use the normal distribution. The critical value z
for a 99% confidence level can be found from the standard normal distribution
table, which is approximately 2.576.
The margin of error (ME) is
s 1.8
ME = z √ = 2.576 √ = 2.576 × 0.328 ≈ 0.845
n 30
325
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
CI = x̄ ± ME = 6.3 ± 0.845
This results in:
Problem 8.7. A biologist studies the effect of a new fertilizer on plant growth.
T
A sample of 10 plants is measured for growth in centimeters over a month. The
growth measurements are as follows:
12.5, 15.3, 14.0, 16.7, 13.5, 14.8, 15.1, 13.2, 17.0, 12.0
(i). Calculate the sample mean (x̄) and sample standard deviation (s) of the
AF growth measurements.
(ii). Using a 95% confidence level, construct a confidence interval for the mean
growth of the plants.
Solution
(i). Given the growth measurements of 10 plants:
12.5, 15.3, 14.0, 16.7, 13.5, 14.8, 15.1, 13.2, 17.0, 12.0
Using Table 8.3, the sample Mean (x̄) is
DR
1441
x̄ = = 14.41
10
326
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
i xi x2i
1 12.5 156.25
2 15.3 234.09
3 14.0 196.00
4 16.7 278.89
5 13.5 182.25
6 14.8 219.04
7 15.1 228.01
8 13.2 174.24
T
9 17.0 289.00
10 12.0 144.00
Total 144.1 2101.77
AF
Using the formula for the sample variance, we have
1 X 2 1
s2 = xi − nx̄2 = 2101.77 − 10 × 14.412 ≈ 2.81
n−1 10 − 1
Hence, the sample Standard Deviation (s) is
√ √
s = s2 = 2.81 ≈ 1.68
DR
(ii). 95% Confidence Interval: For a 95% confidence level, the critical
value t = 2.262 for degrees of freedom ν = 9. So, the margin of error is
calculated as follows:
s 1.65
ME = t √ = 2.262 · √ ≈ 2.262 · 0.522 = 1.18
n 10
The confidence interval is given by:
CI = x̄ ± ME
Thus, we have:
327
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Python Code
1 import numpy as np
2 import scipy . stats as stats
3
T
9 sample_std_dev = np . std ( growth_measurements , ddof =1) #
Sample standard deviation
10
14
15
AF # ( ii ) Construct a 95% confidence interval for the mean
growth
confidence_level = 0.95
16 n = len ( growth_measurements ) # Sample size
17
24 # Margin of error
25 margin_of_error = t_critical * standard_error
26
27 # Confidence interval
28 ci_lower = sample_mean - margin_of_error
29 ci_upper = sample_mean + margin_of_error
30
33
Listing 8.1: The 95% Confidence Interval for the Population Mean.
328
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Solution: Here, s = 15, x̄ = 190, n = 50. Since the sample size is large
(n ≥ 30), we will use the normal distribution. The critical value z for a 95%
confidence level is approximately 1.96.
CI = x̄ ± ME = 190 ± 4.16
T
This gives us:
Solution: Here, s = 2.5, x̄ = 8 and n = 40. Since n ≥ 30, we use the normal
distribution. The critical value for a 95% confidence level is z ≈ 1.96.
s 2.5
ME = z √ = 1.96 √ ≈ 1.96 × 0.395 = 0.774
n 40
The 95% confidence interval is:
329
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
• s2 = sample variance
T
• χ21−α/2,n−1 = the lower chi-square critical value at α/2 with n − 1
degrees of freedom
Important Notes
• These methods assume that the sample comes from a normally dis-
tributed population.
330
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Solution: Here, s2 = 20, n = 25, and for a 95% confidence level with 24
degrees of freedom, χ2α ,24 ≈ 39.36 and χ21− α ,24 ≈ 12.40.
2 2
The confidence interval for the variance is:
24 × 20 24 × 20
, = (12.20, 38.71)
39.36 12.40
Thus, the 95% confidence interval for the population variance is (12.20, 38.71)
and the 95% confidence interval for the standard deviation is approximately:
√ √
( 12.20, 38.71) = (3.49, 6.22).
Problem 8.11. A biostatistical study is conducted with a sample of 10 pa-
T
tients, and the following blood pressure measurements (in mmHg) are recorded:
120, 125, 130, 135, 140, 145, 150, 155, 160, 165. Calculate the variance and stan-
dard deviation of the blood pressure measurements. Also, compute a 95% con-
fidence interval for the variance.
i xi x2i
1 120 14400
2 125 15625
DR
3 130 16900
4 135 18225
5 140 19600
6 145 21025
7 150 22500
8 155 24025
9 160 25600
10 165 27225
Total 1425 205125
For the 95% confidence interval for the variance, use the Chi-square distribution:
" #
(n − 1)s2 (n − 1)s2
CIσ2 = ,
χ2α , n−1 χ21− α , n−1
2 2
331
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
FT
in blood pressure readings among patients, with potential values for variance
reflecting both low and high levels of dispersion.
Python Code
1 import numpy as np
2 import scipy . stats as stats
3
332
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Problem 8.12. A clinical trial measures the time (in minutes) taken for a
specific treatment to have an effect in 6 patients, yielding the following times:
22, 27, 25, 30, 24, 28. Determine the variance and standard deviation of the time
data. Also, compute a 99% confidence interval for the variance.
T
= (22 − 26)2 + (27 − 26)2 + (25 − 26)2
6−1
+(30 − 26)2 + (24 − 26)2 + (28 − 26)2
1
= [16 + 1 + 1 + 16 + 4 + 4]
5
AF =
42
= 8.4
5
The standard deviation is:
√ √
s= s2 = 8.4 ≈ 2.90
For the 99% confidence interval for the variance, use the Chi-square distri-
bution:
DR
" #
(n − 1)s2 (n − 1)s2
CIσ2 = ,
χ2α , n−1 χ21− α , n−1
2 2
333
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Python Code
1 import numpy as np
2 import scipy . stats as stats
3
4 # Given times
5 data = np . array ([22 , 27 , 25 , 30 , 24 , 28])
6
7 # Calculate mean
8 mean = np . mean ( data )
9
T
11 variance = np . var ( data , ddof =1)
12
16 # Degrees of freedom
n = len ( data )
17
18
19
20
AF df = n - 1
29 # Output results
DR
Listing 8.3: The 99% Confidence Interval for the Population Variance.
334
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
x
portion pb = n, where x is the number of successes and n is the sample size.
For a given confidence level (e.g., 95%), the critical value z is determined
from the standard normal distribution. The margin of error (ME) is calculated
as:
r
pb(1 − pb)
ME = z
n
where the quantity r
pb(1 − pb)
SE =
n
T
is called standard error (SE) of pb. The confidence interval is then constructed
as:
pb ± ME ⇒ (b
p − ME, pb + ME)
This interval provides a range within which we can be confident that the
true population proportion lies, based on the sample data. Validity conditions
AF
include having a sufficiently large sample size (n ≥ 30) and ensuring np ≥ 5
and n(1 − p) ≥ 5.
Confidence intervals are crucial for assessing uncertainty and making in-
formed decisions based on sample estimates.
Problem 8.13. In a survey of 500 voters, 300 indicated they would vote for
candidate A. Construct a 95% confidence interval for the proportion of voters
who support candidate A.
Solution: The sample size is n = 500 and the number of voters for candidate
DR
300
A is x = 300. The sample proportion is pb = 500 = 0.6. For a 95% confidence
level, the critical value is z ≈ 1.96.
Problem 8.14. In a survey of 150 patients, 45 reported that they are satisfied
with their current medication. Estimate the proportion of patients satisfied with
their medication and calculate a 95% confidence interval for this proportion.
335
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
T
Calculate the confidence interval:
So, the 95% confidence interval for the population proportion is [0.228, 0.372].
AF
Problem 8.15. A study on the effectiveness of a new drug finds that out of 200
patients, 50 show significant improvement. Estimate the proportion of patients
who show improvement and find a 90% confidence interval for this proportion.
So, the 90% confidence interval for the proportion is [0.196, 0.304].
336
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Python Code
1 import numpy as np
2 import scipy . stats as stats
3
4 # Given data
5 n = 200 # total number of patients
6 x = 50 # number of patients showing improvement
7
T
11 # Confidence level
12 confidence_level = 0.90
13 alpha = 1 - confidence_level
14
15 # Standard error
16 se = np . sqrt (( p_hat * (1 - p_hat ) ) / n )
17
18
19
20
AF # Z - score for the given confidence level
z_score = stats . norm . ppf (1 - alpha / 2)
21 # Margin of error
22 margin_of_error = z_score * se
23
24 # Confidence interval
25 lower_bound = p_hat - margin_of_error
26 upper_bound = p_hat + margin_of_error
27
28 # Output results
29 print ( f " Estimated proportion of patients showing
DR
32
Listing 8.4: The 95% Confidence Interval for the Population Proportion.
337
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
To compute the 99% confidence interval for the proportion, use the formula:
r
pb(1 − pb)
pb ± z
n
where n = 80 and z ≈ 2.576 for a 99% confidence level.
T
Calculate the confidence interval:
So, the 99% confidence interval for the proportion is [0.196, 0.504].
AF
8.6 Sample Size Estimation
Sample size estimation is crucial for ensuring that studies and experiments
in business are designed to provide reliable and valid results. This section
covers sample size estimation for estimating a population mean and a population
proportion, including business-related examples.
2
z×σ
n= (8.4)
E
where:
• z = z1− α2 is the critical value (See the Table 8.1) from the standard
normal distribution corresponding to a confidence level of 1 − α,
338
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
T
Calculate the sample size:
2 2
1.96 × 50 98 2
n= = = (9.8) = 96.04
10 10
AF
Rounding up, the required sample size is 97.
Python Code
13 # Example usage
14 margin_of_error = 10 # Desired margin of error ( $ )
15 st an da rd_deviation = 50 # Population standard deviation ( $
)
16 confidence_level = 0.95 # Confidence level
17
339
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
To use equation (8.4), we need a value for the population standard deviation
σ. Even if we dot know σ, we can still use equation (8.4) if we have a preliminary
value or planing value for it. Here are some practical ways to find this value:
T
3. Use judgment or a “best guess” for the value of σ. For example, estimate
the largest and smallest values in the population. The difference between
these values gives you a range. A common suggestion is to divide this
range by 4 to get a rough estimate of the standard deviation, which can
then serve as the value for σ.
AF Problem 8.18. A market research firm wants to estimate the average monthly
revenue generated by small businesses. A pilot study estimated the standard
deviation of revenue to be $15,000. To ensure a margin of error of $2,000 with
a 90% confidence level, determine the required sample size.
n=
2000
2
24675
=
2000
2
= (12.3375)
= 152.2
340
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Solution: It is given,
s = 8 hours, E = 2 hours,
T
Problem 8.20. The range for a set of data is estimated to be 36.
(a). What is the planning value for the population standard deviation?
(b). At 95% confidence, how large a sample would provide a margin of error
of 3?
AF
(c). At 95% confidence, how large a sample would provide a margin of error
of 2?
Solution: (a.)
The estimated population standard deviation (σ) can be calculated using
the range:
Range
σ≈
4
Given that the range is 36:
DR
36
σ≈
=9
4
(b.) The formula for the sample size (n) is:
2
z×σ
n=
E
Where:
• z is the Z-score for 95% confidence (Z ≈ 1.96)
341
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Using σ = 9 and E = 3:
2 2
1.96 × 9 17.64 2
n= = = (5.88) ≈ 34.57
3 3
Rounding up, we find:
n ≈ 35
(c.) Using the same formula with E = 2:
2 2
1.96 × 9 17.64 2
n= = = (8.82) ≈ 77.92
T
2 2
Rounding up, we find:
AF n ≈ 78
z 2 p(1 − p)
n=
E2
where,
DR
342
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
Solution: Given,
T
When estimating the mean or proportion of a population, it is important to
consider whether the population is finite or small (generally under 5,000). In
this case, the standard sample size formulas may overestimate the needed sam-
ple size. Adjustments are made to account for the limited number of individuals
available, ensuring that the sample more accurately represents the population.
AF
The formula for estimating the sample size n when the population size N is
known is given by:
n0
n=
1 + nN0
where,
• n: The final sample size adjusted for the finite population.
Example:
To illustrate the process of estimating the sample size for a finite population,
consider the following scenario: Suppose you are conducting a study to assess
the prevalence of a certain health condition within a population of residents in
a small town. You have determined the following parameters for your study:
343
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
z 2 · p(1 − p)
n0 =
E2
Substituting the values into the formula gives:
T
Thus, the initial sample size n0 is approximately 384.
Remark 8.6.1. When calculating the initial sample size n0 , we usually round
to the nearest whole number since we cannot survey a fraction of a person. The
method of rounding can depend on the specific context or research guidelines.
Both rounding to the nearest whole number and rounding up are acceptable, as
long as the reasoning behind the choice is clear.
AF
Step 2: Adjust for Finite Population
Next, we apply the finite population correction to determine the adjusted sam-
ple size n:
n0
n=
1 + nN0
Substituting n0 = 384 and N = 1000:
384 384 384
n= 384 = 1 + 0.384 = 1.384 ≈ 277.64
1 + 1000
DR
344
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
T
3. A teacher measures the effectiveness of a new teaching method. In a sam-
ple of 40 students, the average test score was 78 with a known population
standard deviation of 10. Construct a 90% confidence interval for the
mean test score of all students taught with this method.
4. A pharmaceutical company is evaluating the effectiveness of a new drug.
AFA sample of 50 patients reported an average improvement score of 8.5 on
a health scale, with a population standard deviation of 2. Create a 95%
confidence interval for the mean improvement score of all patients taking
the drug.
5. A sociologist studies the average number of hours people spend on social
media. In a sample of 30 individuals, the average time spent was 3.2 hours
with a known population standard deviation of 1.1 hours. Construct a
99% confidence interval for the mean time spent on social media by the
population.
6. A horticulturist studies the effect of a new watering technique on plant
DR
11.8, 14.2, 15.0, 13.6, 12.3, 16.1, 14.4, 13.8, 15.7, 12.5
(i). Calculate the sample mean (x̄) and sample standard deviation (s) of
the growth measurements.
(ii). Using a 95% confidence level, construct a confidence interval for the
mean growth of the plants.
(iii). If the horticulturist wants to increase the confidence level to 99%,
how would that affect the width of the confidence interval? Calculate
the new confidence interval.
(iv). If the growth measurements were to decrease by 1.5 cm for each
plant due to an adverse weather condition, how would this change
the sample mean and the confidence interval?
345
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
7. A survey of 200 students revealed that 120 of them prefer online classes to
in-person classes. Construct a 95% confidence interval for the proportion
of all students who prefer online classes.
8. In a clinical trial, 150 patients were given a new treatment, and 90 re-
ported improvement in their condition. Calculate a 99% confidence inter-
val for the proportion of patients who respond positively to the treatment.
9. A marketing firm conducted a survey of 500 customers and found that 275
T
are satisfied with their services. Using a 90% confidence level, construct
a confidence interval for the proportion of all customers who are satisfied.
10. A poll conducted on 1, 000 voters shows that 430 support a certain can-
didate. Determine the 95% confidence interval for the proportion of all
voters who support this candidate.
AF
11. In a study of cholesterol levels in a population, the following measure-
ments (in mg/dL) were recorded: 210, 220, 230, 240, 250, 260. Calculate
the variance and standard deviation of the cholesterol levels. Also, com-
pute a 90% confidence interval for the variance.
12. A researcher wants to estimate the proportion of voters who support a
new policy with a margin of error of 0.05 and a confidence level of 95%.
If a previous study found that 60% of voters support the policy, calculate
the required sample size.
13. A company aims to determine the average time (in minutes) customers
spend on their website. They want to estimate the mean with a margin of
DR
346
CHAPTER 8. CONFIDENCE INTERVAL ESTIMATION
α χ2
5 χ2α 10 15 20
ν χ2.995 χ2.99 χ2.975 χ2.95 χ2.90 χ2.75 χ2.50 χ2.25 χ2.10 χ2.05 χ2.025 χ2.01 χ2..005 χ2.001
1 0.00 0.00 0.00 0.00 0.02 0.10 0.45 1.32 2.71 3.84 5.02 6.63 7.88 10.83
2 0.01 0.02 0.05 0.10 0.21 0.58 1.39 2.77 4.61 5.99 7.38 9.21 10.60 13.81
T
3 0.07 0.12 0.22 0.35 0.58 1.21 2.37 4.11 6.25 7.81 9.35 11.34 12.84 16.27
4 0.21 0.30 0.48 0.71 1.06 1.92 3.36 5.39 7.78 9.49 11.14 13.28 14.86 18.47
5 0.41 0.55 0.83 1.15 1.61 2.67 4.35 6.63 9.24 11.07 12.83 15.09 16.75 20.52
6 0.68 0.87 1.24 1.64 2.20 3.45 5.35 7.84 10.64 12.59 14.45 16.81 18.55 22.46
7 0.99 1.24 1.69 2.17 2.83 4.25 6.35 9.04 12.02 14.07 16.01 18.48 20.28 24.32
8 1.34 1.65 2.18 2.73 3.49 5.07 7.34 10.22 13.36 15.51 17.53 20.09 21.95 26.12
AF
9 1.73 2.09 2.70 3.33 4.17 5.90 8.34 11.39 14.68 16.92 19.02 21.67 23.59 27.88
10 2.16 2.56 3.25 3.94 4.87 6.74 9.34 12.55 15.99 18.31 20.48 23.21 25.19 29.59
11 2.60 3.05 3.82 4.57 5.58 7.58 10.34 13.70 17.28 19.68 21.92 24.72 26.76 31.26
12 3.07 3.57 4.40 5.23 6.30 8.44 11.34 14.85 18.55 21.03 23.34 26.22 28.30 32.91
13 3.57 4.11 5.01 5.89 7.04 9.30 12.34 15.98 19.81 22.36 24.74 27.69 29.82 34.53
14 4.07 4.66 5.63 6.57 7.79 10.17 13.34 17.12 21.06 23.68 26.12 29.14 31.32 36.12
15 4.60 5.23 6.27 7.26 8.55 11.04 14.34 18.25 22.31 25.00 27.49 30.58 32.80 37.70
16 5.14 5.81 6.91 7.96 9.31 11.91 15.34 19.37 23.54 26.30 28.85 32.00 34.27 39.25
17 5.70 6.41 7.56 8.67 10.09 12.79 16.34 20.49 24.77 27.59 30.19 33.41 35.72 40.79
18 6.26 7.01 8.23 9.39 10.86 13.68 17.34 21.60 25.99 28.87 31.53 34.81 37.16 42.31
19 6.84 7.63 8.91 10.12 11.65 14.56 18.34 22.72 27.20 30.14 32.85 36.19 38.58 43.82
20 7.43 8.26 9.59 10.85 12.44 15.45 19.34 23.83 28.41 31.41 34.17 37.57 40.00 45.32
DR
21 8.03 8.90 10.28 11.59 13.24 16.34 20.34 24.93 29.62 32.67 35.48 38.93 41.40 46.80
22 8.64 9.54 10.98 12.34 14.04 17.24 21.34 26.04 30.81 33.92 36.78 40.29 42.80 48.27
23 9.26 10.20 11.69 13.09 14.85 18.14 22.34 27.14 32.01 35.17 38.08 41.64 44.18 49.73
24 9.89 10.86 12.40 13.85 15.66 19.04 23.34 28.24 33.20 36.42 39.36 42.98 45.56 51.18
25 10.52 11.52 13.12 14.61 16.47 19.94 24.34 29.34 34.38 37.65 40.65 44.31 46.93 52.62
26 11.16 12.20 13.84 15.38 17.29 20.84 25.34 30.43 35.56 38.89 41.92 45.64 48.29 54.05
27 11.81 12.88 14.57 16.15 18.11 21.75 26.34 31.53 36.74 40.11 43.19 46.96 49.64 55.48
28 12.46 13.56 15.31 16.93 18.94 22.66 27.34 32.62 37.92 41.34 44.46 48.28 50.99 56.89
29 13.12 14.26 16.05 17.71 19.77 23.57 28.34 33.71 39.09 42.56 45.72 49.59 52.34 58.30
30 13.79 14.95 16.79 18.49 20.60 24.48 29.34 34.80 40.26 43.77 46.98 50.89 53.67 59.70
40 20.71 22.16 24.43 26.51 29.05 33.66 39.34 45.62 51.81 55.76 59.34 63.69 66.77 73.40
50 27.99 29.71 32.36 34.76 37.69 42.94 49.33 56.33 63.17 67.50 71.42 76.15 79.49 86.66
60 35.53 37.48 40.48 43.19 46.46 52.29 59.33 66.98 74.40 79.08 83.30 88.38 91.95 99.61
70 43.28 45.44 48.76 51.74 55.33 61.70 69.33 77.58 85.53 90.53 95.02 100.42 104.22 112.32
80 51.17 53.54 57.15 60.39 64.28 71.14 79.33 88.13 96.58 101.88 106.63 112.33 116.32 124.84
90 59.20 61.75 65.65 69.13 73.29 80.62 89.33 98.64 107.56 113.14 118.14 124.12 128.30 137.21
100 67.33 70.06 74.22 77.93 82.36 90.13 99.33 109.14 118.50 124.34 129.56 135.81 140.17 149.45
347
Chapter 9
T
Decision Making
AF
9.1 Introduction
In data science, hypothesis testing is a fundamental aspect of statistical in-
ference, providing a systematic method for making decisions about population
parameters based on sample data. This chapter aims to equip readers with a
thorough understanding of hypothesis testing, its importance, and its applica-
tion in various statistical scenarios.
Finally, we cover the methods for estimating the sample size required for
mean and proportion tests, highlighting the importance of adequate sample
sizes for achieving meaningful and accurate conclusions. The chapter concludes
with a set of exercises designed to reinforce the concepts and techniques dis-
cussed, providing practical experience in applying hypothesis testing methods.
348
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
the old website (Version A) and the other half are exposed to the new design
(Version B).
After running the test for several weeks, you gather data on the number of
purchases made by visitors from both versions of the website. Your goal is to
determine whether the new design genuinely results in more sales compared to
AF
the old version or if any observed differences are attributable to random chance.
The term “null” in “null hypothesis” reflects the concept of a “null effect”
or “no effect.” It serves as a baseline assumption that there is no significant
difference, relationship, or effect between the variables under investigation. Es-
sentially, the null hypothesis is the starting point for statistical testing, positing
that any observed differences or effects are due to random chance rather than
a true underlying effect. If the evidence is robust enough to reject the null
hypothesis, it suggests the presence of a significant effect or relationship that
merits further investigation.
Hypothesis: A hypothesis is an assertion or assumption regarding popu-
lation parameters or characteristics of random variables.
• Null Hypothesis (H0 ): The null hypothesis asserts that there
is no effect or difference between the groups or variables being
studied.
349
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
cholesterol level of children whose fathers died from heart disease is significantly
higher than this benchmark. In this case, the hypotheses are following:
350
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Step 3: Collect Data and define a test statistic and then compute it
T
■ standard normal table for Z-test
■ standard t-table for T -test
■ F -table for F -test
■ χ2 -table for χ2 -test, etc.
AF for the level of significance α or compute p-value and then define deci-
sion rule
351
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Statistical Hypothesis
T
rejection of true value,
• assumption, nothing new a given
• independent • rejection of the as-
sumption
• negation of the research aim
H0 : θ = θ0
352
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
In two-sided tests, the hypothesis might state that there is a difference in either
direction
H1 : θ ̸= θ0
T
exists when it does not. This chosen α level reflects the researcher’s tolerance
for making an erroneous decision. A lower α means a stricter criterion for re-
jecting H0 , which reduces the risk of a Type I error but may increase the risk
of a Type II error—failing to reject a false null hypothesis. This threshold thus
sets the stage for evaluating the test results, ensuring that decisions align with
AF
the predetermined risk level.
Types of error
353
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
In Reality
H0 is TRUE H0 is FALSE
Correct Decision Type II error
Accept H0
Decision
T
9.3.3 Test Statistics
The test statistic provides a standardized value that is calculated from sample
data during a hypothesis test. It quantifies the discrepancy between the ob-
served sample data and the null hypothesis. The test statistic is a random
variable because it is derived from random sample data, and it follows a prob-
AF
ability distribution. In general, the test statistic for the location parameter test
defined in null hypothesis H0 : θ = θ0 is
θb − θ0
Test statistic =
se(θ)
b
where θ is a location parameter. It could be a mean, median, quantile, etc.
values for which we fail to reject the null hypothesis H0 . The rejection region
is the range of values for which we reject H0 . The critical value zcri defines the
boundary between these regions. Figure 9.1, Figure 9.2, and Figure 9.3 show
the acceptance and rejection regions for a left-tailed test, a right-tailed test,
and a two-tailed test, respectively, at a significance level of α.
354
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
f (z)
T
Acceptance Region
Rejection Region
−zα z
AF
Figure 9.1: Acceptance and Rejection Regions for a Left-Sided Test at α = 0.05.
f (z)
DR
Acceptance Region
Rejection Region
z zα
355
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
f (z)
T
Acceptance Region
Rejection Region Rejection Region
−zα/2 z zα/2
AF
Figure 9.3: Acceptance and Rejection Regions for a Two-Sided Test at α = 0.05.
The graphs above represent the standard normal distribution (for Z-test),
which is a common distribution used in hypothesis testing. The curve shown is
the probability density function of the standard normal distribution, N (0, 1),
which is symmetric about the mean µ = 0.
• Critical Value (zcri ): Critical values help determine whether to reject
the null hypothesis H0 based on the significance level α and test type.
356
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Critical-Value Method:
T
than the critical value; otherwise, fail to reject H0 .
The p-value: The p-value represents the probability of observing the test
statistic as extreme as, or more extreme than, the value observed if the
null hypothesis is true. A small p-value (less than α) indicates that the
observed data is unlikely under the null hypothesis, leading us to reject
H0 .
357
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
Figure 9.4: The p-value is showed in shadow area.
• If 0.001 ≤ p-value < 0.01, then the results are highly significant.
• If p-value < 0.001, then the results are very highly significant.
• If p-value > 0.05, then the results are considered not statistically sig-
nificant (sometimes denoted by NS).
However, if 0.05 < p-value < 0.10, then a trend toward statistical significance
is sometimes noted.
358
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
• One-Sample Test of Means: This test assesses whether the mean of
a single sample differs from a known or hypothesized population mean.
It is useful when we want to determine if the sample mean is consistent
with a particular value.
AF• Testing Equality of Two Means: Here, we compare the means
of two independent samples to evaluate whether there is a significant
difference between them. This is crucial when examining the effects of
different treatments or conditions on two separate groups.
Each of these tests plays a critical role in statistical inference, helping re-
searchers and analysts make data-driven decisions and understand underlying
DR
Formulating Hypotheses
Before conducting the test, researchers establish two hypotheses:
359
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Ha : µ ̸= µ0
Assumptions
Before performing the test, several key assumptions must be considered:
T
1. Normality: The sample data should be approximately normally dis-
tributed, particularly if the sample size is small (less than 30).
2. Independence: Each observation within the sample must be indepen-
dent of others.
where v
u n
u 1 X
s=t (xi − x̄)2
n − 1 i=1
360
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Critical-Value Method
Next, researchers determine a critical value using statistical tables based on the
chosen significance level (commonly set at 0.05). This value aids in deciding
whether to reject the null hypothesis. Alternatively, the p-value is calculated to
indicate the probability of observing the results if the null hypothesis is true.
T
significance level α, the critical value is denoted as zα (for the upper tail) or
−zα (for the lower tail). For example:
Suppose the realized value of the test statistic (Z) is zcal . Then,
AF•
•
for a two-sided test, reject H0 if and only if zcal ≥ z α2 or zcal ≤ −z α2 .
• If the p-value is less than or equal to the significance level (α), reject
the null hypothesis H0 .
DR
361
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
For a one-tailed test (less than):
To compute the p-value for Z-test, we can use the following formulas:
• For a two-tailed test:
362
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
7
Solution
To test whether the production time on Sunday morning is significantly longer,
a sample of size n = 12 was taken, yielding a sample mean x̄ = 92.2. It
is assumed that the production time is normally distributed with a known
variance σ 2 = 144.
We start by setting up the hypothesis test:
363
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Alternatively, we can also use the p-value method. The p-value is calcu-
T
lated as:
p-value = Pr(Z > 0.9237) = 0.1778.
Since the p-value is greater than 0.05, we do not reject the null hypothesis
H0 at the 5% significance level. This confirms that there is not enough
evidence to suggest that production time is longer on Sunday morning.
AF
Problem 9.2. (Cardiovascular Disease) We want to compare fasting serum-
cholesterol levels between recent Asian immigrants and the general U.S. popula-
tion. Cholesterol levels for U.S. women aged 21-40 are normally distributed with
a mean of 190 mg/dL. For recent female Asian immigrants, with levels assumed
to be normally distributed but with an unknown mean µ, we test H0 : µ = 190
against H1 : µ ̸= 190. Blood tests from 200 female Asian immigrants show a
mean of 181.52 mg/dL and a standard deviation of 40. What conclusions can
we draw from this data?
Solution
DR
To test whether the mean cholesterol level differs from 190, we set up the
following hypotheses:
H0 : µ = 190 vs. H1 : µ ̸= 190.
Assuming that X follows a normal distribution, i.e., X ∼ N (µ, σ 2 ), with a
sample mean x̄ = 181.52, sample standard deviation s = 40, and sample size
n = 200, we proceed with the following steps.
First, we set the significance level to α = 0.05. The test statistic is calculated
as:
x̄ − µ0 181.52 − 190
zcal = √ = √ = −3.00.
s/ n 40/ 200
For a significance level of α = 0.05, the critical value from the z-distribution
is zα/2 = z0.025 = 1.96. Since |zcal | > 1.96, we reject the null hypothesis
H0 : µ = 190 at the 5% significance level. Thus, we conclude that the mean
cholesterol level of recent Asian immigrants is significantly different from that
of the general U.S. population.
364
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Alternatively, we compute the p-value for the test statistic. The p-value
is given by:
Since the p-value is less than α = 0.05, we reject the null hypothesis
H0 : µ = 190 at the 5% level of significance. Thus, we conclude that the
mean cholesterol level of recent Asian immigrants is significantly different
from that of the general U.S. population.
T
tional average. From 100 consecutive full-term births in a low-SES area, the
average birthweight is 115 oz with a standard deviation of 24 oz. Nationally,
the mean birthweight is 120 oz. Can we conclude that the mean birthweight in
this hospital is lower than the national average?
AF
Solution
To determine whether the average birthweight is significantly less than 120, we
set up the following hypotheses:
First, we set the significance level to α = 0.05. Since the sample size in this
case is large (n = 100), we could use either a T -test or a Z-test. However, we
DR
typically use the Z-test in this situation. Therefore, the realized value of test
statistic is calculated as:
x̄ − µ0 115 − 120
zcal = √ = √ = −2.08.
s/ n 24/ 100
For a one-tailed test at α = 0.05, the right-sided critical value zcri = 1.645.
Since |zcal | = 2.08 is greater than 1.645, we reject the null hypothesis H0 : µ =
120 at the 5% level of significance. Therefore, we conclude that the average
birthweights are lower than the normal average.
Problem 9.4. From a long term experience a factory owner knows that a
worker can produce a product in an average time of 89 min. However on Monday
morning, there is the impression that it takes longer. To test whether this
impression is correct or not, he collect a sample of size (n = 12), and it is
found that x̄ = 92.2 and s = 10.75. Test his clam.
365
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Solution
We assume that the production time follows a normal distribution. To verify
whether the impression that it takes longer on Monday morning is correct, we
conduct a hypothesis test at a significance level of 5%.
H0 : µ = 89
H1 : µ > 89
Assuming X follows a normal distribution with an unknown variance σ 2 ,
T
the test statistic is given by
x̄ − µ0
T = √ ∼ t(n−1) under H0 .
s/ n
We use the significance level α to find the critical value tα,(n−1) . From
the Student’s t-distribution table (Table A6 in the Appendix), we find that
AF
tα,(n−1) = t0.05,11 = 1.795.
The decision rule is to reject H0 if |tcal | > tα,(n−1) = 1.795. Given that
n = 12, x̄ = 92.2, and s = 10.75, the calculated test statistic is
92.2 − 89
tcal = √ = 1.0312.
10.75/ 12
Since 1.0312 < 1.795, we cannot reject H0 at the 5% significance level.
Therefore, there is insufficient evidence to conclude that it takes longer to pro-
duce on Monday morning.
Solution
We set up the hypothesis test with
366
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
p-value = Pr(T ≥ |tcal | under H0 )
= Pr(T ≥ | − 2.55|)
= .019
AF
Since p-value is less than α = 0.05, we reject the null hypothesis H0 = 25.
That is, the evidence is that the mean birth-weight from this hospital is
lower than the national average.
Solution
To test whether the mean cholesterol level is significantly different from 175,
we set up the following hypotheses:
367
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
First, we set the significance level to α = 0.05. The test statistic is calculated
as follows:
x̄ − µ0 200 − 175
tcal = √ = √ = 1.58.
s/ n 50/ 10
Next, from the t-distribution table, the critical value for t9,0.95 is 1.833.
Since the calculated test statistic |tcal | = 1.58 is less than the critical value
t9,0.95 = 1.833, we do not reject the null hypothesis H0 : µ = 175 at the 5%
significance level. Thus, we conclude that the mean cholesterol level of these
children does not significantly differ from that of an average child.
T
p-value = Pr(T ≥ |tcal | | H0 ) = Pr(T ≥ 1.58) = 0.074.
Solution
DR
To test whether the mean pulse rate differs from 72, we set up the following
hypotheses:
H0 : µ = 72 vs. H1 : µ ̸= 72.
Assuming that X follows a normal distribution, i.e., X ∼ N (µ, σ 2 ), with a
sample mean x̄ = 80, sample standard deviation s = 20, and sample size
n = 20, we perform the following steps.
First, we set the significance level to α = 0.05. The test statistic is calculated
as:
x̄ − µ0 80 − 72
tcal = √ = √ = 1.79.
s/ n 20/ 20
For a two-tailed test with an α = 0.05 and 19 degrees of freedom, the critical
t-values can be found using a t-distribution table or calculator. The critical t-
value is approximately 2.093. Since |tcal | < 0.093, we do not reject H0 : µ = 72.
This indicates that there is not enough statistical evidence to conclude that the
mean pulse rate in patients with hyperthyroidism is significantly different from
that in healthy adults.
368
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Alternatively, we compute the p-value for the test statistic. The p-value
is given by:
(Note: The p-value can be obtained using the function TDIST(1.79, 19, 2)
in Excel, which gives 0.0894.)
Since the p-value is greater than α = 0.05, we do not reject the null
hypothesis H0 : µ = 72. Thus, there is insufficient evidence to conclude
that the mean pulse rate in hyperthyroidism patients differs from that in
healthy adults.
T
9.6.2 Testing Equality of Two Means
Testing the equality of two means is a common procedure in hypothesis testing,
used to determine if there is a significant difference between the means of two
AF
populations. This can be done using various methods depending on the nature
of the samples.
■ Right-Tailed Test:
H1 : µ1 > µ2
■ Left-Tailed Test:
H1 : µ1 < µ2
369
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
where
(n1 − 1)s21 + (n2 − 1)s22
s2p =
n1 + n2 − 2
T
Here:
ν = n1 + n2 − 2.
Solution
We begin by setting up the hypotheses. Let µ1 and µ2 represent the mean
systolic blood pressures of OC users and non-OC users, respectively. The hy-
potheses are:
H0 : µ1 = µ2 or µ1 − µ2 = 0
H1 : µ1 ̸= µ2 or µ1 − µ2 ̸= 0
We assume the difference x̄1 − x̄2 follows a normal distribution with mean
0 and variance σ 2 . The estimated variance s2 is given by:
370
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
where n1 and n2 are the sample sizes, and s21 and s22 are the sample variances.
For our data:
T
132.86 − 127.44
tcal = q = 0.74
307.18 81 + 21
1
For a two-tailed test at the 0.05 significance level with degrees of freedom
27, the critical t-value of the right-side is approximately 2.052. Since |tcal | =
0.74 < 2.052, the we do not reject H0 . This indicates that there is not enough
AF
statistical evidence to conclude that the mean systolic blood pressure between
oral contraceptive users and non-users is significantly different.
Alternatively, the p-value is calculated as:
n1 −1 + n2 −1
371
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Solution
T
The hypotheses for the test are as follows:
Test Statistic: To determine the test statistic for the two-sample t-test, we
use the following formula:
x̄1 − x̄2
T =q 2
s1 s22
n1 + n2
We have,
372
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
impact on sales.
■ Right-Tailed Test:
H1 : µd > 0
■ Left-Tailed Test:
H1 : µd < 0
Test Statistic
Calculate the test statistic t using the formula:
d¯
T = √
sd / n
where d¯ is the mean difference, sd is the standard deviation of the differences,
and n is the number of pairs.
373
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
• For a two-tailed test, reject H0 if |t| is greater than the critical value.
For a one-tailed test, reject H0 if t falls in the direction specified by
Ha .
Problem 9.10. A nutritionist wants to evaluate the effectiveness of a new
dietary supplement in reducing cholesterol levels. A sample of 10 participants
had their cholesterol levels measured before and after taking the supplement for
a month. The cholesterol levels (in mg/dL) before and after the treatment are
provided as follows:
T
Before After
220 210
240 230
215 205
AF 230
250
215
240
210 200
225 210
240 220
235 225
245 230
DR
Solution
To determine whether the dietary supplement significantly affects cholesterol
levels, we use a paired t-test.
Hypotheses:
• Null Hypothesis (H0 ): There is no difference in cholesterol levels
before and after the treatment, i.e., µd = 0.
374
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
First, we calculate the differences between the before and after measure-
ments for each participant:
Differences = [10, 10, 10, 15, 10, 10, 15, 20, 10, 15]
T
5 250 240 10
6 210 200 10
7 225 210 15
8 240 220 20
AF 9
10
235
245
225
230
10
15
sP r
¯2
(di − d) (10 − 12.5)2 + (10 − 12.5)2 + · · · + (15 − 12.5)2
sd = = = 3.5355
DR
n−1 9
d¯ 12.5 12.5
tcal = √ = √ = ≈ 11.1803
sd / n 2.87/ 10 0.91
With n − 1 = 9 degrees of freedom, we compare the calculated t-value to
the critical value from the t-distribution table at the 5% significance level. For
a two-tailed test with 9 degrees of freedom, the critical value is approximately
2.262.
Since 11.1803 exceeds 2.262, we reject the null hypothesis. At the 5% signif-
icance level, there is sufficient evidence to conclude that the dietary supplement
has a significant effect on reducing cholesterol levels.
375
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
ple population, which is divided into several groups, each receiving a specific
medication over a trial period. After the trial, blood sugar levels are measured
for each participant. The mean blood sugar level is then calculated for each
group. ANOVA is used to compare these group means to determine if there are
significant differences, or if the means are statistically similar.
AFThis method is also known as Fisher’s analysis of variance, emphasizing
its capacity to examine how a categorical variable with multiple levels affects
a continuous variable. The application of ANOVA is determined by the re-
search design. Typically, ANOVAs are employed in three main forms: one-way
ANOVA, two-way ANOVA, and N-way ANOVA. The layout of data for one-way
ANOVA are in the following table.
H0 : µ1 = µ2 = · · · = µk
376
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
k
i=1 ni
SSB MSB
Between Groups (SSB) SSB k−1 MSB = k−1
F = MSW
SSW
Within Groups (SSW) SSW n−k MSW = n−k
Definitions:
• Sum of Squares Between Groups (SSB):
DR
k
X
SSB = ni (x̄i − x̄overall )2
i=1
• Degrees of Freedom:
377
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
• Mean Squares:
SSB
MSB =
k−1
FT
■ Mean Square Within Groups (MSW):
SSW
MSW =
n−k
• F-Statistic:
MSB
F =
MSW
378
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
Test if there are significant differences in the mean blood pressure levels among
these groups at the 5% significance level.
379
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
SSW = (n1 − 1)s21 + (n2 − 1)s22 + (n3 − 1)s23
= 4 × 9.2 + 4 × 9.3 + 4 × 2.5
= 84
Degrees of Freedom:
AF•
•
Between Groups: dfbetween = k − 1 = 3 − 1 = 2
SSW 84
MSW = = =7
dfwithin 12
F-Statistic:
MSB 398.87
F = = = 56.98
MSW 7
380
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
• Using the p-value: Since the p-value is less than 0.05, we also reject
the null hypothesis.
Test if there are significant differences in the mean productivity across the
three environments at the 5% significance level.
Solution
Step 1: Calculate Means and Overall Mean
381
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Environment Mean
30+32+29+31+33
Environment A x̄A = 5 = 31.00
45+50+47+46+48
Environment B x̄B = 5 = 47.20
55+53+52+58+60
Environment C x̄C = 5 = 55.60
31+47.20+55.60
Overall Mean x̄overall = 3 = 44.6
T
Environment A Environment B Environment C
30 45 55
32 50 53
29 47 52
AF 31
33
46
48
58
60
Mean (x̄i ) 31 47.2 55.6
Variance (s2i ) 2.5 3.7 11.3
= 1563.6
Step 3: Sum of Squares Within Groups (SSW)
Calculate the sum of squares within each group. For simplicity, assume the
following values for demonstration:
382
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
Step 6: Conclusion
Compare the calculated F -value to the critical value from the F -distribution
table with k − 1 and n − k degrees of freedom at α = 0.05.
Assuming the critical value from the F -table is approximately 3.49:
•
AF•
Since the calculated F -value (134.02) is greater than the critical value
(3.49), we reject the null hypothesis.
rejecting a false null hypothesis. For example, if a test has a power of 0.80, it
means there is an 80% chance of detecting an effect if it is present.
383
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
µ0 − µ1 µ1 − µ0
Power = Φ −zα/2 + √ + Φ −zα/2 + √ (9.1)
σ/ n σ/ n
and approximately by
FT
|µ1 − µ0 |
Power ≈ Φ −zα/2 + √ (9.2)
σ/ n
Problem 9.13. (Obstetrics) Compute the power of the test for the birthweight
R
data in Problem 9.3 with an alternative mean of 115 ounces (oz) and α = 0.05,
assuming the true standard deviation = 24 oz.
Solution
We have µ0 = 120 oz, µ1 = 115 oz, α = 0.05, σ = 24, n = 100. Thus,
D
h √ i
Power = Φ −z0.05 + (120 − 115) 100/24
= Φ [−1.645 + 5 × 10/24]
= Φ(0.438) = 0.669
384
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
cholesterol data in Problem 9.6, with an alternative mean of 190 mg/dL, a null
mean of 175 mg/dL, and a standard deviation of 50 mg/dL.
Solution
We have µ0 = 175 oz, µ1 = 190 oz, α = 0.05, σ = 50, n = 10. Thus,
h √ i
Power = Φ −z0.05 + (190 − 175) 10/50
h √ i
= Φ −1.645 + 15 × 10/50
= Φ(−0.696) = 0.243
T
Therefore, the chance of finding a significant difference in this case is only 24.3%.
Problem 9.15. (Cardiovascular Disease, Pediatrics) Compute the power
of the test for the cholesterol data in Problem 9.6 with a significance level of
α = 0.01 vs. an alternative mean of190 mg/dL.
AF
Solution
We have µ0 = 175 oz, µ1 = 190 oz, α = 0.01, σ = 50, n = 10. Thus,
h √ i
Power = Φ −z0.01 + (190 − 175) 10/50
h √ i
= Φ −2.326 + 15 × 10/50
= Φ(−1.377) = 0.08
which is lower than the power of 24.3% for α = 0.05, computed in Problem
9.14. What does this mean? It means that if the α level is lowered from 0.05
DR
to 0.01, the β error will be higher or, equivalently, the power, which decreases
from 0.243 to 0.08, will be lower.
(2). Effect size (d): If the alternative mean is shifted farther away from
the null mean (d = |µ0 − µ1 | increases), then the power increases.
(4). Sample size (n): If the sample size increases (n increases), then the
power increases.
385
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
σ: Population standard deviation.
• zα/2 : Critical value from the standard normal distribution for a two-
tailed test.
AF• zβ : Critical value from the standard normal distribution corresponding
to the power (1 − β) of the test.
where the data are normally distributed with mean µ and known variance σ 2 .
The sample size needed to conduct a one-sided test with significance level α
and probability of detecting a significant difference with power 100(1 − β)% is
(zβ + zα )2 σ 2
n=
(µ0 − µ1 )2
where d = µ0 − µ1 .
Problem 9.16. (Obstetrics) Consider the birthweight data in Problem 9.3.
Suppose that µ0 = 120 oz, µ1 = 115 oz, σ = 24, α = .05, 1 − β = 0.80, and we
use a one-sided test. Compute the appropriate sample size needed to conduct
the test.
386
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Solution
Since the power 1 − β = 0.80 then β = 0.20. Therefore, the sample size is
The sample size is always rounded up, so we can be sure to achieve at least the
required level of power (in this case, 80%). Thus, a sample size of 143 is needed
to have an 80% chance of detecting a significant difference at the 5% level if
FT
the alternative mean is 115 oz and a one-sided test is used.
Problem 9.17. (Cardiovascular Disease, Pediatrics) Consider the choles-
terol data in Problem 9.6. Suppose the null mean is 175 mg/dL, the alternative
mean is 190 mg/dL, the standard deviation is 50, and we wish to conduct a one-
sided significance test at the 5% level with a power of 90%. How large should
the sample size be?
Solution
Since the power 1 − β = 0.90 then β = 0.10. Therefore, the sample size is
H1 : µ = µ1
where the data are normally distributed with mean µ and known variance σ 2 .
The sample size needed to conduct a two-sided test with significance level α
and probability of detecting a significant difference with power 100(1 − β)% is
(zβ + zα/2 )2 σ 2
n=
(µ0 − µ1 )2
Note that this sample size is always larger than the corresponding sample
size fora one-sided test, because zα/2 is larger than zα .
387
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Solution
We assume α = 0.05 and σ = 10 beats per minute.. We intend to use a two-
sided test because we are not sure in what direction the heart rate will change
after using the drug. Therefore, the sample size is estimated using the two-sided
T
formulation. We have
(zβ + zα/2 )2 σ 2 (z0.20 + z0.025 )2 × 102 (0.84 + 1.96)2 × 100
n= 2
= 2
=
(µ0 − µ1 ) 5 25
= 31.36 ≈ 32
AF
Thus, 32 patients must be studied to have at least an 80% chance of finding
a significant difference using a two-sided test with α = 0.05 if the true mean
change in heart rate from using the drug is 5 beats per minute.
388
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
4. Determine the critical value or p-value.
• For a two-tailed test at significance level α, the critical values are
−zα/2 and zα/2 .
• Alternatively, calculate the p-value corresponding to the test statis-
AF tic Z.
5. Make a decision.
• If |zcal | > zα/2 , reject the null hypothesis H0 .
• If the p-value is less than α, reject the null hypothesis H0 .
Solution
DR
389
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
• Make a decision:
■ Since zcal = |0.649| < 1.96, we fail to reject the null hypothesis.
■ The p-value is greater than α = 0.05, so we fail to reject the null
hypothesis.
T
Estimating the appropriate sample size for a proportion test is crucial to ensure
that the test has sufficient power to detect a significant difference. The sample
size needed depends on the desired level of statistical significance, the power of
the test, and the expected proportion.
390
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
Solution
• Define the parameters:
and
zβ = 0.84 (for 80% power)
T
• Calculate the pooled proportion:
p0 + p1 0.05 + 0.08
p= = = 0.065
2 2
√ √ 2
1.96 2 × 0.065 × 0.935 + 0.84 0.05 × 0.95 + 0.08 × 0.92
=
0.03
2
1.96 × 0.3486 + 0.84 × 0.3189
=
0.03
2
0.683 + 0.268
=
0.03
DR
≈ 1004
We delved into various testing methodologies, including tests for means and
proportions, each with specific applications and considerations. The discus-
sions on power analysis and sample size estimation underscore the importance
391
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
To test this claim, a sample of 20 batteries is tested, and the sample mean
is found to be 490 hours with a sample standard deviation of 15 hours.
Test the company’s claim at a 5% significance level.
2. A researcher wants to test if the average height of a population of adult
males is different from 70 inches. A random sample of 30 men is selected,
AF and the sample mean height is found to be 68.5 inches with a standard
deviation of 3 inches. Perform a hypothesis test at the 5% significance
level.
3. In a survey of 200 people, 120 indicated that they prefer coffee over tea.
Test whether the proportion of people who prefer coffee is different from
DR
392
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
proportion within 3% with 95% confidence?
393
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
10. A school claims that the average score of its students on a standardized
test is 75. A random sample of 50 students has a mean score of 72
with a standard deviation of 10. Conduct a hypothesis test at the 0.01
significance level to determine if the school’s claim is valid.
11. A researcher wants to compare the effectiveness of two different teaching
methods. Method A has a sample mean of 82 with a standard deviation
of 5 from 25 students. Method B has a sample mean of 78 with a standard
deviation of 6 from 30 students. Test the hypothesis that the two methods
are equally effective at a significance level of 0.05.
12. A study compares the average daily water consumption of two cities. City
T
X has a mean consumption of 150 liters with a standard deviation of 20
liters based on a sample of 40 households. City Y has a mean consumption
of 160 liters with a standard deviation of 25 liters based on a sample of
35 households. Conduct a hypothesis test at the 0.05 level.
13. A dietician wants to test the effectiveness of a new diet plan. She measures
AF the weight of 10 participants before and after the diet. The weights (in
kg) are as follows:
• Before: 85, 78, 90, 82, 88, 80, 76, 85, 87, 90
• After: 83, 76, 87, 80, 84, 78, 75, 82, 86, 89
Test whether the diet plan has significantly reduced weight at the 0.05
significance level.
14. A medical researcher is testing the effect of a new medication on blood
pressure. He measures the blood pressure of 12 patients before and after
treatment. The readings (in mmHg) are as follows:
DR
• Before: 130, 135, 140, 128, 132, 138, 136, 134, 129, 137, 141, 133
• After: 125, 130, 135, 127, 130, 132, 128, 129, 126, 134, 138, 131
Conduct a hypothesis test at the 0.01 significance level.
15. Explain the concepts of Type I and Type II errors in the context of hy-
pothesis testing. Provide examples of each based on the exercises above.
16. A new software is claimed to improve productivity. If you decide to reject
the null hypothesis that it does not improve productivity (Type I error)
when in fact it does not, what are the consequences of such an error?
17. A researcher is conducting a study to evaluate whether a new drug is
effective in lowering blood pressure. The null hypothesis is that the drug
has no effect on blood pressure (i.e., the mean change in blood pressure is
zero), while the alternative hypothesis is that the drug does lower blood
pressure. The researcher expects that the drug will lower blood pressure
394
CHAPTER 9. HYPOTHESIS TESTING FOR DECISION MAKING
T
population standard deviation is known to be 15 widgets. The company
expects that the new process will increase the mean output by 5 widgets
per hour, i.e., the true population mean is 105 widgets per hour. Given
that the sample size is 25, calculate the power of the test to detect a true
mean of 105 widgets per hour.
AF
DR
395
Chapter 10
T
Analysis
AF
10.1 Introduction
In the realm of data science, understanding the relationships between variables
is crucial for deriving actionable insights from data. This chapter provides a
comprehensive overview of correlation and regression analysis, fundamen-
tal techniques in statistical modeling that are essential for data-driven decision-
making. Correlation analysis allows data scientists to quantify the strength
and direction of (linear) relationships between two variables, offering a pre-
liminary understanding of their interdependencies.
Regression analysis is a powerful tool used to model and predict the behavior
of a dependent variable based on one or more independent variables. This
chapter covers simple linear regression as well as multiple linear regression, pro-
viding a detailed examination of the assumptions, estimation procedures, and
interpretation of regression coefficients. We also delve into evaluating model
performance through metrics such as R2 and adjusted R2 , which are critical for
assessing the accuracy and reliability of the model.
To bridge theory and practice, Python code examples are integrated through-
out the chapter, demonstrating how to implement these techniques in real-world
data science applications. These practical illustrations enhance the understand-
396
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
on the X-axis. By plotting data points for pairs of observations, the scatter
diagram visually represents how the two variables relate to each other.
Let’s say we have data on the height (in cm) and weight (in kg) of a group
of individuals. The height and weight pairs might look like this:
AF Table 10.1: Height and Weight of Individuals
The scatter diagram of the height (in cm) and weight (in kg) of a group of
individuals, as given in Table 10.1 is presented in Figure 10.1.
397
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
Weight (kg)
80
70
60
T
Height (cm)
150 160 170 180 190 200
AF Figure 10.1: Scatter diagram of Height and Weight.
• Weak Correlation: Points more widely spread out but still following
a general trend indicate a weak relationship.
398
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
4. Outliers
• Outliers: Points that are far from the general pattern of the data
are called outliers. They may indicate special cases or errors in data
collection and can significantly affect the interpretation of the scatter
plot.
T
Interpretation of Figure 10.1
Consider a scatter plot where height (cm) is plotted against weight (kg):
•
AF
Direction: If the points generally trend upwards, there is a positive
correlation between height and weight.
• Strength: If the points are closely packed around a line, the correla-
tion is strong; if they are more spread out, the correlation is weaker.
•
DR
Outliers: If a point is far away from others (e.g., a height of 190 cm
but a weight of 55 kg), it may be an outlier, which requires further
investigation.
3 # Data
4 heights = [150 , 160 , 170 , 180 , 190 , 185] # Heights in cm
5 weights = [55 , 60 , 65 , 70 , 75 , 81] # Weights in kg
6
399
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
15 # Show grid
16 plt . grid ( True )
17
21
T
10.4 Covariance
A scatter diagram is a powerful tool for visualizing the relationship between
two variables. However, it has some limitations:
• No Quantitative Measure: While scatter plots can visually show
AF trends, they do not provide a quantitative measure of the strength or
direction of the relationship between the variables. For this, statistical
measures like covariance or correlation are needed.
400
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
FT
Interpretation of the value of Covariance:
• A positive covariance indicates that the variables tend to move in the
same direction, meaning that when one variable increases, the other
tends to increase as well.
• Zero (or close to zero) indicates that there is no (or weak) relationship
between two variables.
Problem 10.1. Consider a dataset containing the number of commercials aired
A
and the corresponding sales volume for a product over ten weeks:
Table 10.2: Sample Data for the San Francisco Electronics Store
5 4 54
6 1 38
7 5 63
8 3 48
9 4 59
10 2 46
401
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
(i). Draw a scatter diagram of the data points representing the relationship
between the number of commercial advertisements and the sales volume.
(ii). Based on the scatter diagram, describe the observed trend or pattern in
the data.
(iii). Calculate the sample covariance between the number of commercial ad-
vertisements and the sales volume to quantitatively assess the degree of
association between these two variables.
(iv). Interpret the sample covariance value in terms of the strength and direc-
tion of the relationship between the number of commercial advertisements
T
and the sales volume.
Solution
(i). The scatter plot of Number of commercials (X) and Sales Volume ($100s)
is presented in Figure 10.2.
AF 70
Sales Volume ($100s)
60
50
40
DR
1 2 3 4 5 6
Number of Commercials (X)
(ii). The scatter plot in Figure 10.5 illustrates the relationship between the
number of commercials aired and the sales volume:
402
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
(iii).
Let’s calculate the sample covariance for the given data in Table 10.3:
1
xi
2
yi
50
xi yi
100
2 5 57 285
3 1 41 41
4 3 54 162
5 4 54 216
6 1 38 38
7 5 63 315
DR
8 3 48 144
9 4 59 236
10 2 46 92
30
Mean of X : x̄ = =3
10
510
Mean of Y : ȳ = = 51
10
403
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
Therefore,
1629 − 10 × 3 × 51 99
sxy = = = 11
10 − 1 9
(iv). The positive covariance of 11 suggests a positive relationship between
the number of commercials aired and the sales volume, indicating that as the
number of commercials increases, the sales volume tends to increase as well.
T
Correlation analysis is a group of techniques to measure the strength and
direction of the relationship between two variables. There are different
types of correlation coefficients that can be used depending on the nature of
the variables being analyzed and the assumptions of the data. Some of the
common types of correlation coefficients include:
1. Pearson correlation coefficient: Measures linear relationship be-
AF tween two continuous variables.
2. Spearman rank correlation coefficient: Measures association be-
tween ranked variables.
3. Kendall tau correlation coefficient: Measures similarity of orderings
of data pairs.
4. Point-biserial correlation coefficient: Measures association between
continuous and binary variables.
5. Phi coefficient: Measures association between two binary variables.
DR
404
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
i=1 i=1
n
X n
X
2
(xi − x̄) = x2i − nx̄2 ;
i=1 i=1
and
n
X n
X
AF 2
(yi − ȳ) =
i=1
yi2 − nȳ 2 .
i=1
(n − 1)sx sy
sxy
=
sx sy
s
n
1
P
where sx = n−1 (xi − x̄)2 (the sample standard deviation); and analo-
i=1
gously for sy .
405
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
Problem 10.2. Consider Problem 10.1. Calculate the correlation coefficient
and interpret the results.
Solution
AF
From the solution of the Problem 10.1, we have the sample covariance sxy = 11.
The calculations for the variance are presented in Table 10.4.
i xi yi x2i yi2 xi yi
1 2 50 4 2500 100
2 5 57 25 3249 285
3 1 41 1 1681 41
DR
4 3 54 9 2916 162
5 4 54 16 2916 216
6 1 38 1 1444 38
7 5 63 25 3969 315
8 3 48 9 2304 144
9 4 59 16 3481 236
10 2 46 4 2116 92
We can compute the sample standard deviations for the two variables:
v ! r
u
u 1 n
X 1
sx = t x2 − nx̄ = (110 − 10 × 32 ) = 1.49
n − 1 i=1 i 9
406
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
v !
u n r
u 1 X 1
sy = t yi2 − nȳ = (26576 − 30 × 512 ) = 7.93
n−1 i=1
9
Hence, the sample correlation coefficient equal
sxy 11
r= = = 0.93
sx sy 1.49 × 7.93
which indicates a strong positive linear relationship between number of com-
mercials and sales volume.
Y Y
T
r = +1
r = −1
AF X X
Panel A Panel B
Y
DR
407
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
FT
• |r| close to 0 indicates a weak relationship.
2. Symmetry: The correlation between X and Y is the same as between
Y and X:
r(X, Y ) = r(Y, X)
propriate for continuous data but may not be suitable for ordinal or cat-
egorical data without modification.
8. Pairwise Comparisons: The correlation coefficient is computed for
pairs of variables and does not extend naturally to more than two vari-
ables.
x 4 5 3 6 10
y 4 6 5 7 7
408
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
Solution
(a)
The scatter plot given in Figure 10.5, visualizes the relationship between two
variables, X and Y , for the given sample of observations. Each point on the
scatter plot represents a pair of x and y values.
T
8 y
7
AF 6
3
DR
x
2 4 6 8 10 12
The scatter plot indicates that there is a positive correlation between X and
Y , as the points generally trend upwards. This indicates that higher values of
X tend to be associated with higher values of Y . However, to draw definitive
conclusions about the strength and nature of this relationship, further statis-
tical analysis, such as calculating the Pearson correlation coefficient, would be
necessary.
(b)
409
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
xi yi x2i yi2 xi yi
4 4 16 16 16
5 6 25 36 30
3 5 9 25 15
6 7 36 49 42
10 7 100 49 70
x2i = 186 yi2 = 175
P P P P P
xi = 28 yi = 29 xi yi = 173
x̄ = 5.6 ȳ = 5.8
T
n
P
xi yi − nx̄ȳ
i=1
r = s s n
n
2
P P 2
xi − nx̄2 yi − nȳ 2
AF =p
i=1 i=1
173 − 5 × 5.6 × 5.8
p
186 − 5 × (5.6)2 175 − 5 × (5.8)2
= 0.7522
cient
Testing the significance of the correlation coefficient involves determining whether
the observed correlation between two variables is statistically significant, mean-
ing it is unlikely to have occurred by chance. Here’s a step-by-step guide on
how to perform this test:
1. Hypothesis
2. Level of significance α
3. Test statistic
√
r n−2
T = √ ∼ t distribution with n − 2 degrees of freedom
1 − r2
410
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
4. Decision Rule
Reject H0 if
T > tα/2,n−2 or T < −tα/2,n−2
equivalently,
Reject H0 if and only if |T | > tα/2,n−2
at level of significance 100 × (1 − α)%.
H0 : ρ = 0
and if the null hypothesis is true, the test statistic T follows the Student’s-t
distribution with (n − 2) degrees of freedom, i.e., T ∼ t(n − 2).
T
Table 10.5: Decision rule for the test of hypothesis H0 : ρ = 0
4 # Data
5 heights = [150 , 160 , 170 , 180 , 190 , 182] # Heights in cm
6 weights = [55 , 60 , 65 , 70 , 75 , 81] # Weights in kg
7
411
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
13
T
24 # Compute correlation matrix
25 co rr elation_matrix = data . corr ()
26
27 print ( correlation_matrix )
28 # Compute correlation matrix with 2 decimal places
29 co rr elation_matrix = data . corr () . round (2)
30
31
32
33
34
AF # View the correlation matrix
print ( correlation_matrix )
If ri , si are the ranks of the i-member according to the x and the y-quality
412
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
• ρ = −1: Perfect negative correlation.
• ρ = 0: No correlation.
Problem 10.4. Based on the following data, find the rank correlation between
marks of English and Mathematics courses.
AF English
Maths
56
66
75
70
45
40
71
60
62
65
64
56
58
59
80
77
76
67
61
63
Solution
The procedure for ranking these scores is as follows:
56 66 9 4 5 25
DR
75 70 3 2 1 1
45 40 10 10 0 0
71 60 4 7 -3 9
62 65 6 5 1 1
64 56 5 9 -4 16
58 59 8 8 0 0
80 77 1 1 0 0
76 67 2 3 -1 1
61 63 7 6 1 1
413
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
This indicates a strong positive relationship between the ranks individuals ob-
tained in the Maths and English exam. That is, the higher you ranked in maths,
the higher you ranked in English also, and vice versa.
• If there are more than one such group of items with common rank, this
value is added as many times as the number of such groups.
T
• Then the formula for the rank correlation is
6{ d2i + m 2 m2 2
P
12 (m1 − 1) + 12 (m2 − 1) + · · · }
1
rR = 1 −
n(n2 − 1)
AF
Problem 10.5. Based on the following data, Find the rank correlation between
marks of English and Mathematics courses.
English 56 75 45 71 61 64 58 80 76 61
Maths 70 70 40 60 65 56 59 70 67 80
Solution
The procedure for ranking these scores is as follows:
DR
56 70 9 3 6 36
75 70 3 3 0 0
45 40 10 10 0 0
71 60 4 7 -3 9
61 65 6.5 6 0.5 0.25
64 56 5 9 -4 16
58 59 8 8 0 0
80 70 1 3 -2 4
76 67 2 5 -3 9
61 80 6.5 1 5.5 30.25
414
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
2 2 3
6{104.5 + 12 (2 − 1) + 12 (32 − 1) + · · · }
rR = 1 − 2
10(10 − 1)
= 0.3569
T
10.7.2 Applications of Rank Correlation
• Non-Linear Relationships: Rank correlation is useful when the re-
lationship between variables is not linear but still monotonic (e.g., one
variable consistently increases as the other does, but not necessarily in
a straight line).
AF • Ordinal Data: It is appropriate for ordinal data, where the values
represent rankings or ordered categories (e.g., customer satisfaction
ratings).
3 # Given data
4 english_marks = [56 , 75 , 45 , 71 , 61 , 64 , 58 , 80 , 76 , 61]
5 maths_marks = [70 , 70 , 40 , 60 , 65 , 56 , 59 , 70 , 67 , 80]
6
415
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
If you have two columns of data in a pandas DataFrame, you can calculate
the rank correlation directly from the DataFrame. Here’s an example:
1 # From Data file
2 import pandas as pd
3 # Example DataFrame ( you should replace this with your
actual data )
4 data = pd . DataFrame ({
5 ’ english_marks ’: [56 , 75 , 45 , 71 , 61 , 64 , 58 , 80 , 76 ,
61] ,
6 ’ maths_marks ’: [70 , 70 , 40 , 60 , 65 , 56 , 59 , 70 , 67 ,
80]
})
T
7
15
AF
10.7.4 Kendall Tau Correlation Coefficient
Kendall’s tau is another non-parametric measure of rank correlation that eval-
uates the ordinal association between two variables.
5. It ranges from −1 to 1.
Interpretation:
• τ = 1: Perfect agreement between the rankings.
416
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
• τ = 0: No association.
• Concordant Pairs:
In the context of correlation coefficients such as Kendall Tau, con-
cordant pairs refer to pairs of observations where the ranks for
both variables follow the same order. In other words, if (Xi , Yi )
and (Xj , Yj ) are two pairs of observations, they are considered
concordant if both Xi < Xj and Yi < Yj or if both Xi > Xj and
Yi > Yj .
T
• Discordant Pairs: Discordant pairs, on the other hand, refer
to pairs of observations where the ranks for the variables have
opposite orders. In other words, if (Xi , Yi ) and (Xj , Yj ) are two
pairs of observations, they are considered discordant if Xi < Xj
and Yi > Yj or if Xi > Xj and Yi < Yj .
AFIn the context of calculating correlation coefficients like Kendall Tau, un-
derstanding concordant and discordant pairs is crucial as they form the basis
for determining the strength and direction of association between two variables
based on their ranks.
Problem 10.6. Suppose we have the following data on two variables, X and
Y , with their corresponding ranks:
Observation Rank
X Y
DR
10 15
15 10
20 20
25 25
30 40
Calculate the Kendall Tau correlation coefficient.
417
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
9. (20, 20) and (30, 40): Concordant
10. (25, 25) and (30, 40): Concordant
So, out of the 10 pairs of observations, there are 9 concordant pairs and 1
discordant pair.
AF•
•
Number of concordant pairs (nc ): 9
The Kendall Tau correlation coefficient for the given data is τ = 0.8.
DR
418
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
• Disadvantages:
■ Less Sensitive: Might not capture the strength of the relation-
ship as well as Pearson’s correlation in the presence of linear
relationships.
■ Reduced Power: May have less statistical power compared to
parametric tests when assumptions for those tests are met.
T
2
3 # Given data
4 X = [10 , 15 , 20 , 25 , 30]
5 Y = [15 , 10 , 20 , 25 , 40]
6
10
11
AF tau , p_value = stats . kendalltau (X , Y )
10.7.7 Exercises
1. Define a scatter diagram. What is its primary purpose in data analy-
sis, and how can it help in understanding the relationship between two
variables?
DR
2. Describe how you would interpret a scatter diagram that shows a perfect
positive linear relationship between two variables. What characteristics
would you expect to see in the plot?
3. Explain how a scatter diagram can be used to detect non-linear relation-
ships between variables. Provide examples of different types of non-linear
relationships that might be observed.
4. What are some limitations of using a scatter diagram for analyzing rela-
tionships between variables? How can these limitations affect the inter-
pretation of the data?
5. Define covariance. What does it measure, and how is it different from
correlation?
419
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
assumptions need to be met for Pearson’s correlation coefficient to provide
a valid measure of the strength and direction of the relationship?
11. How does Pearson’s correlation coefficient handle outliers in the data?
What impact can outliers have on the correlation coefficient, and how
might this affect data analysis?
AF
12. What is the purpose of using rank correlation methods, such as Spear-
man’s and Kendall’s Tau, instead of Pearson’s correlation coefficient?
13. Explain the differences between Spearman rank correlation and Kendall
Tau correlation. In what situations might one method be preferred over
the other?
14. Discuss the advantages and limitations of using rank-based methods for
measuring the association between two variables.
15. Define Spearman’s rank correlation coefficient. How is it computed, and
what does it measure?
DR
16. Describe how tied ranks are handled in the computation of Spearman’s
rank correlation coefficient. What impact do ties have on the correlation
value?
17. Define Kendall’s Tau correlation coefficient. How does it differ from
Spearman’s rank correlation in terms of interpretation and calculation?
18. Explain the concepts of concordant and discordant pairs in the context of
Kendall’s Tau. How are they used to calculate the correlation coefficient?
19. Kendall’s Tau is often considered more robust than Spearman’s rank cor-
relation in the presence of tied ranks. Discuss why this is the case and
how Kendall’s Tau adjusts for ties.
20. Given the following pairs of data representing the number of hours studied
(X) and the scores obtained (Y) by 10 students in an exam:
420
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
7 10 80
8 12 85
9 14 90
10 15 95
AF (i). Plot the scatter diagram for this data. Describe the relationship
between the hours studied and the exam scores based on the
scatter plot.
(ii). Calculate the covariance between the number of hours studied
(X) and the exam scores (Y).
(iii). Interpret the sign and magnitude of the covariance. What does
it tell you about the relationship between the two variables?
(iv). Calculate the Pearson correlation coefficient between the hours
studied and exam scores.
DR
421
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
8 11 8.0
9 10 8.0
10 12 6.0
11 6 8.6
AF (i).
12 6 8.0
22. Consider the following data on the ranks of two variables, X and Y :
DR
(a) Calculate the Spearman rank correlation coefficient for the above
data.
(b) Interpret the result in the context of the relationship between X and
Y.
422
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
23. Suppose we have the following ranks for two variables A and B:
T
(a) Calculate the Kendall Tau correlation coefficient for the data pro-
vided.
(b) Discuss the strength and direction of the relationship between A and
B based on your result.
24. You are provided with the following data on two variables, P and Q.
AF Compute both the Spearman rank correlation and Kendall Tau correlation
coefficients:
Observation P Q
1 85 92
2 78 85
3 92 88
4 70 76
5 88 90
DR
(a) Compare the results from both rank correlation methods. Explain
any similarities or differences observed.
423
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
and may not capture complex relationships that exist between variables.
The motivation for regression analysis stems from the need to understand
and model the relationship between variables more comprehensively. Regression
analysis allows us to not only measure the direction and significant effect of
the relationship but also to make predictions and infer causality, provided
certain assumptions are met. By fitting a regression model, we can examine
how changes in one variable are associated with changes in another variable
while controlling for potential confounding factors.
• Dependent Variable: The variable is the outcome or response
that is being predicted or explained in an analysis.
T
• Independent Variable: The variable is the predictor or factor
that is used to predict or explain changes in the dependent vari-
able.
Regression analysis enables deeper insights into the underlying mechanisms
driving the relationship between variables.
AF
Regression analysis: Regression analysis is a set of statistical methods
used to estimate relationships between a dependent variable (also known
as the response or target variable) and one or more independent variables
(also known as predictors, features, or explanatory variables), allowing
for predictions and the determination of the strength and nature of these
relationships. This relationship can be expressed as
y = g(x1 , x2 , . . . , xp ) + e (10.2)
424
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
on parametric regression models.
There are various types of regression models depending on the nature of the
response variable. Some of them are mentioned below:
■ Poisson Regression
■ Negative Binomial Regression, etc.
425
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
E(Y | X = x) = β0 + β1 x (10.3)
where
• E(Y | X = x) represents the expected (or average) value of Y when X
is equal to x.
T
• β0 and β1 are the parameters of the regression function, with β0 being
the intercept and β1 the slope of the regression line.
E(Y ) E(Y )
AF Regression
line
β0 Regression
line
Slope β1 Slope β1
is positive is negative
β0
x x
E(Y )
Regression line
Intercept β0
Slope β1 is 0
Panel C: No Relationship
426
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
Suppose xi is a given value of X for the i-th observation. Then the popula-
tion regression function (PRF) can be written as
E(Y | X = xi ) = β0 + β1 xi (10.4)
where E(Y | X = xi ) represents the expected value of Y given that X takes the
value xi . For a given value x = xi , the observed value yi for Y can be expressed
as:
yi = E(Y | X = xi ) + ei
= β0 + β1 xi + ei (10.5)
T
where ei is the error term for the i-th observation, representing the deviation
of the observed value yi from the expected value E(Y | X = xi ). The model
given by Equation (10.5) is known as the simple linear regression model.
This model is used to estimate the population regression function described in
Equation (10.4).
AF
The classical simple linear regression (CSLR) model for i-th obser-
vation is
y i = β0 + β1 x i + e i ; i = 1, 2, . . . , n (10.6)
where:
Note that the model (10.6) is a special case of (10.2) when g is a linear
function and p = 1.
y = β1 x1 + β2 x2 + · · · + βp xp + ϵ
where y is the dependent variable, xi are the independent variables, βi are the
coefficients, and ϵ is the error term. There is no constant term (β0 ).
427
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
The assumptions of the CSLR model given in Equation (10.6) are as follows:
1. Linearity: The relationship between X and Y must be linear.
428
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
iid
Under these assumptions, we can write ei ∼ N (0, σ 2 ), and hence, yi | xi ∼
N (β0 + β1 xi , σ 2 ). Therefore, this model is also known as a normal linear
regression model. The graphical representation of the simple linear regression
model given in Equation (10.6) under these assumptions is presented in Figure
10.6.
Remark 10.8.1. Normality of the errors (or residuals) is not strictly required.
However, the normality assumption in Equation (10.6) is necessary to perform
hypothesis tests concerning regression parameters, as discussed in the next.
The estimated model (or estimated line) for Equation (10.6) can be
written as
T
ybi = βb0 + βb1 xi
where
• ybi is the estimator of E(Y | X = xi ) based on the sample data.
AF•
•
βb0 is the estimator of β0 .
yi = ybi + ebi
where ebi is the residual, representing the deviation of the observed value yi from
the estimated value ybi .
DR
Under the assumptions given in Section 10.8.3 for the simple linear regression
model provided in Equation (10.6), the OLS estimators for the parameters β0
and β1 are found by minimizing the sum of squared errors. Thus, the objective
function for OLS estimation is:
n
X n
X 2
Minimize e2i = (yi − (β0 + β1 xi ))
i=1 i=1
Pn 2
where i=1 ei is the sum of squared errors.
429
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
OLS Estimators
The OLS estimators for β0 and β1 can be derived by taking the partial deriva-
tives of Q(β0 , β1 ) with respect to β0 and β1 , setting them to zero, and solving
for the coefficients. The OLS estimators are:
Pn
xi yi − nx̄ȳ
β1 = Pi=1
b n 2 2
i=1 xi − nx̄
βb0 = ȳ − βb1 x̄
where:
T
• βb1 is the estimated slope of the regression line.
• βb0 is the estimated intercept of the regression line.
• x̄ and ȳ are the sample means of the independent variable X and the
dependent variable Y , respectively.
AF
Procedure for Calculating the OLS Estimators in a Simple
Linear Regression Model
For the simple linear regression model given in Equation (10.6), the OLS esti-
mators can be found using the following steps:
1. Objective Function:
n
X n
X
e2i = (yi − β0 − β1 xi )2
i=1 i=1
n n
∂ X 2 X
∂ e = −2 (yi − β0 − β1 xi )xi = 0 (10.8)
∂β1 i=1 i i=1
430
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
i=1 i=1 i=1
n
X n
X n
X n
X
⇒ xi yi − ȳ xi + β1 x̄ x i − β1 x2i = 0
i=1 i=1 i=1 i=1
n n
!
X X
⇒ xi yi − nȳx̄ − β1 x2i − nx̄2 =0
AF i=1
Pn
⇒ β1 = Pi=1
n
xi yi − nx̄ȳ
2
i=1 xi − nx̄
2
i=1
i=1 (xi − x̄)(yi − ȳ) sy
β1 =
b P n 2
= rxy
(x
i=1 i − x̄) sx
where rxy is the correlation coefficient between X and Y , and sx and sy are the
sample standard deviations of X and Y , respectively.
431
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
stated in Theorem 10.1.
Theorem 10.1. Let βbxy be the slope of the regression of y on x and βbyx be the
slope of the regression of x on y. The estimated regression lines are:
Then the geometric mean of βbxy and βbyx is equal to the absolute value of the
Pearson correlation coefficient r.
Proof. The slope βbxy of the regression of y on x is given by:
sy
βbxy = r ·
sx
where r is the Pearson correlation coefficient, sy is the standard deviation of y,
and sx is the standard deviation of x.
DR
Thus, the geometric mean of the regression coefficients βbxy and βbyx is the
absolute value of the Pearson correlation coefficient r.
432
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
n n n n
(yi − ybi )2 yi2 − βb0 yi − βb1 xi yi
P P P P
i i i i
b2 =
σ =
n−2 n−2
where
FT
ybi = βb0 + βb1 xi
Hence, the standard error of estimate is
v
u n 2 b Pn n
vP uP P
u (yi − ybi )2 u yi − β0 yi − βb1 xi yi
u
t i t i i i
σ
b= =
n−2 n−2
Problem 10.7 (Shorshe Ilish restaurant Sales Dataset). Suppose data were
collected from a sample of 10 semesters from a restaurant located near to the
university campus. For the ith semester in the sample, xi is the size of the
student population (in thousands) and yi is the quarterly sales (in thousands of
A
dollars).
Table 10.6: Student Population and Quarterly Sales Data for Different
Semesters
5 12 117
6 16 137
7 20 157
8 20 169
9 22 149
10 26 202
433
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
(i). Show the relationship between the size of student population and the
quarterly sales. Make a comment on the diagram.
(ii). Write down the regression model for this example, and mention the
assumptions of the model.
(iii). Find the least square estimates and write the estimated regression model.
Interpret the results.
T
(vi). Find the value of the standard error of the estimates.
Solution
(i). The scatter plot in Figure 10.7 shows the relationship between student
AF
population (in thousands) and quarterly sales (in thousands of dollars) based
on data from ten semesters.
200
Quarterly Sales ($1000s)
150
DR
100
50
0 5 10 15 20 25 30
Student Population (1000s)
Observations:
• There appears to be a positive correlation between student population
and quarterly sales. As the student population increases, the quarterly
sales also tend to increase.
434
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
• The data points are somewhat clustered along a line, suggesting a linear
relationship.
• Some points, such as the one where the student population is 8, show
variability in sales, indicating that factors other than student popula-
tion might also influence sales.
yi = β0 + β1 xi + ei ; i = 1, 2, . . . , 10 (10.10)
where
T
• yi = Quarterly Sales ($1000s)
• β0 is intercept
•
AF •
β1 is slope coefficient
4. Zero Mean: For any fixed value of xi , the mean of the errors (residuals)
is zero.
5. Normality of Errors: The error terms are normally distributed (impor-
tant for inference, but not for estimation).
where βb0 and βb1 are respectively, the estimator of β0 and β1 . The ordinary
least square (OLS) estimators of β0 and β1 are
n
P
xi yi − nx̄ȳ
i=1
βb0 = ȳ − βb1 x̄ ; β1 = P
b
n .
x2i − nx̄2
i=1
435
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
The Least Squares Estimation The ordinary least square (OLS) es-
timators of β0 and β1 are
n
P
xi yi − nx̄ȳ
i=1 2840
β1 = P
b
n = =5
568
x2i − nx̄2
i=1
xi yi x2i xi yi
T
2 58 4 116
6 105 36 630
8 88 64 704
8 118 64 944
AF 12
16
20
117
137
157
144
256
400
1404
2192
3140
20 169 400 3380
22 149 484 3278
26 202 676 5252
Total= 140 1300 2528 21040
ybi = 60 + 5xi
(iv). The regression line with the scatter plot is depicted in Figure 10.8.
436
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
ybi = 60 + 5xi
150
T
100
50
AF 0 5 10 15 20
Student Population (1000s)
25 30
Figure 10.8: Scatter Plot of Student Population vs. Quarterly Sales with Re-
gression Line
yb = 60 + 5 × 16 = 140
437
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
22 149 170 441
26 202 190 144
140 1300 760 1530
AF
(vi). The Standard Error of the Estimates
σ 2
b =
P
i
(yi − ybi )2
=
1530
= 765 and σ
√
b = 765 = 27.66
n−2 10 − 2
Hence the standard error of the estimates is σ
b = 27.66.
(yi − ybi )2
P
2 SSE i
R =1− =1− P
SST (yi − ȳ)2
i
where
• the total
Xsum of squares (proportional to the variance of the data):
SST = (yi − ȳ)2
i
For the Shorshe Ilish restaurant Sales Dataset given in Problem 10.7,
we have
438
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
20 169 1521 160 81
22 149 361 170 441
26 202 5184 190 144
140 1300 15730 1300 1530
AF 2
P
i
(yi − ybi )2
1530
hence, R = 1 − P =1− = 0.9027
(yi − ȳ)2 15730
i
2
The R = 0.9027 implies that 90.27% of the variability of the dependent vari-
able is explained and the remaining 9.73% of the variability is still unexplained
by the regression model.
The relationship between R2 and rxy is that the square of the coefficient of
correlation (rxy ) is equal to the coefficient of determination (R2 ) for the simple
regression model. Mathematically,
2
R2 = (rxy )
The estimated regression equation for simple linear regression model pro-
vides
ybi = βb0 + βb1 xi
439
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
2
Remarks: Note that the relationship R2 = (rxy ) only holds for the simple
linear regression model.
T
10.8.9 Advantages and Disadvantages of R2
Here are some advantages and disadvantages of using R2 :
• R2 is a statistic that will give some information about the goodness-
of-fit of a model.
AF• In regression, the R2 coefficient of determination is a statistical measure
of how well the regression predictions approximate the real data points.
10.8.10 Adjusted R2
DR
SSE/dfe
R̄2 = 1 −
SST /dft
where dft is the degrees of freedom n − 1 of the estimate of the population
variance of the dependent variable, and dfe is the degrees of freedom n − p − 1
of the estimate of the underlying population error variance.
440
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
1 import numpy as np
2 import pandas as pd
3 import matplotlib . pyplot as plt
4 import statsmodels . api as sm
5
7 data = {
AF
# Data : Student Population and Quarterly Sales
8 ’ Restaurant ’: [1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10] ,
9 ’ Student Population (1000 s ) ’: [2 , 6 , 8 , 8 , 12 , 16 , 20 , 20 ,
22 , 26] ,
10 ’ Quarterly Sales ( $ 1000s ) ’: [58 , 105 , 88 , 118 , 117 , 137 ,
157 , 169 , 149 , 202]
11 }
12
25
441
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
29
T
41 # Interpret the results based on the model summary .
42
47
48
AF color = ’ red ’ , label = ’ Regression Line ’)
plt . title ( ’ Regression Line : Quarterly Sales vs . Student
Population ’)
plt . xlabel ( ’ Student Population (1000 s ) ’)
49 plt . ylabel ( ’ Quarterly Sales ( $ 1000s ) ’)
50 plt . legend ()
51 plt . grid ( True )
52 plt . show ()
53
62
442
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
AF
10.8.12 Interval Estimation and Hypothesis Testing
Confidence Interval for β0
The confidence interval for β0 can be computed using the standard formula for
linear regression parameter estimates. The formula for the confidence interval
for β0 is:
443
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
where:
• tα/2 is the critical value of the t-distribution with n − 2 degrees of
freedom at a significance level of α/2 (where α is typically 0.05 for a
95% confidence interval),
where,
T
v
u n
uP
u (yi − ybi )2
t i
σ
b=
n−2
is the estimated standard error of the residuals (or the square root of the mean
squared error, often obtained from the regression output). Once you have these
AF
values, you can compute the confidence interval.
H0 : β1 = 0
H1 : β1 ̸= 0
The null hypothesis H0 implying that the independent variable(s) do not have
any effect on the dependent variable. The alternative hypothesis H1 indicating
DR
SSR/1
F =
SSE/(n − 2)
where:
•
Pn
SSR = i=1 (b
yi − ȳ)2 is the sum of squared regression (explained),
444
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
• Once you compute the F -statistic, you can compare it to the critical
value from the F -distribution at a chosen significance level (e.g., α =
0.05) to determine whether to reject the null hypothesis.
• If the F -statistic is greater than the critical value, you reject the null
hypothesis and conclude that the model is significant. Otherwise, you
fail to reject the null hypothesis.
T
Classical Approach
• Set the significance level α.
• Decision Rule:
■ If p-value < α, reject H0 and conclude that the regression
model is statistically significant.
■ If p-value ≥ α, fail to reject H0 and conclude that the re-
gression model is not statistically significant.
yi = β0 + β1 xi + ei ; 1, 2, . . . , n
445
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
βb1 − 0
t=
se(βb1 )
v
u n
uP
u (yi − ybi )2
σb t i
where, se(β1 ) = pPn
b and σ
b=
T
i=1 (xi − x̄)
2 n−2
We reject H0 if |t| exceeds the critical value tcritical = t α2 ,(n−2) from the t-
distribution with n − 2 degrees of freedom, where n is the sample size.
p-value Approach
• Calculate the p-value associated with each calculated t-statistic.
446
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
• It is computed as:
y∗ )
yb∗ ± t α2 ,(n−2) · se(b
T
where yb∗ is the predicted value of Y for a given x∗ , tα/2 is the critical
value of the t-distribution, n is the number of observations, and
s
∗ 1 (x∗ − x̄)2
se(b
y )=σ + Pn 2
i=1 (xi − x̄)
b
n
AF•
is the standard error of the predicted value.
10 568
√
= 13.829 0.1282
= 4.95
110 ± 11.4147
447
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
Figure 10.10: Confidence and prediction intervals for sales Y at given values of
AF
student population X
• It is wider than the Confidence Interval for the mean response E(Y |X =
x∗ ) because it accounts for the variability of individual observations
around the regression line.
DR
• It is computed as:
yb∗ ± t α2 ,(n−2) · spred
where
s
∗ 2 1 (x∗ − x̄)2
s2pred =σ 2
b + se(b
y ) and hence spred = σ 1+ + Pn 2
i=1 (xi − x̄)
b
n
is the standard error of the predicted value and yb∗ is the predicted
value of Y for a given x∗ .
448
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
computed as follows
r
1 (10 − 14)2
spred = 13.829 1+
+
10 568
√
= 13.829 1.1282
= 14.69
The 95% prediction interval for quarterly sales for the Shorshe Ilish restaurant
located near campus can be found t α2 ,(n−1) = 2.306. Thus, with yb∗ = 110 and
a margin of error of t0.025 × spred = 2.306 × 14.69 = 33.875, the 95% prediction
interval is
110 ± 33.875
In dollars, this prediction interval is $76125 to $143875.
T
Confidence intervals and prediction intervals become more precise as the
value of the independent variable x∗ approaches x. The typical shapes of con-
fidence intervals and the broader prediction intervals are illustrated together in
Figure 10.11.
AF
DR
Figure 10.11: Confidence and prediction intervals for sales Y at given values of
student population X
449
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
16
10.8.16 Exercises
1. What is meant by regression analysis? Distinguish between correlation
and regression analysis.
AF 2. Define and describe the main types of regression analysis (e.g., simple
linear regression, multiple linear regression). Provide a real-world example
for each type and explain why that particular type of regression would be
used.
3. Match the following scenarios with the appropriate type of regression
analysis:
450
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
5. Consider the following sample of production volumes and total cost data
for a manufacturing operation.
T
700 6400
750 7000
550 5500
615 6000
AF (i). Use these data to develop an estimated regression equation that
could be used to predict the total cost for a given production
volume. Interpret the value of regression intercept and slop co-
efficients. The company’s production schedule shows 500 units
must be produced next month. Predict the total cost for this
operation
(ii). Compute the coefficient of determination. What percentage of
the variation in total cost can be explained by production vol-
ume?
DR
• βbyx = 0.7
451
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
• βbyx = 0.5
• Standard deviation of x, sx = 3
• Standard deviation of y, sy = 6
Determine the Pearson correlation coefficient r using the given regression
T
coefficients and standard deviations.
9. If βbxy = 1.2 and βbyx = 0.9, find the value of r2 and compare it with the
product of the regression coefficients.
10. For a simple linear regression model, you have the following sums of
AF squares:
• Total sum of squares (SST) = 300
• Regression sum of squares (SSR) = 180
Calculate the residual sum of squares (SSres ) and discuss its relation to
the total and regression sum of squares.
11. Suppose the Pearson correlation coefficient between two variables is −0.6.
If the standard deviations are sx = 5 and sy = 10, calculate:
(a) βbxy
DR
(b) βbyx
Discuss how a negative correlation affects the regression coefficients com-
pared to a positive correlation.
• βbyx = −1.1
Calculate the geometric mean of these coefficients and confirm if it matches
the absolute value of the Pearson correlation coefficient.
13. Given the following dataset, perform a simple linear regression analysis:
452
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
14. Discuss the key assumptions of the Classical Simple Linear Regression
(CSLR) model. For each assumption, provide an example of a potential
violation and explain how it could affect the results of the regression
analysis.
AF
15. Using the dataset from Exercise 13 , perform the following:
(a) Interpret the meaning of the intercept (β0 ) and slope (β1 ).
(b) Explain what the coefficient values tell you about the relationship
between X and Y .
DR
17. Calculate the estimated error variance and the standard error of the es-
timate for the dataset used in Exercise 13. Show all steps and formulas
used in your calculations.
18. For the regression analysis in Exercise 13, compute the coefficient of de-
termination (R2 ). Interpret the value of R2 in the context of the given
data.
19. Explain the relationship between the coefficient of determination R2 and
the correlation coefficient rxy . If rxy is 0.8, what is R2 and what does it
signify about the regression model?
20. List and discuss the advantages and disadvantages of using R2 as a mea-
sure of goodness-of-fit in regression analysis.
21. Given a multiple regression model with 3 predictors, calculate the adjusted
R2 if the R2 is 0.85, the sample size is 50, and the number of predictors
is 3. Interpret the result.
453
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
22. Write a Python script using pandas and statsmodels to perform a simple
linear regression analysis on a dataset of your choice. The script should:
23. Given the regression output where Ŷ = 1+2X, construct a 95% confidence
interval for β1 (slope) if the standard error of β1 is 0.5. Also, perform a
hypothesis test to determine if β1 is significantly different from 0.
T
24. For the regression analysis provided in Exercise 13, perform an ANOVA
F-test to determine if the overall regression model is significant. Explain
your decision rule and interpret the result.
25. Perform a hypothesis test to determine whether β1 is significantly different
from 0 in a regression model where the estimated β1 is 2 and the standard
error of β1 is 0.4. Use a significance level of 0.05.
AF
26. Calculate and interpret the confidence interval for the expected value
E(Y |X = x) using the dataset from Exercise 13.
where,
• yi is the dependent variable (response variable) for the ith observation.
• x1i , x2i , . . . , xpi are the independent variables for the ith observation.
454
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
Assumptions
1. Linearity: The relationship between the dependent variable yi and the
independent variables x1i , x2i , . . . , xpi is linear.
2. Independence of Errors: The error terms ei are independent of each
other.
T
3. Homoscedasticity: The variance of the error terms ei is constant for all
values of the independent variables.
4. Normality of Errors: The error terms ei are normally distributed with
mean zero.
5. No Perfect Multicollinearity: There is no perfect linear relationship
AF among the independent variables.
6. No Autocorrelation: The errors (e) are not correlated with each other
over time or across observations.
These assumptions are essential for valid estimation and interpretation of
the classical regression model.
y1 1 x11 x21 ... xp1 β0 e1
y2 1 x12 x22 ... xp2 β1 e2
⃗ =
Y .
X = .
⃗ =
β
⃗e = . .
.. .. .. .. .. .. .
.. ..
. . . .
yn 1 x1n x2n ... xpn βp en
455
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
⃗
⃗ − Xβ
⃗e = Y
Calculation of Estimated Coefficients
The sum of squared residuals is given by:
T
SSE = ⃗eT ⃗e = Y ⃗
⃗ − Xβ ⃗
⃗ − Xβ
Y
⃗ and set it to
To minimize SSE, we take the derivative with respect to β
zero:
∂SSE ⃗ =0
⃗ − X β)
= −2X T (Y
T
⃗
∂β
⃗ we get: β
Solving for β, ⃗ = (X T X)−1 X T Y ⃗ is
⃗ . Hence, the estimator of β
⃗b ⃗
β = (X T X)−1 X T Y
⃗b
AFThis equation provides the estimated coefficients β
of squared residuals.
that minimize the sum
456
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
⃗b = (X ⊺ X)−1 X ⊺ Y
β ⃗
⃗ + ⃗e)
= (X ⊺ X)−1 X ⊺ (X β
⃗ + (X ⊺ X)−1 X ⊺⃗e
= (X ⊺ X)−1 X ⊺ X β
⃗ + (X ⊺ X)−1 X ⊺⃗e
=β
Taking the expectation:
T
h i
⃗b = E β ⃗ + (X ⊺ X)−1 X ⊺⃗e = β
⃗ + E (X ⊺ X)−1 X ⊺⃗e
E[β]
⃗ + (X ⊺ X)−1 X ⊺ E[⃗e]
=β
⃗ + (X ⊺ X)−1 X ⊺⃗0
=β
⃗
=β
AF
So, the mean of the OLS estimator is:
⃗b = β
E[β] ⃗
⃗b = β
β ⃗ + (X ⊺ X)−1 X ⊺⃗e
⃗b is given by:
The covariance matrix of β
DR
⃗b = Var(β)
Cov(β) ⃗b
= Var β ⃗ + (X ⊺ X)−1 X ⊺⃗e
Var(⃗e) = σ 2 I
Using the properties of variance:
457
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
⃗b = σ 2 (X ⊺ X)−1
Var(β)
- The mean of the OLS estimator is:
⃗b = β
E[β] ⃗
- The variance of the OLS estimator is:
⃗b = σ 2 (X ⊺ X)−1
Var(β)
T
10.9.6 Coefficient of Determination
The coefficient of determination (R2 ) measures the proportion of the variance
in the dependent variable (Y ) that is explained by the independent variables
(x1 , x2 , . . . , xp ) in the regression model.
AF
where:
R2 = 1 −
SSE
SST
T
⃗b ⃗b
• ⃗
SSE = Y − X β ⃗
Y − X β is the sum of squared errors (residu-
als), representing the unexplained variability in the dependent variable.
• SST = (Y ⃗ − Ȳ)T (Y⃗ − Ȳ) is the total sum of squares, representing the
total variability in the dependent variable.
Interpretation:
DR
10.9.7 Adjusted R2
• the adjusted R2 is denoted by R̄2 and is defined as
n−1
R̄2 = 1 − (1 − R2 )
n−p−1
where p is the total number of independent variables in the model (not
including the constant term), and n is the sample size
458
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
Hence,
SSE/(n − p − 1)
R̄2 = 1 −
T
SST /(n − 1)
Interpretation:
• R̄2 penalizes the addition of unnecessary predictors to the model, unlike
R2 .
•
DR
459
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
i x1i x2i yi
1 9 16 10
1 13 14 12
1 11 10 14
1 11 8 16
1 14 11 18
1 15 17 20
1 16 9 22
T
1 20 16 24
1 15 12 26
1 15 12 28
To fit a multiple regression model, we use the least squares method to esti-
AF
mate the coefficients β0 , β1 , and β2 .
The model equation is:
yi = β0 + β1 x1i + β2 x2i + ei
We calculate the estimated coefficients βb using the formula:
βb = (X T X)−1 X T Y
where X is the design matrix of independent variables, Y is the vector of
the dependent variable, and βb is the vector of estimated coefficients.
⃗b
Calculation of β
DR
1 9 16
1 13 14
1 11 10
10
1 11 8
1 14 11
12
For the given dataset, we have: X = ; Y ⃗ = . .
1 15 17
..
1 16 9
28
1 20 16
1 15 12
1 15 12
⃗b
⃗ , and β.
Now, let’s calculate X T X, (X T X)−1 , X T Y
T
Calculation of X X
460
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
1 9 16
1 1 1 ... 1 1
13 14
XT X =
9 13 11 ... 1
15 11 10
. .. ..
16 14 10 ... 12 .
. . .
1 15 12
10 139 125
=
139 2019 1757
T
125 1757 1651
⃗b
Calculation of β
Using matrix inversion, we find:
AF
(X T X)−1 =
1
T
det(X X)
adj(X T X) =
−0.135
3.369 −0.135 −0.112
0.012
−0.003
Now, we calculate
2.821
⃗b
= (X T X)−1 X T ⃗y =
β 1.591
−0.475
DR
461
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
1 20 16 24 27.043 9.259849
1 15 12 26 20.988 25.12014
1 15 12 28 20.988 49.16814
Total 139 125 190 190 119.5229
AF
• b2 , is calculated as:
The error variance estimation, σ
n
1 X 119.5229
b2 =
σ (yi − ybi )2 = = 17.0747
n − p − 1 i=1 10 − 2 − 1
• Coefficient of determination:
SSE 119.522
DR
R2 = 1 − =1− = 0.6378
TSS 330
(1 − R2 )(n − 1) (1 − 0.6378)(10 − 1)
R̄2 = 1 − =1− = 0.5343
n−p−1 10 − 2 − 1
462
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
H0 : β 1 = β 2 = · · · = β p = 0
T
10.9.10 ANOVA Table in Regression Analysis
• The ANOVA table in multiple regression assesses the overall signifi-
cance of the regression model.
•
AF
•
It partitions the total variance in the dependent variable into explained
variance and unexplained variance.
The table includes sums of squares (SS), degrees of freedom (df), mean
squares (MS), and the F -test statistic.
Source of Variation SS df MS F
Regression SSR p M SR = SSR
p
SSE M SR
Residual (Error) SSE n−p−1 M SE = n−p−1 F = M SE
Total SST n−1
DR
• Decision Rule:
■ If calculated F > Fcritical , reject H0 and conclude that the
regression model is statistically significant.
463
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
p-value Approach
• Calculate the p-value associated with the calculated F -statistic.
• Decision Rule:
T
■ If p-value ≥ α, fail to reject H0 and conclude that the re-
gression model is not statistically significant.
df SS MS F Significance F
AF
Regression
Residual
2
7
210.4651
119.5349
105.2326
17.0764
6.1625 0.0286
Total 9 330
• Each t-test tests the null hypothesis that the corresponding coefficient
DR
is zero.
βbi
t=
SE(βbi )
where βbi is the estimated coefficient, and se(βbi ) is its standard error.
Classical Approach
• Set the significance level α.
464
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
p-value Approach
FT
• Decision Rule for each βbi :
■ If p-value < α, reject H0 and conclude that the correspond-
ing coefficient βbi is statistically significant.
■ If p-value ≥ α, fail to reject H0 and conclude that the cor-
responding coefficient βbi is not statistically significant.
8 20 26 169
9 22 24 149
10 26 30 202
(i) Using the above data, formulate a multiple linear regression model where
the quarterly sales (y) is the dependent variable, and both the student
population (x1 ) and the advertising budget (x2 ) are independent variables.
(ii) Write down the assumptions of the multiple linear regression model.
465
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
(iii) Estimate the regression coefficients for the model formulated in Question
1 using the least squares method. Interpret the coefficients for both the
student population and advertising budget.
(iv) Based on the regression model obtained in part (a), what would be the es-
timated quarterly sales for a restaurant located near a campus with 16,000
students and an advertising budget of $20,000?
(v) Calculate the standard error of the estimates for the model obtained in
Question (iii).
(vi) How would you evaluate the goodness-of-fit for the regression model? What
T
statistical metrics would you consider, and why?
(vii) Create scatter plots to show the relationship between:
• Student population and quarterly sales.
• Advertising budget and quarterly sales.
AF Overlay the regression lines on these plots and comment on the observed
relationships.
(viii) If a new restaurant is to be established near a campus with a student pop-
ulation of 18,000 and an advertising budget of $25,000, use the regression
model from Question 2 to predict the quarterly sales.
(ix) Discuss the limitations of using the linear regression model in this context.
What are some factors not considered by the model that could affect the
accuracy of your predictions?
(x) Test the significance of each regression coefficient at the 5% significance
DR
level. Clearly state the null and alternative hypotheses, the test statistic,
and your conclusion.
(xi) Construct a 95% confidence interval for the coefficients of the student
population and advertising budget. Interpret the intervals.
Solution
(i).
The multiple linear regression model can be written as:
y = β0 + β1 x 1 + β2 x 2 + e
where:
• y is the quarterly sales in $1000s.
466
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
(ii).
The assumptions of the multiple linear regression model are:
• Linearity: The relationship between the independent variables and the
dependent variable is linear.
T
• Independence: The observations are independent of each other.
After fitting the regression model to the data, the estimated coefficients are:
yb = βb0 + βb1 x1 + βb2 x2
= 38.62 + 0.97x1 + 4.11x2
Interpretation:
• The intercept βb0 = 38.62 suggests that when both the student popu-
lation and advertising budget are zero, the estimated quarterly sales
DR
• The coefficient βb1 = 0.97 suggests that for each additional 1000 stu-
dents, quarterly sales increase by $970, holding the advertising budget
constant.
• The coefficient βb2 = 4.11 suggests that for each additional $1000 spent
on advertising, quarterly sales increase by $4110, holding the student
population constant.
(iv).
Substituting x1 = 16 and x2 = 20 into the estimated regression equation:
yb = 38.62 + 0.97(16) + 4.11(20) = 141.30
Therefore, the estimated quarterly sales would be $141,300.
467
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
(v).
The standard error of the estimate is calculated as:
sP
(yi − ybi )2
Standard Error =
n−p
T
(vi).
The goodness-of-fit of the regression model can be evaluated using the following
metrics:
• R2 = 0.944: Represents the proportion of variance in the dependent
variable that is explained by the independent variables. A higher R2
AF •
indicates a better fit.
of the residuals.
468
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
(vii).
T
AF • The scatter plot of student population vs. quarterly sales shows a
positive linear relationship, indicating that as the student population
increases, quarterly sales also increase.
• The scatter plot of advertising budget vs. quarterly sales also shows a
positive linear relationship, indicating that an increase in the advertis-
ing budget leads to higher quarterly sales.
(viii).
Substituting x1 = 18 and x2 = 25 into the estimated regression equation:
DR
(ix).
Limitations of the linear regression model in this context include:
• The model assumes a linear relationship between the variables, which
may not always hold true.
• It does not account for potential interactions between the student pop-
ulation and advertising budget.
• The model does not consider external factors like economic conditions,
competition, or seasonal variations that could impact sales.
• The model assumes that the relationships are constant over time, which
might not be the case in reality.
469
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
(x).
For each coefficient:
• Null Hypothesis (H0 ): The coefficient is equal to zero (βi = 0).
T
Conclusion:
• The intercept is statistically significant (p < 0.05).
(xi).
Using the regression output, the 95% confidence intervals are:
DR
• For student population (β1 ): −3.337 ≤ β1 ≤ 5.283
• For advertising budget: We are 95% confident that the true coefficient
lies within the interval [−0.140, 8.369]. Since this interval also includes
zero, it suggests that the effect of advertising budget on sales may not
be significant.
470
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
9
25 , 26 , 24 , 30] ,
10 ’ Quarterly Sales ( $ 1000s ) ’: [58 , 105 , 88 , 118 , 117 , 137 ,
157 , 169 , 149 , 202]
11 }
12
13
AF
# Convert the data to a DataFrame
14 df = pd . DataFrame ( data )
15
471
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
44 plt . subplot (1 , 2 , 2)
45 plt . scatter ( df [ ’ Advertising Budget ( $ 1000s ) ’] , y , color = ’
green ’)
46 plt . plot ( df [ ’ Advertising Budget ( $ 1000s ) ’] , model . params
[0] + model . params [1] * df [ ’ Student Population (1000 s ) ’ ].
mean () + model . params [2] * df [ ’ Advertising Budget ( $ 1000s
) ’] , color = ’ red ’)
47
48
49
50
AF plt . title ( ’ Quarterly Sales vs . Advertising Budget ’)
plt . xlabel ( ’ Advertising Budget ( $ 1000s ) ’)
plt . ylabel ( ’ Quarterly Sales ( $ 1000s ) ’)
plt . grid ( True )
51
52 plt . tight_layout ()
53 plt . show ()
54
472
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
AF
10.9.13 Exercises
1. Explain the difference between simple linear regression and multiple linear
regression. How does the addition of more predictors in multiple linear
regression improve the model?
2. What assumptions must be met for multiple linear regression analysis to
be valid? Explain why each assumption is important.
3. What is R2 in the context of multiple linear regression? How is it different
DR
Statistic Value
R-squared 0.76
Adjusted R-squared 0.74
F -statistic 38.42
p-value (F -statistic) 0.0001
473
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
given in Table 10.8? Describe the assumptions of multiple regression
analysis. What diagnostic plots can be used to check these assump-
tions?
(ii). Explain the R-squared value given in Table 10.7, which indicates
the model’s performance. How does the adjusted R-squared value
improve on this interpretation?
AF (iii). Given the model summary in Table 10.8, write the estimated model
and
(a) Interpret the coefficient of X1 .
(b) What does the intercept term represent in this context?
(iv). Based on the p-values of the coefficients in Table 10.8, identify which
predictors are statistically significant at the 5% significance level and
explain why.
(v). Calculate the predicted value of Y when X1 = 4, X2 = 2, and
X3 = 3. Show all steps in your calculation.
DR
(vi). The F -statistic and its corresponding p-value are given in the model
summary. Explain what these values indicate about the overall
model.
474
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
• Normality of Residuals: The residuals (errors) of the model should
be approximately normally distributed.
These residuals can be regarded as the ‘observed’ errors. They are not the same
as the unknown true errors
ei = yi − E(yi ) for i = 1, 2, . . . , n.
If the model is appropriate for the data at hand, the residuals should reflect
DR
the properties assumed for ei (i.e., independence, normality, zero mean, and
constant variance). For diagnostic purposes, we sometimes use the semistu-
dentized residual, which is defined as
475
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
semistudentized residual.
Residual plots are a crucial tool for diagnosing issues in a regression model.
A residual plot is a scatterplot of the residuals (errors) on the y-axis and the
predicted values or one of the independent variables on the x-axis. The following
plots of residuals (or semistudentized residuals) will be utilized here for this
purpose: ’
1. Plot of residuals against predictor variable.
2. Plot of absolute or squared residuals against predictor variable.
T
3. Plot of residuals against fitted values.
4. Plot of residuals against time or outlier sequence.
5. Plots of residuals against omitted predictor variables.
6. Box plot of residuals.
AF
7. Normal probability plot of residuals.
A few questions to consider when you analyze plots
1. Do the residuals follow any pattern indicating nonlinearity?
2. Are there any outliers?
3. Does the assumption of constant variance look correct?
4. Label any qualitative variables on the plot. Any patterns?
tern (e.g., curvature), this indicates that the relationship between the
independent and dependent variables may not be linear.
476
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
and provides a foundation for interpreting the results of more complex analyses,
such as regression or correlation.
Figure 10.13: Scatter Plot and Residual Plot Understanding Nonlinear Regres-
sion Function
T
AF
Plot of the Residuals versus x
Plot of the Residuals versus x (See Figure 10.13(b)) is more effective in detecting
nonlinearity than a scatter plot (e.g., nonlinearity may be more prominent on
a residual plot compared to on a scatter plot). It can also indicate other forms
of model departure besides nonlinearity (e.g., non-constancy of variance). See
Figures 10.11. For any observable pattern(s) in the plot of the residuals versus
x indicates a problem with the model assumptions!
DR
477
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
AF
Plot of the Residuals versus yb
For simple linear regression (with a single predictor), the residuals versus yb
plot contains the same information as the plot of residuals versus X, but on
a different scale. For multiple linear regression, this plot allows us to examine
patterns in the residuals as y increases. Ideally, there should be no systematic
patterns.
DR
478
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
Figure 10.15: Residual Time Sequence Plots Illustrating Nonindependence of
Error Terms.
479
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
Figure 10.17: Normal Probability Plots when Error Term Distribution Is Not
Normal.
T
signal potential issues with the model, such as the presence of outliers, skewness,
or other departures from the assumptions of normality.
AF
Added-Variable plot
An Added-Variable plot, also known as a partial regression plot, is a diagnos-
tic tool used in multiple linear regression to assess the relationship between a
specific predictor variable and the response variable, while accounting for the
influence of other predictors in the model. The Added-Variable plot is created
by plotting the residuals from regressing the response variable on the other
predictors against the residuals from regressing the specific predictor on the
same set of predictors. This helps in identifying the unique contribution of the
predictor of interest after removing the effects of other variables in the model.
DR
480
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
• Outlier Identification
• Lack-of-Fit Test
T
terms
• Chi-Square test
• Shapiro-Wilk test,
•
AF•
Kolmogorov-Smirnov test,
Lilliefors test
p-value determines whether the null hypothesis can be rejected; a small p-value
(typically less than 0.05) indicates that the data significantly deviate from nor-
mality. The Shapiro-Wilk test is particularly powerful for detecting departures
from normality in small to moderately sized samples, making it a widely used
tool in statistical analysis. Using a hypothetical dataset, the Python code for
Shapiro-Wilk test is explained in the next section.
481
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
company policy. The conditions during this period were expected to remain
consistent for the next three years.
T
90 389 20 113
110 435 100 420
30 212 50 268
AF 90 377 110 421
30 273 90 468
40 244 80 342
70 323
482
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
12 # Obtain residuals
13 residuals = model . resid
14
T
24 if p_value > alpha :
25 print ( " Residuals look Gaussian ( fail to reject H0 ) " )
26 else :
27 print ( " Residuals do not look Gaussian ( reject H0 ) " )
Durbin-Watson Test
The model is:
yi = β0 + β1 xi + ei
DR
where
ei = ρei−1 + ui , |ρ| < 1
and
ui ∼ N (0, σ 2 ) independent
Hypotheses:
H0 : ρ = 0 versus HA : ρ > 0
Statistic: Pn
(e − e )2
D= Pni 2i−1
i=2
i=1 ei
Decision Rule:
D > du
⇒ do not reject H0
D < dl ⇒ reject H0
dl ≤ D ≤ du ⇒ inconclusive
483
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
90 , 40 , 80 , 70])
7 WorkHrs = np . array ([399 , 121 , 221 , 376 , 361 , 224 , 546 ,
352 , 353 , 157 , 160 , 252 , 389 , 113 , 435 , 420 , 212 , 268 ,
377 , 421 , 273 , 468 , 244 , 342 , 323])
8
11
12
13
AF model = sm . OLS ( WorkHrs , LotSize ) . fit ()
print ( model . summary () )
# Perform the Durbin - Watson test
d u r b in _w a ts o n_ st a ti st i c = durbin_watson ( model . resid )
14 print ( " Durbin - Watson statistic : " ,
d u r b i n _w a ts on _ st a ti st i c )
15
across levels of the predictor variables. One common test is the Breusch-Pagan
test. The procedure for this test is as follows:
Breusch-Pagan Test
• Requires ei to be independent and normally distributed.
484
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
log σi2 = γ0 + γ1 xi
Regress the squared residuals, e2i , against xi and obtain SSR∗ from this regres-
sion.
Hypotheses:
H0 : γ1 = 0 versus HA : γ 1 > 0
T
Statistic:
2
SSR∗
SSE
χ2BP = (10.14)
2 n
where SSR∗ is the regression sum of squares when regressing eb2 on x and
AF
SSE is the error sum of squares when regressing Y on x. Under H0 , the test
statistic follows a χ2 -distribution.
Alternatively, the White test can be used, which is robust to various forms
of heteroscedasticity and follows a similar procedure but does not assume a
specific functional form of the variance.
485
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
6 70 323 312.28 10.72 114.9
3 import numpy as np
4 import statsmodels . api as sm
5 from statsmodels . stats . diagnostic import het_breuschpagan
6
486
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
19 # Extract results
20 bp_ test_statistic = test_results [0]
21 bp_test_p_value = test_results [1]
22
T
24 print ( f ’ Breusch - Pagan p - value : { bp_test_p_value } ’)
The Brown-Forsythe test is used to test the null hypothesis that the vari-
ances of different groups are equal. It is a modification of Levene’s test, where
the median is used instead of the mean to make it more robust to deviations
from normality.
Procedure:
1. Arrange the residuals by increasing values of x.
DR
487
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
Group 1
i Run Lot Size Residual (eil ) dil (dil − d¯1 )2 )
T
... ... ... ... ... ...
12 12 70 -60.28 40.40 19.49
13 25 70 10.72 30.60 202.07
488
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
4
9 # Convert to DataFrame
10 df = pd . DataFrame ({ ’ LotSize ’: LotSize , ’ WorkHrs ’: WorkHrs })
11
489
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
36
T
Pn 2
j=1 (ŷj − ŷj(i) )
Di =
p · MSE
where ŷj is the jth fitted value, ŷj(i) is the jth fitted value with the ith obser-
vation removed, p is the number of predictors in the model, and MSE is the
AF
mean squared error of the model.
Influential Observations
Influential observations are data points that have a large impact on the esti-
mated coefficients of the regression model. They can significantly alter the fit
of the model if removed. Influential observations are identified using Cook’s
distance, where observations with Cook’s distance greater than 4/n (where n
is the number of observations) are considered influential.
DR
Outliers
Outliers are data points that deviate significantly from the rest of the data.
They can affect the regression model’s accuracy and should be investigated
to determine if they are genuine data points or errors. Diagnostic tools such
as leverage plots and Cook’s distance can help identify outliers. Outliers are
identified by selecting observations with Cook’s distance greater than a certain
threshold (here, 4/n).
490
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
1 import numpy as np
2 import pandas as pd
3 import statsmodels . api as sm
4 import matplotlib . pyplot as plt
5
10 # Create DataFrame
11 df = pd . DataFrame ({ ’ LotSize ’: LotSize , ’ WorkHrs ’: WorkHrs })
12
T
13 # Add a constant column for intercept
14 X = sm . add_constant ( df [ ’ LotSize ’ ])
15 y = df [ ’ WorkHrs ’]
16
18
AF
model = sm . OLS (y , X ) . fit ()
19
491
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
10.10.7 Multicollinearity
Multicollinearity is a common issue in regression analysis. It occurs when pre-
dictor variables in a regression model are highly correlated. Multicollinearity
T
can lead to inaccurate estimates of the regression coefficients and their standard
errors.
⃗ + ⃗e
⃗ = Xβ
Y
where:
492
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
⃗b ⃗
β = (X T X)−1 X T Y
If one feature is a linear combination of other features, then the rank of the
design matrix X is less then the number of predictors. That is,
T
rank(X) < p.
If the rank of X is less than the number of predictors p, then the deter-
minant of XT X is close to zero. If one feature is an exact linear combination
of other features, then the determinant of XT X is exactly zero. Consequently,
(XT X)−1 does not exist, which implies that the Ordinary Least Squares (OLS)
AF
estimator β
⃗b
= (X T X)−1 X T Y
cannot be computed in such cases.
⃗b
⃗ and Var(β) = (X T X)−1 σ 2 are undefined or
493
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
The Variance Inflation Factor (VIF) is a measure used to quantify how
much the variance of an estimated regression coefficient increases due to mul-
ticollinearity. For a given predictor Xj in a multiple regression model, the VIF
for each predictor variable, denoted as VIFj for jth predictor, is calculated as:
AF 1
VIFj =
1 − Rj2
where Rj2 is the coefficient of determination from regressing the jth predictor
variable on all the other predictor variables.
494
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
5 # Example DataFrame
6 df = pd . DataFrame ({
7 ’ X1 ’: [1 , 2 , 3 , 4 , 5] ,
8 ’ X2 ’: [2 , 4 , 6 , 8 , 10] ,
9 ’ X3 ’: [5 , 4 , 3 , 2 , 1]
10 })
11
T
20 print ( vif )
21
22
AF
Tolerance of Limit (TOL)
The Tolerance of Limit (TOL) is a measure used to quantify how much of the
variance of a predictor is not explained by the other predictors in a regression
model. It is defined as:
1
TOLj = = 1 − Rj2
VIFj
When Rj2 = 1 (i.e., perfect collinearity), TOLj = 0 and when Rj2 = 0 (i.e.,
DR
no collinearity whatsoever), TOLj is 1. Because of the intimate connection
between VIF and TOL, one can use them interchangeably
5 # Example DataFrame
6 df = pd . DataFrame ({
7 ’ X1 ’: [1 , 2 , 3 , 4 , 5] ,
8 ’ X2 ’: [2 , 4 , 6 , 8 , 10] ,
9 ’ X3 ’: [5 , 4 , 3 , 2 , 1]
495
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
10 })
11
T
21 print ( vif )
496
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
Centering involves subtracting the mean of each predictor from
the predictor values.
• This is particularly useful when dealing with polynomial or in-
teraction terms.
7. Drop or Transform Variables:
AF • Drop variables that do not contribute significantly to the model
or transform variables to reduce multicollinearity.
• Conduct feature selection or use transformations to address is-
sues.
10.10.8 Exercises
1. Define multicollinearity. How can it affect the results of a multiple linear
regression model, and how can it be detected?
DR
2. Explain how you would use a residual plot to assess the fit of a multiple
linear regression model. What patterns in the residual plot might suggest
problems with the model?
3. List the key assumptions of linear regression. For each assumption, pro-
vide a brief explanation of why it is important for the validity of the
regression model.
4. Given a dataset with a linear regression model fitted, describe how you
would check each of the assumptions (linearity, independence, homoscedas-
ticity, normality of errors).
5. Explain the Durbin-Watson test and how it is used to test for autocorre-
lation.
6. Explain how to perform the Breusch-Pagan test and the White test for
heteroscedasticity. Interpret the results of these tests and discuss how
they affect the validity of the regression model.
497
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
10.11 Concluding Remarks
This chapter has equipped you with essential techniques in correlation and re-
gression analysis, crucial for the practice of data science. By examining scatter
diagrams, covariance, and correlation coefficients, we laid the groundwork for
understanding the relationships between variables in a dataset.
AFThe exploration of regression analysis, including both simple and multiple
linear regression models, has highlighted key aspects such as model assumptions,
coefficient interpretation, and model fit evaluation. These tools are fundamen-
tal for building predictive models, making data-driven decisions, and deriving
actionable insights from complex datasets.
The inclusion of Python code examples throughout the chapter bridges the
gap between theoretical concepts and practical implementation, demonstrating
how to apply these techniques in real-world data science scenarios. Mastering
these methods will enhance your ability to analyze data, validate findings, and
DR
498
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
6 70 175 155
7 75 180 160
8 80 185 165
9 85 190 170
10 90 195 175
AF (i) Formulate a linear regression model to study the relationship be-
tween age (x1 ) and baseline blood pressure (x2 ) as independent vari-
ables and the blood pressure after 6 weeks (y) as the dependent
variable.
(ii) State the assumptions of the linear regression model in the context
of this study.
(iii) Estimate the regression coefficients using the least squares method.
Interpret the coefficients for age and baseline blood pressure.
(iv) Using the model obtained in part (iii), predict the blood pressure
after 6 weeks for a 65-year-old patient with a baseline blood pressure
DR
of 165 mm Hg.
(v) Discuss the potential impact of multicollinearity between age and
baseline blood pressure on the model’s estimates.
(vi) Calculate the standard error of the estimates and discuss the preci-
sion of the regression coefficients.
(vii) Perform a hypothesis test to determine whether age is a significant
predictor of blood pressure after 6 weeks. Use a 5% significance level.
(viii) Construct a 95% confidence interval for the coefficient of baseline
blood pressure. Interpret the interval.
(ix) Create a residual plot and comment on the model’s assumptions
regarding homoscedasticity and normality of errors.
(x) Discuss the limitations of using this linear regression model for pre-
dicting blood pressure after 6 weeks and suggest possible improve-
ments.
499
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
3 40 170 480
4 45 180 490
5 50 190 510
6 55 200 530
7 60 210 540
AF 8
9
65
70
220
230
550
560
10 75 240 570
(i) Develop a multiple linear regression model where tensile strength (y)
is the dependent variable, and fiber content (x1 ) and curing temper-
ature (x2 ) are independent variables.
(ii) Explain the assumptions of the multiple linear regression model in
the context of this engineering problem.
(iii) Estimate the regression coefficients using the least squares method.
DR
500
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
(ix) Discuss how changes in fiber content and curing temperature might
interact to affect tensile strength. Could an interaction term be
included in the model?
(x) Analyze the residuals from the model to check for any violations of
the regression assumptions, such as non-linearity or heteroscedastic-
ity.
3. Suppose age (days), birthweight (oz), and SBP are measured for 16 infants
and the data are as shown in Table 10.12. What is the relationship be-
tween infant systolic blood pressure (SBP) and their age and birthweight?
Can we predict SBP based on these factors?
T
Table 10.12: Sample data for infant blood pressure, age, and birthweight for 16
infants
i Age (days) (x1 ) Birthweight (oz) (x2 ) SBP (mm Hg) (y)
AF 1
2
3
3
4
3
135
120
100
89
90
83
4 2 105 77
5 4 130 92
6 5 125 98
7 2 125 82
8 3 105 85
DR
9 5 120 96
10 4 90 95
11 2 120 80
12 3 95 79
13 3 120 86
14 4 150 97
15 3 160 92
16 3 125 88
501
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
iii. Visually assess the relationship between SBP and each predictor.
Is the relationship linear or nonlinear?
(c) Correlation Analysis:
i. Calculate the Pearson correlation coefficient between Age (x1 )
and SBP (y).
AF ii. Calculate the Pearson correlation coefficient between Birthweight
(x2 ) and SBP (y).
iii. Interpret the correlation coefficients. Which variable is more
strongly correlated with SBP?
(d) Simple Linear Regression:
i. Perform a simple linear regression to predict SBP (y) based on
Age (x1 ). Write down the regression equation.
ii. Perform a simple linear regression to predict SBP (y) based on
Birthweight (x2 ). Write down the regression equation.
iii. Interpret the slope of each regression line. What does the slope
tell you about the relationship between each predictor and SBP?
DR
502
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
T
are the potential implications of multicollinearity on your regres-
sion model?
Summarize the relationship between infant systolic blood pressure
(SBP) and the predictors (Age and Birthweight). Based on your
analysis, do you think Age and Birthweight are sufficient to predict
AF SBP? Suggest other factors that might be important to include in
the model.
3 9 10 150
4 14 20 200
5 18 25 250
6 22 28 280
7 25 30 320
8 30 35 350
9 28 32 340
10 35 40 400
(i) Using the above data, formulate a multiple linear regression model
where the annual revenue (y) is the dependent variable, and both
the number of employees (x1 ) and the R&D expenditure (x2 ) are
independent variables.
(ii) Write down the assumptions of the multiple linear regression model.
(iii) Estimate the regression coefficients for the model formulated in Ques-
tion 1 using the least squares method. Interpret the coefficients for
both the number of employees and R&D expenditure.
503
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
(iv) Based on the regression model obtained in part (iii), what would be
the estimated annual revenue for a company with 20,000 employees
and an R&D expenditure of $30 million?
(v) Calculate the standard error of the estimates for the model obtained
in Question (iii).
(vi) How would you evaluate the goodness-of-fit for the regression model?
What statistical metrics would you consider, and why?
(vii) Create scatter plots to show the relationship between:
• Number of employees and annual revenue.
• R&D expenditure and annual revenue.
T
Overlay the regression lines on these plots and comment on the ob-
served relationships.
(viii) If a new company is planning to hire 25,000 employees and spend
$35 million on R&D, use the regression model from Question (iii) to
predict the annual revenue.
AF(ix) Discuss the limitations of using the linear regression model in this
context. What are some factors not considered by the model that
could affect the accuracy of your predictions?
(x) Test the significance of each regression coefficient at the 5% signif-
icance level. Clearly state the null and alternative hypotheses, the
test statistic, and your conclusion.
(xi) Construct a 95% confidence interval for the coefficients of the number
of employees and R&D expenditure. Interpret the intervals.
from a single distribution center for a one-year period. Each data point
for each variable represents one week of activity. The variables included
are:
• The number of cases shipped (X1 )
• The indirect costs of the total labor hours as a percentage (X2 )
• A qualitative predictor called holiday that is coded 1 if the week
has a holiday and 0 otherwise (X3 )
• The total labor hours (Y )
(i). Obtain the scatter plot matrix and the correlation matrix. What
information do these diagnostic aids provide here?
(ii). Write a multiple regression model to the data for three predictor
variables. State the estimate regression function.
504
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
(iii). Obtain the residuals and prepare a box plot of the residuals. What
information does this plot provide?
(iv). Plot the residuals against Y , X1 , X2 , X3 , and X1 X2 on separate
graphs. Also prepare a normal probability plot. Interpret the plots
and summarize your findings.
(v). Prepare a time plot of the residuals. Is there any indication that the
error terms are correlated? Discuss.
(vi). Conduct the Brown-Forsythe test for constancy of the error variance,
using α = 0.01. State the decision rule and conclusion.
(vii). Test whether there is a regression relation, using a level of significance
T
of 0.05. State the alternatives, decision rule, and conclusion. What
does your test result imply about β1 , β2 , and β3 ? What is the P-
value of the test?
(viii). Calculate the coefficient of multiple determination R2 . How is this
measure interpreted here?
AF
(ix). From separate shipments with the following characteristics must be
processed next month:
X1 X2 X3
230,000 7.50 0
250,000 7.30 0
280,000 7.10 0
340,000 6.90 0
ments so that the actual handling times can be compared with the
predicted times to determine whether any are out of line. Develop
the needed predictions, using the most efficient approach and a fam-
ily confidence coefficient of 95%.
(x). Three new shipments are to be received, each with X1 = 282, 000,
X2 = 7.10, and X3 = 0.
(a). Obtain a 95% prediction interval for the mean handling time
for these shipments.
(b). Convert the interval obtained in part (a) into a 95% pre-
diction interval for the total labor hours for the three ship-
ments.
505
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
y x1 x2 x3
T
4325 301995 6.88 0
4110 269334 7.23 0
4111 267631 6.27 0
4161 296350 6.49 0
AF 4560
4401
4251
277223
269189
277133
6.37
7.05
6.34
0
0
0
4222 282892 6.94 0
4063 306639 8.56 0
4343 328405 6.71 0
4833 321773 5.82 1
4453 272319 6.82 0
4195 293880 8.38 0
DR
506
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
y x1 x2 x3
T
4016 252225 7.85 0
4207 261365 6.14 0
4148 287645 6.76 0
4562 289666 7.92 0
AF 4146
4555
270051
265239
8.19
7.55
0
0
4365 352466 6.94 0
4471 426908 7.25 0
5045 369989 9.65 1
4469 472476 8.2 0
4408 414102 8.02 0
4219 302507 6.72 0
DR
507
DR
508
AF
T
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
Appendix
T
A.1: Table of 1000 random digits
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
−3.4 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002
−3.3 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003
−3.2 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005
−3.1 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007
−3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
−2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
T
−2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
−2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
−2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
−2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
−2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
AF
−2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
−2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
−2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
−2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
−1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
−1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
−1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
−1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
−1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
DR
−1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
−1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
−1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
−1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
−1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
−0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
−0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
−0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
−0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
−0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
−0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
−0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
−0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
−0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
510
0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5200 0.5240 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5754
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
T
0.6 0.7258 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7518 0.7549
0.7 0.7580 0.7612 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7996 0.8023 0.8051 0.8079 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8600 0.8621
AF
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9430 0.9441
1.6 0.9452 0.9463 0.9474 0.9485 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9600 0.9616 0.9625 0.9633 0.9641
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9700 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9762 0.9767
DR
2.0 0.9773 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9924 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9942 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9958 0.9959 0.9960 0.9961 0.9962 0.9963
2.7 0.9964 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973
2.8 0.9974 0.9975 0.9976 0.9977 0.9978 0.9979 0.9980 0.9981 0.9982 0.9983
2.9 0.9984 0.9985 0.9986 0.9987 0.9988 0.9989 0.9990 0.9991 0.9992 0.9993
3.0 0.9994 0.9995 0.9996 0.9997 0.9998 0.9999 1.0000 1.0000 1.0000 1.0000
511
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
a z
−za
T
the probability a.
0.01 -2.326 -2.290 -2.257 -2.226 -2.197 -2.170 -2.144 -2.120 -2.097 -2.075
0.02 -2.054 -2.034 -2.014 -1.995 -1.977 -1.960 -1.943 -1.927 -1.911 -1.896
AF
0.03
0.04
0.05
0.06
-1.881
-1.751
-1.645
-1.555
-1.866
-1.739
-1.635
-1.546
-1.852
-1.728
-1.626
-1.538
-1.838
-1.717
-1.616
-1.530
-1.825
-1.706
-1.607
-1.522
-1.812
-1.695
-1.598
-1.514
-1.799
-1.685
-1.589
-1.506
-1.787
-1.675
-1.580
-1.499
-1.774
-1.665
-1.572
-1.491
-1.762
-1.655
-1.563
-1.483
0.07 -1.476 -1.468 -1.461 -1.454 -1.447 -1.440 -1.433 -1.426 -1.419 -1.412
0.08 -1.405 -1.398 -1.392 -1.385 -1.379 -1.372 -1.366 -1.359 -1.353 -1.347
0.09 -1.341 -1.335 -1.329 -1.323 -1.317 -1.311 -1.305 -1.299 -1.293 -1.287
0.10 -1.282 -1.276 -1.270 -1.265 -1.259 -1.254 -1.248 -1.243 -1.237 -1.232
0.11 -1.227 -1.221 -1.216 -1.211 -1.206 -1.200 -1.195 -1.190 -1.185 -1.180
0.12 -1.175 -1.170 -1.165 -1.160 -1.155 -1.150 -1.146 -1.141 -1.136 -1.131
0.13 -1.126 -1.122 -1.117 -1.112 -1.108 -1.103 -1.098 -1.094 -1.089 -1.085
0.14 -1.080 -1.076 -1.071 -1.067 -1.063 -1.058 -1.054 -1.049 -1.045 -1.041
DR
0.15 -1.036 -1.032 -1.028 -1.024 -1.019 -1.015 -1.011 -1.007 -1.003 -0.999
0.16 -0.994 -0.990 -0.986 -0.982 -0.978 -0.974 -0.970 -0.966 -0.962 -0.958
0.17 -0.954 -0.950 -0.946 -0.942 -0.938 -0.935 -0.931 -0.927 -0.923 -0.919
0.18 -0.915 -0.912 -0.908 -0.904 -0.900 -0.896 -0.893 -0.889 -0.885 -0.882
0.19 -0.878 -0.874 -0.871 -0.867 -0.863 -0.860 -0.856 -0.852 -0.849 -0.845
0.2 -0.842 -0.838 -0.834 -0.831 -0.827 -0.824 -0.820 -0.817 -0.813 -0.810
512
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
a z
za
T
the probability a
0.16 0.994 0.990 0.986 0.982 0.978 0.974 0.970 0.966 0.962 0.958
0.17 0.954 0.950 0.946 0.942 0.938 0.935 0.931 0.927 0.923 0.919
0.18 0.915 0.912 0.908 0.904 0.900 0.896 0.893 0.889 0.885 0.882
0.19 0.878 0.874 0.871 0.867 0.863 0.860 0.856 0.852 0.849 0.845
0.2 0.842 0.838 0.834 0.831 0.827 0.824 0.820 0.817 0.813 0.810
513
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
a t
ta
a
ν
T
0.10 0.05 0.025 0.01 0.005 0.001 0.0005
514
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
a χ2
5 χ2a 10 15 20
ν χ2.995 χ2.99 χ2.975 χ2.95 χ2.90 χ2.75 χ2.50 χ2.25 χ2.10 χ2.05 χ2.025 χ2.01 χ2.005 χ2.001
1 0.00 0.00 0.00 0.00 0.02 0.10 0.45 1.32 2.71 3.84 5.02 6.63 7.88 10.83
2 0.01 0.02 0.05 0.10 0.21 0.58 1.39 2.77 4.61 5.99 7.38 9.21 10.60 13.81
T
3 0.07 0.12 0.22 0.35 0.58 1.21 2.37 4.11 6.25 7.81 9.35 11.34 12.84 16.27
4 0.21 0.30 0.48 0.71 1.06 1.92 3.36 5.39 7.78 9.49 11.14 13.28 14.86 18.47
5 0.41 0.55 0.83 1.15 1.61 2.67 4.35 6.63 9.24 11.07 12.83 15.09 16.75 20.52
6 0.68 0.87 1.24 1.64 2.20 3.45 5.35 7.84 10.64 12.59 14.45 16.81 18.55 22.46
7 0.99 1.24 1.69 2.17 2.83 4.25 6.35 9.04 12.02 14.07 16.01 18.48 20.28 24.32
8 1.34 1.65 2.18 2.73 3.49 5.07 7.34 10.22 13.36 15.51 17.53 20.09 21.95 26.12
AF
9 1.73 2.09 2.70 3.33 4.17 5.90 8.34 11.39 14.68 16.92 19.02 21.67 23.59 27.88
10 2.16 2.56 3.25 3.94 4.87 6.74 9.34 12.55 15.99 18.31 20.48 23.21 25.19 29.59
11 2.60 3.05 3.82 4.57 5.58 7.58 10.34 13.70 17.28 19.68 21.92 24.72 26.76 31.26
12 3.07 3.57 4.40 5.23 6.30 8.44 11.34 14.85 18.55 21.03 23.34 26.22 28.30 32.91
13 3.57 4.11 5.01 5.89 7.04 9.30 12.34 15.98 19.81 22.36 24.74 27.69 29.82 34.53
14 4.07 4.66 5.63 6.57 7.79 10.17 13.34 17.12 21.06 23.68 26.12 29.14 31.32 36.12
15 4.60 5.23 6.27 7.26 8.55 11.04 14.34 18.25 22.31 25.00 27.49 30.58 32.80 37.70
16 5.14 5.81 6.91 7.96 9.31 11.91 15.34 19.37 23.54 26.30 28.85 32.00 34.27 39.25
17 5.70 6.41 7.56 8.67 10.09 12.79 16.34 20.49 24.77 27.59 30.19 33.41 35.72 40.79
18 6.26 7.01 8.23 9.39 10.86 13.68 17.34 21.60 25.99 28.87 31.53 34.81 37.16 42.31
19 6.84 7.63 8.91 10.12 11.65 14.56 18.34 22.72 27.20 30.14 32.85 36.19 38.58 43.82
20 7.43 8.26 9.59 10.85 12.44 15.45 19.34 23.83 28.41 31.41 34.17 37.57 40.00 45.32
DR
21 8.03 8.90 10.28 11.59 13.24 16.34 20.34 24.93 29.62 32.67 35.48 38.93 41.40 46.80
22 8.64 9.54 10.98 12.34 14.04 17.24 21.34 26.04 30.81 33.92 36.78 40.29 42.80 48.27
23 9.26 10.20 11.69 13.09 14.85 18.14 22.34 27.14 32.01 35.17 38.08 41.64 44.18 49.73
24 9.89 10.86 12.40 13.85 15.66 19.04 23.34 28.24 33.20 36.42 39.36 42.98 45.56 51.18
25 10.52 11.52 13.12 14.61 16.47 19.94 24.34 29.34 34.38 37.65 40.65 44.31 46.93 52.62
26 11.16 12.20 13.84 15.38 17.29 20.84 25.34 30.43 35.56 38.89 41.92 45.64 48.29 54.05
27 11.81 12.88 14.57 16.15 18.11 21.75 26.34 31.53 36.74 40.11 43.19 46.96 49.64 55.48
28 12.46 13.56 15.31 16.93 18.94 22.66 27.34 32.62 37.92 41.34 44.46 48.28 50.99 56.89
29 13.12 14.26 16.05 17.71 19.77 23.57 28.34 33.71 39.09 42.56 45.72 49.59 52.34 58.30
30 13.79 14.95 16.79 18.49 20.60 24.48 29.34 34.80 40.26 43.77 46.98 50.89 53.67 59.70
40 20.71 22.16 24.43 26.51 29.05 33.66 39.34 45.62 51.81 55.76 59.34 63.69 66.77 73.40
50 27.99 29.71 32.36 34.76 37.69 42.94 49.33 56.33 63.17 67.50 71.42 76.15 79.49 86.66
60 35.53 37.48 40.48 43.19 46.46 52.29 59.33 66.98 74.40 79.08 83.30 88.38 91.95 99.61
70 43.28 45.44 48.76 51.74 55.33 61.70 69.33 77.58 85.53 90.53 95.02 100.42 104.22 112.32
80 51.17 53.54 57.15 60.39 64.28 71.14 79.33 88.13 96.58 101.88 106.63 112.33 116.32 124.84
90 59.20 61.75 65.65 69.13 73.29 80.62 89.33 98.64 107.56 113.14 118.14 124.12 128.30 137.21
100 67.33 70.06 74.22 77.93 82.36 90.13 99.33 109.14 118.50 124.34 129.56 135.81 140.17 149.45
515
CHAPTER 10. CORRELATION AND REGRESSION ANALYSIS
Table 10.15: A.8. Critical Points for α = 0.05 of the F -distribution with its
degrees of freedom F (df1, df2).
df1
df2 1 2 3 4 5 6 7 8 9 10 12 15 20
1 161.4 199.5 215.7 224.6 230.2 234.0 236.8 238.9 240.5 241.9 243.9 245.9 248.0
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.43 19.45
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56
T
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94
AF
10
11
12
4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77
4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65
4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19
DR
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.07 1.99
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.06 1.97
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.04 1.96
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.03 1.94
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84
60 4.00 3.15 2.76 2.53 2.37 2.25
5162.17 2.10 2.04 1.99 1.92 1.84 1.75
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.75 1.66
1000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84 1.76 1.68 1.58
Bibliography
T
[1] David R Anderson, Thomas A Williams, and James J Cochran. Statistics
for Business & Economics. Cengage Learning, 2020.
[4] Frederick Emory Croxton and Dudley J Cowden. Applied General Statis-
tics. Prentice Hall, 1939.
[5] Anthony J Hayter. Probability and Statistics for Engineers and Scientists.
Duxbury Press, 2012.
[6] Changquan Huang and Alla Petukhina. Applied Time Series Analysis and
Forecasting with Python. Springer, 2022.
[7] Francesca Lazzeri. Machine Learning for Time Series Forecasting with
DR
517