0% found this document useful (0 votes)
85 views89 pages

© U Dinesh Kumar, IIM Bangalore

Chapter 3 of 'Business Analytics' introduces probability theory, covering key concepts such as random experiments, sample space, events, and probability estimation. It discusses various probability rules, joint and conditional probabilities, and applications in analytics, including association rule mining and Bayes theorem. The chapter emphasizes the importance of probability in data-driven decision-making and provides examples to illustrate these concepts.

Uploaded by

Divide Zero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views89 pages

© U Dinesh Kumar, IIM Bangalore

Chapter 3 of 'Business Analytics' introduces probability theory, covering key concepts such as random experiments, sample space, events, and probability estimation. It discusses various probability rules, joint and conditional probabilities, and applications in analytics, including association rule mining and Bayes theorem. The chapter emphasizes the importance of probability in data-driven decision-making and provides examples to illustrate these concepts.

Uploaded by

Divide Zero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 89

© U Dinesh Kumar, IIM Bangalore

Business Analytics
The Science of Data-Driven Decision Making
Second Edition

Chapter 3 – Introduction to Probability

U. Dinesh Kumar

© U. Dinesh Kumar, IIM Bangalore


If you torture the data long enough, it will
confess! -
Ronald Coase

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Probability Theory - Terminologies

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Random Experiment

• Random experiment is an experiment in which


the outcome is not known with certainty.

• Predictive analysis mainly deals with random


experiment like:
– Predicting quarterly revenue of an
organization
– Customer churn
– Demand for a product at future time period
etc.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Sample Space
• It is the universal set that consist of all possible
outcomes of an experiment.

• It is represented using letter “S”

• Individual outcomes are called elementary events

• Sample Space can be finite or infinite.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Event
• Event(E) is a subset of a sample space and probability is
usually calculated with respect to an event.

• The Venn diagram indicates that the event E is a subset


of the sample space S, that is, E  S (E is a subset of S)

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Probability Estimation using Relative
Frequency
• The classical approach to probability estimation
of an event is based on the relative frequency of
the occurrence of that event

• According to frequency estimation, the probability


of an event X, P(X), is given by

Number of observations in favour of event X n( X )


P( X )  
Total number of observations N

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Example 3.1
A website displays 10 advertisements and the revenue generated
by the website depends on the number of visitors to the site
clicking on any of the advertisements displayed in the website.
The data collected by the company has revealed that out of 2500
visitors, 30 people clicked on 1 advertisement, 15 clicked on 2
advertisements, and 5 clicked on 3 advertisements. Remaining
did not click on any of the advertisements. Calculate

(a) The probability that a visitor to the website will click on an


advertisement.
(b) The probability that the visitor will click on at least two
advertisements.
(c) The probability that a visitor will not click on any
advertisements.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Solution
(a) Number of customers clicking an advertisement
is 50 and the total number of visitors is 2500.
Thus, the probability that a visitor to the website
will click on an advertisement
50
0.02
is
2500

(b)Number of customers clicking on at least 2


advertisements is 20. 20 Thus, the probability that a
at least
visitor will click on2500 0.008
2 advertisements is

(c) Probability that a visitor will not click on any


advertisement is 2450 0.98
2500

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Algebra of Events
• Assume that X, Y and Z are three events of a sample space. Then the
following algebraic relationships are valid and are useful while deriving
probabilities of events:

• Commutative rule: X  Y = Y  X and X  Y = Y  X

• Associative rule: (X  Y)  Z = X  (Y  Z) and (X  Y)  Z = X  (Y  Z)

• Distributive rule: X  (Y  Z) = (X  Y)  (X  Z)
X  (Y  Z) = (X  Y)  (X  Z)

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Contd…

• The following rules known as DeMorgan’s


Laws on complementary sets are useful
while deriving probabilities:
(X  Y)C = XC  YC
(X  Y)C = XC  YC

where XC and YC are the complementary


events of X and Y, respectively

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Axioms of Probability

According to axiomatic theory of probability, the


probability of an event E satisfies the following
axioms

1. The probability of event E always lies between 0


and 1. That is, 0  P(E) 1.

2. The probability of the universal set S is 1. That is,


P(S) = 1

3. P(X  Y) = P(X) + P(Y), where X and Y are two


mutually exclusive events.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
The elementary rules of probability are directly deduced from the original three
axioms of probability, using the set theory relationships

1. For any event A, the probability of the complementary event, written AC, is given
by

P(A) = 1 – P(AC)

If P(A) is a probability of observing a fraudulent transaction at an e-commerce


portal, then P(AC) is the probability of observing a genuine transaction.

2. The probability of an empty or impossible event, , is zero:

P( ) 0

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
3. If occurrence of an event A implies that an event B also
occurs, so that the event class A is a subset of event class B,
then the probability of A is less than or equal to the probability
of B:
P ( A)  P ( B )

4. The probability that either events A or B occur or both occur


is given by P ( A  B )  P ( A)  P ( B )  P ( A  B )
P ( A  B) 0

5. If A and B are mutually


P ( A  B ) exclusive
P ( A)  P ( B ) events, so that
, then

6. If A1, A2, …, An are n events that n


form a partition of sample
P( A1 )  P( A2 )    P( A n )  P( Ai ) 1
space S, then their probabilities must i 1
add up to 1:

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Joint Probability

• Let A and B be two events in a sample


space. Then the joint probability of the two
events, written as P(A  B), is given by

Number of observations in A  B
P( A  B) 
Total number of observations

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Example 3.2
At an e-commerce customer service centre a total
of 112 complaints were received. 78 customers
complained about late delivery of the items and 40
complained about poor product quality.

(a) Calculate the probability that a customer will


complain about both late delivery and product
quality.
(b) What is the probability that a complaint is only
about poor quality of the product?

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Solution to Example 3.2
• Let A = Late delivery and B = Poor quality of the
product. Let n(A) and n(B) be the number of
events in favour of A and B. So n(A) = 78 and
n(B) = 40. Since the total number of complaints
is 112, hence
n(A  B) = 118 – 112 = 6
• Probability of a complaint about both delivery and
poor product quality is n(A  B) 6
P(A  B)   0.0535
Total number of complaints 112

78
• Probability
1 that
0the complaint is only about poor
.3035
112
quality = 1P(A) =

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• Marginal probability is simply a probability of an event X, denoted by P(X),
without any conditions

• Independent Events : Two events A and B are independent when


occurrence of one event (say event A) does not affect the probability of
occurrence of the other event (event B). Mathematically, two events A
and B are independent when

P(A  B) = P(A)  P(B).

• Conditional Probability: If A and B are events in a sample space, then the


conditional probability of the event B given that the event A has already
occurred, denoted by P(B|A), is defined as

P( A  B)
P( B | A)  , P( A)  0
P( A)

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Application of Simple Probability Rules in Analytics

• Association rule mining is one of the popular algorithms used


to solve problems such as market basket analysis and
recommender systems

• Market basket analysis (MBA) is used frequently by retailers to


predict products a customer is likely to buy together, which
further can be used for designing planogram and product
promotions

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Association Rule Mining

• Association rule learning (also known as association rule mining)


is a method of finding association between different entities in a
database

• Association rule is a relationship of the form


X  Y (that is, X implies Y).

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Association rule learning Example -
Binary representation of point of sale
data
Strawber Green
Transaction ID Apple Orange Grapes Plums Banana
ry Apple

1 1 1 1 0 1 1 1

2 0 1 0 0 0 1 1

3 0 0 0 0 0 1 1

4 1 0 0 0 1 0 0

5 1 0 0 0 1 1 1

6 0 1 1 0 0 0 1

7 0 1 1 0 0 0 1

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• In Table , transaction ID is the transaction reference number and apple,
orange, etc. are the different SKUs sold by the store. Binary code is used to
represent whether the SKU was purchased (equal to 1) or not (equal to 0)
during a transaction. The strength of association between two mutually
exclusive subsets can be measured using ‘support’, ‘confidence’, and ‘lift’

• Support between two sets (of products purchased) is calculated using the
joint probability of those events:

n( X  Y )
Support P( X  Y ) 
N

• Where n(X  Y) is the number of times both X and Y is purchased together


and N is the total number of transactions

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Association Rule Leaning Cont…

• Confidence is the conditional probability of purchasing product Y given the


product X is purchased. It measures probability of event Y (customer
buying a product Y) given the event X has occurred (the customer has
already purchased product X). That is,

P( X  Y )
Confidence = P (Y | X ) 
P( X )

• Lift: The third measure in association rule mining is lift, which is given by

P( X  Y )
Lift =
P ( X ) P (Y )
Association rules can be generated based on threshold values of support,
confidence and lift. For example, assume that the cut-off for support is 0.25
and confidence is 0.5 (Lift should be more than 1)

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Bayes Theorem
• Bayes theorem is one of the most important concepts in analytics since
several problems are solved using Bayesian statistics
P( A  B) P( A  B)
P( A | B)  and P( B | A) 
P( B) P( A)

• Using the two equations, we can show that

P( A | B) P( B)
P( B | A) 
P( A)

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Terminologies used to describe various
components in Bayes Theorem
1. P(B) is called the prior probability (estimate of the
probability without any additional information).
P( A | B) P( B)
P ( B | A) 
P ( A)
2. P(B|A) is called the posterior probability (that is, given that
the event A has occurred, what is the probability of
occurrence of event B). That is, post the additional
information (or additional evidence) that A has
occurred, what is estimated probability of occurrence of B.

3. P(A|B) is called the likelihood of observing evidence A if B


is true.

4. P(A) is the prior probability of A

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Monty Hall Problem

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Monty Hall Problem Using Bayes
Theorem
• Let C1, C2, and C3 be the events that the car is behind
door 1, 2, and 3, respectively. Let D1, D2, and D3 be the
events that Monty opens door 1, 2, and 3, respectively.
Prior probabilities of C1, C2, and C3 are

P(C1) = P(C2) = P(C3) = 1/3

• Assume that the player has chosen door 1 and Monty


opens door 2 to reveal a goat. Now we would like to
calculate the posterior probability P(C1|D2), that is, the
probability that the car is behind door 1 (door chosen
initially by the player) when Monty has provided the
additional information that the car is not behind door 2

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• Using, Bayes theorem
P ( D2 | C1 ) P (C1 ) (1 / 2) (1 / 3)
P (C1 | D2 )   1 / 3
P ( D2 ) (1 / 2)

• P(D2|C1) = 1(if the car is behind door 1, then Monty can open
2
either door 2 or 3)
1 1
P(D2) = 2 3
2
Note that P(C2|D2) = 0. Thus P(C3|D2) = 1 – P(C1|D2) = 1 – = 3

Thus, changing the initial choice will increase the probability of


winning the car. Alternatively,
P ( D2 | C3 ) P (C3 ) 1 (1 / 3)
P (C3 | D2 )   2 / 3
P ( D2 ) (1 / 2)

P(D2|C3) = 1 (if the car is behind door 3 and the player has
chosen door 1, Monty has to open door 2 with probability 1)

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Using, Bayes theorem
P( D2 | C1 ) P(C1 ) (1/ 2) (1/ 3)
P(C1 | D2 )   1/ 3
P( D2 ) (1/ 2)

• P(D2|C1) = (if the car is behind door 1, then Monty


can open either door 2 or 3)
P(
Note that P(C2|D2) = 0. Thus P(C3|D2) = 1 – P(C1|D2) =
1– =
Thus, changing the initial choice will increase the
probability of winning the car. Alternatively,
P( D2 | C3 ) P(C3 ) 1 (1/ 3)
P(C3 | D2 )   2 / 3
P( D2 ) (1/ 2)

• P(D2|C3) = 1 (if the car is behind door 3 and the


player has chosen door 1, Monty has to open door 2
with probability 1)
©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
GENERALIZATION OF BAYES THEOREM
Event generated from mutually exclusive subsets

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Example 3.4

Black boxes used in aircrafts manufactured by three companies


A, B and C. 75% are manufactured by A, 15% by B, and 10% by C.
The defect rates of black boxes manufactured by A, B, and C are
4%, 6%, and 8%, respectively. If a black box tested randomly is
found to be defective, what is the probability that it is
manufactured by company A?

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Solution to Example 3.4
• Let P(A), P(B), P(C) be events corresponding to the
black box being manufactured by companies A, B,
and C, respectively, and P(D) be the probability of
defective black box. We are interested in
calculating the probability
P( D | AP(A|D).
) P( A)
P( A | D) 
P( D)

• Now P(D|A) = 0.04 and P(A) = 0.75. Using Eq.


P(D) = 0.75 × 0.04 + 0.15 × 0.06 + 0.10 × 0.08 = 0.047
0.04 0.75
P( A | D)  0.6382
So, 0.047

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Random
Variables
• Random variable is a
function that maps every
outcome in the sample
space to a real number.

• A function that assigns a


real number to each
sample point in the
sample space S.

• Random variable is a
robust and convenient
way of representing the
outcome of a random
experiment

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Discrete Random Variables
• If the random variable X can assume only a finite or countably infinite set of
values, then it is called a discrete random variable.
• Examples of discrete random variables are:
– Credit rating (usually classified into different categories such as low,
medium and high or using labels such as AAA, AA, A, BBB, etc.).
– Number of orders received at an e-commerce retailer which can be
countably infinite.
– Customer churn (the random variables take binary values, 1. Churn and 2.
Do not churn).
– Fraud (the random variables take binary values, 1. Fraudulent transaction
and 2. Genuine transaction).
– Any experiment that involves counting (for example, number of returns in
a day from customers of e-commerce portals such as Amazon, Flipkart;
number of customers not accepting job offers from an organization).

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Probability mass function
• For a discrete random variable,
the probability that a random
variable X taking a specific value
xi, P(X = xi), is called the
probability mass function P(xi).

• That is, a probability mass


function is a function that maps
each outcome of a random
experiment to a probability

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Expected Value
• Expected value (or mean) of a discrete
random variable is given by
n
E ( X )  xi P ( xi )
i 1

• Where xi is the specific value taken by a


discrete random variable X and P(xi) is the
corresponding probability, that is, P(X =
xi).

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Variance and Standard Deviation

Variance of a discrete random variable is given by


n
Var( X )   xi  E ( X )  P( xi )
2

i 1

Standard deviation of a discrete random variable is given by

  VAR ( X )

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Probability Density Function (pdf)
• The probability density function, f(xi), is
defined as probability that the value of
random variable X lies between an
infinitesimally small interval defined by xi
and xi + x
P( xi  X xi  x)
f ( x)  lim
x  0 x

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Cumulative Distribution Function (CDF)
• The cumulative distribution function
(CDF) of a continuous random variable is
defined by a
F (a) P( X a)   f ( x)dx


Cumulative distribution function

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Probability density The probability between
function and two values a and b, P(a  X
cumulative distribution  b), is the area between
function of a the values a and b under
continuous random the probability density
variable satisfy the function
following properties
f(x)
0
F ()   f ( x ) dx 1


b
P(a  X b)  f ( x)dx F (b)  F (a)
a

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• The expected value of a continuous
random variable, E(X), is given by

E ( X )  xf ( x) dx


• The variance of a continuous random


variable, Var(X), is given by

Var( X )   x  E ( x)  f ( x)dx
2



©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Binomial Distribution
• A random variable X is said to follow a
Binomial distribution when
– The random variable can have only two
outcomes success and failure (also known as
Bernoulli trials).
– The objective is to find the probability of
getting k successes out of n trials.
– The probability of success is p and thus the
probability of failure is (1  p).
– The probability p is constant and does not
change between trials.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Probability Mass Function (PMF) of Binomial
Distribution
• The PMF of the Binomial distribution (probability
that the number of success will be exactly x out
of n trials) is given by
 n x n x
PMF ( x) P ( X x)   p (1  p ) , 0 x n
 x
 n n!
  
 x x!( n  x )!
Where

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Mean and Variance of Binomial
Distribution
The Mean of a binomial distribution is given by:
n
 n x n
Mean E ( X )  x PMF( x)  x   p (1  p ) n  x np
x 0 x 0  x

The variance of a binomial distribution is given by

n n
 n x
Var( X )  ( x  E ( X )) PMF( x)  ( x  E ( X ))   p (1  p) n  x np(1  p)
2 2

x 0 x 0  x

If the number of trials (n) in a binomial distribution is large, then it can


be approximated by normal distribution with mean np and variance npq.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Example 3.5

Fashion Trends Online (FTO) is an e-commerce company that


sells women apparel. It is observed that about 10% of their
customers return the items purchased by them for many
reasons (such as size, color, and material mismatch). On a
particular day, 20 customers purchased items from FTO.
Calculate:
(a) Probability that exactly 5 customers will return the
items.
(b) Probability that a maximum of 5 customers will
return the items.
(c) Probability that more than 5 customers will return
the items purchased by them.
(d) Average number of customers who are likely to
return the items.
(e) The variance and the standard deviation of the
number of returns. ©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Solution
In this case, the value of n = 20 and p = 0.1.
(a) Probability that exactly 5 customers will return the items
purchased is  20 
P( X 5)   (0.1)5 (0.9)15 0.03192
5 
 

(b)Probability that a maximum 5  20 


of 5 customers will return the items
purchased is P( X 5)     (0.1) k (0.9) 20  k 0.9887
k 0  k 

5  20 
(c) Probability that more than 5 customers
P ( X  5) 1  P ( X 5) 1   ) 20 k return
  (0.1) k (0.9will 1  0.9887the product
0.0113 is
k 0  k 

(d)The average number of customers who are likely to return the


items is
E(X) = n × p = 20 × 0.1 = 2
(e) Variance of a binomial distribution is given by

Var(X) = n × p × (1  p) = 20 × 0.1 × 0.9 = 1.8


and the corresponding standard deviation is 1.3416
©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Poisson Distribution
• Poisson distribution is used when we have to find
the probability of number of events
• The probability mass function of a Poisson
distribution is given

byk
e 
P( X k )  , k  0, 1, 2, ...
k!
• where  is the rate of occurrence of the events
per unit of measurement
• Cumulative distribution function of a Poisson
distribution is given by
e    k
k
P[ X k ] 
i 0 k!

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• The mean and variance of a Poisson random variable are given by E ( X ) 
and Var( X ) 

Probability mass function of a Cumulative distribution


Poisson random variable ( = function of a Poisson random
4). variable ( = 4).

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Example
On average, about 20 customers per day cancel their order
placed at Fashion Trends Online. Calculate the probability that
the number of cancellations on a day is exactly 20 and the
probability that the maximum number of cancellations is 25

Solution
The probability that the number of cancellations is exactly 20
is given by e  20 2020
P ( X 20)  0.0888
20!

Probability that the maximum number of cancellation will be


25 is given by 25 e  20 20k
P( X 25)   0.8878
k 0 k!

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Geometric Distribution
• Geometric distribution represents a random experiment in
which the random variable predicts the number of failures
before the success

• The probability density function of a geometric distribution


x 1
P ( X  x )
is given by  P (success at xth trial) (1  p ) p, where x  1, 2, 3, ...

) P ( X  xfunction
F ( xdistribution
• The cumulative ) 1  (1is p ) x by:
 given
1
E( X ) 
p
• Mean and variance of a geometric distribution are given by
(1  p )
Var( X ) 
p2
and

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Probability mass function of a geometric Cumulative distribution function of a
distribution (p = 0.3). geometric distribution (p = 0.3).

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Memoryless Property of Geometric
Distribution
• Memoryless property is a special property of a geometric
distribution in which the conditional
P( X  i  j | X probability,
 i),
depends only on the value j, not on the value i. We know
that i i
P( X  i) 1  P( X i) 1  [1  (1  p) ] (1  p)
P( X  i  j  X  i) P( X  i  j ) (1  p)i  j j
P( X  i  j | X  i)    (1  p )
P( X  i) P( X  i) (1  p)i

P ( X  j ) (1  p ) j P( X  i  j | X  i) P( X  j ).
• Note that, Thus,

• Memoryless property is an important property that


simplifies calculations associated with conditional
probabilities

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Example
Local Dhaniawala (LD) is an online grocery store and has
an innovative feature which predicts whether the
customer has forgotten to buy an item which is very
common among customers of grocery items. The
probability that a customer buys milk in each shopping
visit is 0.2.

(a) Calculate the probability that the customer’s first


purchase of milk happens during the 5th visit.
(b) Calculate the average time between purchases of
milk.
(c) If a customer has not purchased milk during the past
3 shopping visits, what is the probability that the
customer will not buy milk for another 2 visits?

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Solution
(a) Probability that the customer’s first purchase of
milk happens on 5th trip is given by
P( X 5) (1  0.2) 4 0.2 0.08192
(b)The average time between purchase of milk is
1 1
E( X )   5
p 0 .2
(c) Given that a customer has not purchased milk
for the past 3 shopping visits, the probability that
the customer will not buy for another 2 visits is
given by
P( X  3  2 | X  3) P( X  2) (1  p) 2 (1  0.2) 2 0.64

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Parameters of Continuous Distributions
• Scale parameter: Scale parameter defines the
range of the continuous distribution. The larger
the scale parameter value, larger is the spread of
the distribution.

• Shape parameter: Shape parameter defines


the shape of the probability distribution. The
changes to the value of shape parameter will
change the shape of the distribution.

• Location parameter: Location parameter


locates (or shifts) the distribution on the
horizontal axis.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Uniform Distribution
Cumulative distribution
Probability density function functions
0, xa
 1 x  a
 , x  [ a, b] 
f ( x)  b  a F ( x )  , a x  b
b  a

0, otherwise 1,
 x b

Mean and variance of uniform distribution are


1 and 1
E ( X )  ( a  b) Var( X )  (b  a) 2
2 12

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Exponential Distribution
• Exponential distribution is a single parameter continuous
distribution that is traditionally used for modelling time to
failure of electronic components

• The probability density function and cumulative distribution of


exponential distribution are given by
f ( x) e  x ,  0
 x
F ( x ) 1  e

• The parameter  is the scale parameter and represents the


rate of occurrence of the event, (1/) is the mean time
between events.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Probability density function of an
exponential distribution

The mean and variance of an exponential distribution are


given by and
1 1
The E
expected
( X )  value (1/)
 is) the
Var( X  2mean time between events.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Memoryless Property of Exponential
Distribution
• Exponential distribution is the only continuous
probability distribution that has the memoryless
property. That is ,

P ( X  t  s | X  t ) P ( X  s )

P ( X  t  s  X  t ) P ( X  t  s ) e   (t  s )   s
P( X  t  s | X  t )     t e
P( X  t ) P( X  t ) e

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Example
The time to failure of an avionic system follows an
exponential distribution with a mean time between
failures (MTBF) of 1000 hours.

(a) Calculate the probability that the system will fail


before 1000 hours.
(b) Calculate the probability that it will not fail up to
2000 hours.
(c) Calculate the time by which 10% of the systems
will fail (that is calculate P10 life)

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Solution
(a) The probability that the system will fail by 1000
hours F
is(1000) 1  et 1
 1000
 1 / 1000 , t  1000 F (1000) 1  e 1000 1  e  1 0.6321
In this case so ,
1
(b) The probability that the system will not
P( X  2000) 1  P ( X 2000) 1  F (t ) e  t e
 fail up to
1000
0.1353
2000
e  2
2000 hours is

(c) The time by


 t which 10%
 t
of the systems will fail is
F (t ) 0.10  1  e 0.1 e 0.9

So , t   1  ln(0.9)  1000 ln(0.9) 105.61


 

hours

That is, by 105.61 hours, 10% of items will fail.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Normal Distribution
• Normal distribution, also known as Gaussian
distribution, is one of the most popular
continuous distribution in the field of analytics
especially due to its use in multiple contexts
• The probability density function and the
cumulative distribution 1  x    function are given by
2

1  2   
f ( x)  e ,    x  
 2
2
x 1 t  
1   
F ( x)   e 2  
dt ,    x  
   2

• Here  and  are the mean and standard


deviation of the normal distribution
©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
NORM.DIST(x, , , true) can be used for calculating the
probability density function and cumulative distribution
function of a normal distribution with mean  and standard
deviation .

Probability density function of a normal Cumulative distribution


distribution function of a normal
distribution.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Properties of Normal Distribution
1. Theoretical normal density functions are defined
between  and +.

2. It is a two parameter distribution, where the


parameter  is the mean (location parameter)
and the parameter  is the standard deviation
(scale parameter).

3. All normal distributions have symmetrical bell


shape around mean  (thus it is also median). 
is also the mode of the normal distribution, that
is,  is the mean, median as well as the mode.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
4. For any normal distribution, the areas between
specific values measured in terms of  and  are
given by:Value of Random Variable Area under the Normal Distribution (CDF)

    X   +  (area between one 0.6828

sigma from the mean)

  2  X   + 2 (area between 0.9545

two sigma from the mean)

  3  X   + 3 (area between 0.9973

three sigma from the mean)

5. Any linear transformation of a normal random


variable is also normal random variable. That is, if
X is a normal random variable, then the linear
transformation AX + B (where A and B are two
constants) is also a normal random variable.
©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• If X1 and X2 are two independent normal random
variables with mean 1 and 122 andvariance
2
2

and respectively, then X1 + X2 is also a


2 2
 2
normal distribution with mean 1 + 2 and
1

variance

• Sampling distribution of mean values a large


sample drawn form a population of any
distribution is likely to follow a normal
distribution, this result is known as the central
limit theorem

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Standard Normal Variable

• A normal random variable with mean  = 0 and 


= 1 is called the standard normal variable and
usually represented by Z
• The probability density function and cumulative
distribution function of a standard normal variable
are given by 
z2
1 2
f ( z)  e
2

x2
z 1 
F ( z)   e 2 dz
  2

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• By using the following transformation, any normal random
variable X can be converted into a standard normal variable

X  
Z 

• The random variable X can be written in the form of a
standard normal random variable using the relationship
X=+Z

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• A simple approximation of standard normal CDF is
given by Tocher (1963)
e2 kz
P ( Z  z ) F ( z ) 
1  e2 kz

where k  2/

Another more accurate approximation is provided


by Byrc (2002):

 2   z2 / 2
z  A z  A
P( Z  z )  F ( z ) 1   1 2  e
 2 z 3  B z 2  B z  2 A 
 1 2 2

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Example
According to a survey on use of smart phones in
India, the smart phone users spend 68 minutes in a
day on average in sending messages and the
corresponding standard deviation is 12 minutes.
Assume that the time spent in sending messages
follows a normal distribution.

(a) What proportion of the smart phone users are


spending more than 90 minutes in sending
messages daily?
(b) What proportion of customers are spending less
than 20 minutes?
(c) What proportion of customers are spending
between 50 minutes and 100 minutes?
©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Solution
It is given that  = 68 minutes and  = 12 minutes.
(a) Proportion of customers spending more than 90 minutes is
given by P(X  90) = 1  P(X  90) = 1  F(90)
The standard normal random variable value for X = 120 is
given by
x   90  68
Z  1.8333
 12

That is, F(X = 90) = F(Z = 1.8333). From standard normal


distribution table, we get for Z = 1.8333. The area under the
standard normal distribution curve is 0.9666. Thus , P(X  90)
= 1 P(X  90) = 1  F(90) = 1 – 0.9666 = 0.0334

Alternatively, using Excel, we get


P(X  90) = 1  P(X  90) = 1 – Normdist (90, 68, 12, true) =
0.0334

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
(b) Proportion of customers spending less than 20
minutes is
P(X  20) = F(20)
Using Excel function, we have Normdist(20, 68, 12,
true) = 3.1671 × 105

(c) Proportion of customers spending between 50


and 100 minutes is given by
P (50  X 100)  F (100)  F (50)
 Normdist(1 00,68,12, true)  Normdist(5 0,68,12, true)
0.9293

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Chi-Square Distribution
• Chi-square distribution with k degrees of freedom
[denoted as 2(k) distribution] is a non-parametric
distribution which is obtained by adding square of
k independent standard normal random variables.
• Consider a normal random variable X1 with mean 1 and
standard deviation 1. Then we can define Z1 (the standard
normal random variable) as
X  1
Z1  1
1

• Then,
2
 X  1 
Z12  1 
  1 

is a chi-square distribution with one degree of freedom [2(1)]

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• Let X2 be a normal random variable with mean 2 and
standard deviation 2 and Z2 is the corresponding standard
2 2
1  Z2
normal Zvariable. Then the random variable
given by 2 2
 X  1   X  2 
Z12  Z 22  1    2 
  1   2 

is a chi-square distribution with 2 degrees of freedom.

• A chi-square distribution with k degrees of freedom is given


by sum of squares of standard normal random variables Z1,
Z2, …, Zk obtained by transforming normal random
variables X1, X2, …, Xk with mean values 1, 2, …, k and
corresponding standard deviations 1, 2, …, k. That is
2 2 2
2  X  1   X  2   X  k 
 (k ) Z12  Z 22  ...  Z k2  1    2   ...   k 
  1   2   k 

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
The probability density function of 2(k) is given by

k x
1 1 
f ( x)  k 2 x2 e 2
2 ( k 2)

where(k / 2) is a Gamma function given by


( k ) x k 1
e  x
dx
0

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• The cumulative distribution function of a chi-
square distribution with k degrees of freedom is
given by
 k x 
 , 
 2 2 
F ( x) 
 k 
 
 2 
k x
 , 
 2 2
• Where is the lower incomplete Gamma
function. It is given by x
(k , x)  t k  1e  t dt
0

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Cumulative distribution of
Probability density function
chi-square distribution with
of chi-square distribution for
k degrees of freedom
different values of k

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Properties of chi-square distribution
• The mean and standard deviation of a chi-square
2k are k and
distribution where k is the degrees
of freedom

• As the degrees of freedom k increases the


probability density function of a chi-square
distribution approaches normal distribution.

• Chi-square goodness of fit test is one of the


popular tests for checking whether a data follows
a specific probability distribution.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Student’s t-Distribution
• Student’s t-distribution (or simply t-distribution)
arises while estimating the population mean of a
normal distribution using sample which is either
small and/or the population standard deviation is
unknown

• The distribution was developed by William Gosset


under the pseudo name ‘student’ while working
for Guinness Brewery in Dublin, Ireland (Student,
1908) and thus is called student’s distribution.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• Assume that X1, X2, …, Xn are n observations (that is,
sample of size n) from a normal distribution with mean 

X 
and standard deviation . Let
n
 Xi
i1

2
1 n  
S   X i  X 
n  1 i 1 


X
• where and S are mean and standard deviation estimated
from the sample X1, X2, …, Xn. Then the random variable t
defined by 
X 
t
S/ n

follows a t-distribution with (n  1) degrees of freedom.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• The probability density function of t-distribution
with n degrees of freedom is given by
 n 1
  
n 1
2  2
 2  x 
f ( x)  1  
n
  n 
   n 
 2

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Cumulative distribution function of student’s t-
distribution

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Properties of t-distribution:
• The mean of a t distribution with 2 or more degrees of
freedom is 0.
• The standard deviation of t-distribution
n
n 2
is for n >
2, where n is the number of degrees of freedom.
• As the degrees of freedom n increases the probability
density function of a t-distribution approaches the
density function of standard normal distribution. For n
> 120, the difference between the area under
probability density function of a t-distribution is very
close to the area under a standard normal distribution.
• t-distribution is an important distribution for
hypothesis testing of means of a population and for
comparing means of two populations.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
F-Distribution
F-distribution (short form of Fisher’s distribution named after
statistician Ronald Fisher) is a ratio of two chi-square
distributions. Let Y1 and Y2 be two independent chi-square
distributions with k1 and k2 degrees of freedom, respectively.
Then the random variable X Y is1defined
/ k1 as
X 
Y2 / k 2

is a F distribution. The probability density function of an F-


distribution is given by k /21
 k1  k 2   k1  k1
    1
 2 k
 2  x 2
f ( x)  
k  k  k1  k 2
 1   2   k x 2
 2   2   1  1 
 k2 

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Probability density Cumulative density
function of F-distribution function of F-distribution

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Properties of F distribution:
k
• Mean of F-distribution is
2 ,
k2  2
for k2 > 2.

2k 22 is
• Standard deviation of F-distribution ( k1  k 2  2)

for k2 > 4. k1 ( k 2  2) 2 ( k 2  4)

• F-distribution is non-symmetrical and the shape of the


distribution depends on the values of k1 and k2.

• F-distribution is used in Analysis of Variance to test the


mean values of multiple groups

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
Summary
• The concept of probability, random variables and
probability distributions are foundations of data science.
Knowledge of these concepts is important for framing and
solving analytics problems.

• Random variable is a function that maps an outcome of a


random experiment to a real number and plays an
important role in analytics since many key performance
indicators used across industries are random variables.

• Basic probability concepts such as joint events,


independent events, conditional probability and Bayes’
theorem are useful for predicting probability of an event of
importance. These concepts are used in algorithms such as
association rule learning which is used in solving analytics
problems such as market basket analysis and recommender
systems.
©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making
• Discrete probability distributions such as binomial
distribution, Poisson distribution and geometric distribution
are used for modelling discrete random variables.

• Continuous distributions such as normal distribution, chi-


square distribution, t-distribution and F-distribution play an
important role in hypothesis testing.

©Analytics
Business U. Dinesh Kumar,
– The Science IIMDriven
of Data Bangalore
Decision Making

You might also like