CS 215
Data Analysis and Interpretation
Expectation
Suyash P. Awate
Expectation
• “Expectation” of the random variable;
“Expected value” of the random variable;
“Mean” of the random variable.
• “Expected value” isn’t the value that is
most likely to be observed in the
random experiment
• Can think of it as the center of mass of
the probability mass/density function
Expectation
• Definition:
Expectation of a Discrete Random Variable:
• Frequentist interpretation of probabilities and expectation
• If a random experiment is repeated infinitely many times,
then the proportion of number of times event E occurs is the probability P(E)
• If a random experiment underlying a discrete random variable X
is repeated infinitely many times,
then the proportion of number of experiments when X takes value x is P(X=x)
• So, in N→∞ experiments, number of times X takes value xi will → N.P(X=xi)
• So, across all N→∞ experiments,
arithmetic average of observed values will
→ (1/N) ∑i (xi) (N.P(X=xi))
= E[X]
Expectation
• Another Formulation of Expectation
• Recall:
• Discrete random variable X is a function defined on a probability space {Ω,ẞ,P}
• Function X:Ω→R, maps each element in sample space Ω to a single numerical value
belonging to the set of real numbers
X(.)
s x = X(s)
Expectation
• Example
• “Expected value” for the uniform random variable modelling die roll
• Values on die are {1,2,3,4,5,6}
• E[X] = 3.5
• Expectation of a uniform random variable (discrete case)
• If X has uniform distribution over n consecutive integers over [a,b],
then E[X] = (a+b)/2
Expectation
• Example
• Expectation of a binomial random variable (when n=1, this is Bernoulli)
j := k – 1
m := n – 1
Expectation
• Example
• Expectation of a Poisson random variable
• Consider random arrivals/hits occurring at a constant average rate λ>0,
i.e., λ arrivals/hits (typically) per unit time
• This gives meaning to parameter λ as average number of arrivals in unit time
Expectation
• Definition:
"
Expectation of a Continuous Random variable: E[X] :=∫!" 𝑥𝑃 𝑥 𝑑𝑥
• Frequentist interpretation of probabilities and expectation
• If a random experiment underlying a continuous random variable X
is repeated N→∞ times,
then,
for a tiny interval [x,x+Δx],
the proportion of time X takes values within interval is approximately P(x)Δx
• So, in N→∞ experiments,
number of times we will get X within [xi,xi+Δx] is approximately N.P(xi)Δx
• So, across all N→∞ experiments,
arithmetic average of all observed values is
approximately (1/N) ∑i (xi) (N.P(xi)Δx)
• In the limit that Δx→0, this average→E[X]
Expectation
X(.)
• Another Formulation of Expectation
s x = X(s)
• Recall:
• Random variable X is a function defined on a probability space {Ω,ẞ,P}
• Function X:Ω→R, maps each element in sample space Ω to a single numerical value
belonging to the set of real numbers x
P(x)
" " X(s)
• E[X] :=∫!" 𝑥𝑃 𝑥 𝑑𝑥 = ∫!" 𝑋 𝑠 𝑃 𝑠 𝑑𝑠
• Intuition remains the same as in the discrete case [x,x+Δx]
• Using probability-mass conservation:
P(x)Δx is approximated by P(s1)Δs1 + P(s2)Δs2 + …
[s1,s1+Δs1] [s2,s2+Δs2] s
• Thus, x.P(x)Δx is approximated by
X(s1).P(s1)Δs1 + X(s2).P(s2)Δs2 + … P(s)
• A more rigorous proof needs advanced results in real analysis
Expectation
• Mean as the center of mass
P(x)
• By definition,
mean m := E[X] :=∫! 𝑥𝑃 𝑥 𝑑𝑥
• Thus, ∫!(𝑥 − 𝑚)𝑃 𝑥 𝑑𝑥 = 0
x m
• Mass P(x)dx dx
placed around location ‘x’
applies a torque ∝ P(x)dx.(x-m)
at the fulcrum placed at location ‘m’
• Because the integral ∫#(𝑥 − 𝑚)𝑃 𝑥 𝑑𝑥 is zero,
the net torque around the fulcrum ‘m’ is zero
• Hence, ‘m’ is the center of mass
Expectation
• Example
• Expectation of a uniform random variable (continuous case)
Expectation PDF
P(x) = 0, for all x < 0
P(x) = 𝜆 exp −𝜆𝑥 , ∀𝑥 ≥ 0
• Example CDF
f(x) = 0, for all x < 0
• Expectation of an exponential random variable f(x) = 1 − exp −𝜆𝑥 , ∀𝑥 ≥ 0
• Consider random arrivals/hits occurring
at a constant average rate λ > 0
• Define β := 1/λ
• This gives meaning to parameter β as average inter-arrival time
• Larger arrival/hit rate leads to lesser inter-arrival time
Expectation
• Example
• Expectation of a Gaussian random variable
Expectation
• Example
• Expectation of a limiting case of binomial
• As n tends to infinity,
binomial
tends to a
“Gaussian” form
• Gaussian expectation μ(=np here) is
consistent with binomial expectation np
Expectation
• Linearity of Expectation
• For both discrete and continuous random variables
• For random variables X and Y having a joint probability space (Ω,ẞ,P),
the following rules hold:
• E[X + Y] = E[X] + E[Y]
• Either
• Or LHS = ∫# ∫/ 𝑥 + 𝑦 𝑃 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = ∫# 𝑥 ∫/ 𝑃 𝑥, 𝑦 𝑑𝑦 𝑑𝑥 + ∫/ 𝑦 ∫# 𝑃 𝑥, 𝑦 𝑑𝑥 𝑑𝑦 = RHS
• E[X + c] = E[X] + c, where ‘c’ is a constant
• E[a X] = a E[X], where ‘a’ is a scalar constant
• This generalizes to:
Expectation
• Expectation of a “function of a random variable”
• Let us define values y := Y(x), or “Y(.) is a function of the random variable X”
X(.) Y(.)
s x := X(s) y := Y(x) := Y(X(s))
• Discrete random variable: 𝐸 𝑌 𝑋 ≔ 𝐸" # 𝑌(𝑋) ≔ ∑!! 𝑌 𝑥$ 𝑃(𝑥$ )
• Continuous random variable: 𝐸 𝑌 𝑋 ≔ 𝐸" # 𝑌(𝑋) ≔ ∫! 𝑌 𝑥 𝑃 𝑥 𝑑𝑥
• Property:
• Just as EP(S)[X(S)] = EP(X)[X], …
• … we get EP(X)[Y(X)] = EP(Y)[Y]
Expectation
• Expectation of a function of multiple random variables
• Definition: When we have multiple random variables X1,…,Xn with
a joint PMF/PDF P(X1,…,Xn) and
a function of the multiple random variables g(X1,…,Xn),
then we define the expectation of g(X1,…,Xn) as:
𝐸 𝑔 𝑋% , … , 𝑋& ∶= 4 𝑔 𝑥% , … , 𝑥& 𝑃(𝑋% = 𝑥% , … , 𝑋& = 𝑥& )
!" ,…,!#
or
𝐸 𝑔 𝑋% , … , 𝑋& ∶= 5 𝑔 𝑥% , … , 𝑥& 𝑃(𝑥% , … , 𝑥& ) 𝑑𝑥% … 𝑑𝑥&
!" ,…,!#
• If X and Y are independent, then E[XY] = E[X] E[Y]
• Proof:
• ∑#,/ 𝑥𝑦𝑃(𝑋 = 𝑥, 𝑌 = 𝑦) = ∑#,/ 𝑥𝑦𝑃 𝑋 = 𝑥 𝑃 𝑌 = 𝑦 = ∑# 𝑥𝑃 𝑋 = 𝑥 ∑/ 𝑦𝑃(𝑌 = 𝑦)
Expectation
• Tail-sum formula
• Let X be a discrete random variable taking values in set of natural numbers
• Then,
P(x=1)
P(x=2) P(x=2)
P(x=3) P(x=3) P(x=3)
• Proof: P(x=4) P(x=4) P(x=4) P(x=4)
…
Sum over rows (row number = x)
Sum over columns (column number = k)
Expectation
• Tail-sum formula
• Let X be a continuous random variable taking non-negative values
• Notation: For random variable X, PDF is fX(.) and CDF is FX(.)
• Then,
t
• Proof:
x t
x
t x
Expectation in Life
• Action without expectation à Happiness [Indian Philosophy]
Quantile, Quartile
• Definition: For a discrete/continuous random variable
with a PMF/PDF P(.), the q-th quantile
(where 0<q<1) is any real number ‘xq’
such that P(X≤xq) ≥ q and P(X≥xq) ≥ 1-q
• Quartiles: q = 0.25 (1st quartile),
q = 0.5 (2nd), q = 0.75 (3rd)
• Percentiles
• q=0.25 à 25th percentile
• Box plot,
box-and-whisker plot
• Inter-Quartile Range
(IQR)
Quantile, Median
• Definition:
For a discrete/continuous random variable with a PMF/PDF P(.),
the median is any real number ‘m’
such that P(X≤m) ≥ 0.5 and P(X≥m) ≥ 0.5
• Median = second quartile
• Definition:
For a continuous random variable with a PDF P(.),
the median is any real number ‘m’
such that P(X≤m) = P(X>m)
• CDF fX(m) = 0.5
• A PDF can be associated with multiple medians
Mode
• For discrete X
• Mode m is a value for which the PMF value P(X=m) is maximum
• A PMF can have multiple modes
• For continuous X
• Mode ‘m’ is any local maximum of the PDF P(.)
• A PDF can have multiple modes
• Unimodal PDF = A PDF having only 1 local maximum
• Bimodal PDF:
2 local maxima
• Multimodal PDF:
2 or more
local maxima
Mean, Median, Mode
• For continuous X, for unimodal and symmetric distributions,
mode = mean = median
• Assuming symmetry
around mode,
mass on left of mode =
mass on right of mode
• So, mode = median
• Assuming symmetry
around mode,
every P(x)dx mass on left of mode
is matched by
a P(x)dx mass on right of mode
• So, mode = mean
Variance
• Definition: Var(X) := E[(X-E[X])2]
• A measure of the spread of the mass (in PMF or PDF) around the mean
• Property: Variance is always non-negative
• Property: Var(X) = E[X2] – (E[X])2
• Proof: LHS =
E[(X-E[X])2]
= E[ X2 + (E[X])2 – 2.X.E[X] ]
= E[X2] + (E[X])2 – 2(E[X])2
= E[X2] – (E[X])2 = RHS
• Definition: Standard deviation is the square root of the variance
• Units of variance = square of units of values taken by random variable
• Units of standard deviation = units of values taken by random variable
Variance
• Variance of a Uniform Random Variable
• Discrete case
• X has uniform distribution over n integers {a, a+1, …, b}
• Here, n = b–a+1
• Variance = (n2 – 1) / 12
Variance
• Variance of a Binomial Random Variable
• Var(X) = E[X2] – (E[X])2 , where E[X] = np
Variance
• Variance of a Binomial Random Variable
• Var(X) = E[X2] – (E[X])2 , where E[X] = np
• So, E[X2]
= np (mp + 1)
= np ((n–1)p + 1)
= (np)2 + np(1-p)
• Thus, Var(X) = np(1–p) = npq
• Interpretation
• When p=0 or p=1,
then Var(X) = 0,
which is the minimum possible
• When p=q=0.5,
then Var(X) is maximized
Variance
• Variance of a Poisson Random Variable
• Var(X) = E[X2] – (E[X])2 , where E[X] = λ
Variance
• Variance of a Poisson Random Variable
• Var(X) = E[X2] – (E[X])2 , where E[X] = λ
• So, E[X2]
= λ (λ.1 + 1)
= λ2 + λ
• Thus, Var(X) = λ
• Interpretation
• Mean of Poisson random variable was also λ
• Standard deviation of Poisson random variable is λ0.5
• As mean increases, so does variance (and standard deviation)
• When mean increase by factor of N (i.e., N time larger signal = number of arrivals/hits),
then the standard deviation (spread) increases only by a factor of N0.5
• As N increases,
then variability in number of arrivals/hits, relative to average arrival/hit rate, decreases
Variance
• Variance of a Uniform Random Variable
• Continuous case
• X has uniform distribution over [a,b]
• Variance = (b – a)2 / 12
Variance PDF
P(x) = 0, for all x < 0
P(x) = 𝜆 exp −𝜆𝑥 , ∀𝑥 ≥ 0
• Variance of a Exponential Random Variable CDF
f(x) = 0, for all x < 0
• Var(X) = E[X2] – (E[X])2 , where E[X] = β := 1/λ f(x) = 1 − exp −𝜆𝑥 , ∀𝑥 ≥ 0
• So, Var(X) = β2 . So, β = E[X] = SD(X); unlike Poisson.
Variance
• Variance of a Gaussian Random Variable
• Var(X) = E[X2] – (E[X])2, where E[X] = μ
Variance
• Variance of a Gaussian Random Variable
• Var(X) = E[X2] – (E[X])2 , where E[X] = μ
t . (t.exp(-t2))
Variance
• Example
• Variance of a limiting case of binomial
• As n tends to infinity,
binomial
tends to
Gaussian
• Gaussian variance σ2 (= npq in this case) is
consistent with binomial variance npq
Variance
• Property: Var(aX+c) = a2Var(X)
• Adding a constant to a random variable doesn’t change the variance (spread)
• This only shifts the PDF/PMF
• If Y := X + c, then Var(Y) = Var(X)
• If we scaling a random variable by ‘a’, then the variance gets scaled by a2
• If Y := aX, then Var(Y) = a2Var(X)
• Proof:
Variance
• Property: Var(X+Y) = Var(X) + Var(Y) + 2(E[XY] – E[X]E[Y])
• Proof:
E[Y2]
E[Y2]
• If X and Y are independent,
then E[XY] = E[X] E[Y], and so Var(X+Y) = Var(X) + Var(Y)
• If X,Y,Z are independent, then
Var(X+Y+Z) = Var(X+Y) + Var(Z) = Var(X) + Var(Y) + Var(Z)
• For independent random variables X1, …, Xn;
Var(X1 + … + Xn) = Var(X1) + … + Var(Xn)
Markov’s Inequality
• Theorem: Let X be a random variable with PDF P(.).
Let u(.) be an non-negative-valued function.
Let ‘c’ be a positive constant.
Then, P(u(X) ≥ c) ≤ E[u(X)] / c
• Proof:
• E[u(X)] = ∫x:u(x)≥c u(x) P(x) dx + ∫x:u(x)<c u(x) P(x) dx
• Because u(.) takes non-negative values, each integral above is non-negative
• So, E[u(X)] ≥ ∫x:u(x)≥c u(x) P(x) dx
≥ c ∫x:u(x)≥c P(x) dx
= c P(u(X) ≥ c)
• Because c>0, we get E[u(X)]/c ≥ P(u(X) ≥ c)
• Special case à
• X takes non-negative values & u(x) := x
Chebyshev’s Inequality Markov’s Inequality:
P(u(X) ≥ c) ≤ E[u(X)] / c
• Theorem: Let X be a random variable with PDF P(.),
finite expectation E[X], and finite variance Var(X).
Then, P(|X-E[X]| ≥ a) ≤ Var(X) / a2
• Proof:
• Define random variable u(X) := (X-E[X])2
• Then, by Markov’s inequality, P(u(X) ≥ a2) ≤ E[u(X)] / a2
• LHS = P(|X-E[X]| ≥ a)
• RHS = Var(X) / a2
• Q.E.D.
• Corollary: If random variable X has standard deviation σ, then
P(|X-E[X]| ≥ kσ) ≤ 1/k2
• This is consistent with the notion of standard deviation (σ) or variance (σ2)
measuring the spread of the PDF around the mean (center of mass)
Chebyshev’s Inequality
Chebyshev
• Pafnuty Chebyshev
• Founding father of Russian mathematics
• Students: Lyapunov, Markov
• First person to think
systematically in terms of
random variables and their
moments and expectations
Markov
• Andrey Markov
• Russian mathematician best known for
his work on stochastic processes
• Advisor: Chebyshev
• Students: Voronoy
• One year after doctoral defense,
appointed extraordinary professor
• He figured out that he could use chains to model
the alliteration of vowels and consonants
in Russian literature
Jensen’s Inequality
• Theorem: Let X be any random variable; f(.) be any convex function.
Then, E[f(X)] ≥ f(E[X]) A real-valued function is called convex if
the line segment between any two points on the graph of the function
• Proof: lies above/never-below the graph between the two points.
• Let m := E[X], can be anywhere on real line
• Consider a tangent (subderivative line) to f(.) at [m,f(m)]
• This line is, say, Y = aX+b,
which lies at/below (never above) f(X)
• Then, f(m) = am+b
• Then,
E[f(X)] ≥ E[aX+b]
= aE[X] + b
= f(E[X])
Jensen’s Inequality
• Corollary: Let X be any random variable; g(.) be any concave function.
Then, E[g(X)] ≤ g(E[X]) A real-valued function is called concave if
the line segment between any two points on the graph of the function
• Proof: lies below/never-above the graph between the two points.
• Let m := E[X], can be anywhere on real line
• Consider a tangent (subderivative line) to g(.) at [m,g(m)]
• This line is, say, Y = aX+b,
which lies at/above (never below) g(X)
• Then, g(m) = am+b
• Then,
E[g(X)] ≤ E[aX+b]
= aE[X] + b
= g(E[X])
Jensen
• Johan Jensen
• Danish mathematician and engineer
• President of the Danish Mathematical Society
from 1892 to 1903
• Never held any academic position
• Engineer for Copenhagen Telephone Company
• Became head of its technical department
• Learned advanced math topics by himself
• All his mathematics research
was carried out in his spare time
Minimizer of Expected Absolute Deviation
• Theorem: E[|X – c|] is minimum when c = Median(X)
• Case 1: Let c ≤ m := Median(X)
+ *
• E[|X – c|] = ∫)* 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 + ∫+ 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, A + B)
, ,
• A = ∫)* 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 − ∫+ 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 (say, A1 – A2)
, *
• B = ∫+ 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 + ∫, 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, B1 + B2)
1
• Now, B1 – A2 = 2 ∫0 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 ≥ 0
, ,
• A1 = ∫)* 𝑐 − 𝑚 𝑃 𝑥 𝑑𝑥 + ∫)* 𝑚 − 𝑥 𝑃 𝑥 𝑑𝑥 (say, A11 + A12)
* *
• B2 = ∫, 𝑥 − 𝑚 𝑃 𝑥 𝑑𝑥 + ∫, 𝑚 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, B21 + B22)
• Now, A11 + B22 = –(m–c) (1–P(x≥m)) + (m–c) P(x≥m) = (m–c) (2P(x≥m)–1) ≥ 0
• Now, A12 + B21 = E[|X – m|]
,
• So, A+B = E[|X – m|] + (m–c) (2P(x≥m) – 1) + 2 ∫+ 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥
• Value of c minimizing A+B is c = m
Minimizer of Expected Absolute Deviation
• Theorem: E[|X – c|] is minimum when c = Median(X)
• Case 2: Let m := Median(X) ≤ c
+ *
• E[|X – c|] = ∫)* 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 + ∫+ 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, A + B)
, +
• A = ∫)* 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 + ∫, 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 (say, A1 + A2)
+ *
• B = − ∫, 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 + ∫, 𝑥 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, – B1 + B2)
0
• Now, A2 – B1 = 2 ∫1 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥 ≥ 0
, ,
• A1 = ∫)* 𝑐 − 𝑚 𝑃 𝑥 𝑑𝑥 + ∫)* 𝑚 − 𝑥 𝑃 𝑥 𝑑𝑥 (say, A11 + A12)
* *
• B2 = ∫, 𝑥 − 𝑚 𝑃 𝑥 𝑑𝑥 + ∫, 𝑚 − 𝑐 𝑃 𝑥 𝑑𝑥 (say, B21 + B22)
• Now, A11 + B22 = (c–m) P(x≤m) – (c–m) (1–P(x≤m)) = (c–m) (2P(x≤m)–1) ≥ 0
• Now, A12 + B21 = E[|X – m|]
+
• So, A+B = E[|X – m|] + (c–m) (2P(x≤m) – 1) + 2 ∫, 𝑐 − 𝑥 𝑃 𝑥 𝑑𝑥
• Value of c minimizing A+B is c = m
Mean, Median, Standard Deviation
• Theorem:
Mean(X) and Median(X) are within a distance of SD(X) of each other
• Proof:
• Distance between mean and median
= |E[X] – Median(X)|
= |E[X – Median(X)]|
This is |E[.]|, where |.| is a convex function. Apply Jensen’s inequality.
≤ E[|X – Median(X)|]
≤ E[|X – E[X]|] (because Median(X) minimizes expected absolute deviation)
= E[Sqrt{ (X – E[X])2 }]
This is E[Sqrt(.)], where Sqrt(.) is a concave function. Apply Jensen’s inequality.
≤ Sqrt{ E[ (X – E[X])2 ] }
= Sqrt{ Var(X) } = SD(X)
Law of Large Numbers
• This justifies why the expectation is motivated as an average over a
large number of random experiments (“long-term average”)
• Let random variables X1, …, Xi, …, Xn be ‘n’ independent and identically
distributed (i.i.d.), each with mean μ=E[Xi] and finite variance v=Var(Xi)
• Let the average, over ‘n’ experiments, be modeled by
a random variable 𝑋 := (X1 + … + Xn) / n
• Then, the expected average E[𝑋] = μ, by the linearity of expectation
• But, in specific runs, how close is 𝑋 to the expectation μ ?
• So, we analyze the spread of 𝑋 around μ
• Var(𝑋) = Var(X1/n) + … + Var(Xn/n) = n(v/n2) = v/n
Law of Large Numbers
• This justifies why the expectation is motivated as an average over a
large number of random experiments
• Law of large numbers: For all ε > 0, as n→∞, P(|𝑋 – μ | ≥ ε) → 0
• Proof: Using Chebyshev’s inequality,
P(|𝑋 – μ | ≥ ε)
≤ Var(𝑋) / ε2
= v / (nε2)
→0, as n→∞
• Thus, as the average 𝑋 uses data from more number of experiments ‘n’,
the event of “𝑋 being farther from μ than ε” has a probability that tends to 0
Law of Large Numbers
• Example
• This also gives us a way to
compute an “estimate” of
the expectation μ of a
random variable X
from “observations”/data
• What is the estimate ?
•𝑋
Law of Large Numbers
www.nature
.com/article
s/nmeth.26
13
Covariance
• For random variables X and Y, consider the joint PMF/PDF P(X,Y)
• Covariance: A measure of how the values taken by X and Y vary
together (“co”-“vary”)
• Definition: Cov(X,Y) := E[(X – E[X])(Y – E[Y])]
• Interpretation:
• Define U(X) := X – E[X] and V(Y) := Y – E[Y] (Note: U and V have expectation 0)
• In the joint distribution P(U,V),
if larger (more +ve) values of U typically correspond to larger values of V, and
smaller (more –ve) values of U typically correspond to smaller values of V,
then U and V co-vary positively
• In the joint distribution P(U,V),
if larger values of U typically correspond to smaller values of V, and …
then U and V co-vary negatively
• Property: Symmetry: Cov(X,Y) = Cov(Y,X)
Covariance
• Examples
Covariance
• Property: Cov(X,Y) = E[XY] – E[X]E[Y]
• Proof:
• Cov(X,Y) = E[(X – E[X])(Y – E[Y])] = E[XY] – E[X]E[Y] – E[X]E[Y] + E[X]E[Y] = E[XY] – E[X]E[Y]
• So, Var(X+Y) = Var(X) + Var(Y) + 2(E[XY] – E[X]E[Y]) = Var(X) + Var(Y) + 2Cov(X,Y)
• Also, when X and Y are independent, then Cov(X,Y) = 0
• Property: When Var(X) and Var(Y) are finite, and one of them is 0,
then Cov(X,Y)=0
• Property: When Y := mX + c (with finite m), what is Cov(X,Y) ?
• Cov(X,Y) = E[XY] – E[X]E[Y]
= E[mX2 + cX] – E[X](m.E[X] + c)
= m.E[X2] – m(E[X])2 = m.Var(X)
• When Var(X)>0, covariance is ∝ line-slope ‘m’, and has same sign as that of m
Covariance
• Bilinearity of Covariance
• Let X, X1, X2, Y, Y1, Y2 be random variables. Let c be a scalar constant.
• Property: Cov(X1 + X2, Y) = Cov(X1, Y) + Cov(X2, Y) = Cov(Y, X1 + X2)
• Proof (first part; second part follows from symmetry):
• Property: Cov(aX, Y) = a.Cov(X, Y) = Cov(X, aY)
• Proof (first part):
• Cov(aX, Y)
= E[ aXY ] − E[ aX ]E[ Y ]
= a (E[ XY ] − E[ X ]E[ Y ])
= a Cov(X,Y)
Standardized Random Variable
• Definition:
If X is a random variable, then its standardized form is given by
X* := (X – E[X]) / SD(X), where SD(.) gives the standard deviation
• Property: E[X*] = 0, Var(X*) = 1
• Proof:
• X* is unit-less
• X* is obtained by:
• First shifting/translating X to make mean 0, and
• Then scaling the shifted variable to make variance 1
Correlation
• For covariance, the magnitude isn’t easy to interpret (unlike its sign)
• Correlation: A measure of how the values taken by X and Y vary
together (“co”-“relate”) obtained by rescaling covariance
• Pearson’s correlation coefficient
• Assuming X and Y are linearly related, correlation magnitude shows the
strength of the (functional/deterministic) relationship between X and Y
• Let ‘SD’ = standard deviation
• Definition: Cor(X,Y) :=
• Thus, Cor(X,Y) = E[X*Y*], where X* and Y* are the standardized variables
= E[X*Y*] – E[X*]E[Y*]
= Cov(X*,Y*)
Correlation
• Property: -1 ≤ Cor(X,Y) ≤ 1
• Proof:
• First inequality
• 0 ≤ E[(X*+Y*)2]
= E[(X*)2] + E[(Y*)2] + 2E[X*Y*]
= 2(1 + Cor(X,Y))
• So, –1 ≤ Cor(X,Y)
• Second inequality
• 0 ≤ E[(X*–Y*)2]
= E[(X*)2] + E[(Y*)2] – 2E[X*Y*]
= 2(1 – Cor(X,Y))
• So, Cor(X,Y) ≤ 1
Correlation
• Property: If X and Y are linearly related, i.e., Y = mX + c,
and are non-constant (i.e., SD(X)>0 and SD(Y)>0),
then |Cor(X,Y)| = 1
• Proof:
• When Y = mX + c, then SD(Y) = |m| SD(X)
• Cor(X,Y)
= Cov(X,Y) / (SD(X) SD(Y))
= mVar(X) / (SD(X) |m|SD(X))
= ±1
= sign of the slope m
Correlation
• Property: If |Cor(X,Y)| = 1, then X and Y are linearly related
• Proof:
• If Cor(X,Y) = 1, then E[(X*–Y*)2] = 2(1 – Cor(X,Y)) = 0
• For discrete X,Y: this must imply X*=Y* for all (x’,y’) where P(X=x’,Y=y’) > 0
• Else the summation underlying the expectation cannot be zero
• For continuous X,Y: this must imply X*=Y* for all measures (dx’,dy’) where P(dx’,dy’) > 0
• X* and Y* can be unequal only on a countable set of isolated points where P(dx’,dy’) > 0
• Else the integral underlying the expectation cannot be zero
• If Cor(X,Y) = (–1), then E[(X*+Y*)2] = 2(1 + Cor(X,Y)) = 0
• For discrete X,Y: this must imply X*=(–Y*) for all (x’,y’) where P(X=x’,Y=y’) > 0
• For continuous X,Y: this must imply X*=(–Y*) for all measures (dx’,dy’) where P(dx’,dy’) > 0
• Inequality can hold only on a countable set of isolated points where P(dx’,dy’) > 0
• If X* = ±Y*, then Y must be of the form mX+c
Correlation
• If |Cor(X,Y)|=1 (or Y=mX+c), then
how to find the equation of the line from data {(xi,yi): i=1,…,n}?
• By the way: line must pass through (E[X],E[Y])
• Because, when X=E[X], value of Y must be mE[X]+c, but that also equals E[Y]
• We proved that: if Y=mX+c, then |Cor(X,Y)|=1 and Y* = ±X* = Cor(X,Y) X*
• So, (Y – E[Y]) / SD(Y) = Cor(X,Y) (X – E[X]) / SD(X)
• So, Y = E[Y] + SD(Y) Cor(X,Y) (X – E[X]) / SD(X)
• So, Y = E[Y] + Cov(X,Y) (X – E[X]) / Var(X)
• This gives the equation of the line with:
• Slope m := Cov(X,Y) / Var(X)
• Intercept c := E[Y] – Cov(X,Y) E[X] / Var(X)
Correlation
• Examples
Correlation
• Four sets of data with the same correlation of 0.816
• Blue line indicates the line passing through (E[X],E[Y]) with slope = 0.816
(more on this when we study estimation)
• So, correlation = 0.816
doesn’t always mean that data
lies along a line of slope 0.816
• This indicates the likely
misinterpretation of correlation
when variables underlying data
aren’t linearly related
Correlation
• Zero correlation doesn’t imply independence
• We showed that independence implies zero covariance/correlation,
but the converse isn’t always true
• Example: Let X be uniformly distributed within [-1,+1]. Let Y := X2.
• Cov(X,X2) = E[X.X2] – E[X]E[X2] = E[X3] – 0.E[X2] = 0
• Thus, Cov(X,Y) = 0 = Cor(X,Y) even though Y is a deterministic function of X
Correlation
• Non-zero correlation doesn’t imply causation
• https://hbr.org/2015/06/beware-spurious-correlations
• https://science.sciencemag.org/content/348/6238/980.2
• http://www.tylervigen.com/spurious-correlations
Correlation
• Non-zero correlation doesn’t imply causation
• https://hbr.org/2015/06/beware-spurious-correlations
• https://science.sciencemag.org/content/348/6238/980.2
• http://www.tylervigen.com/spurious-correlations
Correlation
• Non-zero correlation
doesn’t imply causation
Correlation
• Non-zero correlation doesn’t imply causation
Correlation
• Non-zero correlation
doesn’t imply causation