CS 771A: Intro to Machine Learning, IIT Kanpur Endsem Exam (22 Nov 2022)
Name 50 marks
Roll No Dept. Page 1 of 6
Instructions:
1. This question paper contains 3 pages (6 sides of paper). Please verify.
2. Write your name, roll number, department in block letters with ink on each page.
3. Write your final answers neatly with a blue/black pen. Pencil marks may get smudged.
4. Don’t overwrite/scratch answers especially in MCQ.
Q1. Write T or F for True/False in the box and give justification below. (4 x (1+2) = 12 marks)
The Nikola company shares have a 40% chance of crashing if its owner Ksümnöle
tweets something silly. The shares have a 10% chance of crashing if no silly tweet
1
is sent. Ksümnöle tweets something silly with a 20% chance. Then, the probability T
that Nikola shares will crash, is less than 20%. Justify by calculating the probability.
By law of total probability, we have
ℙ[Fall] = ℙ[Fall | Tweet] ⋅ ℙ[Tweet] + ℙ[Fall | ¬Tweet] ⋅ ℙ[¬Tweet]
4 2 1 8 16
= ⋅ + ⋅ =
10 10 10 10 100
Given three vectors 𝐱, 𝐲, 𝐳 ∈ ℝ2 such that 𝐱 ⊤ 𝐳 > 𝐱 ⊤ 𝐲, it is always the case that
2
‖𝐱 − 𝐳‖22 < ‖𝐱 − 𝐲‖22 . Give a proof if True else give a counter example. F
We can give a counter example even in one dimension. Consider 𝐱 = (1,0) = 𝐲, 𝐳 = (100,0).
We have 𝐱 ⊤ 𝐲 = 1 < 100 = 𝐱 ⊤ 𝐳. However, we also have ‖𝐱 − 𝐳‖22 = 992 > 0 = ‖𝐱 − 𝐲‖22 .
Consider the set 𝒳 = {−1, +1}3 of 3D vectors with ±1 coordinates. Any map
3 𝜙: 𝒳 → ℝ𝑑 s.t. for all 𝐱, 𝐲 ∈ 𝒳, 𝜙(𝐱)⊤ 𝜙(𝐲) = (1 + 𝐱 ⊤ 𝐲)2 must use 𝑑 ≥ 10 dims. F
Give a proof if True else give a map using fewer dimensions as a counter example.
3
In general, we do need 1 + 2 ⋅ 3 + ( ) = 10 dims. However, for vectors with ±1 coordinates,
2
the dims encoding 𝑥𝑖2 can be omitted since (+1)2 = 1 = (−1)2 . The resulting map looks like
𝜙(𝑥1 , 𝑥2 , 𝑥3 ) = √2(√2, 𝑥1 , 𝑥2 , 𝑥3 , 𝑥1 𝑥2 , 𝑥1 𝑥3 , 𝑥2 𝑥3 ) ∈ ℝ7
We have
𝜙(𝐱)⊤ 𝜙(𝐲) = 4 + 2𝑥1 𝑦1 + 2𝑥2 𝑦2 + 2𝑥3 𝑦3 + 2𝑥1 𝑥2 𝑦1 𝑦2 + 2𝑥1 𝑥3 𝑦1 𝑦3 + 2𝑥2 𝑥3 𝑦2 𝑦3
= 1 + 𝑥12 𝑦12 + 𝑥22 𝑦22 + 𝑥32 𝑦32 + 2𝑥1 𝑦1 + 2𝑥2 𝑦2 + 2𝑥3 𝑦3 + 2𝑥1 𝑥2 𝑦1 𝑦2 + 2𝑥1 𝑥3 𝑦1 𝑦3
+ 2𝑥2 𝑥3 𝑦2 𝑦3
= 1 + (𝑥1 𝑦1 + 𝑥2 𝑦2 + 𝑥3 𝑦3 )2 + 2(𝑥1 𝑦1 + 𝑥2 𝑦2 + 𝑥3 𝑦3 ) = (1 + 𝐱 ⊤ 𝐲)2
Page 2 of 6
If 𝑋, 𝑌 ∈ ℝ3×3 are rank one matrices, then 𝑋 + 𝑌 can never be rank one, no
4
matter what are 𝑋, 𝑌. Give a brief proof if True else give a counter example. F
Take 𝑋 = 𝟏𝟏⊤ = 𝑌, both are rank-one, but their sum is 2 ⋅ 𝟏𝟏⊤ that is rank-one itself.
Q2. (Informative non-response models) Melbo is studying how one’s income level affects one’s
reluctance to reveal one’s income publicly. 𝑛 people were chosen with incomes 𝑋1 , 𝑋2 , … , 𝑋𝑛 .
Melbo knows that the income levels 𝑋𝑖 are distributed as independent standard Gaussian random
variables i.e., 𝑋𝑖 ∼ 𝒩(0,1) for all 𝑖 (let us interpret positive 𝑋𝑖 as higher-than-median income and
negative 𝑋𝑖 as lower-than-median income). However, not everyone wants to reveal their income.
When Melbo conducts the survey, the responses are 𝑍1 , 𝑍2 , … , 𝑍𝑛 . If the 𝑖 th person reveals their
𝛼 2 𝑋𝑖2
income, then 𝑍𝑖 = 𝑋𝑖 else 𝑍𝑖 = 𝜙. It is known that ℙ[𝑍𝑖 ≠ 𝜙 | 𝑋𝑖 ] = exp (− ), where 𝛼 > 0 is
2
an unknown parameter to be learnt. (Total 12 marks)
1. Is a rich person e.g., 𝑋𝑖 = 100 more likely or less likely to reveal their income than a person
with close-to-median income e.g., 𝑋𝑗 = −0.01? Give brief justification. (1+1 = 2 marks)
A rich person is less likely to reveal their income than someone with median income. This is
because ℙ[𝑍𝑖 ≠ 𝜙 | 𝑋𝑖 = 𝑣] is high only if |𝑣| is small i.e., if 𝑣 has large magnitude, never
mind\its sign, the reveal probability dips.
2. Is a poor person e.g., 𝑋𝑖 = −10 more likely or less likely to reveal their income than a person
with close-to-median income e.g., 𝑋𝑗 = 0.1? Give brief justification. (1+1 = 2 marks)
A poor person is also less likely to reveal their income than someone with median income. This
is because ℙ[𝑍𝑖 ≠ 𝜙 | 𝑋𝑖 = 𝑣] is high only if |𝑣| is small i.e., if 𝑣 has large magnitude, never
mind its sign, the reveal probability dips.
3. Derive an expression for ℙ[𝑍𝑖 ≠ 𝜙] the prior probability of a person revealing their income.
Show steps and give your answer as a function ℎ(𝛼). Hint: the density of a Gaussian looks
1 (𝑋−𝜇)2 ∞ 𝑎2 𝑡 2 2𝜋
like 2
exp (−
2𝜎 2
) and 𝑋𝑖 ∼ 𝒩(0,1). Also, ∫−∞ exp (− 2
) 𝑑𝑡 = √ 𝑎2 . (4 marks)
√2𝜋𝜎
CS 771A: Intro to Machine Learning, IIT Kanpur Endsem Exam (22 Nov 2022)
Name 50 marks
Roll No Dept. Page 3 of 6
By law of total probability, we get
∞
1
ℎ(𝛼) ≝ ℙ[𝑍𝑖 ≠ 𝜙] = ∫ ℙ[𝑍𝑖 ≠ 𝜙, 𝑋𝑖 = 𝑣] 𝑑𝑣 =
−∞ √1 + 𝛼 2
since
∞ ∞
(1 + 𝛼 2 )𝑣 2
1 1
∫ ℙ[𝑍𝑖 ≠ 𝜙 | 𝑋𝑖 = 𝑣] ⋅ ℙ[𝑋𝑖 = 𝑣] 𝑑𝑣 = ∫ exp (− ) 𝑑𝑣 =
−∞ √2𝜋 −∞ 2 √1 + 𝛼 2
4. Write down an expression for the negative log-likelihood of the form (no derivation needed)
ℒ(𝛼) = − ∑ ln ℙ[𝑍𝑖 ≠ 𝜙, 𝑋𝑖 ] − ∑ ln ℙ[𝑍𝑖 = 𝜙]
𝑖:𝑍𝑖 ≠𝜙 𝑖:𝑍𝑖 =𝜙
Notice that the terms in the first summation involve joint probability. (2 marks)
By using ℙ[𝑍𝑖 ≠ 𝜙, 𝑋𝑖 ] = ℙ[𝑍𝑖 ≠ 𝜙 | 𝑋𝑖 ] ⋅ ℙ[𝑋𝑖 ], we get
1 (1 + 𝛼 2 )𝑋𝑖2 1
ℒ(𝛼) = − ln +∑ − 𝑛0 ⋅ ln (1 − )
√2𝜋 𝑖:𝑍𝑖 ≠𝜙 2 √1 + 𝛼 2
where 𝑛0 = |𝑖: 𝑍𝑖 = 𝜙|.
5. Write down an expression for the gradient ℒ ′ (𝛼) (no derivation needed). (2 marks)
𝑛0 𝛼
ℒ ′ (𝛼) = 𝛼 ⋅ ∑ 𝑋𝑖2 − ⋅
𝑖:𝑍𝑖 ≠𝜙 √1 + 𝛼 2 − 1 1 + 𝛼 2
where 𝑛0 = |𝑖: 𝑍𝑖 = 𝜙|.
Page 4 of 6
Q3. (Quantile regression) Can we find the 𝑘th largest number in a set of 𝑛 numbers simply by
solving an optimization problem?! Turns out it is indeed possible using a trick called quantile
regression. For a set of real numbers 𝑥1 < 𝑥2 < ⋯ < 𝑥𝑛 (sorted in ascending order for sake of
simplicity), for any integer 𝑘 = 0,1,2, … 𝑛, consider the problem argmin𝑧∈[𝑥1,𝑥𝑛 ] 𝑓𝑘 (𝑧), with
𝑘 𝑘
𝑓𝑘 (𝑧) ≝ ( − 1) ⋅ ∑ (𝑥𝑖 − 𝑧) + ⋅ ∑ (𝑥𝑖 − 𝑧)
𝑛 𝑥𝑖 <𝑧 𝑛 𝑥𝑖 ≥𝑧
There are no duplicates in 𝑥1 , … , 𝑥𝑛 . Assume that an empty sum equals 0.
1. Find a minimizer for argmin𝑧∈[𝑥1 ,𝑥𝑛 ] 𝑓𝑛 (𝑧) i.e., 𝑘 = 𝑛. Show brief derivation. (1+1=2 marks)
We have 𝑓𝑛 (𝑧) = ∑𝑥𝑖 ≥𝑧(𝑥𝑖 − 𝑧). Notice that 𝑓𝑛 (𝑧) ≥ 0 for all 𝑧 by definition. However, we see
that 𝑓𝑛 (𝑥𝑛 ) = 𝑥𝑛 − 𝑥𝑛 = 0. Thus, 𝑥𝑛 is a minimizer for argmin𝑧∈[𝑥1,𝑥𝑛 ] 𝑓𝑛 (𝑧).
2. Find a minimizer for argmin𝑧∈[𝑥1 ,𝑥𝑛 ] 𝑓0 (𝑧) i.e., 𝑘 = 0. Show brief derivation. (1+1=2 marks)
3.
Similarly, 𝑓 (𝑧) = − ∑ (𝑥 − 𝑧). Notice that 𝑓 (𝑧) ≥ 0 for all 𝑧 by definition. However, we
0 𝑥𝑖 <𝑧 𝑖 0
see that 𝑓0 (𝑥1 ) = 0 since it is an empty sum. Thus, 𝑥1 is a minimizer for argmin𝑧∈[𝑥1,𝑥𝑛 ] 𝑓0 (𝑧).
4.
5. Let us handle 𝑘 ∈ [1, 𝑛 − 1]. Show brief derivation that if 𝑥𝑗 < 𝑎 < 𝑏 ≤ 𝑥𝑗+1 , 𝑎 ≠ 𝑏, then
a. We have 𝑓𝑘 (𝑎) > 𝑓𝑘 (𝑏) if 1 ≤ 𝑗 < 𝑘.
b. We have 𝑓𝑘 (𝑎) < 𝑓𝑘 (𝑏) if 𝑘 < 𝑗 < 𝑛, we have.
c. We have 𝑓𝑘 (𝑎) = 𝑓𝑘 (𝑏) if 𝑗 = 𝑘, i.e., for 𝑥𝑘 < 𝑎 < 𝑏 ≤ 𝑥𝑘+1 . (4+4+4 = 12 marks)
After establishing a few more results like the ones above (which you do not have to show), we can
deduce that any value of 𝑧 ∈ [𝑥𝑘 , 𝑥𝑘+1 ) is a minimizer of argmin𝑧∈[𝑥1,𝑥𝑛 ] 𝑓𝑘 (𝑧). (Total 16 marks)
CS 771A: Intro to Machine Learning, IIT Kanpur Endsem Exam (22 Nov 2022)
Name 50 marks
Roll No Dept. Page 5 of 6
Let 𝑥𝑗 < 𝑎 < 𝑏 ≤ 𝑥𝑗+1 . Notice that no summation has a zero coefficient when 𝑘 ∈ [1, 𝑛 − 1]
𝑘 𝑗 𝑘 𝑛
𝑓𝑘 (𝑎) = ( − 1) ∑ (𝑥𝑖 − 𝑎) + ∑ (𝑥𝑖 − 𝑎)
𝑛 𝑖=1 𝑛 𝑖=𝑗+1
𝑘 𝑗 𝑘 𝑛
𝑓𝑘 (𝑏) = ( − 1) ∑ (𝑥𝑖 − 𝑏) + ∑ (𝑥𝑖 − 𝑏)
𝑛 𝑖=1 𝑛 𝑖=𝑗+1
𝑘 𝑘
giving us 𝑓𝑘 (𝑎) − 𝑓𝑘 (𝑏) = [( − 1) 𝑗 + (𝑛 − 𝑗)] (𝑏 − 𝑎) = (𝑘 − 𝑗)(𝑏 − 𝑎). Since 𝑏 > 𝑎
𝑛 𝑛
1. If 𝑗 < 𝑘 i.e., if 𝑗 ≤ 𝑘 − 1, we get 𝑓𝑘 (𝑎) > 𝑓𝑘 (𝑏)
2. If 𝑗 > 𝑘 i.e., if 𝑗 ≥ 𝑘 + 1, we get 𝑓𝑘 (𝑎) < 𝑓𝑘 (𝑏)
3. If 𝑗 = 𝑘, we get 𝑓𝑘 (𝑎) = 𝑓𝑘 (𝑏)
To properly establish the minimizer, some more results are needed (showing these results is
not a part of the problem). Let 𝑥𝑗 < 𝑐 < 𝑥𝑗+1 for some 𝑗 ≥ 1. We have
𝑘 𝑗−1 𝑘 𝑛
𝑓𝑘 (𝑥𝑗 ) = ( − 1) ∑ (𝑥𝑖 − 𝑥𝑗 ) + ∑ (𝑥𝑖 − 𝑥𝑗 )
𝑛 𝑖=1 𝑛 𝑖=𝑗+1
𝑘 𝑗 𝑘 𝑛
𝑓𝑘 (𝑐) = ( − 1) ∑ (𝑥𝑖 − 𝑐) + ∑ (𝑥𝑖 − 𝑐)
𝑛 𝑖=1 𝑛 𝑖=𝑗+1
𝑘 𝑘 𝑘
𝑓𝑘 (𝑥𝑗 ) − 𝑓𝑘 (𝑐) = [( − 1) (𝑗 − 1) + (𝑛 − 𝑗)] (𝑐 − 𝑥𝑗 ) − ( − 1) (𝑥𝑗 − 𝑐)
𝑛 𝑛 𝑛
𝑘 𝑘
= [( − 1) 𝑗 + (𝑛 − 𝑗)] (𝑐 − 𝑥𝑗 ) = (𝑘 − 𝑗)(𝑐 − 𝑥𝑗 )
𝑛 𝑛
Since 𝑐 > 𝑥𝑗 by design, we have
1. 𝑓𝑘 (𝑥𝑗 ) > 𝑓𝑘 (𝑐) whenever 𝑗 < 𝑘
2. 𝑓𝑘 (𝑥𝑗 ) < 𝑓𝑘 (𝑐) whenever 𝑗 > 𝑘
3. 𝑓𝑘 (𝑥𝑘 ) = 𝑓𝑘 (𝑐)
This finishes the complete argument.
Q4. (Robust mean estimation) Melbo has got samples 𝑋1 , … , 𝑋𝑛 from a Gaussian with unknown
1
mean 𝜇 but known variance 𝜎 = i.e., with density 𝑓(𝑋; 𝜇) = exp(−𝜋(𝑋 − 𝜇)2 ). Melbo wishes
√2𝜋
to estimate 𝜇 using these samples but is stuck since some samples were corrupted by Melbo’s
enemy Oblem. It is not known which samples did Oblem corrupt. Let’s use latent variables to solve
Page 6 of 6
this problem. For each 𝑖, we say 𝑍𝑖 = 1 if we think 𝑋𝑖 is corrupted else 𝑍𝑖 = 0. For any 𝜇 ∈ ℝ, we
are told that ℙ[𝑍𝑖 = 1 | 𝜇] = 𝜂, and that ℙ[𝑋𝑖 | 𝜇, 𝑍𝑖 = 1] = 𝜖, and ℙ[𝑋𝑖 | 𝜇, 𝑍𝑖 = 0] = 𝑓(𝑋𝑖 ; 𝜇).
Thus, we suspect that Oblem corrupted around 𝜂 fraction of the samples and we assume that a
1
corrupted sample can take any value with probability 𝜖. Assume 𝜖, 𝜂 < and are both known.
10
1. For a given 𝜇, derive for a rule to find out if ℙ[𝑍𝑖 = 1 | 𝑋𝑖 , 𝜇] > ℙ[𝑍𝑖 = 0 | 𝑋𝑖 , 𝜇] or not.
Applying Bayes rule tells us that
ℙ[𝑍𝑖 = 1 | 𝑋𝑖 , 𝜇] ∝ ℙ[𝑋𝑖 | 𝜇, 𝑍𝑖 = 1] ⋅ ℙ[𝑍𝑖 = 1 | 𝜇] = 𝜖 ⋅ 𝜂
ℙ[𝑍𝑖 = 0 | 𝑋𝑖 , 𝜇] ∝ ℙ[𝑋𝑖 | 𝜇, 𝑍𝑖 = 0] ⋅ ℙ[𝑍𝑖 = 0 | 𝜇] = exp(−𝜋(𝑋𝑖 − 𝜇)2 ) ⋅ (1 − 𝜂)
1 (1−𝜂)
Thus, ℙ[𝑍𝑖 = 1 | 𝑋𝑖 , 𝜇] > ℙ[𝑍𝑖 = 0 | 𝑋𝑖 , 𝜇] if |𝑋𝑖 − 𝜇| > √ ln ( )
𝜋 𝜖⋅𝜂
2. Suppose we are given values of 𝑍1 , … , 𝑍𝑛 ∈ {0,1}. Derive an expression for the MLE estimate
𝑛
argmax𝜇∈ℝ ∏ ℙ[𝑋𝑖 | 𝜇, 𝑍𝑖 ]
𝑖=1
We have
𝑛
∏ ℙ[𝑋𝑖 | 𝜇, 𝑍𝑖 ] = ∏ exp(−𝜋(𝑋𝑖 − 𝜇)2 ) ⋅ ∏ 𝜖
𝑖=1 𝑖:𝑍𝑖 =0 𝑖:𝑍𝑖 =1
Taking logarithms and setting the derivative w.r.t. 𝜇 to zero gives us the MLE estimate as
1
𝜇MLE = ∑ 𝑋
|𝑖: 𝑍𝑖 = 0| 𝑖:𝑍𝑖 =0 𝑖
Note that this allows us to execute alternating optimization to help Melbo solve the problem even
in the presence of corruptions. We can initialize 𝜇 (say randomly), then use part 1 to set 𝑍𝑖 values
for each 𝑖 (set 𝑍𝑖 = 1 if ℙ[𝑍𝑖 = 1 | 𝑋𝑖 , 𝜇] > ℙ[𝑍𝑖 = 0 | 𝑋𝑖 , 𝜇] else set 𝑍𝑖 = 0), then use part 2 to
update 𝜇 given these 𝑍𝑖 values and then repeat the process till convergence. (5 + 5 = 10 marks)