MLF PA GA Sol
MLF PA GA Sol
Answer: C
Explanation: A model represents reality/a system in some mathematical form or an-
other.
A model is never an exact representation, it is a mathematical simplification of reality.
A model always makes some assumptions. For instance, this assumption could be, that
the data points fit in 2 dimensions or 3 dimensions.
Answer: D
Explanation: We know that there are 2 broad categories for supervised algorithms:
regression and classification.
Classification is where we have to put the data points into some “class” based on the
features and regression is where we predict a numerical value. In A, B & C, we would
classify the data points into some class(es) whereas, option D requires us to predict a
numerical value.
3. (1 point) For spam detection, if we use traditional programming rather than a machine
learning approach, which problems may be faced?
Course: Machine Learning - Foundations Page 2 of 7
A. The need to check for a long list of possible patterns that may be present.
B. Difficulty in maintaining the program containing the complex hard-coded rules.
C. The need to keep writing new rules as the spammers become innovative.
D. All of these.
Answer: D
Explanation: Traditional programming would require explicitly defining and checking
for specific patterns that indicate spam. This would involve creating a long list of pos-
sible patterns that may be present in spam messages.
Traditional programming for spam detection often involves creating complex rules and
conditions to identify spam. Maintaining and updating such a program can be cumber-
some.
If you compare the spam messages from a decade ago, you’ll notice a heavy change in
the pattern. So, traditional programming will require us to constantly update our code
and rules to keep up with the spam detection trend.
4. P
(1 point) In a regression model, the parameters wi ’s and b of the function f (X) =
d
j=1 wj xj + b where X = [x1 , x2 , ...., xd ]
A. are strictly integers.
B. always lie in the range [0,1].
C. are any real value.
D. are any imaginary value.
Answer: C
Explanation: The weights w and the bias term b is not restricted to integers or a specific
range of real numbers. They can be of any real value. Regression models don’t make
use of imaginary numbers.
Answer: A,C
Course: Machine Learning - Foundations Page 3 of 7
If the given problem statement allows you to predict the results into a specific class from
a specific set of classes, then it can be said that the problem is a classification problem.
In option A, based on the writing style, the data point would be tagged as a specific
class - Either “Male”, “Female” or any other class, if present.
In option B, price is a numerical quantity that is being predicted. Hence, this would be
a regression problem, rather than a classification problem.
In option C, there are 2 classes for which the predictions will be made. The classes can
be, for instance, “will rain” and “will not rain”
In option D, the prediction is a numerical quantity. Therefore, this will be a regression
problem
6. (1 point) Identify task that needs the use of regression in the following
A. Predict the height of a person based on his weight.
B. Predict the country a person belongs to based on his linguistic features.
C. Predict whether the price of gold will increase tomorrow or not based on data
of last 25 day.
D. Predict whether a movie is comedy or tragedy based on its reviews.
Answer: A
Explanation: For regression problems, the predicted value must be a numerical quan-
tity. This numerical quantity is usually continuous.
In option A, you are predicting a numerical value. Therefore, this will fall under the
category of regression problem.
In option B, specific classes are predicted. In this case, the classes would be: the set of
all countries
In option C, again, the predictions being made are of specific classes and not numerical
quantities. Here, classes are: { will increase, will not increase }
In option D, the specific style/genre of the movies is to be predicted. Here, the classes
would be: { Comedy, tragedy }
Answer: A,B
Explanation: To identify unsupervised learning problems, see if the problem requires
us to group/cluster based on the patterns in the data. Unsupervised learning problems
have unlabelled data (the target column is absent).
In option A, we are grouping, hence unsupervised.
In option B we are making clusters, hence unsupervised.
In option C, We typically train a model for the following problem using labelled data
to help us identify spam or not spam for future unseen emails (based on the patterns
learned from the labelled data). Hence, it’s a supervised problem.
In option D, we would be “grouping” the customers based on their buying behaviour to
help us identify their gender. We don’t have explicit labels available with us. So, this is
also an unsupervised problem.
Answer: D
Explanation: First, we need to understand the notation used here.
1(expression) = output → if the expression inside () is true, then the output is 1, if the
expression inside () is false, the output is 0.
This is known as the indicator function. In mathematics, an indicator function of a sub-
set of a set is a function that maps elements of the subset to one, and all other elements
to zero.
Now, in option A, the expression is true, hence the output will be 1. Therefore, it is a
correct option.
In option B, the expression is false (10 mod 3 is equal to 1), hence the output is 0.
Therefore, this option is correct as well.
In option C, the expression is false, hence the output should be 0. Again, the option is
correct.
In option D, the expression is true but the output mentioned is 0, whereas it should have
been 1. Therefore, it’s an incorrect statement and hence, our answer.
A. f : Rd → R
B. f : Rd → {+1, −1}
′
C. f : Rd → Rd where d′ <d
Answer: B
Explanation: Only option B is where we have a set of classes. In classification models,
a set of features is mapped to a set of classes.
10. (1 point) Which of the following can form a good encoder decoder pair for d dimensional
data ?
′
A. f : Rd → Rd where d′ < d
′
g : Rd → Rd where d′ > d
′
B. f : Rd → Rd where d′ > d
′
g : Rd → Rd where d′ < d
′
C. f : Rd → Rd where d′ < d
′
g : Rd → Rd where d′ < d
′
D. f : Rd → Rd where d′ > d
′
f : Rd → Rd where d′ > d
Answer: C
Explanation: A good encoder would be the one which reduces the dimensions.
A good decoder would be one which gives us higher dimensional data (preferably the
same dimension as the original data).
Only option C satisfies the aforementioned requirements.
Answer: C
Course: Machine Learning - Foundations Page 6 of 7
12. (2 points) Consider the following data set where each data point consists of three fea-
tures x1 , x2 and x3 :
x1 x2 x3
10 10 9
13 12 13
5 5 4
8 7 7
Consider two encoder functions f and f˜ with decoders g and g̃ respectively aiming to
reduce the dimensionality of the data set from 3 to 1:
Pair 1:f (x1 , x2 , x3 ) = x1 − x2 + x3 and g(u) = [u, u, u]
x1 + x2 + x3
Pair 2:f˜(x1 , x2 , x3 ) = and g̃(u) = [u, u, u]
3
The reconstruction loss of the encoder decoder pair is the mean of the squared distance
between the reconstructed input and input.
Explanation:
We can calculate the f () and f˜() as following:
x1 x2 x3 f (x1 , x2 , x3 ) f˜(x1 , x2 , x3 )
10 10 9 9 9.66
13 12 13 14 12.66
5 5 4 4 4.66
8 7 7 8 7.33
Now, we can also calculate the g(u) and g̃(u) which would give us the vectors
g(u) g̃(u)
[9, 9, 9] [9.66, 9.66, 9.66]
[14, 14, 14] [12.66, 12.66, 12.66]
[4, 4, 4] [4.66, 4.66, 4.66]
[8, 8, 8] [7.33, 7.33, 7.33]
n
1X
||g f xi − xi ||2
losspair1 =
n i=0
||[−1, −1, 0]||2 + ||[1, 2, 1]||2 + ||[−1, −1, 0]||2 + ||[0, 1, 1]||2
=
4
2+6+2+2
= =3
4
n
1 X
|| g f xi − xi ||2
losspair2 =
n i=0
||[0.34, 0.34, 0.66]||2 + ||[0.34, 0.66, 0.34]]||2 + ||[0.34, 0.34, 0.66]||2 + ||[0.67, 0.33, 0.33]||2
=
4
Answer: D
Solution:
[2, 4, −5] contains 3 components and all of them are real numbers.
Thevector
2
So, 4 ∈ R3 .
−5
∴ Option D is correct.
2. (1 point) Which of the following may not be an appropriate choice of loss function for
regression?
A. n1 ni=1 (f (xi ) − yi )2
P
B. n1 ni=1 |f (xi ) − yi |
P
Answer: C
Solution:
Here, option C that is, Loss = n1 ni=1 1(f (xi ) ̸= yi ) may be a good choice for classifica-
P
tion, but it is not a good choice for regression.
You can see that this loss function will increase when the prediction is not equal to
a label. However, it does this with a fixed loss of 1. Ideally, we would want the loss
increase to be proportionate to the amount of discrepancy between the prediction and
the label.
∴ Option C is correct.
A. Predicting the amount of rainfall in May 2022 in North India based on precip-
itation data of the year 2021.
B. Predicting the price of a land based on its area and distance from the market.
C. Predicting whether an email is spam or not.
D. Predicting the number of Covid cases on a given day based on previous month
data.
Answer: C
Solution:
Here, in options A, B and D, we can see that we have to predict some kind of real
number. Namely, amount of rainfall, price of land and number of cases. These kinds of
problems are more suitable to regression. Option C however, is predicting in which cate-
gory the datapoint (email) falls into. It is an example of binary classification technique.
∴ Option C is correct.
Answer: C
Solution:
A. Since 355 is odd, 355%2 = 1. So, the statement inside the indicator function is
true. That is, 1(355%2 = 1) = 1. Since this option is a true statement, it will not be
marked.
B. Since 788 is even, 788%2 = 0. So, the statement inside the indicator function is
false. That is, 1(788%2 = 1) = 0. Since this option is a true statement, it will not be
marked.
C. Since 355 is odd, 355%2 = 1. So, the statement inside the indicator function is
false. That is, 1(355%2 = 0) = 0. Since this option is a false statement, it will be
marked.
Course: Machine Learning - Foundations Page 3 of 11
D. Since 788 is even, 788%2 = 0. So, the statement inside the indicator function is
true. That is, 1(788%2 = 0) = 1. Since this option is a true statement, it will not be
marked.
5. (1 point) Which of the following is false regarding supervised and unsupervised machine
learning?
A. Unsupervised machine learning helps you to find different kinds of unknown
patterns in data.
B. Regression and classification are two types of supervised machine learning tech-
niques while clustering and density estimation are two types of unsupervised
learning.
C. In unsupervised learning model, the data contains both input and output
variables while in supervised learning model, the data contains only input
data.
Answer: C
Solution:
Here, option C is a false statement. It is infact supervised learning model in which, the
data contains both input and output variables. Also, it is unsupervised learning model
in which, data contains only input data.
∴ Option C is correct.
Answer: C
Solution:
The output of a regression model, linear regression for example, can be any real number.
It is continuous and can be within any range.
∴ Option C is correct.
Course: Machine Learning - Foundations Page 4 of 11
7. (1 point) (Multiple select) Which of the following is/are supervised learning task(s)?
A. Making different groups of customers based on their purchase history.
B. Predicting whether a loan client may default or not based on previous credit
history.
C. Grouping similar Wikipedia articles as per their content.
D. Estimating the revenue of a company for a given year based on number of
items sold.
Answer: B,D
Solution:
8. (1 point) Which of the following is used for predicting a continuous target variable?
A. Classification
B. Regression
C. Density Estimation
D. Dimensionality Reduction
Answer: B
Solution:
Out of the options, the technique used for prediction of a continuous target variable is
regression.
∴ Option B is correct.
Course: Machine Learning - Foundations Page 5 of 11
9. (1 point) Consider the following: “The is used to fit the model; the is used
for model selection; the is used for computing the generalization error.”
Which of the following will fill the above blanks correctly?
A. Test set; Validation set; training set
B. Training set; Test set; Validation set
C. Training set; Validation set; Test set
D. Test set; Training set; Validation set
Answer: C
Solution:
The training set is used to fit our model. After that, the validation set is used to select
the best model. Then, the test set is used for computing the generalization error.
∴ Option C is correct.
Answer: C
Solution:
1. This is the negative log likelihood loss and is used for density estimation.
2. This is computing the error between the reconstructed datapoint and actual datapoint
and is used in dimensionality reduction.
3. This is the squared error loss and it is used for regression.
4. This loss function simply compares if prediction and label are equal or not. This is
used in classification.
∴ Option C is correct.
11. (1 point) Compute the loss when Pair 1 and Pair 2 (shown below) are used for dimen-
sionality reduction for the data given in the following Table:
x1 x2
1 0.5
2 2.3
3 3.1
4 3.9
1 Pn i i 2
Consider the loss function to be i=1 ||g(f (x )) − x || .
n
Here f (x) is the encoder function and g(x) is the decoder function.
Pair 1:
Pair 2:
We are given an encoder (f ) and a decoder (g) function. To solve this question, we
will take each datapoint xi and encode it using encoder function getting f (xi ) and then
decode it to get g(f (xi )). Then the squared error would be given as ||g(f (xi )) − xi ||2 .
We would then take the average of this error over all datapoints to get the loss.
Pair 1:
4
1 X 1
||g f xi − xi ||2 = (0.625 + 10.625 + 19.2245 + 30.425) = 15.224
Loss =
4 i=1 4
Pair 2:
12. (1 point) Consider the following 4 training examples. We want to learn a function
x y
-1 0.0319
0 0.8692
1 1.9566
2 3.0343
f (x) = ax + b which is parameterized by (a,b). Using average squared error as the loss
function, which of the following parameters would be best to model the given data?
A. (1, 1)
B. (1, 2)
C. (2, 1)
D. (2, 2)
Answer: A
Solution: For each of the parameters given, we have a different function to estimate y.
For each function we will estimate each label y i .
4
1X
Then the loss will be given by ||f (xi ) − y i ||2 .
4 i=1
x y x + 1 x + 2 2x + 1 2x + 2
−1 0.0319 0 1 −1 0
0 0.8692 1 2 1 2
1 1.9566 2 3 3 4
2 3.0343 3 4 5 6
Course: Machine Learning - Foundations Page 8 of 11
A. f (x) = x + 1
B. f (x) = x + 2
C. f (x) = 2x + 1
D.f (x) = 2x + 2
Since, the loss for f (x) = x + 1 is the smallest, the parameters (1, 1) are the best fit for
this model.
∴ Option A is correct.
X y
[ 2] 5.8
[ 3] 8.3
[ 6] 18.3
[ 7] 21
[ 8] 22
What will be the amount of loss when the functions g = 3x1 + 1 and h = 2x1 + 2 are used
to represent the regression line. Consider the average squared error as loss function.
g:
Course: Machine Learning - Foundations Page 9 of 11
h:
X y
[ 4, 2] +1
[ 8, 4] +1
[ 2, 6] -1
[ 4, 10] -1
[ 10, 2] +1
[ 12, 8] -1
What will be the average misclassification error when the functions g(X) = sign(x1 −
x2 − 2) and h(X) = sign(x1 + x2 − 10) are used to classify the data points into classes
+1 or −1.
g:
h:
n
1X
The average misclassification error for a function f (x) is given by 1(f (X i ) ̸= y i )
n i=1
X
[1,2,3]
[2,3,4]
[-1,0,1]
[0,1,1]
Give the reconstruction error for this encoder decoder pair. The reconstruction error is
the mean of the squared distance between the reconstructed input and input.
We are given an encoder (f ) and a decoder (g) function. To solve this question, we
will take each datapoint xi and encode it using encoder function getting f (xi ) and then
decode it to get g(f (xi )). Then, the squared error would be given as ||g(f (xi )) − xi ||2 .
We would then take the average of this error over all datapoints to get the loss.
138
So, the loss will be = 34.5.
4
∴ Answer is 34.5.
Course: Machine Learning - Foundations
Week 2 - Practice Questions
Is f (x) continuous at x = 0?
A. False
B. True
Answer: B
Solution:
sin(x)
, x ̸= 0
f (x) = x
1 , x=0
2. If U = [10, 100], A = [30, 50] and B = [50, 90], which of the following is/are false?
(Consider all values to be integers.)
Answer: A, D, F
Solution:
We know that,
Ac = U \ A
= [10, 100] \ [30, 50]
= [10, 30) ∪ (50, 100]
∴ Option A is false and option B is true.
Next,
A ∪ B = [30, 50] ∪ [50, 90]
= [30, 90]
∴ Option C is true.
Next,
A ∩ B = [30, 50] ∩ [50, 90]
= {50} ≠ ∅
∴ Option D is false and option E is true
Next,
Ac ∩ B c = (A ∪ B)c
= U \ (A ∪ B)
= [10, 100] \ [30, 90]
= [10, 30) ∪ (90, 100]
= [10, 30) ∪ [91, 100]
Finally, option F is True.
Course: Machine Learning - Foundations Page 3 of 10
Answer: D
Solution:
x1 y1
x2 y2
We have x = .. and y = .. ,
. .
xd yd
y1
d
T
y2
X
Now, x y = x1 x2 . . . xd .. = xi y i
. i=1
yd
d
X
Also by definition, x · y = xi y i
i=1
∴ Option D is correct.
Answer: D
Solution:
The linear approximation of a function f around x = a is given by
Here, f (x) = tan(x) and a = 0, Substituting these in the above equation, we get
∴ Option D is correct
5. The partial derivative of x3 + y 2 w.r.t. x at x = 1 and y = 2 is .
Answer: 3
Solution:
Is f (x) continuous?
A. Yes
B. No
Answer: A
Solution:
Now, ∀ x < 1, f is a linear function which makes it continuous and ∀ x ≥ 1, f is constant
which also makes it continuous. So, the only point we need to check is at x = 1 :
LHL = lim− f (x)
x→1
= lim f (1 − h)
h→0
= lim 9
h→0
= 9
Course: Machine Learning - Foundations Page 5 of 10
7. Which of the following is the best approximation of e0.019 ? (Use linear approximation
around 0).
A. 1
B. 0
C. 0.019
D. 1.019
Answer: D
Solution:
The linear approximation of a function f around x = a is given by
Answer: B
Course: Machine Learning - Foundations Page 6 of 10
Solution:
The linear approximation of a function f around (x, y) = (a, b) is given by
x−a
L(x, y) = f (a, b) + · ∇f (a, b)
y−b
= f (a, b) + (x − a) fx (a, b) + (y − b) fy (a, b)
∴ Option B is correct.
Answer: B
Solution:
The gradient (∇f ) of f (x, y) = x2 y is
fx (x, y)
∇f (x, y) =
fy (x, y)
2xy
=
x2
10. The directional derivative of f (x, y, z) = x2 + 3y + z 2 at (1, 2, 1) along the unit vector
in the direction of [1, -2, 1] is .
Course: Machine Learning - Foundations Page 7 of 10
Answer: -0.816
Solution:
The directional derivative is given by
11. Find the direction of steepest ascent for the function x2 + y 3 + z 4 at point (1, 1, 1).
h i
2 3 4
A. √
29
, √
29
, √
29
h i
−2 3 4
B. √
29
, √
29
, √
29
h i
C. √−2 29
, −3
√
29
, √4
29
h i
D. √229 , √−3 29
, √4
29
Answer: A
Solution:
The direction of the steepest ascent for any function is the direction of the gradient itself.
Course: Machine Learning - Foundations Page 8 of 10
∴ Option A is correct.
12. The directional derivative of f (x, y, z) = x + y + z at point (-1, 1, -1) along the unit
vector in the direction of [1, -1, 1] is .
Answer: 0.577
Solution:
The directional derivative is given by
Dû f (−1, 1, −1) = ∇f (−1, 1, −1) · û
First, let us compute the gradient
fx
∇f (x, y, z) = fy
fz
1
= 1
1
1
∇f (−1, 1, −1) = 1
1
1 1
u 1
The unit vector in the direction of u = −1 is û = ||u|| = 3 −1
√
1 1
1 1
1 1
∴ Dû f (−1, 1, −1) = ∇f (−1, 1, −1) · û = 1 · √ −1 = √ ≈ 0.577
1 3 1 3
Course: Machine Learning - Foundations Page 9 of 10
13. Which of the following is/are the vector equations of a line that passes through (1, 2, 3)
and (4, 0, 1)?
(i) [x, y, z] = [1, 2, 3] + α[3, −2, −2]
(ii) [x, y, z] = [4, 0, 1] + α[−3, 2, 2]
(iii) [x, y, z] = [1, 2, 3] + α[4, 0, 1]
(iv) [x, y, z] = [4, 0, 1] + α[1, 2, 3]
Answer: E
Solution:
The vector equation of a line passing through two points a and b is given by
[x, y, z] = a + α (b − a) , α ∈ R
Answer: D
Solution:
The Cauchy-Schwarz inequality states
Also, equality holds if and only if in the boundary case when a is a scalar multiple of b,
since here a and b are -ve scalar multiples, we get
− ||a|| ∗ ||b|| = aT b
∴ Option D is correct.
Course: Machine Learning - Foundations
Week 2 - Graded assignment
Answer: D
Explanation: Option A is not defined at x = 1 therefore, it’ll have a breakpoint there.
Hence, not continuous.
In option B, the function is again not continuous at x = 1. One may try to simplify the
option as follows:
x2 − 1 (x − 1)(x + 1)
=
x−1 x−1
Please note that you cannot cancel out (x − 1) here because you would be assuming that
x − 1 is not equal to 0. But, we get (x − 1) = 0 at x = 1. Here, limits exist but that
doesn’t necessarily mean that the function is continuous.
Option C is discontinuous at x = 2.
Option D is continuous at all points.
2. Regarding a d-dimensional vector x, which of the following four options is not equivalent
to the rest three options?
A. xT x
B. ||x||2
Pd 2
C. i=1 xi
D. xxT
Answer: D
Explanation:
d
X
T
x·x=x x= x2i
i=1
q
||x|| = x21 + x22 + ... + x2d
Course: Machine Learning - Foundations Page 2 of 8
d
X
2
=⇒ ||x|| = x21 + x22 + ... + x2d = x2i
i=1
xT x ̸= xxT
Answer: B, D
Explanation:
f (x) is continuous at x = 3 if limx→3− f (x) = limx→3+ f (x) = f (3)
LHL ̸= RHL
Therefore, the function is not continuous at x = 3
Answer: 1.011
Explanation: To approximate the value of e0.011 by linearizing ex around x = 0, we can
use the first-order Taylor expansion of ex around the limit x = a, which is given by:
Course: Machine Learning - Foundations Page 3 of 8
ex ≈ ea + ea (x − a)
where a is the point around which we are linearizing (in this case, a = 0).
Using this approximation, we have:
√ √
5. Approximate 3.9 by linearizing x around x = 4.
Answer: 1.975
√ √
Explanation: To approximate the value of√ 3.9 by linearizing x around x = 4, we
can use the first-order Taylor expansion of x around the limit x = 0, which is given
by:
√ √ 1
x ≈ a + √ (x − a)
2 a
Using this approximation, we have:
√ √ 1 1
3.9 ≈ 4 + √ (3.9 − 4) = 2 + (−0.1) = 2 − 0.025 = 1.975
2 4 4
√ √
Therefore, the approximate value of 3.9 obtained by linearizing x around x = 4 is
approximately 1.975.
Answer: A, D, E, F
Explanation: If 2 vectors are perpendicular to each other, the 2 vectors must have the
dot product equal to 0.
Course: Machine Learning - Foundations Page 4 of 8
Answer: B
Explanation:
3x2
∇f (x, y) =
3y 2
12
=⇒ ∇f (2, 2) =
12
x − x∗
∗ ∗ T
Lx∗,y∗ [f ](x, y) =f (x, y) + ∇f (x , y ) ·
y − y∗
x−2
=16 + 12 12
y−2
=16 + 12x − 24 + 12y − 24
=12x + 12y − 32
Answer: A
Explanation:
2 2
3x y 3(1)2 (2)2 12
∇f (x, y) = =⇒ ∇f (1, 2) = =
2x3 y 2(1)3 (2) 4
A. [1, 2, 3]
B. [-1, 2, 3]
C. [0, 2, 3]
D. [2, 0, 3]
Answer: C
Explanation: The gradient of f = x3 + y 2 + z 3 is given by:
∂f ∂f ∂f
∇f = , ,
∂x ∂y ∂z
10. For two vectors a and b, which of the following is true as per Cauchy-Schwarz inequality?
(i) aT b ≤ ||a|| ∗ ||b||
(ii) aT b ≥ −||a|| ∗ ||b||
(iii) aT b ≥ ||a|| ∗ ||b||
(iv) aT b ≤ −||a|| ∗ ||b||
A. (i) only
B. (ii) only
C. (iii) only
D. (iv) only
E. (i) and (ii)
F. (iii) and (iv)
Answer: 0.816
Explanation: directional derivative is given by the dot product of gradient at a point
with a unit vector along which the directional derivative is needed.
2
3x
∇f (x, y, z) = 2y
3z 2
3
=⇒ ∇f (1, 1, 1) = 2
3
Next, let’s find the unit vector along [1, −2, 1]. To do that, we divide the vector by its
[1, −2, 1]
magnitude: u =
∥[1, −2, 1]∥
p √
Calculating the magnitude: ∥[1, −2, 1]∥ = 12 + (−2)2 + 12 = 6
√
1/ √6
=⇒ u = −2/√ 6
1/ 6
√
1/ √6
Du [f ](v) = ∇f (1, 1, 1) · u = 3 2 3 −2/√ 6
1/ 6
Therefore, the directional derivative of f (x, y, z) at (1, 1, 1) in the direction of the unit
2
vector along [1, −2, 1] is √ .
6
12. The direction of steepest ascent for the function 2x + y 3 + 4z at the point (1, 0, 1) is
h i
A. √220 , 0 √420 ,
h i
B. √129 , 0 √129 ,
Course: Machine Learning - Foundations Page 7 of 8
h i
−2 √4 ,
C. √
29
, 0 29
h i
√2 , −4
D. 20
0 √
20
,
Answer: A
Explanation:
Let f (x, y, z) = 2x + y 3 + 4z
2
∇f (x, y, z) = 3y 2
4
2
=⇒ ∇f (1, 0, 1) = 0
4
To obtain the direction of steepest ascent, we need to normalize the gradient vector.
The magnitude of the gradient vector is:
√ √ √
∥∇f (1, 0, 1)∥ = 22 + 02 + 42 = 20 = 2 5
Therefore, the direction of steepest ascent for the function 2x + y 3 + 4z at the point
2 4
(1, 0, 1) is √ , 0 √ ,
20 20
Answer: 0.577
Explanation: To find the directional derivative of f (x, y, z) = x + y + z at (−1, 1, 0) in
the direction of the unit vector along [1, −1, 1], we need to calculate the dot product of
the gradient of f at that point with the unit vector.
1
∇f (x, y, z) = 1
1
1
=⇒ ∇f (−1, 1, 1) = 1
1
Next, let’s find the unit vector along [1, −1, 1]. To do that, we divide the vector by its
[1, −1, 1]
magnitude: u =
∥[1, −1, 1]∥
Course: Machine Learning - Foundations Page 8 of 8
p √
Calculating the magnitude: ∥[1, −1, 1]∥ = 12 + (−1)2 + 12 = 3
Therefore,
1 1 1 1
u = √ [1, −1, 1] = √ , − √ , √
3 3 3 3
1 1 1
Du [f ](v) = ∇f (−1, 1, 0) · u = (1, 1, 1) · √ , − √ , √
3 3 3
Therefore, the directional derivative of f (x, y, z) = x + y + z at (−1, 1, 0) in the direction
1
of the unit vector along [1, −1, 1] is √ ≈ 0.577.
3
14. Which of the following is the equation of the line passing through (7, 8, 6) in the direction
of vector [1, 2, 3]
A. [1, 2, 3] + α[−6, −6, 3]
B. [7, 8, 9] + α[−6, −6, 3]
C. [1, 2, 3] + α[6, 6, 3]
D. [7, 8, 6] + α[6, 6, 3]
E. [7, 8, 6] + α[1, 2, 3]
F. [1, 2, 3] + α[7, 8, 6]
Answer: E
Explanation: A line through the point u ∈ Rd along a vector v ∈ Rd is given by the
equation
x = u + αv
=⇒ x = [7, 8, 6] + α[1, 2, 3]
So, option E is the answer.
Course: Machine Learning - Foundations
Week 3: Practice questions
1
1. (1 point) What is the length of the vector 1 ?
−1
A. 1.73
B. 1.71
C. 1.72
D. 1.74
Answer: A
Solution:
Using the definition of length of a vector,
1 p √
1 = 12 + 12 + (−1)2 = 3 ≈ 1.732
−1
∴ Option A is correct.
1 −1
2. (1 point) The inner product of 0 and 2 is
3 4
A. 11
B. 12
C. 31
D. 20
Answer: A
Solution:
Using the definition of the standard inner product,
1 −1
< 0 , 2 > = (1)(−1) + (0)(2) + (3)(4) = 11
3 4
∴ Option A is correct.
Course: Machine Learning - Foundations Page 2 of 8
0 1 2
3. (1 point) The rank of the matrix, A = 1 2 1 is
2 7 8
A. 0
B. 1
C. 2
D. 3
Answer: C
Solution:
To solve this question, we can first find the row echelon form of A. Then the rank of the
matrix is the number of pivots (or free variables) in R.
0 1 2 1 2 1 1 2 1 1 2 1
R1 ⇆R2 R3 →R3 −2R1 R3 →R3 −3R2
A = 1 2 1 −−−−→ 0 1 2 −− −−−−−→ 0 1 2 −− −−−−−→ 0 1 2 = R
2 7 8 2 7 8 0 3 6 0 0 0
We can see that R has 2 pivots, so A has a rank of 2.
∴ Option C is correct.
1 0 2
4. (1 point) The rank of the matrix, A = 2 1 0 is
3 2 1
A. 0
B. 1
C. 2
D. 3
Answer: D
Solution:
We can take the determinant of A
1 0 2
det(A) = 2 1 0
3 2 1
Expanding determinant along R1 ,
1 0 2
2 1 0 = (1)[(1)(1) − (0)(2)] − 0 + (2)[(2)(2) − (1)(3)]
3 2 1
= 3 ̸= 0
Course: Machine Learning - Foundations Page 3 of 8
We can see that the determinant of this 3×3 matrix is non-zero. This implies the matrix
is full rank. That is, rank(A)= 3.
∴ Option D is correct.
5. (1 point) Can we span the entire 4-d space using the four column vectors given in the
following matrix?
1 2 3 4
0 2 2 0
1 0 3 0
0 1 0 4
A. Yes
B. No
Answer: A
Solution:
Here, we wish to find out if the column space of a 4 × 4 matrix spans all of 4-d space.
This can only happen if and only if the determinant of the matrix is non-zero. So, we have
1 2 3 4
0 2 2 0
det =
1 0 3 0
0 1 0 4
Expanding along C4 ,
0 2 2 1 2 3
det = −(4) 1 0 3 + 0 − 0 + (4) 0 2 2
0 1 0 1 0 3
∴ Option A is correct.
Course: Machine Learning - Foundations Page 4 of 8
2 6 8
3 7 10
4 8 12
5 9 14
A. 0
B. 1
C. 2
D. 3
Answer: C
Solution:
Let the above matrix be A. Now, to solve this question, we can first find the row echelon
form of A. Then, the rank of the matrix is the number of pivots (or basic variables) in R.
2 6 8 1 3 4
3 7 10 R1 → 21 R1 3 7 10
A= − −−−−→
4 8 12 4 8 12
5 9 14 5 9 14
1 3 4
R2 →R2 −3R1 0 −2 −2
−− −−−−−→
R3 →R3 −4R1 0 −4 −4
5 9 14
1 3 4
R2 → −1 R2 0 1 1
−−−−−2−−→
R4 →R4 −5R1 0 −4 −4
0 −6 −6
1 3 4
R3 →R3 +4R2 0 1 1
−−−−−−−→ =R
R4 →R4 +6R2 0 0 0
0 0 0
We can see that R has 2 pivots, so A has a rank of 2.
∴ Option C is correct.
1 2 3
7. (1 point) The rank of matrix 2 3 6 is
4 5 9
Course: Machine Learning - Foundations Page 5 of 8
A. 1
B. 2
C. 3
D. 4
Answer: C
Solution:
Let us find the determinant of the matrix.
1 2 3 h i h i h i
2 3 6 = 1 (3)(9) − (5)(6) − 2 (2)(9) − (4)(6) + 3 (2)(5) − (4)(3)
4 5 9
= −3 − 2(−6) + 3(−2) = 3 ̸= 0
We can see that the determinant of this 3×3 matrix is non-zero. This implies the matrix
is full rank. That is, rank(A)= 3.
∴ Option C is correct.
Answer: B
Solution: To solve this question, we will make use of the rank nullity theorem which
states for any m × n matrix,
rank(A) + nullity(A) = n
Using the fact that here, n = 3 and rank(A) = 2, and substituting these values into the
above equation, we get nullity(A) = 1.
∴ Option B is correct.
2 4 6 8
9. (1 point) Which of the following represents the row space of the matrix 1 3 0 5?
1 1 6 3
(Note: span {S} denotes the set of linear combinations of the elements of S)
Course: Machine Learning - Foundations Page 6 of 8
1 0
0 , 1
A. Span 9 −3
2 1
9 −2
, 1
3
B. Span 1 0
0 1
1 0
, 1
0
C. Span −9 0
0 1
0 3
3 , −1
D. Span 1 0
0 1
Answer: A
Solution:
To solve this question, we will make use of the fact that the row space of a matrix
does not change when applying row operations. Let the matrix given be A and it’s
reduced row echelon form be R. Then, rowspace(A) =rowspace(R).
2 4 6 8 R →1R 1 2 3 4
1 1
A = 1 3 0 5 −−−−2−→ 1 3 0 5
1 1 6 3 1 1 6 3
1 2 3 4
R →R −R1
−−2−−−2−−→ 0 1 −3 1
R3 →R3 −R1
0 −1 3 −1
1 2 3 4
R →R +R2
−−3−−−3−−→ 0 1 −3 1
0 0 0 0
1 0 9 2
R1 →R1 −2R2
−− −−−−−→ 0 1 −3 1 = R
0 0 0 0
Since rowspace(A) = rowspace(R), then a basis for the row space of A can be the pivot
rows of R. That is,
1 0
0 1
rowspace(A) = span ,
9 −3
2 1
Course: Machine Learning - Foundations Page 7 of 8
∴ Option A is correct.
3
10. (1 point) Find the projection matrix for v= 3
3
1 1 1
3 3 3
A. 13 1
3
1
3
1 1 1
3 3 3
−1
1 1
3 3 3
B. −1 1 −1
3 3 3
1 −1 1
3 3 3
−1 1 −1
3 3 3
1 −1 1
C. 3 3 3
−1 1 −1
3 3 3
−1 1 1
3 3 3
1 −1 1
D. 3 3 3
1 1 −1
3 3 3
Answer: A
Solution:
The projection matrix P for a vector v is given by
vv T
P =
vT v
3
Here, v = 3, so the outer product is
3
3 9 9 9
vv T = 3 3 3 3 = 9 9 9
3 9 9 9
Finally, P is given by 1 1 1
9 9 9
1 3 3 3
P = 9 9 9 = 13 1
3
1
3
27 1 1 1
9 9 9 3 3 3
∴ Option A is correct.
11. (1 point) Find the projection of [1, −4, 2] along [1, −2, −3]
Course: Machine Learning - Foundations Page 8 of 8
−6 −9
3
A.
14
3
14
6 9
14
B. 14 14 14
−3 −6 −9
C.
14
3 6
14
−9
14
D. 14 14 14
Answer: A
Solution:
The projection of a vector a along a vector b is given by
aT b
projb a = b
bT b
1 1
Here, a = −4 and b = −2. Substituting these values into the above expression,
2 −3
we get
1
(1)(1) + (−4)(−2) + (2)(−3)
projb a = −2
12 + (−2)2 + (−3)2
−3
1
3
= −2
14
−3
3
14
= −6
14
−9
14
∴ Option A is correct.
Course: Machine Learning - Foundations
Week 4: Test questions
Answer: A
Solution: If a vector (let us say b) is orthogonal to column space of P , then it means
that b vector is orthogonal to vector (let us say a) from which projection matrix P is
built.
That is b ⊥ a. So, a.b = 0 and P b = 0.
⇒ P b = 0b
Hence eigenvalue is zero.
Answer: D
Solution: By doing row operations, we already know that rank and columns space will
not change.
By doing row operations determinant too will not change. For example consider the
following
matrix and operations done.
R1
A = R2
R3
Where R1, R2, and R3 are row vectors. Let us form another matrix B by doing row
operations.
R1
Let B = A − R1
R1
R1
Now⇒ det(B) = det(A)− det( R1)
R1
⇒ detB = detA
Course: Machine Learning - Foundations Page 2 of 7
Answer: D
Solution: Eigenvalues of symmetric matrix are always real. Eigenvectors corresponding
to two distinct eigen values of a matrix are independent. In specific if matrix is sym-
metric, then they are orthogonal too.
Since, eigenvectors of symmetric matrix are orthogonal, the diagonlization of symmetric
matrix will result in orthogonal diagonalization.
Answer: A
Solution: As mentioned in the above question, eigen vectors corresponding to distinct
eigenvalues of a matrix are independent.
B. 0
C. 6
D. -6
E. -2
Answer: D
Solution: The product of eigen values is equal to determinant of the matrix. So,
determinant is 1 × −2 × 3, which is −6.
6. (1 point) The trace of a 2 × 2 matrix is -1 and its determinant is -6. Its eigenvalues will
be
A. -1, 3
B. 2, 3
C. 2, -3
D. -2, 3
Answer: C
Solution: Trace (sum of diagonal elements) of a matrix is equal to the sum of the
eigenvalues.
Since the given matrix is of 2 × 2 dimension, it will have two eigen values.
λ1 + λ2 = −1, λ1 λ2 = −6, by solving we get λ1 = 2 and λ2 = −3
7. (1 point) If the eigenvalues of a matrix are -1, 0 and 4, then its trace and determinant
are
Trace:
Determinant:
Answer: 3, 0
Solution: By using same above relations, we can find trace as 3(−1 + 0 + 4) and
determinant as 0(−1 × 0 × 4).
1 1
8. (1 point) The characteristic polynomial for the matrix A= is
1 3
A. λ2 − 4λ + 1
B. λ2 − 4λ
C. λ2 − 4λ − 2
D. λ2 + 4λ + 2
E. λ2 − 4λ + 2
Course: Machine Learning - Foundations Page 4 of 7
Answer: E
Solution For any matrix A, characteristic polynomial equation is formed by using rela-
−λI) = 0.
tion det(A
1 1 1 0
⇒ det( −λ =0
1 3 0 1
1−λ 1
⇒ det( )=0
1 3−λ
By simplifying we get λ2 − 4λ + 2 = 0
1 1
9. (1 point) The eigenvalues of matrix A= are
2 3
√ √
A. 2 + 3, 2 − 3
√ √
B. 3, − 3
C. 0,1
√ √
D. 5, − 5
Answer: A
Solution: To find eigenvalues, we need to solve characteristic polynomial equation just
likewe did
for the above problem.
a b
det is ad − bc.
c d
1−λ 1
det =0
2 3−λ
⇒ (1 − λ)(3 − λ) − 2 ∗ 1 = 0
⇒ λ2 − 4λ + √1 = 0
⇒ λ = 2 ± 3.
10. (2 points) If the eigenvalues of a matrix A are 0 -1 and 5, then the eigenvalues of A3
are
A. 0, -1 and 5
B. 0, -1 and 125
C. 0, 1 and -125
D. 0, 1 and -5
Answer: B
Solution: λk is eigenvalue of matrix Ak , if λ is eigenvalue of matrix A.
Since eigen values of A are 0, −1, and 5, eigenvalues of A3 are 03 , −13 , and 53 .
Hence option B is correct.
Course: Machine Learning - Foundations Page 5 of 7
Answer: A
Solution: √By following√the lectures you can see that
Fk = √15 ( 1+2 5 )k + √15 ( 1−2 5 )k
√ √ √
F1 10 = √15 ( 1+2 5 )110 + √15 ( 1−2 5 )110 ≈ √1 ( 1+ 5 )110 .
5 2
Hence, option A is correct.
12. (1 point) (Multiple Select) Let A be an n × n matrix. Which of the following statements
is/are false?
A. If A has r non-zero eigenvalues, then rank of A is at least r.
B. If one of the eigenvalues of A are zero, then |A| =
̸ 0.
C. If x is an eigenvector of A, then so is every vector on the line through x.
D. If 0 is an eigenvalue of A, then A cannot be invertible.
Answer: B
Solution:
If a matrix has r non-zero eigenvalues, then rank of the matrix is r, hence first option is
true.
If one of the eigenvalue is zero, then determinant is zero, since determinant is product
of eigenvalues. So, option B is false.
If x is an eigenvector then cx will also be an eigenvector, where c is real constant value.
So, option C is true.
If 0 is eigenvalue, then determinant is zero, hence it is not invertible.
1 2
13. (2 points) The eigenvalues of the matrix are
2 4
Answer: 0, 5
Solution: By using the characteristic polynomial equation, we get (1 − λ)(4 − λ) − 4 = 0
⇒ λ2 − 5λ = 0. On solving them we get λ eigenvalues as 0, 5.
14. (2 points) (Multiple Select) For the matrix given in the previous question, which of the
following vectors is/are its eigenvector(s)?
1
A.
2
Course: Machine Learning - Foundations Page 6 of 7
−2
B.
1
1
C.
1
1
D.
−2
Answer: A, B
Solution: We will find eigenvector for each of the eigenvalue 0 and 5.
For λ = 0 case:
We know that (A − λI)x = 0
⇒ (A − 0I)x
= 0
x
Let x = 1
x2
1 2 x1 0
⇒ =
2 4 x 2
0
x1 + 2x2 0
⇒ =
2x1 + 4x2 0
1
Let us take x1 as 1, then we get x2 as 0.5, so first eigen vector is c Among the
−0.5
1 −2
given options, b satifsfies when c = −2, that is −2 =
−0.5 1
For λ = 5 case:
Here we will get equation as
(A − 5I)x = 0
1−5 2 x1 0
⇒ =
2 4 − 5 x2 0
⇒ −4x1 + 2x2 = 0 and ⇒ 2x1 − x2 = 0
If x1 = 1, then x2 = 2
1
Hence, eigenvector is of the form c . So, option A is correct.
2
Answer: 1,9,16
Solution: Here P −1 AP gives upper triangular matrix where diagonal elements are
eigenvalues. So, eigenvalues of A are −1, 3, and 4. So, eigen values of A2 are (−1)2 , 32 ,
and 42 .
Course: Machine Learning - Foundations Page 7 of 7
16. (2 points) The best second order polynomial that fits the data set
x y
0 0
1.3 1.5
4 1.2
is
A. 1.35x2 + 0.3x
B. 1.25x2 + 0.45x
C. −0.316x2 + 1.56x
D. −0.25x2 + 0.5
Answer: C
Solution: Here we are asked to find second order polynomial that fits the data. This is
similar to linear regression of multiple features where we have one feature with order 2.
The equation
willbe y = θ0 + θ1 x + θ2 x2
0 1 0 0
2 θ0 θ1
Y = 1.5, A = 1 1.3 1.3 , and θ =
θ2
1.2 1 4 42
T −1 T
In orderto minimize error
θ = (A A) A Y
1 1 1 1 0 0 1 5.3 17.69
AT A = 0 1.3 4 1 1.3 1.32 = 5.3 17.69 66.19
2
0 1.69 16 1 4 4 17.69 66.19 258.8561
1 1 1 0 2.7
AT Y = 0 1.3 4 1.5 = 6.75
0 1.69 16 1.2 21.735
Yet to do
−1 1.017 −0.197
(AT A)−1 ≈ 1.017 0.273 −0.14
−0.197 −0.14 0.05
−1 1.0178 −0.1917 2.7
⇒ θ = 1.0178 0.2738 −0.1395 6.75 ≈ −0.117045 1.545 −0.315
−0.1917 −0.1395 0.0526 21.735
So, option C is correct.
Course: Machine Learning - Foundations
Week 5: Practice questions
1 − i 1 − 3i
1. (1 point) The complex conjugate of matrix A = is
6 + 4i 35 − 2i
1 − i 1 − 3i
A.
6 + 4i 35 − 2i
1 + i 1 + 3i
B.
6 − 4i 35 + 2i
−1 + i −1 − 3i
C.
−6 + 4i −35 − 2i
1 − i 1 − 3i
D.
6 − 4i 35 − 2i
Answer: B
Explanation: The complex conjugate of a matrix is given by simply taking the conju-
gate of all the components of the matrix. That is,
1 − i 1 − 3i 1 + i 1 + 3i
A= =⇒ Ā =
6 + 4i 35 − 2i 6 − 4i 35 + 2i
∴ Option B is correct.
3 − 2i 5 + i
2. (1 point) The complex conjugate transpose of matrix A = is
1 + 4i 7 − 2i
7 + i 5 + 41
A.
3 − i 3 − 2i
5 − i 3 − 4i
B.
1 + i 7 + 2i
3+i 5−i
C.
1 + 4i 7 − 2i
3 + 2i 1 − 4i
D.
5 − i 7 + 2i
Answer: D
Explanation: The complex conjugate transpose of a matrix is given by simplytaking
the transpose of the complex conjugate of the matrix. That is,
3 − 2i 5 + i ∗ T 3 + 2i 1 − 4i
A= =⇒ A = (Ā) =
1 + 4i 7 − 2i 5 − i 7 + 2i
Course: Machine Learning - Foundations Page 2 of 10
∴ Option D is correct.
1−i −1 − i
3. (1 point) The inner product of x = and y = is
2i i
A. 7 − 6i
B. 4 − 4i
C. 2 − 2i
D. 3 + 4i
Answer: C
Explanation: The inner product between two complex vectors x and y is given by
multiplying the complex conjugate transpose of one vector by the other, That is,
< x, y > = x∗ y
−1 − i
= 1 + i −2i
i
= −(1 + i)2 + (−2i)(i)
= −1 + 1 − 2i + 2
= 2 − 2i
∴ Option C is correct.
2−i
4. (1 point) The square of length of vector x = is
4−i
A. 16
B. 17
C. 31
D. 22
Answer: D
Explanation: The squared length of a vector x is given by taking the inner product of
x with itself. So,
∴ Option D is correct.
(1 + i) (1 + i)
√ √
3 6
5. (1 point) The matrix A =
is unitary.
i 2i
√ √
3 6
A. True
B. False
Answer: B
Explanation: The condition of A to be unitary is that it’s columns must be pairwise
orthogonal and of unit length. Let us check these conditions.
Let the columns of the matrix A be a1 and a2 . For checking orthogonality, the in-
ner product must be 0.
1 2 3
6. (1 point) The matrix Z = 2 4 5 is Hermitian.
3 5 6
A. True
B. False
Answer: True
Course: Machine Learning - Foundations Page 4 of 10
Answer: C, D
Explanation: By definition, a given matrix A is hermitian if and only if A∗ = A. Let
us check this for all the options.
Option A.
∗ T 1 3−i
A =A = ̸= A
3 + i −i
Since the matrix is not hermitian, option A will not be marked.
Option B.
∗ T 0 3 + 2i
B =B = ̸= B
3 + 2i 4
Since the matrix is not hermitian, option B will not be marked.
Option C.
3 2 − i −3i
T
C ∗ = C = 2 + i 0 1 − i = C
3i 1 + i 0
Course: Machine Learning - Foundations Page 5 of 10
Option D.
−1 2 3
T
D∗ = D = 2 0 −1 = D
3 −1 4
Since the matrix is hermitian, option D will be marked.
3 2 − i −3i
8. (1 point) The eigenvalues of matrix A = 2 + i 0 1 − i are
3i 1 + i 0
A. -1,-6 and 2
B. 1, -6 and -2
C. 1, 6 and 2
D. -1, 6 and -2
Answer: D
Explanation: For this question, we can use the fact that the trace of a matrix is the
same as the sum of the eigenvalues. Here,
tr(A) = 3 + 0 + 0 = 3
Out of these options, only option D gives a sum of 3 for the eigenvalues to match the
trace of A.
∴ Option D is correct.
1+i 1−i
9. (2 points) Let A = k , where k ∈ R. A is unitary if k is
1−i 1+i
1
A. 2
B. 1
1
C. 4
1
D. 8
Answer: A
Explanation: To solve this question, we will use the fact that, for a unitary matrix A,
it’s columns must be of unit length.
Course: Machine Learning - Foundations Page 6 of 10
k(1 + i)
Here, Let a1 = . Then, we need for k to be such that ||a1 || = 1
k(1 − i)
||a1 ||2 = < a1 , a1 >
= a∗1 a1
1+i
=k 1−i 1+i k
1−i
= k 2 12 + 12 + 12 + 12
= 4k 2
1
So, we have, 4k 2 = 12 =⇒ k =
2
∴ Option A is correct.
√
1 1 + i √k
10. (2 points) Let A = 2
, where k ∈ R. A is unitary if k is
1−i ki
1
A. 2
B. 1
C. 2
1
D. 4
Answer: C
Explanation: To solve this question, we will use the fact that, for a unitary matrix A,
its columns must be of unit length.
√
1
Here, Let a2 = √ k . Then, we need for k to be such that ||a2 || = 1
2 ki
||a2 ||2 = < a2 , a2 >
= a∗2 a2
√
1 √ √ 1
= k − ki √k
2 2 ki
1
= (k + k)
4
k
=
2
k
So, we have, = 12 =⇒ k = 2
2
∴ Option C is correct.
Course: Machine Learning - Foundations Page 7 of 10
2 1+i
11. (3 points) A matrix A = can be written as A = U DU ∗ , where U is a
1−i 3
unitary matrix and D is a diagonal matrix. Then, U and D respectively are
" #
−1−i
√ 1+i
3 6 1 0
A. U = √1 √2
,D=
3 6
0 4
" #
−1+i
√ −1+i
3 6 1 0
B. U = √1 √2
,D=
3 6
0 4
" #
−1+i
√ −1+i
3 6 1 0
C. U = √ −1 −2 ,D=
3
√
6
0 4
" #
−1+i
√ −1+i
3 6 −1 0
D. U = √ −1 −2 ,D=
3
√
6
0 −4
Answer: A
Explanation: Since the matrix A can be written as U DU ∗ , A is unitarily diagonalizable.
Let us start by finding the eigenvalues and eigenvectors. The characteristic polynomial
is c(λ).
2−λ 1+i
c(λ) = |A − λI| = = (2 − λ)(3 − λ) − 2 = λ2 − 5λ + 4 =⇒ λ = 1, 4
1−i 3−λ
1 0
So, the eigenvalues are 1 and 4 =⇒ D =
0 4
Let the eigenvectors be v1 and v2 . Then,
1 1+i 1 1+i −1 − i
Eλ=1 = null = null =⇒ v1 =
1−i 2 0 0 1
−1−i
−2 1 + i 1 2
Eλ=4 = null = null
1 − i −1 1 − i −1
" −1−i #!
1
2 1+i
= null =⇒ v2 =
0 0 2
" −1−i 1+i #
√ √
3 6
Converting v1 and v2 to unit vectors and putting them in a matrix we get U = .
√1 √2
3 6
∴ Option A is correct.
12. (1 point) (Multiple select) Which of the following matrices is/are unitary?
Course: Machine Learning - Foundations Page 8 of 10
1 i
√ √
2 2
A. i 1
−√ √
2 2
1 i
−√ √
B.
i 2 2
1
√ √
2 2
1 i
√ −√
2 2
C. i 1
√ √
2 2
1 i
√ √
2 2
D. i 1
√ −√
2 2
1 i
√ √
2 2
E. i 1
√ √
2 2
Answer: E
Explanation: For a matrix to be unitary, the columns of the matrix must be pairwise
orthogonal. Let us check this for all the options.
Option A. 1 i
√i
√ √
1 i
2 2
< −i , 1 > = √ √ 12 = i ̸= 0
√ √ 2 2 √
2 2 2
Option B. −1 i
√i
√ √
−1 −i
2 2
< i , 1 > = √ √ 12 = −i ̸= 0
√ √ 2 2 √
2 2 2
Option C. 1 −i −i
√ √
1 −i
√
2 2
< i , 1 > = √ √ 12 = −i ̸= 0
√ √ 2 2 √
2 2 2
Course: Machine Learning - Foundations Page 9 of 10
Option D. 1 i
√i
√ √
1 −i
2 2
< i , −1 > = √ √ −12 = i ̸= 0
√ √ 2 2 √
2 2 2
Option E. 1 i
√i
√ √
1 −i
< i2 , 12 > = √ √ 12 = 0
√ √ 2 2 √
2 2 2
Out of all the options only the matrix in option E has orthogonal columns. Further, we
can" calculate
# using
" the
# inner product, that the columns are also of unit length. That is,
√1 √i
2 2
|| √i
|| = || √1
|| = 1
2 2
13. (1 point) Let U and V be two symmetric matrices. Consider the following statements:
1. U V is symmetric.
2. U + V is symmetric.
Then,
A. both statements are true.
B. both statements are false.
C. 1. is false.
D. 2. is false.
Answer: C
Explanation: Let us look at both statements.
Statement 1.
(U V )T = V T U T = V U ̸= U V
So, statement 1 is false.
Statement 2.
(U + V )T = U T + V T = U + V
Course: Machine Learning - Foundations Page 10 of 10
∴ Option C is correct.
Course: Machine Learning - Foundations
Week 5: Test questions
1. (1 point) Consider two non-zero vectors x ∈ C and y ∈ C. Suppose the inner product
between x and y obeys commutative property (i.e., x · y = y · x), it implies that
A. y must be a conjugate transpose of x
B. y is equal to x
C. y must be orthogonal to x
D. y must be a scalar (possibly complex) multiple of x
Explanation: Let us look at each option and see what makes the commutative property
hold.
Option A.
Assuming y = x∗
x· y = x∗ y = x∗ x∗
This multiplication is not defined, so option A is incorrect.
Option B.
Assuming y = x
x· y = y· x
This is trivially true, so option B is correct.
Option C.
Assuming x∗ y = 0
x· y = x∗ y = 0 = y ∗ x = y· x
So, option C is correct.
Option D.
Assuming y = zx where z ∈ C
2. (1 point) The inner product of two distinct vectors x and y that are drawn randomly
from C100 is 0.8 − 0.37i.The vector x is scaled by a scalar 1 − 2i to obtain a new vector
z, then the inner product between z and y is
Course: Machine Learning - Foundations Page 2 of 8
A. 0.06 − 1.97i
B. 1.54 − 1.23i
C. 1.54 + 1.23i
D. 0.8 − 0.37i
E. Not possible to calculate
Explanation: From the question, it is given that x∗ y = 0.8 − 0.37i. Let us replace x
with (1 − 2i)x and compute -
∴ Option C is correct.
3. (1 point)
Select
the correct statement(s).The Eigen-value decomposition for the matrix
0 −1
A=
1 0
A. doesn’t exist over R but exists over C
B. doesn’t exist over C but exists over R
C. neither exists over R nor exists over C
D. exists over both C and R
Here, we can see that the eigenvalues are complex. So, the decomposition doesn’t exist
over R but since there are two distinct eigenvalues, the decomposition will exist over C.
∴ Option A is correct.
1 1 + i −2 − 2i
4. (1 point) Consider the complex matrix S = 1 − i 1 −i . The matrix is
−2 + 2i i 1
A. Hermitian and Symmetric
B. Symmetric but not Hermitian
C. Neither Hermitian nor Symmetric
D. Hermitian but not Symmetric
Explanation: For this question, we can check both conditions. We can see that here
S T ̸= S, so S is not symmetric. But taking the complex conjugate transpose, S ∗ = S,
so S is hermitian.
Course: Machine Learning - Foundations Page 3 of 8
∴ Option D is correct.
Explanation: The requirement for a matrix U to be unitary is that it’s columns must
be pairwise orthogonal and of unit length.
Here, if we consider the diagonal matrix D = 2I, say. Then multiplying D with U will
cause all the columns to double, which in turn will change the length of the columns.
So, the resultant matrix will not be unitary.
∴ Option B is correct.
3 2 − i −3i
6. (3 points) The eigenvectors of matrix A = 2 + i 0 1 − i are
3i 1 + i 0
−1 1 − 21i 1 + 3i
A. 1 + 2i , 6 − 9i , −2 − i
1 13 5
1 1 − 21i 1 + 3i
B. 1 − 2i, 6 − 9i , −2 − i
1 13 5
−1 1 − 21i 1 + 3i
C. 1 − 2i, 6 − 9i , −2 − i
−1 13 5
−1 1 − 21i 1 − 3i
D. 1 + 2i, 6 − 9i , 2 − i
1 13 −5
Explanation: To solve this question, we can use trial and error from the options and
see which one of them is an eigenvector.
3 2 − i −3i −1 −1
2 + i 0 1 − i 1 + 2i = −1 1 + 2i
3i 1 + i 0 1 1
Course: Machine Learning - Foundations Page 4 of 8
3 2 − i −3i 1 − 21i 1 − 21i
2 + i 0 1 − i 6 − 9i = 6 6 − 9i
3i 1 + i 0 13 13
3 2 − i −3i 1 + 3i 1 + 3i
2 + i 0 1 − i −2 − i = −2 −2 − i
3i 1 + i 0 5 5
So, all the vectors from option A are eigenvectors as upon multiplication with the matrix
A, the vectors get scaled.
∴ Option A is correct.
√
1 k + i √2
7. (1 point) A matrix A = 2
is unitary if k is
k−i 2i
1
A. 2
B. 1
C. - 12
D. -1
E. ±1
F. ± 12
Explanation: To solve this question, we will use the fact that, for a unitary matrix A,
its columns must be pairwise orthogonal. That is we need,
0 = < a1 , a2 >
= a∗1 a2
√
1 1
= k−i k+i √2
2 2 2i
1 √
√ √ √
= 2k − 2i + 2ki − 2
4√ √ √ √
( 2k − 2) + i( 2k − 2)
=
4
√ √
So, we have, 2k − 2 = 0 =⇒ k = 1
∴ Option B is correct.
1 1+i
8. (3 points) A matrix A = can be written as A = U DU ∗ , where U is a
1−i 1
unitary matrix and D is a diagonal matrix. Then, U and D, respectively, are
Course: Machine Learning - Foundations Page 5 of 8
√ √
1√+ i 2 1+ 2 0√
A. U = ,D=
2 1−i 0 1− 2
√ √
−1 √ +i 2 1+ 2 0√
B. U = ,D=
2 −1 − i 0 1− 2
√ √
1√+ i 2 −1 + 2 0√
C. U = ,D=
2 1−i 0 1− 2
√ √
1 − i 2 1 + 2 0√
D. U = √ ,D=
−2 1 − i 0 1− 2
1−λ 1+i √ √
c(λ) = |A − λI| = = (1 − λ)2 − 2 = λ2 − 2λ − 1 =⇒ λ = 1 + 2, 1 − 2
1−i 1−λ
√
√ √
1+ 2 0√
So, the eigenvalues are 1 + 2 and 1 − 2 =⇒ D =
0 1− 2
Let the eigenvectors be v1 and v2 . Then,
√ −1 − i
√
− 2 1+ 1
Eλ=1+√2 = null √i = null √2
1−i − 2 1−i − 2
−1 − i
1 √ 1 + i
= null 2 =⇒ v1 = √2
0 0
√ 1+i
√
2 1√+ i 1
Eλ=1−√2 = null = null √ 2
1−i 2 1−i 2
1 + i
1 √
−1
√ − i
= null 2 =⇒ v2 =
2
0 0
1+i −1 − i
2 2
Converting v1 and v2 to unit vectors and putting them in a matrix we get U =
1
.
1
√ √
2 2
∴ Option A is correct.
Course: Machine Learning - Foundations Page 6 of 8
0 −1
9. (2 points) The matrix Z = has
1 0
A. only real eigenvalues.
B. one real and one complex eigenvalue.
C. no real eigenvalues.
−λ −1
|Z − λI| = = λ2 + 1 =⇒ λ = ±i
1 −λ
∴ Option C is correct.
10. (1 point) (Multiple select) Which of the following matrices is/are unitary?
cos θ − sin θ
A.
sin θ − cos θ
cos θ sin θ
B.
sin θ cos θ
− cos θ sin θ
C.
sin θ cos θ
cos θ sin θ
D.
− sin θ cos θ
− cos θ sin θ
E.
sin θ − cos θ
Explanation: The conditions for a particular matrix A to be unitary is that the columns
need to be of unit length and pairwise orthogonal.
We can see here that due to the property sin2 θ + cos2 θ = 1, all matrices here sat-
isfy the first condition.
For the second condition, we need the inner product of the columns to come out to 0.
This will only be true for options C and D.
1. U V is unitary.
2. U + V is unitary.
Course: Machine Learning - Foundations Page 7 of 8
Explanation: The conditions for a particular matrix A to be unitary is that the columns
need to be of unit length and pairwise orthogonal. These conditions are captured in the
property that for a unitary matrix U , U ∗ = U −1 . For the first statement, we have
(U V )∗ = V ∗ U ∗ = V −1 U −1 = (U V )−1
This shows that U V is a unitary matrix. However the same cannot be said for statement
2 because of the fact that U −1 + V −1 ̸= (U + V )−1
∴ Option D is correct.
Explanation: For this question, we can simply multiply the vectors in each of the op-
tions to the matrix A. If this multiplication has an effect of scaling the vectors, that
implies it is an eigenvector.
Option A.
1 1 + i −1 − i 0
=
1−i 2 1 0
This is an eigenvalue of 0, so option A is correct.
Option B.
Course: Machine Learning - Foundations Page 8 of 8
Option C.
"1 + i# "
1+i
#
1 1+i
2 =3 2
1−i 2 1 1
This is an eigenvalue of 3, so option C is correct.
Option D.
This vector is simply a multiple of the vector in option C, so it is also an eigenvector.
1. f (x, y) = 2xy + y 2
fx = 2y = 0 ⇒ y = 0
fy = 2x + 2y = 0 ⇒ x = 0
So, the stationary point is (0, 0)
2.
4 1 −1
A= 1 2 1
−1 1 2
The characteristic equation is
λ3 − 8λ2 + 17λ − 6 = 0
Solving we get
λ1 = 0.438
λ2 = 4.56
λ3 = 3
Since all eigenvalues are greater than 0, A is positive definite.
3.
6 2
A=
2 1
Here, since a = 6 > 0 and ac − b2 = 6 − 4 = 2 > 0, A is positive definite.
4.
f (x, y) = 4 + x3 + y 3 − 3xy
fx = 3x2 − 3y and fy = 3y 2 − 3x
6.
f (x, y) = −3x2 − 6xy − 6y 2
To check at (0, 0)
First derivative test:
fx (0, 0) = −6x − 6y = 0
fx x = −6 < 0
fy y = −12 < 0
So, the point (0,0) is the maxima.
7.
6 5
A=
5 4
Here, since a = 6 > 0 and ac − b2 = 24 − 25 = −1 < 0, A is indefinite
definite.
8.
−6 0 0
A= 0 −5 9
0 0 −7
The eigenvalues of A are -6,-5, and -7; all negative. So, A is negative
definite.
λ1 + λ2 = 6 ....(i)
λ1 × λ2 = 8
(λ1 − λ2 ) = (λ1 + λ2 )2 − 4(λ1 × λ2 )
2
(λ1 − λ2 )2 = 36 − 32 = 4
λ1 − λ2 = 2, −2 ...(ii)
Solving eq. (i) and (ii) we get λ1 = 4 and λ2 = 2. So, A is positive
definite.
10.
1 2 T 1 2
A= ⇒A =
2 1 2 1
λ2 − 10λ + 9 = 0
Solving we get
λ = 9, 1
So,
singular values, √ √
σ1 = 9 = 3 and σ2 = 1=1
11.
√0 1 1
A = 2 2 0
0 1 1
√
√2 2 0
AAT = 2 2 6 2
0 2 2
The characteristic polynomial for AAT is
λ3 − 10λ2 + 16λ = 0
Solving we get, λ = 0, 2, 8
For λ = 8,
(AAT − 8I)X = 0
√
−6
√ 2 2 0 x 0
2 2 −2 2 y = 0
0 2 −6 z 0
√
−6x + 2 2y = 0
√
2 2x − 2y + 2z = 0
2y − 6z = 0
let z = t then
√
y = 3t x = 2t √
x 2
v1 = y = t 3
z 1
On normalizing, for t = 1 we get
√1
√63
v1 = 2
1
√
2 3
For λ = 2,
(AAT − 2I)X = 0
√
0
√ 2 2 0 x 0
2 2 4 2 y = 0
0 2 0 z 0
√
2 2y = 0 ⇒ y = 0
√
2 2x + 4y + 2z = 0
let z = k then
x = −k
√
2 −1
x √
2
v2 = y = k 0
z 1
On normalizing, for k = 1 we get
√1
3
v2 = 0
−2
√
6
For λ = 0,
(AAT )X = 0
√
2
√ 2 0 x 0
2 2 6 2 y = 0
0 2 2 z 0
√
2x + 2 2y = 0
√
2 2x + 4y + 2z = 0
2y + 2z = 0
let z = l then
y = −l
√
x = 2l √
x 2
v3 = y = l −1
z 1
On normalizing, for l = 1 we get
√1
2
v3 = −1
2
1
2
√1 √1 √1 √1 √3 √1
√63 3 2
√16 12 12
−2
Q2 = 2 0 −1
2
⇒ QT2 = 3 0 √
6
1
√ −2
√ 1 √1 −1 1
2 3 6 2 2 2 2
Ax1
y1 =
σ1
√1 √1
1 √0 1 1 √6
3 √26
y1 = √ 2 2 0 =
2 6
2 2 0 1 1 1
√ √1
2 3 6
Ax2
y2 =
σ2
1
−1
√ √
0 1 1
1 √ 3 3
−1
y2 = √ 2 2 0 0 = √3
2 0 1 1 √2 √1
6 3
a + 2b + c = 0.....(i)
√1
3
c] −13
√
[a b =0
√1
3
a − b + c = 0......(ii)
let b = k1 and c = k2
a + 2k1 + k2 = 0
a = −2k1 − k2
f rom (ii), a = k1 − k2
So, −2k1 − k2 = k1 − k2 ⇒ k1 = 0
a = −k2 , b = 0 and c = k2
−1
y3 = k2 0
1
√1 √1 √1
√26 −1
√
3
0
2
Q1 = 6 3
√1 √1 −1
√
6 3 2
12.
1 0 1 1
A= ⇒ AT =
1 1 0 1
2 1
AT A =
1 1
The characteristics polynomial of AT A is
λ2 − 3λ + 1 = 0
Solving we get
λ = 2.618, 0.382
So, √
σ1 = 2.618 = 1.618
√
σ2 = 0.382 = 0.618
13.
2 0
A=
1 2
T 2 1 2 0 5 2
A A= =
0 2 1 2 2 4
The characteristic equation is
λ2 − 9λ + 16 = 0
Solving we get √
9± 17
λ=
2
σ1 = 2.56, σ2 = 1.56
2.56 0
Σ=
0 1.56
For λ = 6.56155, we have
(AT A − 6.56155I)X = 0
−1.56 2 x
=0
2 −2.56 y
−1.56x + 2y = 0
2x = 2.56y
Let y = k then x = 1.28k So,
0.788
v1 =
0.615
2.56x + 2y = 0
2x + 1.56y = 0
Let y = k then x = −0.78k So,
−0.615
v2 =
0.788
0.788 −0.615
Q2 =
0.615 0.788
Now,
Av1
y1 =
σ1
0.615
y1 =
0.788
Similarly,
Av2
y2 =
σ2
−0.788
y2 =
0.615
So,
T
0.615 −0.788 2.56 0 0.788 −0.615
A=
0.788 0.615 0 1.56 0.615 0.788
Week 6 Graded Assignment Solution
1.
f (x) = x2 + y 2
∂(x, y)
fx = = 2x
∂x
∂(x, y)
fy = = 2x
∂y
Put fx = 0 and fy = 0,
fx = 0 ⇒ 2x = 0 ⇒ x = 0
fy = 0 ⇒ 2y = 0 ⇒ y = 0
2.
4 2 a b
A= ⇔
2 2 b c
Now,
1 1 a b
B= ⇔
1 2 b c
3.
2 −1 1
A = −1 2 −1
1 −1 2
The characteristic polynomial is
λ3 − 6λ2 + {3 + 3 + 3}λ − 4 = 0
FORMULA:
x3 − [trace(A)]x2 + Σ[Minors of diagonal elements(A)]x − det(A) = 0
λ3 − 6λ2 + 9λ − 4 = 0
Solving we get λ = 4, 1, 1 Since all eigenvalues are greater than zero, A is
positive definite.
4.
f (x, y) = 2x2 + 2xy + 2y 2 − 6x
fx = 4x + 2y − 6
fy = 2x + 4y
For stationary point, fx = 0 and fy = 0
4x + 2y − 6 = 0 ⇒ 2x + y = 3
2x + 4y = 0 ⇒ x = −2y
Solving we get x = 2 and y = −1
5.
a b c x
x y z d e f y
g h i z
x
ax + dy + gz bx + ey + hz cx + f y + iz y
z
ax2 + ey 2 + iz 2 + (b + d)xy + (g + c)xz + (f + h)yz
Compare this with x2 + y 2 − z 2 − xy + yz + xz we get
a = 1, e = 1, i = −1, b + d = −1, c + g = 1, f + h = 1
Since fxx > 0 and fyy > 0, the point (0, 0) is a minima.
7.
4 2 a b
A= ⇔
2 3 b c
8.
1 2 a b
A= ⇔
2 1 b c
9.
3 0 0
A = 0 5 0
0 0 7
Since A is a diagonal matrix, the eigenvalues are 3, 5 and 7.
10.
1 1 0 1
A = 0 0 0 1
1 1 0 0
1 0 1
1 1 0 1 3 1 2
1 0 1
AAT = 0
0 0 1
0 0 0
= 1 1 0
1 1 0 0 2 0 2
1 1 0
The characteristics polynomial of AAT :
λ3 − 6λ2 + 6λ = 0
Solving we get √
λ = 0, 3 ± 3
√
Now, σ = λ
So,
√ √
q q
σ1 = 3+ 3 and σ2 = 3− 3
are the non-zero singular values.
11.
1 0 1 0
A=
0 1 0 1
1 0
1 0 1 0 1 = 2
0 0
AAT =
0 1 0 1 1 0 0 2
0 1
The characteristics polynomial of AAT :
λ2 − 4λ + 4 = 0
Solving we get
λ = 2, 2
√ √
σ= λ= 2
√
2 √0 0 0
Σ=
0 2 0 0
Now for λ = 2,
(AAT − 2I)X = 0
2 0 1 0 x
−2 =0
0 2 0 1 y
0 x
=0
0 y
x
v=k×
y
1
for x = 1, y = 0; v1 = k
0
0
for x = 0, y = 1; v2 = k
1
1 0
Q1 =
0 1
AT x 1
y1 =
σ1
√1
1 0 2
1 0 1 1 = 01
y1 = √ ×
2 1 0 0 √
2
0 1 0
Similarly,
0
1 0
1 0 1 0 √1
y2 = √ × = 2
2 1 0 1 0
0 1 √1
2
0
1
a b c d 0 = 0
1
and
1
0
a b c d 1 = 0
0
From above we have
a + c = 0 and b + d = 0
Let c = k1 and d = k2
−1 − −1 −
0 −1 1 0 1 −1
k1 + k2 N ormalizing → √ k1 ; √ k2
N ow,
1 0 2 1 2 0
0 1 0 1
1 0
1 0 1 1
y3 = √ f or k1 = −1 and y4 = √ f or k2 = −1
2 −1 2 0
0 −1
So,
1 0 1 0
1 0 1 0 1
Q2 = √
2 1
0 −1 0
0 1 0 −1
1 0 1 0
T 1 0 1 0 1
Q2 = √
2 1 0 −1 0
0 1 0 −1
Page 1 of 3
Answer: B
(0 + 1 + 2) 1
Mean vector = X̄ = Σni=1 n1 xi = 1
3
=
(2 + 1 + 0) 1
2. (2 points) The covariance matrix C = n1 Σni=1 (xi −x̄)(xi −x̄)T for the data points x1 , x2 , x3
is
0 0
A.
0 0
0.5 0.5
B.
0.5 0.5
0.67 −0.67
C.
−0.67 0.67
1 0
D.
0 1
Answer: C
1 −1 0 1 1 2 −2
C = 3( −1 1 + 0 0 + 1 −1 ) = 3
1 0 −1 −2 2
3. (2 points) The eigenvalues of the covariance matrix C = n1 Σni=1 (xi − x̄)(xi − x̄)T are
Course: Machine Learning - Foundations Page 2 of 3
A. 0.5, 0.5
B. 1, 1
C. 4, 0
D. 0, 0
Answer: C
Characteristics equation:
2
− λ − 32
3 I=0
− 23 2
3
−λ
The determinant of the obtained matrix is λ(λ − 4) = 0
Eigenvalues:
The roots are λ1 = 43 , λ2 = 0
Eigenvectors:
2
− 23
2
− 3 − 23
−λ
λ1 = 4, 3 2 2 =
−3 3
−λ − 23 − 23
−1 1 −1
The null space of this matrix is , Corresponding eigenvector is, u1 = √2
1 1
2
− λ − 23 2
−2
λ2 = 0, 3 2 2 = 32 23
−3 3
−λ −3 3
1 1 1
The null space of this matrix is , Corresponding eigenvector is, u2 = √2
1 1
4. (2 points) The eigenvectors of the covariance matrix C = n1 Σni=1 (xi − x̄)(xi − x̄)T are
(Note: The eigenvectors should be arranged in the descending order of eigenvalues from
left to right in the matrix.)
1 0
A.
0 1
0.5 0.5
B.
1 1
1 1
C.
1 1
−0.7 0.7
D.
0.7 0.7
Answer: D
Refer the solution of the previous question.
5. (2 points) The data points x1 , x2 , x3 are projected onto the one dimensional space using
PCA as points z1 , z2 , z3 respectively.
Course: Machine Learning - Foundations Page 3 of 3
1 1 1
A. z1 = , z2 = , z3 =
1 1 1
0.5 0 −0.5
B. z1 = , z2 = , z3 =
0.5 0 −0.5
0 1 2
C. z1 = , z2 = , z3 =
2 1 0
−1 0 1
D. z1 = , z2 = , z3 =
1 0 −1
Answer: D
√1
−1
λ1 = 4, u1 = 21
1
−1 1 −1 −1
z1 = √2 ( 0 2 ) √2 =
1 1 1
−1 1 −1 0
z2 = √12 ( 1 1
) √2 =
1 1 0
−1 1 −1 1
z3 = √12 ( 2 0
) √2 =
1 1 −1
6. (1 point) The approximation error J is given by Σni=1 ||xi − zi ||2 . What could be the
possible value of the reconstruction error?
A. 1
B. 2
C. 10
D. 20
Answer: B
Reconstruction error, J = n1 Σni=1 ||xi − zi ||2 = 31 [(12 + 12 ) + (12 + 12 ) + (12 + 12 )] = 2
Course: Machine Learning - Foundations
Test Questions
Lecture Details: Week 7
Answer: B
Explanation:
3 0 1 2 1
1 X 1
x= = 1 + 1 + 1 = 1
3 i=1 3
2 1 0 1
∴ Option B is correct.
2. (2 points) The covariance matrix C = n1 Σni=1 (xi − x̄)(xi − x̄)T of the data points x1 , x2 , x3
is
0 0 0
A. 0 0 0
0 0 0
0.5 0.5 0.5
B. 0.5 0.5 0.5
0.5 0.5 0.5
Course: Machine Learning - Foundations Page 2 of 6
0.67 0 −0.67
C. 0 0 0
−0.67 0 0.67
1 0 0
D. 0 1 0
0 0 1
Answer: C
Explanation: To solve this question we first take the data and center it by subtracting
the mean. Doing this will give us the centered dataset.
−1 0 1
x 1 − x = 0 , x2 − x = 0 , x3 − x = 0
1 0 −1
1 n
Now, we can use the formula C = Σ (x
n i=1 i
− x̄)(xi − x̄)T to get the covariance ma-
trix C.
0 −1
1 0 0 0 1 0 −1
1 0
C= 0 0 + 0 0 0 + 0 0 0
3
−1
0 1 0 0 0 −1 0 1
2 0−2
1 0 0
= 0
3
−2 02
0.67 0 −0.67
≈ 0 0 0
−0.67 0 0.67
∴ Option C is correct.
3. (2 points) The eigenvalues of the covariance matrix C = n1 Σni=1 (xi − x̄)(xi − x̄)T are
A. 2, 0, 0
B. 1, 1, 1
C. 1.34, 0, 0
D. 0.5, 0, 0.5
Answer: C
Course: Machine Learning - Foundations Page 3 of 6
Explanation: To find the eigenvalues, we find the characteristic polynomial and then
find the roots.
0.67 − λ 0 −0.67
|C − λI| = 0 −λ 0
−0.67 0 0.67 − λ
= (0.67 − λ)(−λ)(0.67 − λ) + (−0.67)(−0.67λ)
= −λ3 + 1.34λ2 − 0.45λ + 0.45λ
= λ2 (1.34 − λ)
Solving for the roots, we get
|C − λI| = 0 =⇒ λ = 1.34, 0, 0
∴ Option C is correct.
4. (2 points) The eigenvectors of the covariance matrix C = n1 Σni=1 (xi − x̄)(xi − x̄)T are
(Note: The eigenvectors should be arranged in the descending order of eigenvalues from
left to right in the matrix.)
1 0 1
A. 0 1 0
1 0 1
0.71 0 1
B. 0 0.71 0
0.71 0.71 0
−0.71 0 0.71
C. 0 1 0
0.71 0 0.71
0.33 0 0
D. 0.33 1 0
0.34 0 1
Answer: C
Explanation: To solve this question, lets consider the eigenvalues one by one and find
the eigenvectors.
Course: Machine Learning - Foundations Page 4 of 6
Now, λ = 0,
E0 = null(C)
0.67 0 −0.67
= null 0 0 0
−0.67 0 0.67
1 0 −1
= null 0 0 0
0 0 0
0 1
= col 1 0
0 1
Keep in mind that the eigenvectors themselves can be scaled so, scaling it appropriately,
we can see that the columns of option C match our eigenvectors.
∴ Option C is correct.
5. (2 points) The data points x1 , x2 , x3 are projected onto the one dimensional space using
PCA as points z1 , z2 , z3 respectively. (Use eigenvector with the maximum eigenvalue for
this projection.)
1 1 1
A. z1 = 1, z2 = 1, z3 = 1
1 1 1
Course: Machine Learning - Foundations Page 5 of 6
0.5 0 −0.5
B. z1 = 0.5, z2 = 0, z3 = −0.5
0.5 0 −0.5
0 1 2
C. z1 = 2 , z2 = 1 , z3 = 2
2 1 0
0 1.0082 2.0164
D. z1 = 1 , z2 = 1 , z3 = 1
2.0164 1.0082 0
Answer: D
∴ Option D is correct.
n
1X
6. (1 point) The approximation error J on the given data set is given by ||xi − zi ||2 .
n i=1
What is the reconstruction error?
A. 6.724 × 10−4
B. 5
C. 10
D. 20
Answer: A
Course: Machine Learning - Foundations Page 6 of 6
Explanation: Let us plugin the values of xi and zi into the formula and find the ap-
proximation error.
3
1 X
J= ||xi − z1 ||2
3 i=1
1 1 1
1 2
= 2
|| 1 || + || 1 || + || 1 ||2
3
1 1 1
=3
∴ Option A is correct.
Course: Machine Learning - Foundations
8 Practice questions
Week 7:
1. (2 points) Two positive numbers have a sum of 60. What is the maximum product of
one number times the square of other number?
A. 0
B. 32000
C. 60000
D. 64000
Answer: B
Let the two numbers be x and y
x+y=60
objective function from the question will be,
f (x) = x2 (60 − x)
For optima f 0 (x) = 0, 120x − 3x2 = 0
x = 0, 40
Product is maximum when x=40.
maximum product=32000
Answer: A
Objective function f (x) = (x − 0)2 + (x2 + 1 − 1.5)2
f (x) = x4 + 0.25
For minima f 0 (x) = 0
4x3 = 0
x=0
Corresponding y = 1
3. (2 points) The volume of largest cone that can be inscribed in a circle of radius 3 m is
(correct up to two decimal places)
Answer: 33.51 m3
V = 31 πr2 h
√
r = 9 − x2
Course: Machine Learning - Foundations Page 2 of 5
h=3+x
For maxima, V 0 (x) = 0
− 3x2 − 6x + 9 = 0
x = 1, −3
x can not be nagative.
So r = 2.828
h=4
V = 33.5
4. (2 points) The area of largest rectangle that can be inscribed in a circle of radius 4 is
A. 16
B. 8
C. 32
D. 20
Answer: C
Let x and y be two sides of the rectangle.
x2 +√y 2 = 64
y = 64 − x√2
A = xy = x 64 − x2
For maxima dA =0
√ dx
x = 4√ 2
y=4 2
A = 32
(Question 5-8 have common data) A manufacturing plant produces two products M and
N. Maximum production capacity is 700 for total production. At least 270 units must
be produced every day. Machine hours consumption per unit is 6 hours for M and 5
hours for N. At least 1100 machine hours must be used daily. Manufacturing cost is Rs
25 for M and Rs 35 for N.
Let, x1 = No of units of M produced per day
and x2 = No of units of N produced per day
Answer: C
We need to minimize the cost of the function.
B. x1 + x2 ≥ 700
C. x1 + x2 ≥ 270
D. x1 + x2 = 700
Answer: A
Maximum production capacity is 700.
Answer: D
Minimum production capacity is 270.
Answer: B
At least 1100 hours must be used.
(Questions 9-11 have common data)
A factory manufactures two products A and B. To manufacture one unit of A, 3 machine
hours and 5 labour hours are required. To manufacture product B, 2 machine hours and
4 labour hours are required. In a month, 270 machine hours and 280 labour hours are
available. Profit per unit for A is Rs. 55 and for B is Rs. 15.
Let x1 =Number of units of A produced per month
and x2 =Number of units of B produced per month
Answer: A
We need to maximise profit.
Answer: B
270 hours available.
Answer: B
280 labour hours is available.
12. (2 points) The value of a function at a point x = 5 is 3.2 and the value of the function’s
derivative at point x = 5 is 1.2. What will be the approximate value of the function at
a point x = 5.2(First order approximation)?
Answer: 3.44
According to Taylor’s series,
2 0
f (x + h) = f (x) + hf 0 (x) + h f2 (x) + ......
Here x = 5, h = 0.2
∴ f (x + h) = 3.44
14. (2 points) The area of the largest rectangle that can be inscribed in a circle of radius 1
is
A. 1
B. 1.5
C. 6
Course: Machine Learning - Foundations Page 5 of 5
D. 2
Answer: D
Follow same steps as q no 4
Course: Machine Learning - Foundations
8 Test questions
Week 7:
1. (2 points) Two positive numbers have a sum of 60. What is the minimum product of
one number times the square of other number?
A. 0
B. 900
C. 60
D. 240
Answer: A
Let the two numbers be x and y
x+y=60
objective function from the question will be,
f (x) = x2 (60 − x)
For optima f 0 (x) = 0, 120x − 3x2 = 0
x = 0, 40
Product is minimum when x=0.
Answer: A,C
Objective function f (x) = (x − 0)2 + (x2 + 1 − 2)2
f (x) = x4 − x2 + 1
For minima f 0 (x) = 0
4x3 − 2x = 0
x = 0, 0.707, −0.707
Corresponding y = 1, 1.5, 1.5
3. (2 points) The volume of the largest cone that can be inscribed in a circle of radius 6 m
is (correct up to two decimal places)
Answer: 268.19 m3
V = 31 πr2 h
√
r = 36 − x2
h=6+x
Course: Machine Learning - Foundations Page 2 of 5
Answer: D
1000 machine hours must be used daily.
(Questions 9-11 have common data)
A factory manufactures two products A and B. To manufacture one unit of A, 1 machine
hours and 2 labour hours are required. To manufacture product B, 2 machine hours and
1 labour hours are required. In a month, 200 machine hours and 140 labour hours are
available. Profit per unit for A is Rs. 45 and for B is Rs. 35.
Let x1 =Number of units of A produced per month
and x2 =Number of units of B produced per month
Answer: A
We need to maximize profit.
Answer: B
Total machine hours available=200.
Answer: B
Total labour hour available is 140.
Course: Machine Learning - Foundations Page 4 of 5
Answer: A,C,D
For critical points gradient of a function is 0.
As we move towards minima gradient decreases.
12. The value of a function at point 10 is 100. The values of the function’s first and second
order derivatives at this point are 20 and 2 respectively. What will be the function’s
approximate value correct up to two decimal places at the point 10.5 (Use second order
approximation)?
Answer: 110.25
According to Taylor’s series,
2 0
f (x + h) = f (x) + hf 0 (x) + h f2 (x) + ......
Here x = 10, h = 0.5
∴ f (x + h) = 110.25
13. (2 points) For the function f (x) = x sin(x) − 1, with an initial guess of x0 = 2.5, and
step size of 0.1, as per gradient descent algorithm, what will be the value of the function
after 4 iterations? (Correct up to 3 decimal places)
14. (2 points) The value of f (x1 , x2 ) = 4x21 − 4x1 x2 + 2x22 with an initial guess of (2, 3)
after two iterations of gradient descent algorithm will be ............... Take the step size
1
η = t+1 , where t= 0,1,2....
Answer: 130
xn+1 =xn − η∇f (x)
8x1 − 4x2
∇f =
−4x
1 + 4x2
−2
x1 =
−1
Course: Machine Learning - Foundations Page 5 of 5
4
x2 =
−3
f (4, −3) = 130
15. (2 points) The point of minimum for the function f (x1 , x2 ) = x21 − x1 x2 + 2x22 with an
initial guess of (3, 2) with step size=0.5 using gradient descent algorithm after second
iteration will be .............. (correct up to 3 decimal places)
16. (2 points) Suppose we have n data points randomly distributed in space given by D =
{x1 , x2 , ....., xn }. A function f (p) is defined to
Pncalculate the sum of distances of data
2
points from a fixed point, say p. Let f (p) = i=1 (p − xi ) . What is the value of p so
that f(p) is minimum?
A. x1 + x2 + ..... + xn
B. x1 − x2 + x3 − x4 ....
x1 + x2 + .... + xn
C.
n
x1 − x2 + x3 − x4 ....
D.
n
Answer: Pn C
f (p) = i=1 (p − xi )2
f (p) = (p − x1 )2 + ...... + (p − xn )2
f 0 (p) = 2p(p − x1 ) + ..... + 2p(p − xn )
For minima f 0 (p) = 0
(p − x1 ) + (p − x2 ) + ..... + (p − xn ) = 0
np − (x1 + x2 + .... + xn ) = 0
x1 + x2 + .... + xn
p=
n
Course: Machine Learning - Foundations
Practice Questions - Solution
Lecture Details: Week 8
1. (1 point) Given if x1 , x2 , x1 +x
2
2
∈ S then 34 x1 + 41 x2 ∈ S. Is this a true statement?
A. Yes
B. No
Answer: A
x1 +x2
(x1 + )
x1 , x2 , x1 +x
2
2
∈ S then 2
2
∈ S =⇒ 3
x
4 1
+ 14 x2 ∈ S
It is clear that if the set is midpoint convex, then the set is a convex set.
By definition, For a convex set S, if x1 , x2 ∈ S =⇒ λx1 + (1 − λ)x2 ∈ S, λ ∈ [0, 1]
2. (1 point) Which of the following is a convex function?
A. f (x) = ax + b over R where a, b ∈ R
B. f (x) = eax over R where a ∈ R
C. f (x) = x2 over R
D. f (x) = x3 over R
Answer: A, B, C
1. f (x) = ax + b over R where a, b ∈ R
∂f (x) ∂ 2 f (x)
∂x
= a, ∂x2
= 0,
The second order partial derivative is non-negative. Hence, the function is a convex
function.
2. f (x) = eax over R where a ∈ R
∂f (x) ∂ 2 f (x)
∂x
= aeax , ∂x2
= a2 eax ≥ 0 ∀ x ∈ R,
The second order partial derivative is non-negative (positive curvature). Hence, the
function is a convex function.
3. f (x) = x2 over R
∂f (x) ∂ 2 f (x)
∂x
= 2x, ∂x2
= 2 ≥ 0,
The second order partial derivative is non-negative (positive curvature). Hence, the
function is a convex function.
4. f (x) = x3 over R
∂f (x) ∂ 2 f (x)
∂x
= 3x2 , ∂x2
= 6x,
The second order partial derivative depends on the x and can be negatie or positive.
Hence, the function is a non-convex function in nature.
Course: Machine Learning - Foundations Page 2 of 7
Answer: A
f : R → R, f (x, y) = ax4 + 8y
∂f (x,y) ∂ 2 f (x,y)
fx = ∂x
= 4ax3 , fxx = ∂x2
= 12ax2 ,
∂f (x,y) ∂ 2 f (x,y)
fy = ∂y
= 8, fyy = ∂y 2
= 0,
∂ 2 f (x,y)
fxy = ∂x∂y
= 0,
fxx fxy 12ax2 0
The hessian matrix, H = =
fxy fyy 0 0
The determinant of the hessian matrix, D = fxx fyy − fxy 2 = 0
For the function to be a convex function, the second order partial derivative with respect
to x should be positive, in other words fxx > 0
For this to be true, a > 0
4. (1 point) Which of the following hessian matrix corresponds to the convex function?
−2 2
A.
2 −2
2 1
B.
1 2
−2 2
C.
2 0
−2 2
D.
2 2
Answer: B
fxx fxy
The hessian matrix is denoted as, H =
fxy fyy
A function f (x, y) is convex when fxx > 0 and D = fxx fyy − fxy 2 ≥ 0
Answer: A, B
fxx fxy
The hessian matrix is denoted as, H =
fxy fyy
A function f (x, y) is positive semi-definite when fxx > 0 and D = fxx fyy − fxy 2 ≥ 0
A function f (x, y) is positive definite when fxx > 0 and D = fxx fyy − fxy 2 > 0
In both cases, the function fulfills the criteria of convexity.
Answer: A, B
Please refer to the previous solution.
Answer: A
Please refer to the lecture videos.
Answer: A, D
Since the function is convex, its second order partial derivative will be non-negative.
Hence, both (A) and (D) are true answer.
9. (1 point) What is the relationship between eigenvalues of the hessian matrix of twice
differentiable convex function?
A. All eigenvalues are non-negative
B. Eigenvalues are both positive and negative
C. All eigenvalues are non-positive
D. There is no relationship
Answer: A
For a positive definite function, the eigen values are always positive. For a positive
semi-definite function, the eigen values are always non-negative.
For a convex function, the determinant of the hessian matrix is non-negative and positive
semi-definite (or definite).
Therefore, in case of convex function option (A) is the true answer.
10. (1 point) A batch of cookies requires 4 cups of flour, and a cake requires 7 cups of flour.
What would be the constraint limiting the amount of cookies(a) and cakes(b) that can
be made with 50 cups of flour.
A. 4a + 7b ≤ 50
B. 7a + 4b ≤ 50
C. 11(a + b) ≤ 50
D. 4a.7b ≤ 50
Answer: A
Since we need min 4 cup for cookies (a) ⇒ 4a.
We need min 7 cup for cake (b) ⇒ 7b
and Max amt of flour available is⇒ 50 cup. Hence equation becomes 4a + 7b ≤ 50
D. ( √12 , 0, √12 )
Answer: C
Given
f (x, y, z) = x + z
g(x, y, z) = x2 + y 2 + z 2 = 1
To get critical point we need to solve ∇f (x, y, z) = λ∇g(x, y, z):
using above we get following equation:
(i) diff w.r.t x ⇒ 1 = 2λx
(ii) diff w.r.t y ⇒ 0 = 2λy
(iii) diff w.r.t z ⇒ 1 = 2λz
1
using above we get: x = z = 2λ
and y = 0
Substituting above in x2 + y 2 + z 2 = 1 we get critical point as
−1 −1
(√ 2
, 0, √ 2
) and ( √12 , 0, √12 )
−1 −1
Here f ( √ 2
, 0, √ 2
) ≤ f ( √12 , 0, √12 )
and since constrained equation shows a sphere,so:
−1 −1
f(√ 2
, 0, √ 2
) is constrained minimum point.
and f ( √12 , 0, √12 ) is constrained maximum point.
Answer: D
Refer Previous solution
13. (1 point) Find the points on the surface y 2 = 1 + xz that are closest to the origin.
A. (0, −1, 0)
B. (1, 1, 1)
C. (0, 0, 0)
D. (0, 2, 0)
Course: Machine Learning - Foundations Page 6 of 7
E. (1, 2, 0)
Answer: A
To get the closest point on the surface from a point we can create distance function and
try to minimise them.
Since, coordinate of origin = (0, 0, 0)
Our objective function will be distance between them, so:
p
d = (x − 0)2 + (y 2 − 0) + (z − 0)2
Hence,
objective function= f (x, y, z) = x2 + y 2 + z 2
constrained equation g(x, y, z) = y 2 − 1 − xz = 0
To get critical point we need to solve ∇f (x, y, z) = λ∇g(x, y, z):
using above we get following equation:
(i) diff w.r.t x ⇒ 2x = −λz
(ii) diff w.r.t y ⇒ 2y = 2λy → λ = 1
(iii) diff w.r.t z ⇒ 2z = −λx
using above we get: x = z = 0
and By putting x = z = 0 in y 2 − 1 − xz = 0 we get y = 1, −1
so Point closet to origin are (0, 1, 0) and (0, −1, 0)
14. (1 point) The minimum value of the function f (x, y) = xy 2 on the circle x2 + y 2 = 1 is
(correct upto two decimal places) .
15. (1 point) (multiple select) The minimum value of the function f (x, y) = xy 2 on the
circle x2 + y 2 = 1 occurs at the below points:
√ √
A. ( 3/3, 6/3)
√ √
B. (− 3/3, 6/3)
√ √
C. ( 3/3, − 6/3)
√ √
D. (− 3/3, − 6/3)
Answer: C, D
Refer to the solution of the previous question
Course: Machine Learning - Foundations
Graded Questions - Solution
Lecture Details: Week 8
1. (1 point) Points (0, 0), (5, 0), (5, 5), (5, 0) forms a convex hull. Which of the following
points are the part of this convex hull?
A. (1,1)
B. (1,-1)
C. (-1,1)
D. (-1,-1)
Answer: A
Let (x1 , y1 ) = (0, 0), (x2 , y2 ) = (5, 0), (x3 , y3 ) = (5, 5), (x4 , y4 ) = (5, 0) Any point on the
convex hull of these points will be the the set S such that
{S = (x, y) | x = λ1 x1 + λ2 x2 + λ3 x3 + λ4 x4 , y = λ1 y1 + λ2 y2 + λ3 y3 + λ4 y4 , λ1 , λ2 , λ3 , λ4 ∈
[0, 1]}
The convex hull of these 4 points forms a square and any point inside or on the square
will be the part of the convex hull.
The point (1,1) lies inside this square, while (1,-1), (-1,1) and (-1,=1) lies outside this
square.
2. (1 point) Given S is a convex set and the points x1 , x2 , x3 , x4 ∈ S. Which of the following
points must be the part of convex set S:
A. 0.1x1 + 0.2x2 + 0.3x3 + 0.4x4
B. −0.1x1 + −0.2x2 + 0.6x3 + 0.7x4
C. 0.1x1 + 0.12 x2 + 0.13 x3 + 0.13 x4
D. 0.25x1 + 0.25x2 + 0.25x3 + 0.25x4
Answer: A, D
The convex combination of the points will always be the part of the convex set and is
known as convex hull.
For the convex combination, the coefficients should be non-negative and should sum to
1. Therefore , options (A) and (D) are correct.
D. None of these
Answer: A
2 2 T
1 0 x
f (x) = x + y = v Av = x y
0 1 y
a b 1 0 x
where A = = ,v =
b c 0 1 y
Here, a > 0, ac − b2 = 1.1 − 02 = 1 > 0, This shows the matrix A is a positive definite
matrix. Therefore, the function is a convex function.
4. (1 point) What is the boundary value of x so that the function (x − 3)3 + (y + 1)2 to
remain convex?
A. x ≥ 1
B. x ≥ 2
C. x ≥ 3
D. None of these
Answer: C
f (x, y) = (x − 3)3 + (y + 1)2
First order partial derivatives, fx = 3(x − 3)2 , fy = 2(y + 1)
Second order partial derivatives, fxx = 6(x − 3), fxy = 0, fyy = 2
The Second order partial derivatives with respect to x, fxx = 6(x − 3) changes sign at
the point where x = 3.
When x ≥ 3, fxx ≥ 0 otherwise fxx < 0
The hessian matrix will be,
fxx fxy 6(x − 3) 0
D= = = 12(x − 3)
fxy fyy 0 2
D will be non negative when x ≥ 3 and the function remains convex.
Answer: A
1 2x 6
= 0.02, d = , ∇f (x, y)(3,2) = =
2 2y (3,2) 4
The linear approximation of a function f at the point (x + d) is
6
f (x, y) + dT ∇f (x, y) = (32 + 12 ) + 0.02 ∗ 1 2 = 10 + 0.02 ∗ 14 = 10.28
4
Answer: D
Refer to the solution of the previous question
Course: Machine Learning - Foundations Page 4 of 7
Answer: E
Refer to the solution of the previous question
10. (1 point) A box (cuboid shaped) is to made out of the cardboard with the total area of
24 cm2 . The maximum volume occupied by the box will be .
11. (1 point) You are planning to setup a manufacturing business where the revenue (r) is
a function of labour units (l), material units (m) and fixed cost (c), r = l.m2 + 2c. You
have an annual budget (b) of 1004 million rupees, b = 2l + 16m + c to run the business.
What would be maximum revenue that can be generated from the business in million
rupees under the optimum combination of labour units (l), material units (m) and fixed
cost(c).
A. 1944
B. 2036
C. 2072
D. 2080
Answer: A
Given
Revenue, r = f (l, m, c) = l.m2 + 2c, Budget constraint, g(l, m, c) = 2l + 16m + c = 1004
2
m 2
∇f (x, y, z) = 2lm , ∇g(x, y, z) = 16
2 1
We find the values of x, y, z, λ that simultaneously satisfy the equations to get the ex-
treme points ∇f (l, m, c) = λ∇g(x, y, z) and, g(l, m, c) = 2l + 16m + c = 1004
Solving, ∇f (l, m, c) = λ∇g(x, y, z)
Course: Machine Learning - Foundations Page 6 of 7
m2 2
=⇒ 2lm = λ 16 =⇒ m = ±2, l = ±8, λ = 2
2 1
Solving, g(l, m, c) = 2l + 16m + c = 1004 =⇒ c = 1004 − 2l − 16m
Therefore, the extreme point coordinates will be,
(l1 , m1 , c1 ) = (8, 2, 956), Revenue, r = f (8, 2, 956) = 8 ∗ (2)2 + 2 ∗ 956 = 1944
(l2 , m2 , c2 ) = (−8, −2, 1052), Revenue, r = f (−8, −2, 1052) = −8 ∗ (−2)2 + 2 ∗ 1052 =
2072
Since, the configurations can not be negative. So, l ≥ 0, m ≥ 0, c ≥ 0
We can see the maximum revenue is achieved under the configuration l = 8, m = 2, c =
956. The maximum revenue is 1944.
12. (1 point) The distance of the point on the sphere x2 + y 2 + z 2 = 3 closest to the point
(2,2,2) is .
13. (1 point) The distance of the point on the sphere x2 + y 2 + z 2 = 3 farthest from the
point (2,2,2) is .
Answer: A
If the first derivative of f (x) at any point x is positive or negative, then we call f (x) as
increasing or decreasing respectively.
f 0 (x) = −4x, at x = −3 we get f 0 (−3) = 12. Hence increasing.
Answer: D
Global minimum occurs at a point x when ∇(f (x)) = 0.
−1 + 4x = 0
x = 0.25
2
3. (1 point) (Multiple select) Consider two convex functions f (x) = x2 and g(x) = e3x .
Choose the correct convex function(s) that is a resultant of combination of f (x) and
g(x).
2
A. h(x) = x2 + e3x
2
B. h(x) = x2 e−3x
2
C. h(x) = x2 e3x
2
D. h(x) = x2 − e3x
Answer: A,C
2 2
h(x) = f (x) + g(x) = x2 + e3x is a convex function h(x) = f (x) × g(x) = x2 e3x is a
convex function
4. (1 point) Consider two functions g(x) = 2x − 3 and f (x) = x − 10ln(5x). Select the
true statement.
A. h = f og is a convex function.
B. h = f og is a concave function.
Course: Machine Learning - Foundations Page 2 of 6
Answer: A
Since f (x) is a convex function and g(x) is a linear function, h = f og is also a convex
function.
(Common data for Q5-Q7) Given below is a set of data points and their labels.
X y
[1,0] 1.5
[2,1] 2.9
[3,2] 3.4
[4,2] 3.8
[5,3] 5.3
To perform linear regression on this data set, the sum of squares error with respect to
w is to be minimized.
5. (1 point) Which of the following is the optimal w∗ computed using the analytical method?
1.5
A.
0
1.255
B.
−0.317
1.512
C.
0.004
−1.5
D.
0
Answer: B
optimal
w∗ = (X T X)−1 (X T y)
1 0
2 1
X3 2
4 2
5 3
T 55 31
X X=
31 18
59.2
XT y =
33.2
1.255
w∗ =
−0.317
0.1
6. (1 point) Let w1 be initialized to . Gradient descent optimization is used to find
0.1
the value of optimal w∗ . For the first iteration t = 1, which of the following is the
gradient computed with respect to w1 ?
Course: Machine Learning - Foundations Page 3 of 6
50.6
A.
28.3
8.6
B.
4.9
−50.6
C.
−28.3
−8.6
D.
−4.9
Answer: C
T T
Gradient ∇f
(w) = (X X)w − X y
−50.6
∇f (w) =
−28.3
7. (1 point) Using the gradient descent update equation with a learning rate ηt = 0.1,
compute the value of w at t = 2.
5.16
A.
−2.93
5.16
B.
2.93
5.5
C.
3.5
5.5
D.
−3.5
Answer: B
1 0.1
Given w =
0.1
2 1 1
update
equation:
w = w − ηt ∇f (w )
0.1 −50.6
w2 = − 0.1
0.1 −28.3
5.16
w2 =
2.93
(Common data for Q8-Q10) A rectangle has a perimeter of 20 m. Using the Lagrange
multiplier method, find the height and width of the rectangle which results in maximum
area.
13. (1 point) Choose the equivalent Lagrange function for the problem.
A. L(x, y, z) = x + y − λ(15x + 20y − 1000)
B. L(x, y, z) = x + y + λ(15x + 20y + 1000)
C. L(x, y, z) = xy + λ(15x + 20y − 1000)
D. L(x, y, z) = xy − λ(15x + 20y + 1000)
Answer: C
Lagrange function = f (x) + λh(x) = xy + λ(15x + 20y − 1000)
14. (2 points) Minimize the function f = x21 + 60x1 + x22 subject to the constraints g1 =
x1 − 80 ≥ 0 and g2 = x1 + x2 − 120 ≥ 0 using KKT conditions. Which of the following
is the optimal solution set?
A. [x∗1 , x∗2 ] = [80, 40]
B. [x∗1 , x∗2 ] = [−80, −40]
C. [x∗1 , x∗2 ] = [45, 75]
D. [x∗1 , x∗2 ] = [−45, −75]
Answer: A
The KKT conditions for the problem with multiple variables are given as follows:
(i) Stationary condition
2
∂f X ∂gj
+ uj = 0, i = 1, 2
∂xi j=1 ∂xi
Therefore, we get
2x1 + 60 + u1 + u2 = 0 (1)
2x2 + u2 = 0 (2)
(ii) Complementary slackness condition
ui gi = 0, i = 1, 2
Therefore, we get
u1 (x1 − 80) = 0 (3)
u2 (x1 + x2 − 120) = 0 (4)
(iii) Primal feasibility condition
gi ≥ 0
Therefore, we get
x1 − 80 ≥ 0 (5)
x1 + x2 − 120 ≥ 0 (6)
(iv) Dual feasibility condition
ui ≥ 0
Course: Machine Learning - Foundations Page 6 of 6
Therefore, we get
x1 − 80 ≥ 0 (7)
x1 + x2 − 120 ≥ 0 (8)
From (3), either u1 = 0 or x1 = 80
Case (i): Let u1 = 0
Substitute in (1);
2x1 + 60 + u2 = 0
u2
x1 = − − 30 (9)
2
Substitute in (2);
2x2 + u2 = 0
u2
x2 = − (10)
2
Substitute in (9) and (10) in (4);
u2 u2
u2 (− − 30 − − 120) = 0
2 2
u2 (−u2 − 150) = 0
From this, u2 = 0 or u2 = −150
Using u2 = 0 in (9) and (1) we get, x1 = −30 and x2 = 0 respectively. This violates (5)
and (6).
Using u2 = −150 in (9) and (10) we get, x1 = 45 and x2 = 75, respectively. This violates
(5).
Case (ii): Let x1 = 80, substitute in (1)
2(80) + 60 + u1 + u2 = 0
u2 = −u1 − 220 (11)
Substitute (11) in (2):
u1 = 2x2 − 220 (12)
Using (12) in (11) we get u2 = −2x2 . Using x1 = 80 and u2 = −2x2 in (4) we get
−2x2 (80 + x2 − 120) = 0
x2 (x2 − 40) = 0 (13)
From this, either x2 = 0 or x2 = 40.
Case (ii-a): Let x2 = 0, then from (12) we get u1 = −220.
Substituting x1 = 80, x2 = 0 and u1 = −220 in (6) will violate the condition.
Case (ii-b): Let x2 = 40
Substituting x1 = 80, x2 = 40 in (5) and (6), the conditions are satisfied.
Substituting x1 = 80, x2 = 40 in (11) and (12), we get u1 = −140 and u2 = −80. These
values satisfy conditions (7) and (8).
Thus the optimal solution is [x∗1 , x∗2 ] = [80, 40] because it satisfies the KKT conditions.
Course: Machine Learning - Foundations
Test Questions
Lecture Details: Week 10
1. (1 point) For a given point x = 2, is the function f (x) = xe−3x increasing or decreasing?
A. increasing
B. decreasing
Answer: B
If the first derivative of f (x) at any point x is positive or negative, then we call f (x) as
increasing or decreasing respectively.
f ′ (x) = e−3x (1 − 3x), at x = 2 we get f ′ (2) = −0.012. Hence decreasing.
2. (1 point) Find the value of a functionf (x) = x + 3x2 at its global minimum point.
Answer: -0.0833
range: -0.09,-0.08
Answer: B
In order to find the interval over which f (x) is convex, let us find where f ′′ (x) > 0.
2
f ′ (x) = 2x2 ex
2
f ′′ (x) = ex (4x3 + 6x)
2
Therefore f ′′ (x) > 0 implies ex (4x3 + 6x) > 0.
2
To satisfy this inequality, we want an interval of x for which ex > 0 and (4x3 + 6x) > 0.
Because e raised to any power will be positive, first condition can be satisfied for any
value of x.
Second condition can be written as 2x(x2 + 3) > 0.
(x2 + 3) will be greater than 0 for any values of x.
2x will be positive only when x > 0.
Thus in interval notation, the largest interval of x for which f (x) is convex is (0, inf).
Course: Machine Learning - Foundations Page 2 of 8
4. (1 point) (Multiple select) Let the composition of two functions f (x) = sin(x) − 2x2 + 1
and g(x) = ex be h = f og. At a point x = 5, Select the true statement(s).
A. h(x) is a convex function.
B. h(x) is a concave function.
C. h(x) is a non-decreasing function.
D. h(x) is a decreasing function.
Answer: B,D
L(x, y, λ) = x + y + λx2 + λy 2 − λ
∂L
= 2λy + 1 = 0
∂y
1
y=− (2)
2λ
∂L
= x2 + y 2 = 1 (3)
∂λ
Substituting (1) and (2) in (3), we get
1 1
2
+ 2 =1
4λ 4λ
1
λ = ±√ (4)
2
Using (3) in (1) and (2) we get, x = ∓ √12 and y = ∓ √12 . Since we want to minimize
f (x, y), we shall consider x = − √12 and y = − √12 .
Minimum value of f (x, y) = −1.414.
6. (1 point) For the functions g(x) = (3x + 2)2 and f (x) = ex , select the plot that corre-
sponds to the correct composition h = f og.
A.
Course: Machine Learning - Foundations Page 4 of 8
B.
C.
D.
Answer: A
Course: Machine Learning - Foundations Page 5 of 8
The functions f(x), g(x) and their composition h(x) are plotted as follows:
(Common data for Q7, Q8) Linear programming deals with the problem of finding a
vector x that minimizes a given linear function cT x, where x ranges over all vectors
(x ≥ 0) satisfying a given system of linear equations Ax = b. Here A is a m × n matrix,
c, x ∈ Rn and b ∈ Rm .
7. (1 point) Choose the correct dual program with y as the dual variable for the above
linear program from the following.
A.
min by subject to AT y ≥ c
y
B.
max bT y subject to AT y ≤ c
y
C.
max bT y subject to AT y ≥ c
y
D.
max by subject to AT y ≤ c
y
Answer: B
8. (1 point) From the below given statements regarding constraints and decision variables
related to the primal and dual problems of the linear program, choose the correct state-
ment.
A. Primal problem has m constraints and m decision variables whereas dual prob-
lem has n constraints and n decision variables.
B. Primal problem has m constraints and n decision variables whereas dual prob-
lem has n constraints and m decision variables.
Course: Machine Learning - Foundations Page 6 of 8
C. Primal problem has n constraints and n decision variables whereas dual prob-
lem has m constraints and m decision variables.
D. Primal problem has n constraints and m decision variables whereas dual prob-
lem has m constraints and n decision variables.
Answer: B
9. (2 points)
Let a set of data points with five samples and two features per sample be
1 2 1.5
2 3 2
X= 4 2.5 and the corresponding labels be y = 2.5. Perform linear regression
6 4 3
7.5 5 4
∗
on this data set and choose the optimal solution for w to minimize the sum of squares
error.
0.2763
A.
1.2039
0.0691
B.
0.3010
0.1382
C.
0.6019
0.0276
D.
0.1204
Answer: C
optimal w∗ = (X T X)−1 (X T y)
113.25 79.5
XT X =
79.5 60.25
63.5
XT y =
47.25
0.1382
w∗ =
0.6019
A, B and D are scalar multiples of C.
(Common data for Q10, Q11) Krishna runs a steel fabrication industry and produces
steel products. He regularly purchases
√ raw steel for Rs.500 per ton. His revenue is
modeled by a function R(s) = 100 s, where s is the tons of steel purchased. His budget
for steel purchase is Rs.150000.
10. (1 point) Using Lagrangian function, find the amount of raw steel to be purchased to
get maximum revenue?
Answer: 300
range: 298, 302
Course: Machine Learning - Foundations Page 7 of 8
√
Maximum revenue: 100 ∗ 300 = 1732.05
2
12. (1 point) Consider a vector ŵ = 4 in R3 . In R3 , there are many unit vectors. Use
3
Lagrange method to find the unit vector which gives the minimum dot product.
2
1
A. û = 2λ 4 , with λ ≥ 0
3
2
−1
B. û = 3λ 4 , with λ ≥ 0
3
2
−1
C. û = 4λ 4 , with λ ≥ 0
3
2
D. û = −1
2λ
4, with λ ≥ 0
3
Answer: D
x
Let unit vector be ⃗u = y .
z
⃗ subject to x2 + y 2 + z 2 = 1.
Objective is to minimize ⃗u.w
f (x, y, z) = ⃗u.w
⃗ = 2x + 4y + 3z
Course: Machine Learning - Foundations Page 8 of 8
subject to
0.5y1 + y2 ≥ 6
2y1 + 2y2 ≥ 14
y1 + 4y2 ≥ 13
y1 ≥ 0, y2 ≥ 0
Answer: C
][y1∗ , y2∗ ] = [11, 0.5], satisfies all the constraints and gives minimum value of v.
PRACTICE QUESTIONS
1. (1 point) The continuous random variable X represents the amount of sunshine in hours
between noon and 8 pm at a skiing resort in the high season. The probability density
function, f (x), of X is modelled by
(
kx2 , for 0 ≤ x ≤ 8
f (x) =
0, otherwise
Find the probability that on a particular day in the high season there is more than two
hours of sunshine between noon and 8 pm.
Answer: 0.98
Explanation: To solve this question, we will use the pdf to find the required probability.
But first we must find k using the fact that total probability is 1.
Z Z8
k 3 8 512
fX (x)dx = kx2 dx = x = k
3 0 3
x 0
3
Since the total probability is 1 =⇒ k = . So, now the probability that there is
512
more than two hours of sunshine is P (X > 2) given by
Z8 Z8
3 3 3 1 4 8 3 (84 − 24 )
P (X > 2) = fX (x)dx = x dx = x = ≈ 0.98
512 512 4 2 (512)(4)
2 2
Calculate Value of b.
Course: Machine Learning - Foundations Page 2 of 18
Answer: -6
Explanation: To solve this question, we will use the fact that total probability is 1.
Z Z1
b 1 b
fX (x)dx = (6x + bx2 )dx = 3x2 + x3 = 3 +
3 0 3
x 0
b
Since the total probability is 1 =⇒ 3 + = 1 =⇒ b = −6.
3
∴ The answer is −6
Answer: 0.8
Explanation: To solve this question, we will use the pdf to find the expectation. But
first we must find k using the fact that total probability is 1.
Z Z1
4 1 4
fX (x)dx = 4xk dx = xk+1 =
k+1 0 k+1
x 0
4
Since the total probability is 1 =⇒ = 1 =⇒ k = 3. So, now the expectation is
k+1
given by
Z Z1
4 1
(x) 4x3 dx = x4 = 0.8
E(X) = xfX (x)dx =
5 0
x 0
4. (1 point) The time that passenger train will reach the station is uniformly distributed
between 2:00 PM and 4:00 PM. What is the probability that the train reaches station
Course: Machine Learning - Foundations Page 3 of 18
exactly at 04:00 PM?
Answer: 0
Explanation: Let X be the random variable denoting the time the train will arrive
such that 2 ≤ X ≤ 4. Now, X is a continuous random variable so the probability of X
being equal to a single point is 0. That is, P (X = 4) = 0.
∴ The answer is 0.
Answer: C
We can see why this is true. We know that the cdf of an exponential distribution is
FX (x) = 1 − e−λx =⇒ P (X > x) = 1 − FX (x) = e−λx . Now,
P (X > x + k ∧ X > k)
P (X > x + k | X > k) =
P (X > k)
P (X > x + k)
=
P (X > k)
−λ(x+k)
e
=
e−λk
−λx
=e
= P (X > x)
Course: Machine Learning - Foundations Page 4 of 18
∴ Option C is correct.
6. (1 point) The lifetime of a electric bulb is exponentially distributed with a mean life of
18 months. If there is a 60% chance that an electric bulb will last for at most t months,
then what is the value of t?
Answer: 16.5
Explanation: Let X be the random variable denoting the lifetime of an electric bulb
1
in months. Since the mean life is 18 months, the rate parameter λ = , which makes
18
1
the pdf fX (x) =
18
Now, the probability that an electric bulb will last for at most t months is P (X < t)
given by
−t −t
P (X < t) = 1 − e 18 = 0.6 =⇒ e 18 = 0.4 =⇒ t = (−18) ln(0.4) ≈ 16.5
Zb
b 2 − a2
Z
x 1 b a+b
E(X) = xfX (x)dx = dx = x2 = =
b−a 2(b − a) a 2(b − a) 2
x a
Zb
x2 b 3 − a3 b2 + a2 + ab
Z
2
2 1 3
b
E X = x fX (x)dx = dx = x = =
b−a 3(b − a) a 2(b − a) 3
x a
Answer: A, B, C, D
Firstly, we know this equation is a bell curve, so option A is correct. Also, we can
see that this curve is symmetric about the mean. This is because of the fact that
fX (µ − x) = fX (x). So, Option B is correct.
When considering the standard normal, the mean is 0 and variance is 1. Since it is
still a pdf, the area under it will be 1. So, Option C is correct.
Finally, Option D is correct for the same reason as B.
1 1
Calculate P (0 < X < , 0 < Y < )
2 2
Answer: 0.0625
Explanation: To solve this question, we will use the pdf to find the required probability.
But first we must find c using the fact that total probability is 1.
Z1 Z1
x2 y2 1 c
ZZ 1
fX,Y (x, y)d(x, y) = cxy dxdy = c . . =
2 0 2 0 4
x,y 0 0
Since the total probability is 1 =⇒ c = 4. So, now the required probability is given by
Z0.5Z0.5
x2 y2
1 1 0.5 0.5 1
P 0<X< , 0<Y < = 4xy dxdy = 4 . . = = 0.0625
2 2 2 0 2 0 16
0 0
25
10. ( points) Let X be a uniformly distributed random variable with µx = 15 and σx2 = .
3
Calculate P (X > 17)
Course: Machine Learning - Foundations Page 7 of 18
Answer: 0.3
2 (b − a)2 25
σX = = =⇒ (2b − 30)2 = 100 =⇒ b = 20 =⇒ a = 10
12 3
So, X ∼ U(10, 20). Finally, the required probability is given by
Z20
1 x 20 3
P (X > 17) = dx = = = 0.3
20 − 10 10 17 10
17
Calculate P (0 ≤ x ≤ 4)
Answer: 0.77
Explanation: To solve this question, we will use the pdf to find the required probability.
But first we must find a using the fact that total probability is 1.
Z3 Z6
x2
Z
a 23 6
fX (x)dx = ax dx + a(6 − x) dx = x +a 6x −
2 0 2 3
x 0
3
9a 36 9
= + a 36 − − 18 − = 9a
2 2 2
Course: Machine Learning - Foundations Page 8 of 18
1
Since the total probability is 1 =⇒ a = . So, now the required probability is given by
9
Z3 Z4
x2
x 6−x 1 23 1 4
P (0 < X < 4) = dx + dx = x + 6x −
9 9 18 0 9 2 3
0 3
1 1 16 9 2.5
= + 24 − − 18 − = 0.5 + ≈ 0.77
2 9 2 2 9
∴ The answer is 0.77
12. Let X be exponentially distributed with parameter λ, then which of the following is/are
true about the variance of X:
Answer: A, B, C
13. (1 point) Which of the following options is/are always true for three events A, B and C
of a random experiment?
a. If A ⊂ B then P (B|A) = 1
b. If B ⊂ A then P (B|A) = 1
Course: Machine Learning - Foundations Page 9 of 18
c. If B ⊂ A then P (A|B) = 1
d. If P (A|B) > P (A) then P (B|A) > P (B) (Assuming the events have non-zero
probabilities)
Answer: A, C, D
Explanation: To solve this question, we will use the definition of conditional probabil-
P (B ∩ A)
ity, P (B|A) = .
P (A)
P (B ∩ A) P (A)
Now, if A ⊂ B =⇒ B ∩ A = A =⇒ P (B|A) = = = 1.
P (A) P (A)
So, Option A and C are correct.
P (B ∩ A) P (B)
If however, we have B ⊂ A =⇒ B ∩ A = B =⇒ P (B|A) = = ̸= 1.
P (A) P (A)
So, Option B is incorrect.
Finally,
P (A|B) > P (A)
P (A ∩ B)
=⇒ > P (A)
P (B)
P (A ∩ B)
=⇒ > P (B)
P (A)
=⇒ P (B|A) > P (B)
So, Option D is correct.
14. (1 point) Let the random experiment of selecting a number from a set of integers from
1 to 20, both inclusive. Assuming all numbers are equally likely to occur. Let A be
the event that the selected number is odd, B be the event that the selected number is
divisible by 3. Choose the correct option from the following:
A. A and B are dependent on each other.
B. A and B are independent on each other.
C. Can’t say
Answer: B
Course: Machine Learning - Foundations Page 10 of 18
Explanation: To solve this question, we will use the definition of independent events.
Two events A and B are said to be independent if the occurrence of one event does not
effect the other. That is, P (A|B) = P (A) or equivalently, P (A ∩ B) = P (A)P (B).
∴ Option B is correct.
15. (1 point) Mayur rolls a fair die repeatedly until a number that is multiple of 3 is ob-
served. Let the random variable N represent the total number of times the die is rolled.
Find the probability distribution of N .
k−1
2
1
× , k = 1, 2, 3, . . .
A. fN (k) = 3 3
0 otherwise
k−1
1
2
× , k = 1, 2, 3, . . .
B. fN (k) = 3 3
0 otherwise
k−1
1
1
× , k = 1, 2, 3, . . .
C. fN (k) = 2 2
0 otherwise
k−1
1
5
× , k = 1, 2, 3, . . .
D. fN (k) = 6 6
0 otherwise
Answer: B
Explanation: Here, the random variable N , representing the number of dice rolls fol-
lows a geometric distribution. Here, the probability of success p, is the probability of
Course: Machine Learning - Foundations Page 11 of 18
2 1
getting a multiple of 3. So, we get p = = .
6 3
The pmf of a geometric distribution is given by fN (k) = (1 − p)k−1 p.
k−1
1 1 2
For p = , we get fN (k) = × .
3 3 3
∴ Option B is correct.
16. (1 point) Shelly wrote an exam that contains 20 multiple choice questions. Each ques-
tion has 4 options out of which only one option is correct and each question carries 1
mark. She knows the correct answer of 10 questions, and for the remaining 10 questions,
she chooses the options at random. Assume that all the questions are independent. Find
the probability that she will score 18 marks in the exam.
8 2
10 1 3
a. × ×
8 4 4
8 2
10 1 3
b. × ×
2 4 4
2 8
10 1 3
c. × ×
2 4 4
8 2
10 3 1
d. × ×
8 4 4
Answer: A, B
Explanation: Here, we need to find the probability that Shelly scores 18 marks. We
know that she knows the answer of 10 questions, which guarantees her 10 marks. Let X
be the random variable denoting the number of questions she will guess correctly. This
will bring her total marks to be 10 + X.
Now, since she guesses on a total of 10 questions, each being independent with a proba-
bility of 0.25 of being correct, we can say that X follows a binomial distribution. That
is, X ∼ Bin(n = 10, p = 0.25).
17. (1 point) Suppose the number of runs scored of a delivery is uniform in {1, 2, 3, 4, 5, 6}
independent of what happens in other deliveries. A batsman needs to bat till he hits a
four. What is the probability that he needs fewer than 6 deliveries to do so? (Answer
the question correct to two decimal points.)
Answer: 0.6
Explanation: Let X be the random variable denoting the number of deliveries such
that the X th delivery is 4 runs. We can see that X here is a geometric
random variable,
1 1
where p, the probability of success is . That is, X ∼ Geom p = .
6 6
x−1
1 5
So the pmf of X becomes fX (x) = × , x = 1, 2, 3, . . .
6 6
X
1 2 3
Y
1 0.25 0.25 0
2 0 0.25 0.25
Answer: 0.25
Explanation: We can find the covariance between X and Y using the formula Cov(X, Y ) =
E(XY ) − E(X)E(Y ). Let us find the relevant expectations.
2
X
E(X) = xP (X = x) = 0.5 + (2)(0.5) = 1.5
x=1
3
X
E(Y ) = yP (Y = y) = 0.25 + (2)(0.5) + (3)(0.25) = 2
y=1
X
E(XY ) = xy P (X = x ∩ Y = y)
x,y
3 X
X 2
= xy P (X = x ∩ Y = y)
y=1 x=1
19. (1 point) Let X and Y be two random variables with joint PMF fX,Y (x, y) given below.
Calculate fY |X=2 (2).
Course: Machine Learning - Foundations Page 14 of 18
X
1 2 3
Y
1 0.25 0.25 0
2 0.125 a1 0.125
Answer: 0.5
Explanation: To find the required probability, we first must find a1 . We can use the
law of total probability here.
X
P (X = x ∩ Y = y) =1
x,y
20. (1 point) Two random variables X and Y are jointly distributed with PDF
( y
ax + for x, y ∈ {0, 1}
fX,Y (x, y) = 4
0 otherwise
Course: Machine Learning - Foundations Page 15 of 18
Calculate the value of a.
Answer: 0.25
Explanation: For this question we can find the value of a by using the law of total
probability.
X
fX,Y (x, y) =1 (1)
x,y
Answer: 0.2
Explanation: For this question we can find the value of k by using the law of total
probability.
X
1= P (X = x)
x
=⇒ 1 = P (X = 1) + P (X = 2) + P (X = 3)
=⇒ 1 = (0) + (k) + (4k)
=⇒ 1 = 5k =⇒ k = 0.2
x 1 2 3 4
P (X = x) a b c 0.3
x 1 2 3 4
FX (x) 0.2 0.6 0.7 d
Answer: 1.7
Explanation: Let us look the the pmf first. From the law of total probability, we know
4
X
that P (X = x) = 1 =⇒ a + b + c + 0.3 = 1 =⇒ a + b + c = 0.7.
x=1
So, a + b + c + d = 1.7
23. (1 point) A series of four matches is played between India and England. Let the random
variable X represent the absolute difference in the number of matches won by India and
England. Find the set of possible values that X can take. (Assume that the match does
not result in a tie.)
A. {0,2,4}
B. {0,1,2,4}
C. {0,1,2,3,4}
D. {0,1,3,4}
Answer: A
Course: Machine Learning - Foundations Page 17 of 18
Explanation: Let Y be the random variable denoting the number of wins for India.
Since 4 matches are played, the number of wins for England is 4 − Y .
X is the absolute difference in the number of wins, we get, X = |Y − (4 − Y )| = |2Y − 4|.
Since Y represents number of wins, Y can take on the values 1, 2, 3, 4. This implies X
can take on the values 0, 2, 4.
∴ Option A is correct.
24. (1 point) There are five multiple choice questions asked in an exam. There is 70% chance
that Shelly will solve a question correctly and independent of the rest of the solution. Let
X be the random variable that represents the number of questions she solves correctly.
Which of the following is the probability mass function of X?
0.00243 ,x = 0
0.02835 ,x = 1
0.1323 ,x = 2
A. P (X = x) =
0.3087 ,x = 3
0.36015
,x = 4
0.16807 ,x = 5
0.00243 ,x = 0
0.02835 ,x = 1
0.16807 ,x = 2
B. P (X = x) =
0.3087 ,x = 3
0.36015 ,x = 4
0.1323 ,x = 5
0.00243 ,x = 0
0.02835 ,x = 1
0.3087 ,x = 2
C. P (X = x) =
0.1323 ,x = 3
0.36015
,x = 4
0.16807 ,x = 5
Course: Machine Learning - Foundations Page 18 of 18
0.00243 , x = 0
0.01835 , x = 1
0.1223 , x = 2
D. P (X = x) =
0.2987 , x = 3
0.37015 , x = 4
0.19807 , x = 5
Answer: A
Explanation: Let X be the random variable denoting the number of questions solved
correctly. Since there are 5 questions in total and they have independently a probability
of 0.7 to be solved correctly. This means that X follows a binomial distribution.
5
That is, X ∼ Binom(n = 5, p = 0.7) =⇒ P (X = x) = (0.7)x (0.3)5−x .
x
Simplifying for each value of x,
0.00243 ,x = 0
0.02835 ,x = 1
0.1323 ,x = 2
P (X = x) =
0.3087 ,x = 3
0.36015 ,x = 4
0.16807 ,x = 5
∴ Option A is correct.
Course: Machine Learning - Foundations
Week 11 Questions
GRADED QUESTIONS
1. (1 point) The number of hours Messi spends each day practicing in ground is modelled
by the continuous random variable X, with p.d.f. f (x) defined by
(
a(x − 1)(6 − x) for 1 < x < 6
fX (x) =
0 otherwise
Find the probability that Messi will practice between 2 and 5 hours in ground on a
randomly selected day.
Answer: 0.80
R∞
We know that −∞ f (x)dx = 1
6
Solving above equation taking required f (x), value of a can be calculated. i.e a =
125
R5
Then calculate P (2 ≤ X ≤ 5) = 2 f (x)dx
2. (1 point) Let X be a continuous random variable with PDF
ax
for 0 < x < 2
fX (x) = a(4 − x) for 2 ≤ x ≤ 4
0 otherwise
Calculate P (1 ≤ x ≤ 3)
Answer: 0.75
R∞
We know that −∞ f (x)dx = 1
1
Solving above equation taking required f (x), value of a can be calculated. i.e a =
4
R2
Then calculate P (1 ≤ X ≤ 2) = 1 f (x)dx
3. (1 point) The probability density function of X is given by
x
for 0 < x < 1
fX (x) = 2 − x for 1 < x < 2
0 otherwise
Course: Machine Learning - Foundations Page 2 of 13
Calculate E(X)
Answer: 1
Solution: R∞
We know that −∞ f (x)dx = 1
R2
E(X) = 0
xf (x)dx
4. (1 point) The distribution of the lengths of a cricket bat is uniform between 80 cm and
100 cm. There is no cricket bat outside this range. The mean and variance of the lengths
of the the cricket ball is a and b. Calculate a + b
Answer: 127.33
(h − l)2
V (X) =
12
(h + l)
E(X) =
2
5. ( points) Suppose that random variable X is uniformly distributed between 0 and 10.
10
Then find P (X + ≥ 7). (Write answer upto two decimal places)
X
Answer: 0.7
10
Solve this quadratic equation, X + ≥7
X
get the values of X for this X ≥ 0.
Answer: A, B, C, D
Option A: Standard normal variate(distributions) have a mean of 0 and variance of 1.
SD is the squared root of variance.
Option C: The normal distribution resembles a bell curve. The maximum concentration
is around the mean value/middle portion.
Option D: The spread of the distribution will change based on the SD. Hence, the
shape is dependent on the standard deviation.
—————————————————————————————————————-.
Let X and Y be continuous random variables with joint density
(
cxy for 0 < x < 2, 1 < y < 3
fXY (x, y)
0 otherwise
1
Answer:
8
Solution: R∞ R∞
We know that −∞ −∞ fXY (x, y)dydx = 1
R2R3
0 1
cxydydx = 1
c can be calculated from above equation.
3
Answer:
32
Solution:
R1R2 1
P (0 < X < 1, 1 < Y < 2) = 0 1
xydydx
8
9. ( points) Calculate P (0 < X < 1, Y > 2)
Course: Machine Learning - Foundations Page 4 of 13
5
Answer:
32
Solution:
R1R3 1
P (0 < X < 1, Y > 2) = 0 2
xydydx
8
10. ( points) Calculate P ((X + Y ) < 3)
Answer: 0.25
Solution:
R 2 R 3−x 1
= 0 1
xydydx
8
11. ( points) Calculate FX (1)
Answer: 0.25
Solution:
3
Answer:
8
Solution:
Answer: 0.25
FX,Y = FX (x) × FY (y)
Course: Machine Learning - Foundations Page 5 of 13
14. (1 point) Suppose a random variable X is best described by a uniform probability dis-
tribution with range 1 to 5. Find the value of a such that P (X ≤ a) = 0.5
Answer: 3
Solution: P (X ≤ 3) = 0.5, From the area of Uniform distribution curve.
15. (1 point) If X is an exponential random variable with rate parameter λ then which of
the following statement(s) is(are) correct.
1
a) E[X] = λ
1
b) V ar[X] = λ2
c) P (X > x + k|X > k) = P (X > x) for k, x ≥ 0.
d) P (X > x + k|X > k) = P (X > k) for k, x ≥ 0.
Answer: A, B, C
Solution :
Options (a) and (b) are correct for Exponential distribution.
16. (1 point) (Multiple Select) For three events, A, B, and C, with P (C) > 0, Which of
the following is/are correct?
A. P (Ac |C)= 1 - P (A|C)
B. P (ϕ|C) = 0
C. P (A|C) ≤ 1
D. if A ⊂ B then P (A|C) ≤ P (B|C)
Answer: A, B, C, D
Option A: Using standard probability properties. If we have an event E, then:
P (E) = 1 − P (E C )
Course: Machine Learning - Foundations Page 6 of 13
Option B: The option asks for the probability of getting a null set given that an event
C has already occurred. It is also given to us that the probability of the occurrence of
the event C is not zero. Hence, P (ϕ|C) = 0
Option C: The probability of an event given to another with non-zero probability will
always be less than or equal to 1 because the total probability can only be 1.
Option D: The probability of getting a bigger set is more than a smaller set. A is a
smaller set than B, hence P (A|C) ≤ P (B|C)
17. (2 points) (Multiple Select) Let the random experiment be tossing an unbiased coin
two times. Let A be the event that the first toss results in a head, B be the event that
the second toss results in a tail and C be the event that on both the tosses, the coin
landed on the same side. Choose the correct statements from the following:
A. A and C are independent events.
B. A and B are independent events.
C. B and C are independent events.
D. A, B, and C are independent events.
Answer: A, B, C
Solution:
A = {HT, HH}
B = {HT, T T }
C = {T T, HH}
1
P (A) =
2
1
P (B) =
2
1
P (C) =
2
P (A ∩ B) = {HT }
P (C ∩ B) = {T T }
P (A ∩ C) = {HH}
P (A ∩ B) = P (A) × P (B) Hence, option B is correct
P (A ∩ C) = P (A) × P (C) Hence, option A is correct
P (C ∩ B) = P (C) × P (B) Hence, option C is correct
18. (2 points) (Multiple Select) If A1 , A2 , A3 , , An are non empty disjoint sets and subsets
of sample space S, and a set An+1 is also a subset of S, then which of the following
statements are true?
Course: Machine Learning - Foundations Page 7 of 13
A. The sets A1 ∩ An+1 , A2 ∩ An+1 , A3 ∩ An+1 , , An ∩ An+1 are disjoint.
B. If An+1 , An are disjoint then A1 , A2 , , An−1 are disjoint with An+1 .
C. The sets A1 , A2 , A3 , , An , ϕ are disjoint.
D. The sets A1 , A2 , A3 , , An , S are disjoint.
Answer: A,C
Option A: Consider two sets Ai ∩ An+1 and Aj ∩ An+1 , where i ̸= j. The intersection of
these two sets is
Hence, Ai ∩ An+1 and Aj ∩ An+1 are disjoint sets for all i ̸= j. Therefore, the sets
A1 ∩ An+1 , A2 ∩ An+1 , A3 ∩ An+1 , , An ∩ An+1 are disjoint.
Option B: Again, if An+1 is disjoint with An , it doesn’t mean that it’ll be disjoint with
other n-1 sets as well. Take this example:
you have 3 sets, A1 , A2 , A3 , you can have A4 = A1 + A2 Now, A4 is disjoint with A3 but
not with the other two.
Option C: A1 , ...An are disjoint with each other(given). Every n set will also be disjoint
with ϕ (Intersection will give empty set)
Option D: Every n set is a subset of S. So, it’ll result in something when taken an
intersection with S. Hence, not disjoint.
19. (3 points) A triangular spinner having three outcomes can lands on one of the numbers
0, 1 and 2 with probabilities shown in table.
Outcome 0 1 2
Probability 0.7 0.2 0.1
The spinner is spun twice. The total of the numbers on which it lands is denoted by X.
The the probability distribution of X is.
x 2 3 4 5 6
A. 49 28 1 4 18
P (X = x)
100 100 100 100 100
x 2 3 4 5 6
B. 28 49 18 1 4
P (X = x)
100 100 100 100 100
Course: Machine Learning - Foundations Page 8 of 13
x 0 1 2 3 4
C. 49 28 18 4 1
P (X = x)
100 100 100 100 100
x 2 3 4 5 6
D. 28 49 18 4 1
P (X = x)
100 100 100 100 100
Answer: C
The maximum sum you can get is 4 (2+2).
1
P (X = 4) = P (2 and 2) = P (2) ∗ P (2) = 0.1 ∗ 0.1 =
100
18
P (X = 2) =
100
28
P (X = 1) = P (X = 2) =
100
49
P (X = 0) =
100
20. (1 point) When throwing a fair die, what is the variance of the number of throws needed
to get a 1?
Answer: 30
Solution:
1−p
= V ar(X) =
p2
1
1−
= 6
12
6
= 30
Course: Machine Learning - Foundations Page 9 of 13
21. (1 point) Joint pmf of two random variables X and Y are given in Table
y
1 2 3 fX (x)
x
1 0.05 0 a1 0.15
2 0.1 0.2 a3 a2
3 a4 0.2 a5 0.45
fY (y) 0.3 0.4 a6
Answer: 0.22
Solution:
P
fXY (x, y) = 1 ............. (i)
P
fX (x) = fXY (x, y) .............(ii)
y∈Ry
P
fY (y) = fXY (x, y) ...............(iii)
x∈RX
22. (1 point) (Multiple Select) Which of the following options is/are correct?
A. If Cov[X, Y ] = 0, then X and Y are independent random variables.
B. Cov[X, X] = V ar(X)
C. If X andPY are two independent random variables and Z = X + Y then
fZ (z) = x fX (x) × fY (z − x)
D. If X andPY are two independent random variables and Z = X + Y then
fZ (z) = y fX (x) × fY (z − x)
Answer: B, C
Solution:
Option B
Course: Machine Learning - Foundations Page 10 of 13
Cov[X, X] is the covariance between X and X i.e V ar(X)
Answer: B, C
Solution:
For k
FX (3) = 1
x3 + k
=1
40
We will get
14
P (X = 1) =
40
7
P (X = 2) =
40
19
P (X = 3) =
40
Course: Machine Learning - Foundations Page 11 of 13
259
Now easily with V ar(X) equation we will get V ar(X) =
320
24. (1 point) In a game of Ludo, Player A needs to repeatedly throw an unbiased die till
he gets a 6. What is the probability that he needs fewer than 4 throws? (Answer the
question correct to two decimal points.)
Solution:
1
P (6) =
6
25. (1 point) (Multiple Select) Let X and Y be two random variables with joint PMF
fXY (x, y) given in Table 10.3.
y
0 1 2
x
1 1 1
0 6 4 8
1 1 1
1 8 6 6
Which of the following options is/are correct for fXY (x, y) given in Table 10.1.
5
A. P (X = 0, Y ≤ 1) =
12
7
B. P (X = 0, Y ≤ 1) =
12
C. X and Y are independent.
D. X and Y are dependent.
Answer: A, D
1 1 10 5
P (X = 0, Y ≤ 1) = P (X = 0 and Y = 0) or P (X = 0 and Y = 1) = + = =
6 4 24 12
1 1 1 13
P (X = 0) = + + =
6 4 8 24
1 1 7
P (Y = 0) = + =
6 8 24
Course: Machine Learning - Foundations Page 12 of 13
Now, if they are independent, then the product of the marginal should be equal to the
joint probability.
13 7 91
PXY (0, 0) = ∗ = ≈ 0.158 (If independent)
24 24 576
1
Also, the tables says that PXY (0, 0) = = 0.1667
6
Because the two are not equal, we can conclude that they are not independent.
26. (1 point) A discrete random variables X has the probability function as given in table
10.4.
x 1 2 3 4 5 6
P (X) a a a b b 0.3
Answer: 0.3
P
P (X = x) = 1
3a + 2b = 0.7
P
E(X) = P (X = xi ) × xi
6a + 9b = 2.4
27. (1 point) A discrete random variable X has the probability function as follows.
(
k × (1 − x)2 , for x = 1, 2, 3
P (X = x) =
0, otherwise
Evaluate E(X)
Answer: 2.8
Course: Machine Learning - Foundations Page 13 of 13
Solution:
P
P (X = x) = 1
k + 4k = 1
k = 0.2
P
E(X) = P (X = xi ) × xi
0.2 × 2 + 0.8 × 3