0% found this document useful (0 votes)
50 views159 pages

MLF PA GA Sol

The document contains practice questions and answers for a Machine Learning course, focusing on concepts such as models, regression, classification, and unsupervised learning. It includes explanations for each answer to enhance understanding of the topics. The questions cover various scenarios and applications of machine learning techniques.

Uploaded by

22f2001453
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views159 pages

MLF PA GA Sol

The document contains practice questions and answers for a Machine Learning course, focusing on concepts such as models, regression, classification, and unsupervised learning. It includes explanations for each answer to enhance understanding of the topics. The questions cover various scenarios and applications of machine learning techniques.

Uploaded by

22f2001453
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 159

Course: Machine Learning - Foundations

Week 1 (Practice questions)

1. (1 point) Which of the following is true about a model?

1. A model is a mathematical representation of reality.


2. A model is an exact representation of a system.
3. A model uses no assumptions.
A. 1 and 2
B. 2 and 3
C. only 1
D. only 2

Answer: C
Explanation: A model represents reality/a system in some mathematical form or an-
other.
A model is never an exact representation, it is a mathematical simplification of reality.
A model always makes some assumptions. For instance, this assumption could be, that
the data points fit in 2 dimensions or 3 dimensions.

2. (1 point) Identify which of the following problem requires regression algorithm.


A. Predicting the country that a person belongs to based on his physical features.
B. Predicting the animal present in a given image based on sample images of
various animals.
C. Predicting the topic of a given Wikipedia article based on article’s keywords.
D. Predicting the stock price of a company on a given day based on revenue
growth and profit after tax.

Answer: D
Explanation: We know that there are 2 broad categories for supervised algorithms:
regression and classification.
Classification is where we have to put the data points into some “class” based on the
features and regression is where we predict a numerical value. In A, B & C, we would
classify the data points into some class(es) whereas, option D requires us to predict a
numerical value.

3. (1 point) For spam detection, if we use traditional programming rather than a machine
learning approach, which problems may be faced?
Course: Machine Learning - Foundations Page 2 of 7

A. The need to check for a long list of possible patterns that may be present.
B. Difficulty in maintaining the program containing the complex hard-coded rules.
C. The need to keep writing new rules as the spammers become innovative.
D. All of these.

Answer: D
Explanation: Traditional programming would require explicitly defining and checking
for specific patterns that indicate spam. This would involve creating a long list of pos-
sible patterns that may be present in spam messages.
Traditional programming for spam detection often involves creating complex rules and
conditions to identify spam. Maintaining and updating such a program can be cumber-
some.
If you compare the spam messages from a decade ago, you’ll notice a heavy change in
the pattern. So, traditional programming will require us to constantly update our code
and rules to keep up with the spam detection trend.

4. P
(1 point) In a regression model, the parameters wi ’s and b of the function f (X) =
d
j=1 wj xj + b where X = [x1 , x2 , ...., xd ]
A. are strictly integers.
B. always lie in the range [0,1].
C. are any real value.
D. are any imaginary value.

Answer: C
Explanation: The weights w and the bias term b is not restricted to integers or a specific
range of real numbers. They can be of any real value. Regression models don’t make
use of imaginary numbers.

5. (1 point) Identify classification problem in the following statements


A. Find the gender of a person by analyzing writing style.
B. Predict the price of a car based on engine power, mileage etc.
C. Predict whether there will be abnormally heavy rainfall tomorrow or not based
on previous data.
D. Predict the number of street accidents that will happen this month based on
traffic volume count.

Answer: A,C
Course: Machine Learning - Foundations Page 3 of 7

If the given problem statement allows you to predict the results into a specific class from
a specific set of classes, then it can be said that the problem is a classification problem.

In option A, based on the writing style, the data point would be tagged as a specific
class - Either “Male”, “Female” or any other class, if present.
In option B, price is a numerical quantity that is being predicted. Hence, this would be
a regression problem, rather than a classification problem.
In option C, there are 2 classes for which the predictions will be made. The classes can
be, for instance, “will rain” and “will not rain”
In option D, the prediction is a numerical quantity. Therefore, this will be a regression
problem

6. (1 point) Identify task that needs the use of regression in the following
A. Predict the height of a person based on his weight.
B. Predict the country a person belongs to based on his linguistic features.
C. Predict whether the price of gold will increase tomorrow or not based on data
of last 25 day.
D. Predict whether a movie is comedy or tragedy based on its reviews.

Answer: A
Explanation: For regression problems, the predicted value must be a numerical quan-
tity. This numerical quantity is usually continuous.
In option A, you are predicting a numerical value. Therefore, this will fall under the
category of regression problem.
In option B, specific classes are predicted. In this case, the classes would be: the set of
all countries
In option C, again, the predictions being made are of specific classes and not numerical
quantities. Here, classes are: { will increase, will not increase }
In option D, the specific style/genre of the movies is to be predicted. Here, the classes
would be: { Comedy, tragedy }

7. (1 point) Which of the following are examples of unsupervised learning problems?


A. Grouping tweets based on topic similarity
B. Making clusters of cells having similar appearance under microscope.
C. Checking whether an email is spam or not.
D. Identify the gender of online customers based on buying behaviour.
Course: Machine Learning - Foundations Page 4 of 7

Answer: A,B
Explanation: To identify unsupervised learning problems, see if the problem requires
us to group/cluster based on the patterns in the data. Unsupervised learning problems
have unlabelled data (the target column is absent).
In option A, we are grouping, hence unsupervised.
In option B we are making clusters, hence unsupervised.
In option C, We typically train a model for the following problem using labelled data
to help us identify spam or not spam for future unseen emails (based on the patterns
learned from the labelled data). Hence, it’s a supervised problem.
In option D, we would be “grouping” the customers based on their buying behaviour to
help us identify their gender. We don’t have explicit labels available with us. So, this is
also an unsupervised problem.

8. (1 point) Which of the following is/are incorrect?


A. 1(2 is even) = 1
B. 1(10%3 = 0) = 0
C. 1(0.5 ∈
/ R) = 0
D. 1(2 ∈ {2, 3, 4}) = 0

Answer: D
Explanation: First, we need to understand the notation used here.
1(expression) = output → if the expression inside () is true, then the output is 1, if the
expression inside () is false, the output is 0.
This is known as the indicator function. In mathematics, an indicator function of a sub-
set of a set is a function that maps elements of the subset to one, and all other elements
to zero.

Now, in option A, the expression is true, hence the output will be 1. Therefore, it is a
correct option.
In option B, the expression is false (10 mod 3 is equal to 1), hence the output is 0.
Therefore, this option is correct as well.
In option C, the expression is false, hence the output should be 0. Again, the option is
correct.
In option D, the expression is true but the output mentioned is 0, whereas it should have
been 1. Therefore, it’s an incorrect statement and hence, our answer.

9. (1 point) Which of the following functions corresponds to a classification model?


Course: Machine Learning - Foundations Page 5 of 7

A. f : Rd → R
B. f : Rd → {+1, −1}

C. f : Rd → Rd where d′ <d

Answer: B
Explanation: Only option B is where we have a set of classes. In classification models,
a set of features is mapped to a set of classes.

10. (1 point) Which of the following can form a good encoder decoder pair for d dimensional
data ?

A. f : Rd → Rd where d′ < d

g : Rd → Rd where d′ > d

B. f : Rd → Rd where d′ > d

g : Rd → Rd where d′ < d

C. f : Rd → Rd where d′ < d

g : Rd → Rd where d′ < d

D. f : Rd → Rd where d′ > d

f : Rd → Rd where d′ > d

Answer: C
Explanation: A good encoder would be the one which reduces the dimensions.
A good decoder would be one which gives us higher dimensional data (preferably the
same dimension as the original data).
Only option C satisfies the aforementioned requirements.

11. (2 points) Consider the following two scenarios:


(1) Given the details of a person’s sample, the lab technician wants to find whether the
person is suffering from cancer or not.
(2) You are the manufacturer of a mobile company. You wish to know how many of the
mobiles are expected to be sold in the next six months.
A. Both (1) and (2) are suited for regression.
B. Both (1) and (2) are suited for classification.
C. Problem (1) is better suited for classification while (2) is better suited for
regression.
D. Problem (2) is better suited for classification while (1) is better suited for
regression.

Answer: C
Course: Machine Learning - Foundations Page 6 of 7

Explanation: In scenario 1, we have 2 classes: cancer and not-cancer. This scenario


is suitable for classification. In scenario 2, we have a regression problem as we want to
predict the number of mobiles to be sold. Therefore, option C is the correct answer.

12. (2 points) Consider the following data set where each data point consists of three fea-
tures x1 , x2 and x3 :

x1 x2 x3
10 10 9
13 12 13
5 5 4
8 7 7

Consider two encoder functions f and f˜ with decoders g and g̃ respectively aiming to
reduce the dimensionality of the data set from 3 to 1:
Pair 1:f (x1 , x2 , x3 ) = x1 − x2 + x3 and g(u) = [u, u, u]
x1 + x2 + x3
Pair 2:f˜(x1 , x2 , x3 ) = and g̃(u) = [u, u, u]
3
The reconstruction loss of the encoder decoder pair is the mean of the squared distance
between the reconstructed input and input.

Answer: Pair 1: 3(Range 2.95 to 3.05)


Pair 2: 0.667(Range 0.63 to 0.7)

Explanation:
We can calculate the f () and f˜() as following:

x1 x2 x3 f (x1 , x2 , x3 ) f˜(x1 , x2 , x3 )
10 10 9 9 9.66
13 12 13 14 12.66
5 5 4 4 4.66
8 7 7 8 7.33

Now, we can also calculate the g(u) and g̃(u) which would give us the vectors

g(u) g̃(u)
[9, 9, 9] [9.66, 9.66, 9.66]
[14, 14, 14] [12.66, 12.66, 12.66]
[4, 4, 4] [4.66, 4.66, 4.66]
[8, 8, 8] [7.33, 7.33, 7.33]

Now, we calculate the loss using the loss function below:


Course: Machine Learning - Foundations Page 7 of 7

n
1X
||g f xi − xi ||2

losspair1 =
n i=0

||[−1, −1, 0]||2 + ||[1, 2, 1]||2 + ||[−1, −1, 0]||2 + ||[0, 1, 1]||2
=
4

2+6+2+2
= =3
4

n
1 X
|| g f xi − xi ||2

losspair2 =
n i=0

||[0.34, 0.34, 0.66]||2 + ||[0.34, 0.66, 0.34]]||2 + ||[0.34, 0.34, 0.66]||2 + ||[0.67, 0.33, 0.33]||2
=
4

0.6668 + 0.6668 + 0.6668 + 0.66674


= = 0.6667
4
Course: Machine Learning - Foundations
Week 1 (Graded assignment)

1. (1 point) [2, 4, -5] belongs to which of the following?


A. R
B. R+
C. Both R+ and R−
D. R3

Answer: D
Solution:

 [2, 4, −5] contains 3 components and all of them are real numbers.
Thevector
2
So, 4  ∈ R3 .

−5
∴ Option D is correct.

2. (1 point) Which of the following may not be an appropriate choice of loss function for
regression?
A. n1 ni=1 (f (xi ) − yi )2
P

B. n1 ni=1 |f (xi ) − yi |
P

C. n1 ni=1 1(f (xi ) ̸= yi )


P

Answer: C
Solution:

Here, option C that is, Loss = n1 ni=1 1(f (xi ) ̸= yi ) may be a good choice for classifica-
P
tion, but it is not a good choice for regression.

You can see that this loss function will increase when the prediction is not equal to
a label. However, it does this with a fixed loss of 1. Ideally, we would want the loss
increase to be proportionate to the amount of discrepancy between the prediction and
the label.

∴ Option C is correct.

3. (1 point) Identify which of the following requires use of classification technique.


Course: Machine Learning - Foundations Page 2 of 11

A. Predicting the amount of rainfall in May 2022 in North India based on precip-
itation data of the year 2021.
B. Predicting the price of a land based on its area and distance from the market.
C. Predicting whether an email is spam or not.
D. Predicting the number of Covid cases on a given day based on previous month
data.

Answer: C
Solution:

Here, in options A, B and D, we can see that we have to predict some kind of real
number. Namely, amount of rainfall, price of land and number of cases. These kinds of
problems are more suitable to regression. Option C however, is predicting in which cate-
gory the datapoint (email) falls into. It is an example of binary classification technique.

∴ Option C is correct.

4. (1 point) (Multiple Select) Mark all incorrect statements in the following


A. 1(355%2 = 1) = 1
B. 1(788%2 = 1) = 0
C. 1(355%2 = 0) = 1
D. 1(788%2 = 0) = 1

Answer: C
Solution:

Let’s look at each option one by one.

A. Since 355 is odd, 355%2 = 1. So, the statement inside the indicator function is
true. That is, 1(355%2 = 1) = 1. Since this option is a true statement, it will not be
marked.

B. Since 788 is even, 788%2 = 0. So, the statement inside the indicator function is
false. That is, 1(788%2 = 1) = 0. Since this option is a true statement, it will not be
marked.

C. Since 355 is odd, 355%2 = 1. So, the statement inside the indicator function is
false. That is, 1(355%2 = 0) = 0. Since this option is a false statement, it will be
marked.
Course: Machine Learning - Foundations Page 3 of 11

D. Since 788 is even, 788%2 = 0. So, the statement inside the indicator function is
true. That is, 1(788%2 = 0) = 1. Since this option is a true statement, it will not be
marked.

∴ Only option C is correct.

5. (1 point) Which of the following is false regarding supervised and unsupervised machine
learning?
A. Unsupervised machine learning helps you to find different kinds of unknown
patterns in data.
B. Regression and classification are two types of supervised machine learning tech-
niques while clustering and density estimation are two types of unsupervised
learning.
C. In unsupervised learning model, the data contains both input and output
variables while in supervised learning model, the data contains only input
data.

Answer: C
Solution:

Here, option C is a false statement. It is infact supervised learning model in which, the
data contains both input and output variables. Also, it is unsupervised learning model
in which, data contains only input data.

∴ Option C is correct.

6. (1 point) The output of regression model is


A. is discrete.
B. is continuous and always within a finite range.
C. is continuous with any range.
D. may be discrete or continuous.

Answer: C
Solution:

The output of a regression model, linear regression for example, can be any real number.
It is continuous and can be within any range.

∴ Option C is correct.
Course: Machine Learning - Foundations Page 4 of 11

7. (1 point) (Multiple select) Which of the following is/are supervised learning task(s)?
A. Making different groups of customers based on their purchase history.
B. Predicting whether a loan client may default or not based on previous credit
history.
C. Grouping similar Wikipedia articles as per their content.
D. Estimating the revenue of a company for a given year based on number of
items sold.

Answer: B,D
Solution:

Let’s take each option one by one.

A. Making different groups is an example of clustering, which is an unsupervised learning


task.
B. Predicting whether a client may default or not is an example of binary classification,
which is a supervised learning task.
C. Again, grouping similar articles is an example of clustering, which is an unsupervised
learning task.
D. Estimation of revenue which is a real continuous number is an example of regression,
a supervised learning task.

∴ Options B and D are correct.

8. (1 point) Which of the following is used for predicting a continuous target variable?
A. Classification
B. Regression
C. Density Estimation
D. Dimensionality Reduction

Answer: B
Solution:

Out of the options, the technique used for prediction of a continuous target variable is
regression.

∴ Option B is correct.
Course: Machine Learning - Foundations Page 5 of 11

9. (1 point) Consider the following: “The is used to fit the model; the is used
for model selection; the is used for computing the generalization error.”
Which of the following will fill the above blanks correctly?
A. Test set; Validation set; training set
B. Training set; Test set; Validation set
C. Training set; Validation set; Test set
D. Test set; Training set; Validation set

Answer: C
Solution:

The training set is used to fit our model. After that, the validation set is used to select
the best model. Then, the test set is used for computing the generalization error.

∴ Option C is correct.

10. (1 point) Consider the following loss functions:


1 Pn
1. −log(P (X i ))
n i=1
1 Pn
2. ||g(f (X i )) − X i ||2
n i=1
1 Pn
3. (f (X i ) − Y i )2
n i=1
1 Pn
4. 1(f (X i ) ̸= Y i )
n i=1
The above loss functions pertain to which of the following ML techniques (in that or-
der)?
A. Dimensionality Reduction, Regression, Classification, Density Estimation
B. Dimensionality Reduction, Classification, Density Estimation, Regression
C. Density Estimation, Dimensionality Reduction, Regression, Classification
D. Classification, Density Estimation, Dimensionality Reduction, Regression
E. Classification, Dimensionality Reduction, Regression, Density Estimation

Answer: C
Solution:

Let’s go over them one by one.


Course: Machine Learning - Foundations Page 6 of 11

1. This is the negative log likelihood loss and is used for density estimation.
2. This is computing the error between the reconstructed datapoint and actual datapoint
and is used in dimensionality reduction.
3. This is the squared error loss and it is used for regression.
4. This loss function simply compares if prediction and label are equal or not. This is
used in classification.

∴ Option C is correct.

11. (1 point) Compute the loss when Pair 1 and Pair 2 (shown below) are used for dimen-
sionality reduction for the data given in the following Table:

x1 x2
1 0.5
2 2.3
3 3.1
4 3.9

1 Pn i i 2
Consider the loss function to be i=1 ||g(f (x )) − x || .
n

1. Pair 1: f (x) = (x1 − x2 ), g(u) = [u/2, u/2]


2. Pair 2: f (x) = (x1 + x2 )/2, g(u) = [u/2, u/2]

Here f (x) is the encoder function and g(x) is the decoder function.
Pair 1:
Pair 2:

Answer: Pair 1: 7.6 [Range could be 7.22 to 7.98]


Pair 2: 3.8 [Range could be 3.61 to 4]
Solution:

We are given an encoder (f ) and a decoder (g) function. To solve this question, we
will take each datapoint xi and encode it using encoder function getting f (xi ) and then
decode it to get g(f (xi )). Then the squared error would be given as ||g(f (xi )) − xi ||2 .
We would then take the average of this error over all datapoints to get the loss.

Pair 1:

xi f (xi ) g (f (xi )) g (f (xi )) − xi ||g (f (xi )) − xi ||2


x1 [1, 0.5] 0.5 [0.25, 0.25] [−0.75, −0.25] 0.625
x2 [2, 2.3] −0.3 [−0.15, −0.15] [−2.15, −2.45] 10.625
x3 [3, 3.1] −0.1 [−0.05, −0.05] [−3.05, −3.15] 19.2245
x4 [4, 3.9] 0.1 [0.05, 0.05] [−3.95, −3.85] 30.425
Course: Machine Learning - Foundations Page 7 of 11

4
1 X 1
||g f xi − xi ||2 = (0.625 + 10.625 + 19.2245 + 30.425) = 15.224

Loss =
4 i=1 4

Pair 2:

xi f (xi ) g (f (xi )) g (f (xi )) − xi ||g (f (xi )) − xi ||2


x1 [1, 0.5] 0.75 [0.375, 0.375] [−0.625, −0.125] 0.406
x2 [2, 2.3] 2.15 [1.075, 1.075] [−0.925, −1.225] 2.356
x3 [3, 3.1] 3.05 [1.525, 1.525] [−1.475, −1.575] 4.66
x4 [4, 3.9] 3.95 [1.975, 1.975] [−2.025, −1.925] 7.81
4
1 X 1
||g f xi − xi ||2 = (0.406 + 2.356 + 4.66 + 7.81) = 3.8

Loss =
4 i=1 4
∴ The answer is 15.224 and 3.8.

12. (1 point) Consider the following 4 training examples. We want to learn a function

x y
-1 0.0319
0 0.8692
1 1.9566
2 3.0343

f (x) = ax + b which is parameterized by (a,b). Using average squared error as the loss
function, which of the following parameters would be best to model the given data?
A. (1, 1)
B. (1, 2)
C. (2, 1)
D. (2, 2)

Answer: A
Solution: For each of the parameters given, we have a different function to estimate y.
For each function we will estimate each label y i .
4
1X
Then the loss will be given by ||f (xi ) − y i ||2 .
4 i=1

x y x + 1 x + 2 2x + 1 2x + 2
−1 0.0319 0 1 −1 0
0 0.8692 1 2 1 2
1 1.9566 2 3 3 4
2 3.0343 3 4 5 6
Course: Machine Learning - Foundations Page 8 of 11

Let’s go over each option one by one.

A. f (x) = x + 1

Loss = (0 − 0.0319)2 + (1 − 0.8692)2 + (2 − 1.9566)2 + (3 − 3.0343)2


= (−0.0319)2 + 0.132 + 0.04342 + (−0.0343)2
= 0.005

B. f (x) = x + 2

Loss = (1 − 0.0319)2 + (2 − 0.8692)2 + (3 − 1.9566)2 + (4 − 3.0343)2


= 0.96812 + 1.132 + 1.04342 + 0.96572
= 1.058

C. f (x) = 2x + 1

Loss = (−1 − 0.0319)2 + (1 − 0.8692)2 + (3 − 1.9566)2 + (5 − 3.0343)2


= (−1.0319)2 + 0.132 + 1.04342 + 1.96572
= 1.51

D.f (x) = 2x + 2

Loss = (0 − 0.0319)2 + (2 − 0.8692)2 + (4 − 1.9566)2 + (6 − 3.0343)2


= (−0.0319)2 + 1.132 + 2.04342 + 2.96572
= 3.55

Since, the loss for f (x) = x + 1 is the smallest, the parameters (1, 1) are the best fit for
this model.

∴ Option A is correct.

13. (1 point) Consider the following input data points:

X y
[ 2] 5.8
[ 3] 8.3
[ 6] 18.3
[ 7] 21
[ 8] 22

What will be the amount of loss when the functions g = 3x1 + 1 and h = 2x1 + 2 are used
to represent the regression line. Consider the average squared error as loss function.
g:
Course: Machine Learning - Foundations Page 9 of 11

h:

Answer: g: 2.964 [Range could be 2.82 to 3.11]


h: 11.924 [Range could be 11.32 to 12.52]
Solution:
1 P5
The average squared loss for regression line f (x) is given by (y − f (xi ))2
5 i=1
x y g(x) h(x) (y − g(x))2 (y − h(x))2
2 5.8 7 6 1.44 0.04
3 8.3 10 8 2.89 0.09
6 18.3 19 14 0.49 18.49
7 21 22 16 1 25
8 22 25 18 9 16
14.82 59.62
14.82
We can see that loss for g(x) = 3x1 + 1 is = 2.964 and the loss for h(x) = 2x1 + 2
5
59.62
is = 11.924.
5
∴ Answer is 2.964 and 11.924.

14. (2 points) Consider the following input data points:

X y
[ 4, 2] +1
[ 8, 4] +1
[ 2, 6] -1
[ 4, 10] -1
[ 10, 2] +1
[ 12, 8] -1

What will be the average misclassification error when the functions g(X) = sign(x1 −
x2 − 2) and h(X) = sign(x1 + x2 − 10) are used to classify the data points into classes
+1 or −1.
g:

h:

Answer: g: 1/6(Range 0.158 to 0.175)


h: 1/2 ( Range 0.475 to 0.525)
Solution:
Course: Machine Learning - Foundations Page 10 of 11

n
1X
The average misclassification error for a function f (x) is given by 1(f (X i ) ̸= y i )
n i=1

x y g(x) h(x) 1(y ̸= g(x)) 1(y ̸= h(x))


(4, 2) 1 1 −1 0 1
(8, 4) 1 1 1 0 0
(2, 6) −1 −1 −1 0 0
(4, 10) −1 −1 1 0 1
(10, 2) 1 1 1 0 0
(12, 8) −1 1 1 1 1
1 3
1 1
So, the loss for g(x) = sign(x1 −x2 −2) is and the loss for h(x) = sign(x1 +x2 −10) is .
6 2
1 1
∴ Answer is and .
6 2

15. (1 point) f (x1 , x2 , x3 ) = x1 +2x


2
2
is used as encoder function and g(u) = [u, 2u, 3u] is used
as decoder function for dimensionality reduction of following data set.

X
[1,2,3]
[2,3,4]
[-1,0,1]
[0,1,1]

Give the reconstruction error for this encoder decoder pair. The reconstruction error is
the mean of the squared distance between the reconstructed input and input.

Answer: 34.5 (Range 32.78 to 36.22)


Solution:

We are given an encoder (f ) and a decoder (g) function. To solve this question, we
will take each datapoint xi and encode it using encoder function getting f (xi ) and then
decode it to get g(f (xi )). Then, the squared error would be given as ||g(f (xi )) − xi ||2 .
We would then take the average of this error over all datapoints to get the loss.

xi f (xi ) g (f (xi )) ||g (f (xi )) − xi ||2


(1, 2, 3) 2.5 (2.5, 5, 7.5) 31.5
(2, 3, 4) 4 (4, 8, 12) 93
(−1, 0, 1) −0.5 (−0.5, −1, −1.5) 7.5
(0, 1, 1) 1 (1, 2, 3) 6
138
Course: Machine Learning - Foundations Page 11 of 11

138
So, the loss will be = 34.5.
4
∴ Answer is 34.5.
Course: Machine Learning - Foundations
Week 2 - Practice Questions

1. Consider a function f (x) such that



 sin(x)
, x ̸= 0
f (x) = x
1 , x=0

Is f (x) continuous at x = 0?
A. False
B. True

Answer: B
Solution: 
 sin(x)
, x ̸= 0
f (x) = x
1 , x=0

Now, f is continuous at x = 0 if lim+ f (x) = lim− f (x) = f (a).


x→a x→a

First we will compute the left hand limit at x = 0 :

LHL = lim− f (x)


x→0
= lim f (0 − h)
h→0
sin(−h)
= lim
h→0 −h
sin(h)
= lim
h→0 h
=1

Next we will compute the right hand limit at x = 0 :

RHL = lim+ f (x)


x→0
= lim f (0 + h)
h→0
sin(h)
= lim
h→0 h
=1
∵ LHL = RHL = f (0) =⇒ f is continuous at x = 0

Since the statement is true, option B is correct.


Course: Machine Learning - Foundations Page 2 of 10

2. If U = [10, 100], A = [30, 50] and B = [50, 90], which of the following is/are false?
(Consider all values to be integers.)

A. AC = [10, 30] ∪ [50, 100]


B. AC = [10, 30) ∪ (50, 100]
C. A ∪ B = [30, 90]
D. A ∩ B = ∅
E. A ∩ B = {50}
F. AC ∩ B C = [10, 30) ∪ [91, 100]

Answer: A, D, F
Solution:
We know that,
Ac = U \ A
= [10, 100] \ [30, 50]
= [10, 30) ∪ (50, 100]
∴ Option A is false and option B is true.

Next,
A ∪ B = [30, 50] ∪ [50, 90]
= [30, 90]
∴ Option C is true.

Next,
A ∩ B = [30, 50] ∩ [50, 90]
= {50} ≠ ∅
∴ Option D is false and option E is true

Next,
Ac ∩ B c = (A ∪ B)c
= U \ (A ∪ B)
= [10, 100] \ [30, 90]
= [10, 30) ∪ (90, 100]
= [10, 30) ∪ [91, 100]
Finally, option F is True.
Course: Machine Learning - Foundations Page 3 of 10

3. Consider two d-dimensional vectors x and y and the following terms:


(i) xT y
(ii) x.y
(iii) di=1 xi yi
P

Which of the above terms are equivalent?


A. Only (i) and (ii)
B. Only (ii) and (iii)
C. Only (i) and (iii)
D. (i), (ii) and (iii)

Answer: D
Solution:    
x1 y1
 x2   y2 
We have x =  ..  and y =  .. ,
   
. .
xd yd
 
y1
d
T
  y2 
 X
Now, x y = x1 x2 . . . xd  ..  = xi y i

. i=1
yd
d
X
Also by definition, x · y = xi y i
i=1

Therefore, all three terms are equivalent.

∴ Option D is correct.

4. The linear approximation of tan(x) around x = 0 is:


A. 1 + x
B. 1 − x
C. x − 1
D. x

Answer: D
Solution:
The linear approximation of a function f around x = a is given by

L(x) = f (a) + (x − a)f ′ (a)


Course: Machine Learning - Foundations Page 4 of 10

Here, f (x) = tan(x) and a = 0, Substituting these in the above equation, we get

L(x) = tan(0) + (x − 0) sec2 (0) [∵ (tanx)′ = sec2 x]


= x

∴ Option D is correct
5. The partial derivative of x3 + y 2 w.r.t. x at x = 1 and y = 2 is .

Answer: 3
Solution:

The partial derivative of x3 + y 2 w.r.t. x is



x3 + y 2 = 3x2

∂x
∴ At (x, y) = (1, 2), it will be

x3 + y 2

=3
∂x (x,y)=(1,2)

So, the answer is 3.

6. Consider the following function:


(
7x + 2, if x >1
f (x) =
9, if x ≤ 1

Is f (x) continuous?
A. Yes
B. No

Answer: A
Solution:
Now, ∀ x < 1, f is a linear function which makes it continuous and ∀ x ≥ 1, f is constant
which also makes it continuous. So, the only point we need to check is at x = 1 :
LHL = lim− f (x)
x→1
= lim f (1 − h)
h→0
= lim 9
h→0
= 9
Course: Machine Learning - Foundations Page 5 of 10

RHL = lim+ f (x)


x→1
= lim f (1 + h)
h→0
= lim 7(1 + h) + 2
h→0
= 9
∴ LHL = RHL = f (1) = 9 =⇒ f is continuous at x = 1 =⇒ f is continuous.
So, option A is correct.

7. Which of the following is the best approximation of e0.019 ? (Use linear approximation
around 0).
A. 1
B. 0
C. 0.019
D. 1.019

Answer: D
Solution:
The linear approximation of a function f around x = a is given by

L(x) = f (a) + (x − a)f ′ (a)

Here, f (x) = ex and a = 0, substituting these values in L(x), we get

L(x) = e0 + (x − 0)e0 [∵ (ex )′ = ex ]


= 1+x

=⇒ e0.019 ≈ L(0.019) = 1 + 0.019 = 1.019.


∴ Option D is correct.

8. What is the linear approximation of f (x, y) = x2 + y 2 around (1, 1)?


A. 2x + 2y + 2
B. 2x + 2y − 2
C. 2x + 2y + 1
D. 2x + 2y − 1

Answer: B
Course: Machine Learning - Foundations Page 6 of 10

Solution:
The linear approximation of a function f around (x, y) = (a, b) is given by
 
x−a
L(x, y) = f (a, b) + · ∇f (a, b)
y−b
= f (a, b) + (x − a) fx (a, b) + (y − b) fy (a, b)

Here, (a, b) = (1, 1) and f (x, y) = x2 + y 2 ,


   
2x 2
∇f (x, y) = =⇒ ∇f (1, 1) =
2y 2

So the linear approximation is given by

L(x, y) = f (1, 1) + (x − 1)(2) + (y − 1)(2)


= 2x + 2y − 2

∴ Option B is correct.

9. What is the gradient of f (x, y) = x2 y at (1, 3)?


A. [1, 6]
B. [6, 1]
C. [1, 3]
D. [3, 1]

Answer: B
Solution:
The gradient (∇f ) of f (x, y) = x2 y is
 
fx (x, y)
∇f (x, y) =
fy (x, y)
 
2xy
=
x2

∴ At (x, y) = (1, 3), it will be  


6
∇f (1, 3) =
1
So, option B is correct.

10. The directional derivative of f (x, y, z) = x2 + 3y + z 2 at (1, 2, 1) along the unit vector
in the direction of [1, -2, 1] is .
Course: Machine Learning - Foundations Page 7 of 10

Answer: -0.816
Solution:
The directional derivative is given by

Dû f (1, 2, 1) = ∇f (1, 2, 1) · û

First, let us compute the gradient


 
fx
∇f (x, y, z) = fy 

fz
 
2x
=  3
2z
 
2
∇f (1, 2, 1) = 3

2
   
1 1
u 1 
The unit vector in the direction of u = −2 is û = ||u|| = 6 −2.
  √
1 1
   
2 1
1   −2
∴ Dû f (1, 2, 1) = ∇f (1, 2, 1) · û = 3 · √ −2 = √ ≈ −0.816.
 
2 6 1 6

So, the answer is −0.816.

11. Find the direction of steepest ascent for the function x2 + y 3 + z 4 at point (1, 1, 1).
h i
2 3 4
A. √
29
, √
29
, √
29
h i
−2 3 4
B. √
29
, √
29
, √
29
h i
C. √−2 29
, −3

29
, √4
29
h i
D. √229 , √−3 29
, √4
29

Answer: A
Solution:
The direction of the steepest ascent for any function is the direction of the gradient itself.
Course: Machine Learning - Foundations Page 8 of 10

So let us compute the gradient first


 
fx
∇f (x, y, z) = fy 

fz
 
2x
= 3y 

4z
 
2
∇f (1, 1, 1) = 3

4
The direction of the steepest ascent, û is in the direction of the gradient. That is,
 
2
∇f (1, 1, 1) 1
û = = √ 3 .
||∇f (1, 1, 1)|| 29 4

∴ Option A is correct.

12. The directional derivative of f (x, y, z) = x + y + z at point (-1, 1, -1) along the unit
vector in the direction of [1, -1, 1] is .

Answer: 0.577
Solution:
The directional derivative is given by
Dû f (−1, 1, −1) = ∇f (−1, 1, −1) · û
First, let us compute the gradient
 
fx
∇f (x, y, z) = fy 

fz
 
1
= 1

1
 
1
∇f (−1, 1, −1) = 1
1
   
1 1
u 1 
The unit vector in the direction of u = −1 is û = ||u|| = 3 −1
  √
1 1
   
1 1
1 1
∴ Dû f (−1, 1, −1) = ∇f (−1, 1, −1) · û = 1 · √ −1 = √ ≈ 0.577
1 3 1 3
Course: Machine Learning - Foundations Page 9 of 10

So, the answer is 0.577.

13. Which of the following is/are the vector equations of a line that passes through (1, 2, 3)
and (4, 0, 1)?
(i) [x, y, z] = [1, 2, 3] + α[3, −2, −2]
(ii) [x, y, z] = [4, 0, 1] + α[−3, 2, 2]
(iii) [x, y, z] = [1, 2, 3] + α[4, 0, 1]
(iv) [x, y, z] = [4, 0, 1] + α[1, 2, 3]

A. (i) and (ii)


B. (iii) and (iv)
C. (i) and (iii)
D. (ii) and (iv)

Answer: E
Solution:
The vector equation of a line passing through two points a and b is given by

[x, y, z] = a + α (b − a) , α ∈ R

Taking a = [1, 2, 3] and b = [4, 0, 1], we have

[x, y, z] = [1, 2, 3] + α ( [4, 0, 1] − [1, 2, 3] )


= [1, 2, 3] + α [3, −2, −2]

Now, taking a = [4, 0, 1] and b = [1, 2, 3], we have

[x, y, z] = [4, 0, 1] + α ( [1, 2, 3] − [4, 0, 1] )


= [4, 0, 1] + α [−3, 2, 2]

So, statements (i) and (ii) are correct.

14. As per Cauchy-Schwarz inequality, if b is a -ve scalar multiple of a, then,


A. aT b ≥ ||a|| ∗ ||b||
B. aT b ≤ ||a|| ∗ ||b||
C. aT b = ||a|| ∗ ||b||
D. aT b = −||a|| ∗ ||b||
Course: Machine Learning - Foundations Page 10 of 10

Answer: D
Solution:
The Cauchy-Schwarz inequality states

|aT b| ⩽ ||a|| ∗ ||b||

=⇒ − ||a|| ∗ ||b|| ⩽ aT b ⩽ ||a|| ∗ ||b||

Also, equality holds if and only if in the boundary case when a is a scalar multiple of b,
since here a and b are -ve scalar multiples, we get

− ||a|| ∗ ||b|| = aT b

∴ Option D is correct.
Course: Machine Learning - Foundations
Week 2 - Graded assignment

1. Which of the following functions is/are continuous?


1
A. x−1
x2 −1
B. x−1
C. sign(x − 2)
D. sin(x)

Answer: D
Explanation: Option A is not defined at x = 1 therefore, it’ll have a breakpoint there.
Hence, not continuous.
In option B, the function is again not continuous at x = 1. One may try to simplify the
option as follows:

x2 − 1 (x − 1)(x + 1)
=
x−1 x−1
Please note that you cannot cancel out (x − 1) here because you would be assuming that
x − 1 is not equal to 0. But, we get (x − 1) = 0 at x = 1. Here, limits exist but that
doesn’t necessarily mean that the function is continuous.
Option C is discontinuous at x = 2.
Option D is continuous at all points.

2. Regarding a d-dimensional vector x, which of the following four options is not equivalent
to the rest three options?
A. xT x
B. ||x||2
Pd 2
C. i=1 xi
D. xxT

Answer: D
Explanation:
d
X
T
x·x=x x= x2i
i=1

q
||x|| = x21 + x22 + ... + x2d
Course: Machine Learning - Foundations Page 2 of 8

d
X
2
=⇒ ||x|| = x21 + x22 + ... + x2d = x2i
i=1

xT x ̸= xxT

Therefore, options A, B, and C are equivalent but option D is different.

3. Consider the following function:


(
3x + 3, if x ≥ 3
f (x) =
2x + 8, if x<3

Which of the following is/are true?


A. f (x) is continuous at x = 3.
B. f (x) is not continuous at x = 3.
C. f (x) is differentiable at x = 3.
D. f (x) is not differentiable at x = 3.

Answer: B, D
Explanation:
f (x) is continuous at x = 3 if limx→3− f (x) = limx→3+ f (x) = f (3)

lim (2x + 8) = 2(3) + 8 = 14


x→3−

lim (3x + 3) = 3(3) + 3 = 12


x→3+

LHL ̸= RHL
Therefore, the function is not continuous at x = 3

For a function to be differentiable, the minimum requirement for it is to be continu-


ous at that point. As our function is not continuous, it cannot be differentiable.
Hence, options B and D are the correct options.

4. Approximate the value of e0.011 by linearizing ex around x=0.

Answer: 1.011
Explanation: To approximate the value of e0.011 by linearizing ex around x = 0, we can
use the first-order Taylor expansion of ex around the limit x = a, which is given by:
Course: Machine Learning - Foundations Page 3 of 8

ex ≈ ea + ea (x − a)

where a is the point around which we are linearizing (in this case, a = 0).
Using this approximation, we have:

e0.011 ≈ e0 + e0 (0.011 − 0) = 1 + 1(0.011) = 1.011

Therefore, the approximate value of e0.011 obtained by linearizing ex around x = 0 is


approximately 1.011.

√ √
5. Approximate 3.9 by linearizing x around x = 4.

Answer: 1.975
√ √
Explanation: To approximate the value of√ 3.9 by linearizing x around x = 4, we
can use the first-order Taylor expansion of x around the limit x = 0, which is given
by:

√ √ 1
x ≈ a + √ (x − a)
2 a
Using this approximation, we have:

√ √ 1 1
3.9 ≈ 4 + √ (3.9 − 4) = 2 + (−0.1) = 2 − 0.025 = 1.975
2 4 4
√ √
Therefore, the approximate value of 3.9 obtained by linearizing x around x = 4 is
approximately 1.975.

6. Which of the following pairs of vectors are perpendicular to each other?


A. [2, 3, 5] and [-2, 3, -1]
B. [1, 0, 1] and [0, 1, 1]
C. [1, 2, 0] and [0, 1, 2]
D. [0, 1, 0] and [0, 0, 1]
E. [2, -3, 5] and [-2, 3, -5]
F. [1, 0, 0] and [0, 1, 0]

Answer: A, D, E, F
Explanation: If 2 vectors are perpendicular to each other, the 2 vectors must have the
dot product equal to 0.
Course: Machine Learning - Foundations Page 4 of 8

Only options A, D, E, and F result in a dot product = 0.

7. What is the linear approximation of f (x, y) = x3 + y 3 around (2, 2)?


A. 4x + 4y − 8
B. 12x + 12y − 32
C. 12x + 4y − 8
D. 12x + 12y + 32

Answer: B
Explanation:
 
3x2
∇f (x, y) =
3y 2
 
12
=⇒ ∇f (2, 2) =
12

x − x∗
 
∗ ∗ T
Lx∗,y∗ [f ](x, y) =f (x, y) + ∇f (x , y ) ·
y − y∗
 
  x−2
=16 + 12 12
y−2
=16 + 12x − 24 + 12y − 24
=12x + 12y − 32

8. What is the gradient of f (x, y) = x3 y 2 at (1, 2)?


A. [12, 4]
B. [4, 12]
C. [1, 4]
D. [4, 1]

Answer: A
Explanation:
 2 2    
3x y 3(1)2 (2)2 12
∇f (x, y) = =⇒ ∇f (1, 2) = =
2x3 y 2(1)3 (2) 4

9. The gradient of f = x3 + y 2 + z 3 at x = 0, y = 1 and z = 1 is given by,


Course: Machine Learning - Foundations Page 5 of 8

A. [1, 2, 3]
B. [-1, 2, 3]
C. [0, 2, 3]
D. [2, 0, 3]

Answer: C
Explanation: The gradient of f = x3 + y 2 + z 3 is given by:
 
∂f ∂f ∂f
∇f = , ,
∂x ∂y ∂z

Taking the partial derivatives:


∂f ∂f ∂f
= 3x2 , = 2y, = 3z 2
∂x ∂y ∂z
Evaluating these partial derivatives at x = 0, y = 1, and z = 1:
∂f
(0, 1, 1) = 3(0)2 = 0
∂x
∂f
(0, 1, 1) = 2(1) = 2
∂y
∂f
(0, 1, 1) = 3(1)2 = 3
∂z
Therefore, the gradient ∇f (0, 1, 1) = [0, 2, 3].

10. For two vectors a and b, which of the following is true as per Cauchy-Schwarz inequality?
(i) aT b ≤ ||a|| ∗ ||b||
(ii) aT b ≥ −||a|| ∗ ||b||
(iii) aT b ≥ ||a|| ∗ ||b||
(iv) aT b ≤ −||a|| ∗ ||b||

A. (i) only
B. (ii) only
C. (iii) only
D. (iv) only
E. (i) and (ii)
F. (iii) and (iv)

Answer: E ((i) and (ii))


Course: Machine Learning - Foundations Page 6 of 8

Explanation: According to Cauchy-Schwarz inequality:

−||a|| · ||b|| ≤ aT b ≤ ||a|| · ||b||

11. The directional derivative of f (x, y, z) = x3 + y 2 + z 3 at (1, 1, 1) in the direction of unit


vector along v = [1, −2, 1] is .

Answer: 0.816
Explanation: directional derivative is given by the dot product of gradient at a point
with a unit vector along which the directional derivative is needed.

 2
3x
∇f (x, y, z) =  2y 
3z 2
 
3
=⇒ ∇f (1, 1, 1) = 2
3

Next, let’s find the unit vector along [1, −2, 1]. To do that, we divide the vector by its
[1, −2, 1]
magnitude: u =
∥[1, −2, 1]∥
p √
Calculating the magnitude: ∥[1, −2, 1]∥ = 12 + (−2)2 + 12 = 6

√ 

1/ √6
=⇒ u = −2/√ 6
1/ 6
√ 
  1/ √6
Du [f ](v) = ∇f (1, 1, 1) · u = 3 2 3 −2/√ 6

1/ 6

Therefore, the directional derivative of f (x, y, z) at (1, 1, 1) in the direction of the unit
2
vector along [1, −2, 1] is √ .
6

12. The direction of steepest ascent for the function 2x + y 3 + 4z at the point (1, 0, 1) is
h i
A. √220 , 0 √420 ,
h i
B. √129 , 0 √129 ,
Course: Machine Learning - Foundations Page 7 of 8
h i
−2 √4 ,
C. √
29
, 0 29
h i
√2 , −4
D. 20
0 √
20
,

Answer: A
Explanation:
Let f (x, y, z) = 2x + y 3 + 4z
 
2
∇f (x, y, z) = 3y 2 
4
 
2
=⇒ ∇f (1, 0, 1) = 0
4

To obtain the direction of steepest ascent, we need to normalize the gradient vector.
The magnitude of the gradient vector is:
√ √ √
∥∇f (1, 0, 1)∥ = 22 + 02 + 42 = 20 = 2 5

Therefore, the direction of steepest ascent for the function 2x + y 3 + 4z at the point
2 4
(1, 0, 1) is √ , 0 √ ,
20 20

13. The directional derivative of f (x, y, z) = x + y + z at (−1, 1, 0) in the direction of unit


vector along [1, -1, 1] is .

Answer: 0.577
Explanation: To find the directional derivative of f (x, y, z) = x + y + z at (−1, 1, 0) in
the direction of the unit vector along [1, −1, 1], we need to calculate the dot product of
the gradient of f at that point with the unit vector.

 
1
∇f (x, y, z) = 1

1
 
1
=⇒ ∇f (−1, 1, 1) = 1

1

Next, let’s find the unit vector along [1, −1, 1]. To do that, we divide the vector by its
[1, −1, 1]
magnitude: u =
∥[1, −1, 1]∥
Course: Machine Learning - Foundations Page 8 of 8
p √
Calculating the magnitude: ∥[1, −1, 1]∥ = 12 + (−1)2 + 12 = 3
Therefore,
 
1 1 1 1
u = √ [1, −1, 1] = √ , − √ , √
3 3 3 3

 
1 1 1
Du [f ](v) = ∇f (−1, 1, 0) · u = (1, 1, 1) · √ , − √ , √
3 3 3
Therefore, the directional derivative of f (x, y, z) = x + y + z at (−1, 1, 0) in the direction
1
of the unit vector along [1, −1, 1] is √ ≈ 0.577.
3

14. Which of the following is the equation of the line passing through (7, 8, 6) in the direction
of vector [1, 2, 3]
A. [1, 2, 3] + α[−6, −6, 3]
B. [7, 8, 9] + α[−6, −6, 3]
C. [1, 2, 3] + α[6, 6, 3]
D. [7, 8, 6] + α[6, 6, 3]
E. [7, 8, 6] + α[1, 2, 3]
F. [1, 2, 3] + α[7, 8, 6]

Answer: E
Explanation: A line through the point u ∈ Rd along a vector v ∈ Rd is given by the
equation
x = u + αv
=⇒ x = [7, 8, 6] + α[1, 2, 3]
So, option E is the answer.
Course: Machine Learning - Foundations
Week 3: Practice questions

 
1
1. (1 point) What is the length of the vector  1 ?
−1
A. 1.73
B. 1.71
C. 1.72
D. 1.74

Answer: A
Solution:
Using the definition of length of a vector,
 
1 p √
1 = 12 + 12 + (−1)2 = 3 ≈ 1.732
−1

∴ Option A is correct.

   
1 −1
2. (1 point) The inner product of 0 and  2  is
3 4
A. 11
B. 12
C. 31
D. 20

Answer: A
Solution:
Using the definition of the standard inner product,
   
1 −1
< 0 ,  2  > = (1)(−1) + (0)(2) + (3)(4) = 11
3 4

∴ Option A is correct.
Course: Machine Learning - Foundations Page 2 of 8
 
0 1 2
3. (1 point) The rank of the matrix, A = 1 2 1 is
2 7 8
A. 0
B. 1
C. 2
D. 3

Answer: C
Solution:
To solve this question, we can first find the row echelon form of A. Then the rank of the
matrix is the number of pivots (or free variables) in R.
       
0 1 2 1 2 1 1 2 1 1 2 1
R1 ⇆R2 R3 →R3 −2R1 R3 →R3 −3R2
A = 1 2 1 −−−−→ 0 1 2 −− −−−−−→ 0 1 2 −− −−−−−→  0 1 2 = R
2 7 8 2 7 8 0 3 6 0 0 0
We can see that R has 2 pivots, so A has a rank of 2.

∴ Option C is correct.

 
1 0 2
4. (1 point) The rank of the matrix, A = 2 1 0 is
3 2 1
A. 0
B. 1
C. 2
D. 3

Answer: D
Solution:
We can take the determinant of A

1 0 2
det(A) = 2 1 0
3 2 1
Expanding determinant along R1 ,
1 0 2
2 1 0 = (1)[(1)(1) − (0)(2)] − 0 + (2)[(2)(2) − (1)(3)]
3 2 1
= 3 ̸= 0
Course: Machine Learning - Foundations Page 3 of 8

We can see that the determinant of this 3×3 matrix is non-zero. This implies the matrix
is full rank. That is, rank(A)= 3.

∴ Option D is correct.

5. (1 point) Can we span the entire 4-d space using the four column vectors given in the
following matrix?

 
1 2 3 4
0 2 2 0
 
1 0 3 0
0 1 0 4
A. Yes
B. No

Answer: A
Solution:
Here, we wish to find out if the column space of a 4 × 4 matrix spans all of 4-d space.
This can only happen if and only if the determinant of the matrix is non-zero. So, we have

1 2 3 4
0 2 2 0
det =
1 0 3 0
0 1 0 4
Expanding along C4 ,
0 2 2 1 2 3
det = −(4) 1 0 3 + 0 − 0 + (4) 0 2 2
0 1 0 1 0 3

Expanding both sub-determinants along C1 ,


h i h i
det = (−4)(−1) (2)(0) − (2)(1) + (4) (1)[(2)(3) − (2)(0)] + (1)[(2)(2) − (2)(3)]
= (4)(−2) + (4)(6 − 2) = 8 ̸= 0
We can see that the determinant is non-zero. This implies the matrix is full rank. That
is, the column space = R4 .

∴ Option A is correct.
Course: Machine Learning - Foundations Page 4 of 8

6. (1 point) What is the rank of the following matrix?

 
2 6 8
3 7 10
 
4 8 12
5 9 14
A. 0
B. 1
C. 2
D. 3

Answer: C
Solution:
Let the above matrix be A. Now, to solve this question, we can first find the row echelon
form of A. Then, the rank of the matrix is the number of pivots (or basic variables) in R.

   
2 6 8 1 3 4
3 7 10 R1 → 21 R1 3 7 10
A= − −−−−→  
4 8 12 4 8 12
5 9 14 5 9 14
 
1 3 4
R2 →R2 −3R1  0 −2 −2
−− −−−−−→  
R3 →R3 −4R1  0 −4 −4
5 9 14
 
1 3 4
R2 → −1 R2 0 1 1
−−−−−2−−→  
R4 →R4 −5R1 0 −4 −4
0 −6 −6
 
1 3 4
R3 →R3 +4R2  0 1 1
−−−−−−−→  =R
R4 →R4 +6R2  0 0 0
0 0 0
We can see that R has 2 pivots, so A has a rank of 2.

∴ Option C is correct.

 
1 2 3
7. (1 point) The rank of matrix 2 3 6 is
4 5 9
Course: Machine Learning - Foundations Page 5 of 8

A. 1
B. 2
C. 3
D. 4

Answer: C
Solution:
Let us find the determinant of the matrix.
1 2 3 h i h i h i
2 3 6 = 1 (3)(9) − (5)(6) − 2 (2)(9) − (4)(6) + 3 (2)(5) − (4)(3)
4 5 9
= −3 − 2(−6) + 3(−2) = 3 ̸= 0

We can see that the determinant of this 3×3 matrix is non-zero. This implies the matrix
is full rank. That is, rank(A)= 3.

∴ Option C is correct.

8. (1 point) Rank of a 4 × 3 matrix is 2, what is the dimension of its null space?


A. 3
B. 1
C. 2
D. 4

Answer: B
Solution: To solve this question, we will make use of the rank nullity theorem which
states for any m × n matrix,

rank(A) + nullity(A) = n

Using the fact that here, n = 3 and rank(A) = 2, and substituting these values into the
above equation, we get nullity(A) = 1.

∴ Option B is correct.

 
2 4 6 8
9. (1 point) Which of the following represents the row space of the matrix 1 3 0 5?
1 1 6 3
(Note: span {S} denotes the set of linear combinations of the elements of S)
Course: Machine Learning - Foundations Page 6 of 8
   

 1 0 
   
0 ,  1 

A. Span 9 −3

 
2 1
 
   

 9 −2 
   
 , 1 
3

B. Span 1  0 

 
0 1
 
   

 1 0 
   
  , 1
0

C. Span −9 0

 
0 1
 
   

 0 3 
   
3 , −1

D. Span 1  0 

 
0 1
 

Answer: A
Solution:

To solve this question, we will make use of the fact that the row space of a matrix
does not change when applying row operations. Let the matrix given be A and it’s
reduced row echelon form be R. Then, rowspace(A) =rowspace(R).

   
2 4 6 8 R →1R 1 2 3 4
1 1
A = 1 3 0 5 −−−−2−→ 1 3 0 5
1 1 6 3 1 1 6 3
 
1 2 3 4
R →R −R1
−−2−−−2−−→ 0 1 −3 1 
R3 →R3 −R1
0 −1 3 −1
 
1 2 3 4
R →R +R2
−−3−−−3−−→ 0 1 −3 1
0 0 0 0
 
1 0 9 2
R1 →R1 −2R2
−− −−−−−→ 0 1 −3 1 = R
0 0 0 0
Since rowspace(A) = rowspace(R), then a basis for the row space of A can be the pivot
rows of R. That is,    

 1 0 
   
0  1 

rowspace(A) = span   ,  


 9 −3 
2 1
 
Course: Machine Learning - Foundations Page 7 of 8

∴ Option A is correct.

 
3
10. (1 point) Find the projection matrix for v= 3

3
1 1 1
3 3 3
A.  13 1
3
1
3
1 1 1
3 3 3
−1
1 1

3 3 3
B.  −1 1 −1 
3 3 3
1 −1 1
3 3 3
 −1 1 −1

3 3 3
1 −1 1 
C. 3 3 3
−1 1 −1
3 3 3
 −1 1 1

3 3 3
1 −1 1 
D. 3 3 3
1 1 −1
3 3 3

Answer: A
Solution:
The projection matrix P for a vector v is given by

vv T
P =
vT v
 
3
Here, v = 3, so the outer product is

3
   
3  9 9 9
vv T = 3 3 3 3 = 9 9 9


3 9 9 9

And the inner product is v T v = ||v||2 = 32 + 32 + 32 = 27.

Finally, P is given by  1 1 1
 
9 9 9
1  3 3 3
P = 9 9 9 =  13 1
3
1
3
27 1 1 1
9 9 9 3 3 3

∴ Option A is correct.

11. (1 point) Find the projection of [1, −4, 2] along [1, −2, −3]
Course: Machine Learning - Foundations Page 8 of 8

−6 −9
3 
A.
 14
3
14
6 9
14

B. 14 14 14
 −3 −6 −9

C.
 14
3 6
14
−9
14

D. 14 14 14

Answer: A
Solution:
The projection of a vector a along a vector b is given by
 aT b 
projb a = b
bT b
   
1 1
Here, a = −4 and b = −2. Substituting these values into the above expression,
2 −3
we get  
  1
(1)(1) + (−4)(−2) + (2)(−3)  
projb a = −2
12 + (−2)2 + (−3)2
−3
 
1
3  
= −2
14
−3
3
14

=  −6
 
14 
−9
14

∴ Option A is correct.
Course: Machine Learning - Foundations
Week 4: Test questions

1. (1 point) If P is a projection matrix, then the eigenvalue corresponding to every nonzero


vector orthogonal to the column space of P is
A. 0
B. 1
C. -1

Answer: A
Solution: If a vector (let us say b) is orthogonal to column space of P , then it means
that b vector is orthogonal to vector (let us say a) from which projection matrix P is
built.
That is b ⊥ a. So, a.b = 0 and P b = 0.
⇒ P b = 0b
Hence eigenvalue is zero.

2. (1 point) (Multiple select) Subtracting a multiple of one row from another


A. changes the determinant of the matrix.
B. changes the column space of the matrix.
C. changes the rank of the matrix.
D. changes the eigenvalues of the matrix.

Answer: D
Solution: By doing row operations, we already know that rank and columns space will
not change.
By doing row operations determinant too will not change. For example consider the
following
 matrix and operations done.
R1
A = R2

R3
Where R1, R2, and R3 are row vectors. Let us form another matrix B by doing row
operations.  
R1
Let B = A − R1
R1  
R1
Now⇒ det(B) = det(A)− det( R1)

R1
⇒ detB = detA
Course: Machine Learning - Foundations Page 2 of 7

Since determinant of matrix with same rows is zero.


Doing row operations will change eigenvalues because characteristic polynomial equation
changes.

3. (1 point) Consider the following statements regarding a real symmetric matrix A

1. The eigenvalues of A are always real.


2. The eigenvalues of A may be imaginary.
3. The eigenvectors corresponding to different eigenvalues of A are linearly indepen-
dent.
4. The eigenvectors corresponding to different eigenvalues of A are not linearly inde-
pendent.
5. A is orthogonally diagonalizable.
6. A is not diagonalizable.

Which of the above statements are true?


A. 2, 3 and 5
B. 1, 4 and 5
C. 2, 4 and 6
D. 1, 3 and 5

Answer: D
Solution: Eigenvalues of symmetric matrix are always real. Eigenvectors corresponding
to two distinct eigen values of a matrix are independent. In specific if matrix is sym-
metric, then they are orthogonal too.
Since, eigenvectors of symmetric matrix are orthogonal, the diagonlization of symmetric
matrix will result in orthogonal diagonalization.

4. (1 point) The eigenvectors corresponding to distinct eigenvalues of a matrix


A. are linearly independent
B. are linearly dependent
C. have no relation

Answer: A
Solution: As mentioned in the above question, eigen vectors corresponding to distinct
eigenvalues of a matrix are independent.

5. (1 point) The determinant of a 3 × 3 matrix having eigenvalues 1, -2 and 3 is


A. 2
Course: Machine Learning - Foundations Page 3 of 7

B. 0
C. 6
D. -6
E. -2

Answer: D
Solution: The product of eigen values is equal to determinant of the matrix. So,
determinant is 1 × −2 × 3, which is −6.

6. (1 point) The trace of a 2 × 2 matrix is -1 and its determinant is -6. Its eigenvalues will
be
A. -1, 3
B. 2, 3
C. 2, -3
D. -2, 3

Answer: C
Solution: Trace (sum of diagonal elements) of a matrix is equal to the sum of the
eigenvalues.
Since the given matrix is of 2 × 2 dimension, it will have two eigen values.
λ1 + λ2 = −1, λ1 λ2 = −6, by solving we get λ1 = 2 and λ2 = −3

7. (1 point) If the eigenvalues of a matrix are -1, 0 and 4, then its trace and determinant
are
Trace:
Determinant:

Answer: 3, 0
Solution: By using same above relations, we can find trace as 3(−1 + 0 + 4) and
determinant as 0(−1 × 0 × 4).
 
1 1
8. (1 point) The characteristic polynomial for the matrix A= is
1 3
A. λ2 − 4λ + 1
B. λ2 − 4λ
C. λ2 − 4λ − 2
D. λ2 + 4λ + 2
E. λ2 − 4λ + 2
Course: Machine Learning - Foundations Page 4 of 7

Answer: E
Solution For any matrix A, characteristic polynomial equation is formed by using rela-
 −λI) = 0. 
tion det(A
1 1 1 0
⇒ det( −λ =0
1 3 0 1
1−λ 1
⇒ det( )=0
1 3−λ
By simplifying we get λ2 − 4λ + 2 = 0
 
1 1
9. (1 point) The eigenvalues of matrix A= are
2 3
√ √
A. 2 + 3, 2 − 3
√ √
B. 3, − 3
C. 0,1
√ √
D. 5, − 5

Answer: A
Solution: To find eigenvalues, we need to solve characteristic polynomial equation just
likewe did
 for the above problem.
a b
det is ad − bc.
c d 
1−λ 1
det =0
2 3−λ
⇒ (1 − λ)(3 − λ) − 2 ∗ 1 = 0
⇒ λ2 − 4λ + √1 = 0
⇒ λ = 2 ± 3.

10. (2 points) If the eigenvalues of a matrix A are 0 -1 and 5, then the eigenvalues of A3
are
A. 0, -1 and 5
B. 0, -1 and 125
C. 0, 1 and -125
D. 0, 1 and -5

Answer: B
Solution: λk is eigenvalue of matrix Ak , if λ is eigenvalue of matrix A.
Since eigen values of A are 0, −1, and 5, eigenvalues of A3 are 03 , −13 , and 53 .
Hence option B is correct.
Course: Machine Learning - Foundations Page 5 of 7

11. (1 point) The 110th term of Fibonacci sequence is approximately given by



A. √1 ( 1+ 5 )110
5 2

B. √1 ( 1− 5 )110
5 2

C. √1 ( 1+ 5 )−110
5 2

−1 1+ 5 110
D. √
5
( 2
)

Answer: A
Solution: √By following√the lectures you can see that
Fk = √15 ( 1+2 5 )k + √15 ( 1−2 5 )k
√ √ √
F1 10 = √15 ( 1+2 5 )110 + √15 ( 1−2 5 )110 ≈ √1 ( 1+ 5 )110 .
5 2
Hence, option A is correct.

12. (1 point) (Multiple Select) Let A be an n × n matrix. Which of the following statements
is/are false?
A. If A has r non-zero eigenvalues, then rank of A is at least r.
B. If one of the eigenvalues of A are zero, then |A| =
̸ 0.
C. If x is an eigenvector of A, then so is every vector on the line through x.
D. If 0 is an eigenvalue of A, then A cannot be invertible.

Answer: B
Solution:
If a matrix has r non-zero eigenvalues, then rank of the matrix is r, hence first option is
true.
If one of the eigenvalue is zero, then determinant is zero, since determinant is product
of eigenvalues. So, option B is false.
If x is an eigenvector then cx will also be an eigenvector, where c is real constant value.
So, option C is true.
If 0 is eigenvalue, then determinant is zero, hence it is not invertible.
 
1 2
13. (2 points) The eigenvalues of the matrix are
2 4

Answer: 0, 5
Solution: By using the characteristic polynomial equation, we get (1 − λ)(4 − λ) − 4 = 0
⇒ λ2 − 5λ = 0. On solving them we get λ eigenvalues as 0, 5.

14. (2 points) (Multiple Select) For the matrix given in the previous question, which of the
following vectors is/are its eigenvector(s)?
 
1
A.
2
Course: Machine Learning - Foundations Page 6 of 7
 
−2
B.
1
 
1
C.
1
 
1
D.
−2

Answer: A, B
Solution: We will find eigenvector for each of the eigenvalue 0 and 5.
For λ = 0 case:
We know that (A − λI)x = 0
⇒ (A − 0I)x
 = 0
x
Let x = 1
 x2   
1 2 x1 0
⇒ =
2 4 x 2
 0 
x1 + 2x2 0
⇒ =
2x1 + 4x2 0  
1
Let us take x1 as 1, then we get x2 as 0.5, so first eigen vector is c Among the
    −0.5
1 −2
given options, b satifsfies when c = −2, that is −2 =
−0.5 1
For λ = 5 case:
Here we will get equation as
(A − 5I)x = 0     
1−5 2 x1 0
⇒ =
2 4 − 5 x2 0
⇒ −4x1 + 2x2 = 0 and ⇒ 2x1 − x2 = 0
If x1 = 1, then x2 = 2  
1
Hence, eigenvector is of the form c . So, option A is correct.
2

15. (3 points) Suppose


 thatA, P are 3 × 3 matrices, and P is an invertible matrix.
−1 0 3
If P −1 AP =  0 3 8, then the eigenvalues of the matrix A2 are
0 0 4

Answer: 1,9,16
Solution: Here P −1 AP gives upper triangular matrix where diagonal elements are
eigenvalues. So, eigenvalues of A are −1, 3, and 4. So, eigen values of A2 are (−1)2 , 32 ,
and 42 .
Course: Machine Learning - Foundations Page 7 of 7

Eigenvalues of A2 are 1,9, and 16.

16. (2 points) The best second order polynomial that fits the data set

x y
0 0
1.3 1.5
4 1.2

is
A. 1.35x2 + 0.3x
B. 1.25x2 + 0.45x
C. −0.316x2 + 1.56x
D. −0.25x2 + 0.5

Answer: C
Solution: Here we are asked to find second order polynomial that fits the data. This is
similar to linear regression of multiple features where we have one feature with order 2.
The equation
  willbe y = θ0 +  θ1 x + θ2 x2
0 1 0 0  
2 θ0 θ1
Y = 1.5, A = 1 1.3 1.3 , and θ =
θ2
1.2 1 4 42
T −1 T
In orderto minimize error
 θ = (A A)  A Y 
1 1 1 1 0 0 1 5.3 17.69
AT A = 0 1.3 4  1 1.3 1.32  =  5.3 17.69 66.19 
2
0 1.69 16  1 4 4  17.69 66.19 258.8561
1 1 1 0 2.7
AT Y = 0 1.3 4  1.5 =  6.75 
0 1.69 16 1.2 21.735
Yet to do  
−1 1.017 −0.197
(AT A)−1 ≈  1.017 0.273 −0.14 
 −0.197 −0.14 0.05   
−1 1.0178 −0.1917 2.7  
⇒ θ =  1.0178 0.2738 −0.1395  6.75  ≈ −0.117045 1.545 −0.315
−0.1917 −0.1395 0.0526 21.735
So, option C is correct.
Course: Machine Learning - Foundations
Week 5: Practice questions

 
1 − i 1 − 3i
1. (1 point) The complex conjugate of matrix A = is
6 + 4i 35 − 2i
 
1 − i 1 − 3i
A.
6 + 4i 35 − 2i
 
1 + i 1 + 3i
B.
6 − 4i 35 + 2i
 
−1 + i −1 − 3i
C.
−6 + 4i −35 − 2i
 
1 − i 1 − 3i
D.
6 − 4i 35 − 2i

Answer: B
Explanation: The complex conjugate of a matrix is given by simply taking the conju-
gate of all the components of the matrix. That is,
   
1 − i 1 − 3i 1 + i 1 + 3i
A= =⇒ Ā =
6 + 4i 35 − 2i 6 − 4i 35 + 2i

∴ Option B is correct.

 
3 − 2i 5 + i
2. (1 point) The complex conjugate transpose of matrix A = is
1 + 4i 7 − 2i
 
7 + i 5 + 41
A.
3 − i 3 − 2i
 
5 − i 3 − 4i
B.
1 + i 7 + 2i
 
3+i 5−i
C.
1 + 4i 7 − 2i
 
3 + 2i 1 − 4i
D.
5 − i 7 + 2i

Answer: D
Explanation: The complex conjugate transpose of a matrix is given by simplytaking
the transpose of the complex conjugate of the matrix. That is,
   
3 − 2i 5 + i ∗ T 3 + 2i 1 − 4i
A= =⇒ A = (Ā) =
1 + 4i 7 − 2i 5 − i 7 + 2i
Course: Machine Learning - Foundations Page 2 of 10

∴ Option D is correct.

   
1−i −1 − i
3. (1 point) The inner product of x = and y = is
2i i
A. 7 − 6i
B. 4 − 4i
C. 2 − 2i
D. 3 + 4i

Answer: C
Explanation: The inner product between two complex vectors x and y is given by
multiplying the complex conjugate transpose of one vector by the other, That is,

< x, y > = x∗ y
 
  −1 − i
= 1 + i −2i
i
= −(1 + i)2 + (−2i)(i)
= −1 + 1 − 2i + 2
= 2 − 2i
∴ Option C is correct.

 
2−i
4. (1 point) The square of length of vector x = is
4−i
A. 16
B. 17
C. 31
D. 22

Answer: D
Explanation: The squared length of a vector x is given by taking the inner product of
x with itself. So,

||x||2 = < x, x >


= x∗ x
 
  2−i
= 2+i 4+i
4−i
= 22 + 12 + 42 + 1 2
= 22
Course: Machine Learning - Foundations Page 3 of 10

∴ Option D is correct.

(1 + i) (1 + i)
 
√ √
 3 6 
5. (1 point) The matrix A = 
  is unitary.
i 2i 
√ √
3 6
A. True
B. False

Answer: B
Explanation: The condition of A to be unitary is that it’s columns must be pairwise
orthogonal and of unit length. Let us check these conditions.

Let the columns of the matrix A be a1 and a2 . For checking orthogonality, the in-
ner product must be 0.

< a1 , a2 > = a∗1 a2


1+i
 

1 − i −i 
 
6 
= √ √  
3 3  2i 

6
2 2
1 +1 (−i)(2i)
= √ + √
3 2 3 2
4
= √
3 2
̸= 0

Since A does not have orthogonal columns, it is not unitary.


∴ Option A is correct.

 
1 2 3
6. (1 point) The matrix Z = 2 4 5 is Hermitian.
3 5 6
A. True
B. False

Answer: True
Course: Machine Learning - Foundations Page 4 of 10

Explanation: By definition, a given matrix A is hermitian if and only if A∗ = A. Let


us check this for Z
 
1 2 3
∗ T
Z =Z = 2  4 5 = Z
3 5 6
So, Z is Hermitian
∴ Option A is correct.

7. (1 point) (Multiple select) Which of the following matrices are Hermitian?


 
1 3−i
A.
3+i i
 
0 3 − 2i
B.
3 − 2i 4
 
3 2 − i −3i
C. 2 + i 0 1 − i
3i 1 + i 0
 
−1 2 3
D.  2 0 −1
3 −1 4

Answer: C, D
Explanation: By definition, a given matrix A is hermitian if and only if A∗ = A. Let
us check this for all the options.

Option A.
 
∗ T 1 3−i
A =A = ̸= A
3 + i −i
Since the matrix is not hermitian, option A will not be marked.

Option B.
 
∗ T 0 3 + 2i
B =B = ̸= B
3 + 2i 4
Since the matrix is not hermitian, option B will not be marked.

Option C.
 
3 2 − i −3i
T
C ∗ = C = 2 + i 0 1 − i = C
3i 1 + i 0
Course: Machine Learning - Foundations Page 5 of 10

Since the matrix is hermitian, option C will be marked.

Option D.
 
−1 2 3
T
D∗ = D =  2 0 −1 = D
3 −1 4
Since the matrix is hermitian, option D will be marked.

 
3 2 − i −3i
8. (1 point) The eigenvalues of matrix A = 2 + i 0 1 − i are
3i 1 + i 0
A. -1,-6 and 2
B. 1, -6 and -2
C. 1, 6 and 2
D. -1, 6 and -2

Answer: D
Explanation: For this question, we can use the fact that the trace of a matrix is the
same as the sum of the eigenvalues. Here,

tr(A) = 3 + 0 + 0 = 3

Out of these options, only option D gives a sum of 3 for the eigenvalues to match the
trace of A.
∴ Option D is correct.

 
1+i 1−i
9. (2 points) Let A = k , where k ∈ R. A is unitary if k is
1−i 1+i
1
A. 2
B. 1
1
C. 4
1
D. 8

Answer: A
Explanation: To solve this question, we will use the fact that, for a unitary matrix A,
it’s columns must be of unit length.
Course: Machine Learning - Foundations Page 6 of 10

 
k(1 + i)
Here, Let a1 = . Then, we need for k to be such that ||a1 || = 1
k(1 − i)
||a1 ||2 = < a1 , a1 >
= a∗1 a1
 
  1+i
=k 1−i 1+i k
1−i
= k 2 12 + 12 + 12 + 12


= 4k 2
1
So, we have, 4k 2 = 12 =⇒ k =
2
∴ Option A is correct.

 √ 
1 1 + i √k
10. (2 points) Let A = 2
, where k ∈ R. A is unitary if k is
1−i ki
1
A. 2
B. 1
C. 2
1
D. 4

Answer: C
Explanation: To solve this question, we will use the fact that, for a unitary matrix A,
its columns must be of unit length.
√ 
1
Here, Let a2 = √ k . Then, we need for k to be such that ||a2 || = 1
2 ki
||a2 ||2 = < a2 , a2 >
= a∗2 a2
√ 
1 √ √  1
= k − ki √k
2 2 ki
1
= (k + k)
4
k
=
2
k
So, we have, = 12 =⇒ k = 2
2
∴ Option C is correct.
Course: Machine Learning - Foundations Page 7 of 10
 
2 1+i
11. (3 points) A matrix A = can be written as A = U DU ∗ , where U is a
1−i 3
unitary matrix and D is a diagonal matrix. Then, U and D respectively are
" #
−1−i
√ 1+i  
3 6 1 0
A. U = √1 √2
,D=
3 6
0 4
" #
−1+i
√ −1+i  
3 6 1 0
B. U = √1 √2
,D=
3 6
0 4
" #
−1+i
√ −1+i  
3 6 1 0
C. U = √ −1 −2 ,D=
3

6
0 4
" #
−1+i
√ −1+i  
3 6 −1 0
D. U = √ −1 −2 ,D=
3

6
0 −4

Answer: A
Explanation: Since the matrix A can be written as U DU ∗ , A is unitarily diagonalizable.
Let us start by finding the eigenvalues and eigenvectors. The characteristic polynomial
is c(λ).

2−λ 1+i
c(λ) = |A − λI| = = (2 − λ)(3 − λ) − 2 = λ2 − 5λ + 4 =⇒ λ = 1, 4
1−i 3−λ
 
1 0
So, the eigenvalues are 1 and 4 =⇒ D =
0 4
Let the eigenvectors be v1 and v2 . Then,
     
1 1+i 1 1+i −1 − i
Eλ=1 = null = null =⇒ v1 =
1−i 2 0 0 1

−1−i
   
−2 1 + i 1 2
Eλ=4 = null = null
1 − i −1 1 − i −1
" −1−i #!
1
 
2 1+i
= null =⇒ v2 =
0 0 2
" −1−i 1+i #
√ √
3 6
Converting v1 and v2 to unit vectors and putting them in a matrix we get U = .
√1 √2
3 6
∴ Option A is correct.

12. (1 point) (Multiple select) Which of the following matrices is/are unitary?
Course: Machine Learning - Foundations Page 8 of 10
 1 i 
√ √
 2 2
A.  i 1 
−√ √
2 2
 1 i 
−√ √
B.

 i 2 2
1 
√ √
2 2
 1 i 
√ −√
 2 2
C.  i 1 
√ √
2 2
 1 i 
√ √
 2 2 
D.  i 1 
√ −√
2 2
 1 i 
√ √
 2 2
E.  i 1 
√ √
2 2

Answer: E
Explanation: For a matrix to be unitary, the columns of the matrix must be pairwise
orthogonal. Let us check this for all the options.

Option A.  1   i 
 √i
 
√ √ 
1 i
 2  2
<  −i  ,  1  > = √ √   12  = i ̸= 0

√ √ 2 2 √
2 2 2
Option B.  −1   i 
 √i
 
√ √ 
−1 −i
 2  2
<  i , 1  > = √ √   12  = −i ̸= 0

√ √ 2 2 √
2 2 2
Option C.  1   −i   −i 
√ √ 
1 −i
 √
 2  2
<  i , 1  > = √ √   12  = −i ̸= 0

√ √ 2 2 √
2 2 2
Course: Machine Learning - Foundations Page 9 of 10

Option D.  1   i 
 √i
 
√ √ 
1 −i
 2  2
<  i  ,  −1  > = √ √   −12  = i ̸= 0

√ √ 2 2 √
2 2 2
Option E.  1   i 
 √i
 
√ √ 
1 −i
<  i2  ,  12  > = √ √   12  = 0
    
√ √ 2 2 √
2 2 2
Out of all the options only the matrix in option E has orthogonal columns. Further, we
can" calculate
# using
" the
# inner product, that the columns are also of unit length. That is,
√1 √i
2 2
|| √i
|| = || √1
|| = 1
2 2

∴ Only option E is correct.

13. (1 point) Let U and V be two symmetric matrices. Consider the following statements:

1. U V is symmetric.
2. U + V is symmetric.

Then,
A. both statements are true.
B. both statements are false.
C. 1. is false.
D. 2. is false.

Answer: C
Explanation: Let us look at both statements.

Statement 1.

(U V )T = V T U T = V U ̸= U V
So, statement 1 is false.

Statement 2.

(U + V )T = U T + V T = U + V
Course: Machine Learning - Foundations Page 10 of 10

So, statement 2 is true.

∴ Option C is correct.
Course: Machine Learning - Foundations
Week 5: Test questions

1. (1 point) Consider two non-zero vectors x ∈ C and y ∈ C. Suppose the inner product
between x and y obeys commutative property (i.e., x · y = y · x), it implies that
A. y must be a conjugate transpose of x
B. y is equal to x
C. y must be orthogonal to x
D. y must be a scalar (possibly complex) multiple of x

Explanation: Let us look at each option and see what makes the commutative property
hold.

Option A.
Assuming y = x∗
x· y = x∗ y = x∗ x∗
This multiplication is not defined, so option A is incorrect.

Option B.
Assuming y = x
x· y = y· x
This is trivially true, so option B is correct.

Option C.
Assuming x∗ y = 0
x· y = x∗ y = 0 = y ∗ x = y· x
So, option C is correct.

Option D.
Assuming y = zx where z ∈ C

x· y = x∗ y = x∗ zx = zx∗ x ̸= zx∗ x = (zx)∗ x = y ∗ x = y· x

So, option D is incorrect.

∴ Options B, C are correct.

2. (1 point) The inner product of two distinct vectors x and y that are drawn randomly
from C100 is 0.8 − 0.37i.The vector x is scaled by a scalar 1 − 2i to obtain a new vector
z, then the inner product between z and y is
Course: Machine Learning - Foundations Page 2 of 8

A. 0.06 − 1.97i
B. 1.54 − 1.23i
C. 1.54 + 1.23i
D. 0.8 − 0.37i
E. Not possible to calculate

Explanation: From the question, it is given that x∗ y = 0.8 − 0.37i. Let us replace x
with (1 − 2i)x and compute -

[(1 − 2i)x]∗ y = (1 − 2i) x∗ y = (1 + 2i) x∗ y = (1 + 2i)(0.8 − 0.37i) = 1.54 + 1.23i

∴ Option C is correct.

3. (1 point)
 Select
 the correct statement(s).The Eigen-value decomposition for the matrix
0 −1
A=
1 0
A. doesn’t exist over R but exists over C
B. doesn’t exist over C but exists over R
C. neither exists over R nor exists over C
D. exists over both C and R

Explanation: Let us find the eigenvalues of the matrix A.

|A − λI| = (−λ)2 − (−1) = λ2 + 1 =⇒ λ = i, −i

Here, we can see that the eigenvalues are complex. So, the decomposition doesn’t exist
over R but since there are two distinct eigenvalues, the decomposition will exist over C.
∴ Option A is correct.

 
1 1 + i −2 − 2i
4. (1 point) Consider the complex matrix S =  1 − i 1 −i . The matrix is
−2 + 2i i 1
A. Hermitian and Symmetric
B. Symmetric but not Hermitian
C. Neither Hermitian nor Symmetric
D. Hermitian but not Symmetric

Explanation: For this question, we can check both conditions. We can see that here
S T ̸= S, so S is not symmetric. But taking the complex conjugate transpose, S ∗ = S,
so S is hermitian.
Course: Machine Learning - Foundations Page 3 of 8

∴ Option D is correct.

5. (1 point) Suppose that an unitary matrix U is multiplied by a diagonal matrix D with


dii ∈ R, then the resultant matrix will always be unitary. The statement is
A. True
B. False

Explanation: The requirement for a matrix U to be unitary is that it’s columns must
be pairwise orthogonal and of unit length.

Here, if we consider the diagonal matrix D = 2I, say. Then multiplying D with U will
cause all the columns to double, which in turn will change the length of the columns.
So, the resultant matrix will not be unitary.

∴ Option B is correct.

 
3 2 − i −3i
6. (3 points) The eigenvectors of matrix A = 2 + i 0 1 − i are
3i 1 + i 0
     
−1 1 − 21i 1 + 3i
A. 1 + 2i , 6 − 9i , −2 − i
    
1 13 5
     
1 1 − 21i 1 + 3i
B. 1 − 2i,  6 − 9i , −2 − i
1 13 5
     
−1 1 − 21i 1 + 3i
C. 1 − 2i,  6 − 9i , −2 − i
−1 13 5
     
−1 1 − 21i 1 − 3i
D. 1 + 2i,  6 − 9i ,  2 − i 
1 13 −5

Explanation: To solve this question, we can use trial and error from the options and
see which one of them is an eigenvector.
    
3 2 − i −3i −1 −1
2 + i 0 1 − i 1 + 2i = −1 1 + 2i
3i 1 + i 0 1 1
Course: Machine Learning - Foundations Page 4 of 8

    
3 2 − i −3i 1 − 21i 1 − 21i
2 + i 0 1 − i  6 − 9i  = 6  6 − 9i 
3i 1 + i 0 13 13
    
3 2 − i −3i 1 + 3i 1 + 3i
2 + i 0 1 − i −2 − i = −2 −2 − i
3i 1 + i 0 5 5

So, all the vectors from option A are eigenvectors as upon multiplication with the matrix
A, the vectors get scaled.

∴ Option A is correct.

 √ 
1 k + i √2
7. (1 point) A matrix A = 2
is unitary if k is
k−i 2i
1
A. 2
B. 1
C. - 12
D. -1
E. ±1
F. ± 12

Explanation: To solve this question, we will use the fact that, for a unitary matrix A,
its columns must be pairwise orthogonal. That is we need,

0 = < a1 , a2 >
= a∗1 a2
√ 
1   1
= k−i k+i √2
2 2 2i
1 √
 √ √ √ 
= 2k − 2i + 2ki − 2
4√ √ √ √
( 2k − 2) + i( 2k − 2)
=
4
√ √
So, we have, 2k − 2 = 0 =⇒ k = 1

∴ Option B is correct.

 
1 1+i
8. (3 points) A matrix A = can be written as A = U DU ∗ , where U is a
1−i 1
unitary matrix and D is a diagonal matrix. Then, U and D, respectively, are
Course: Machine Learning - Foundations Page 5 of 8
 √   √ 
1√+ i 2 1+ 2 0√
A. U = ,D=
2 1−i 0 1− 2
 √   √ 
−1 √ +i 2 1+ 2 0√
B. U = ,D=
2 −1 − i 0 1− 2
 √   √ 
1√+ i 2 −1 + 2 0√
C. U = ,D=
2 1−i 0 1− 2
 √   √ 
1 − i 2 1 + 2 0√
D. U = √ ,D=
−2 1 − i 0 1− 2

Explanation: Since the matrix A can be written as U DU ∗ , A is unitarily diagonalizable.


Let us start by finding the eigenvalues and eigenvectors. The characteristic polynomial
is c(λ).

1−λ 1+i √ √
c(λ) = |A − λI| = = (1 − λ)2 − 2 = λ2 − 2λ − 1 =⇒ λ = 1 + 2, 1 − 2
1−i 1−λ

√ √
 
1+ 2 0√
So, the eigenvalues are 1 + 2 and 1 − 2 =⇒ D =
0 1− 2
Let the eigenvectors be v1 and v2 . Then,

 √ −1 − i
 


− 2 1+ 1
Eλ=1+√2 = null √i = null  √2

1−i − 2 1−i − 2
−1 − i
 
 
1 √ 1 + i
= null  2  =⇒ v1 = √2
0 0

 √ 1+i
 


2 1√+ i 1
Eλ=1−√2 = null = null  √ 2 
1−i 2 1−i 2
 1 + i 
1 √
 
−1
√ − i
= null  2  =⇒ v2 =
2
0 0
1+i −1 − i
 
 2 2 
Converting v1 and v2 to unit vectors and putting them in a matrix we get U = 
 1
.
1 
√ √
2 2
∴ Option A is correct.
Course: Machine Learning - Foundations Page 6 of 8


0 −1
9. (2 points) The matrix Z = has
1 0
A. only real eigenvalues.
B. one real and one complex eigenvalue.
C. no real eigenvalues.

Explanation: Let us compute the eigenvalues of Z.

−λ −1
|Z − λI| = = λ2 + 1 =⇒ λ = ±i
1 −λ
∴ Option C is correct.

10. (1 point) (Multiple select) Which of the following matrices is/are unitary?
 
cos θ − sin θ
A.
sin θ − cos θ
 
cos θ sin θ
B.
sin θ cos θ
 
− cos θ sin θ
C.
sin θ cos θ
 
cos θ sin θ
D.
− sin θ cos θ
 
− cos θ sin θ
E.
sin θ − cos θ

Explanation: The conditions for a particular matrix A to be unitary is that the columns
need to be of unit length and pairwise orthogonal.

We can see here that due to the property sin2 θ + cos2 θ = 1, all matrices here sat-
isfy the first condition.
For the second condition, we need the inner product of the columns to come out to 0.
This will only be true for options C and D.

∴ Option C, D are correct.

11. (2 points) Let U and V be two unitary matrices. Then

1. U V is unitary.
2. U + V is unitary.
Course: Machine Learning - Foundations Page 7 of 8

A. Both statements are true.


B. Both statements are false.
C. 1. is false.
D. 2. is false.

Explanation: The conditions for a particular matrix A to be unitary is that the columns
need to be of unit length and pairwise orthogonal. These conditions are captured in the
property that for a unitary matrix U , U ∗ = U −1 . For the first statement, we have

(U V )∗ = V ∗ U ∗ = V −1 U −1 = (U V )−1

This shows that U V is a unitary matrix. However the same cannot be said for statement
2 because of the fact that U −1 + V −1 ̸= (U + V )−1
∴ Option D is correct.

12. (2 points) (Multiple


 select)
 Which of the following is/are eigenvectors of the Hermitian
1 1+i
matrix A =
1−i 2
 
−1 − i
A.
1
 
−2 − 2i
B.
2
1+i
" #
C. 2
1
 
1+i
D.
2
E. All of these.

Explanation: For this question, we can simply multiply the vectors in each of the op-
tions to the matrix A. If this multiplication has an effect of scaling the vectors, that
implies it is an eigenvector.

Option A.
    
1 1 + i −1 − i 0
=
1−i 2 1 0
This is an eigenvalue of 0, so option A is correct.

Option B.
Course: Machine Learning - Foundations Page 8 of 8

This vector is simply a multiple of the vector in option A, so it is also an eigenvector.

Option C.
  "1 + i# "
1+i
#
1 1+i
2 =3 2
1−i 2 1 1
This is an eigenvalue of 3, so option C is correct.

Option D.
This vector is simply a multiple of the vector in option C, so it is also an eigenvector.

∴ All options are correct.


Week 6 Practice Assignment Solution

1. f (x, y) = 2xy + y 2
fx = 2y = 0 ⇒ y = 0
fy = 2x + 2y = 0 ⇒ x = 0
So, the stationary point is (0, 0)

2.  
4 1 −1
A= 1 2 1
−1 1 2
The characteristic equation is

λ3 − 8λ2 + 17λ − 6 = 0

Solving we get
λ1 = 0.438
λ2 = 4.56
λ3 = 3
Since all eigenvalues are greater than 0, A is positive definite.

3.  
6 2
A=
2 1
Here, since a = 6 > 0 and ac − b2 = 6 − 4 = 2 > 0, A is positive definite.

4.
f (x, y) = 4 + x3 + y 3 − 3xy
fx = 3x2 − 3y and fy = 3y 2 − 3x

Solving fx = 0 and fy = 0 gives x = y = 1


So. f has stationary point at (1, 1)
5. Given,
x2 − z 2 + 2yz + 2zx
In V T Av representation, the matrix A has diagonal elements as coefficients
of x2 ,y 2 and z 2
Only option (D) is satisfied.

6.
f (x, y) = −3x2 − 6xy − 6y 2
To check at (0, 0)
First derivative test:

fx (0, 0) = −6x − 6y = 0

fy (0, 0) = −6x − 12y = 0


Now the second derivative test:

fx x = −6 < 0

fy y = −12 < 0
So, the point (0,0) is the maxima.

7.  
6 5
A=
5 4
Here, since a = 6 > 0 and ac − b2 = 24 − 25 = −1 < 0, A is indefinite
definite.

8.  
−6 0 0
A= 0 −5 9
0 0 −7
The eigenvalues of A are -6,-5, and -7; all negative. So, A is negative
definite.

9. Let the eigenvalues of A are λ1 and λ2 . Given,

λ1 + λ2 = 6 ....(i)

λ1 × λ2 = 8
(λ1 − λ2 ) = (λ1 + λ2 )2 − 4(λ1 × λ2 )
2
(λ1 − λ2 )2 = 36 − 32 = 4
λ1 − λ2 = 2, −2 ...(ii)
Solving eq. (i) and (ii) we get λ1 = 4 and λ2 = 2. So, A is positive
definite.

10.    
1 2 T 1 2
A= ⇒A =
2 1 2 1

The characteristic polynomial is

λ2 − 10λ + 9 = 0
Solving we get
λ = 9, 1
So,
singular values, √ √
σ1 = 9 = 3 and σ2 = 1=1

11.  
√0 1 1
A =  2 2 0
0 1 1
 √ 
√2 2 0
AAT = 2 2 6 2
0 2 2
The characteristic polynomial for AAT is

λ3 − 10λ2 + 16λ = 0
Solving we get, λ = 0, 2, 8
For λ = 8,
(AAT − 8I)X = 0
 √    
−6
√ 2 2 0 x 0
2 2 −2 2   y  = 0
0 2 −6 z 0

−6x + 2 2y = 0

2 2x − 2y + 2z = 0
2y − 6z = 0
let z = t then

y = 3t x = 2t   √ 
x 2
v1 =  y  = t  3 
z 1
On normalizing, for t = 1 we get

√1
 
 √63 
v1 =  2 
1

2 3

For λ = 2,
(AAT − 2I)X = 0
 √    
0
√ 2 2 0 x 0
2 2 4 2 y  = 0
0 2 0 z 0

2 2y = 0 ⇒ y = 0

2 2x + 4y + 2z = 0
let z = k then
x = −k

2    −1 
x √
2
v2 = y = k 0 
  
z 1
On normalizing, for k = 1 we get
 √1 
3
v2 =  0 
−2

6

For λ = 0,
(AAT )X = 0
 √    
2
√ 2 0 x 0
2 2 6 2 y  = 0
0 2 2 z 0

2x + 2 2y = 0

2 2x + 4y + 2z = 0
2y + 2z = 0
let z = l then
y = −l

x = 2l   √ 
x 2
v3 = y  = l −1
z 1
On normalizing, for l = 1 we get

√1
 
2
v3 =  −1 
2
1
2

√1 √1 √1 √1 √3 √1
   
 √63 3 2
 √16 12 12
−2 
Q2 =  2 0 −1 
2 
⇒ QT2 =  3 0 √
6 
1
√ −2
√ 1 √1 −1 1
2 3 6 2 2 2 2

Ax1
y1 =
σ1
√1 √1
    
1 √0 1 1 √6 
3  √26 
y1 = √  2 2 0  =

2   6
2 2 0 1 1 1
√ √1
2 3 6

Ax2
y2 =
σ2
 1 
 −1 
 √ √
0 1 1
1 √ 3 3
−1 
y2 = √ 2 2 0  0  =  √3 

2 0 1 1 √2 √1
6 3

Since σ3 = 0, σ13 is indeterminate.


Let u3 = [a b c] such that < u1 , u3 >=< u2 , u3 >= 0
 1 

 6
[a b c]  √26  = 0
√1
6

a + 2b + c = 0.....(i)

√1
 
3
c]  −13 
√
[a b =0

√1
3

a − b + c = 0......(ii)
let b = k1 and c = k2

a + 2k1 + k2 = 0
a = −2k1 − k2
f rom (ii), a = k1 − k2
So, −2k1 − k2 = k1 − k2 ⇒ k1 = 0
a = −k2 , b = 0 and c = k2
 
−1
y3 = k2  0 
1

For k2 = −1, upon normalizing we get


 √1 
2
y3 =  0 
−1

2

√1 √1 √1
 
 √26 −1

3
0
2
Q1 =  6 3 
√1 √1 −1

6 3 2

12.    
1 0 1 1
A= ⇒ AT =
1 1 0 1
 
2 1
AT A =
1 1
The characteristics polynomial of AT A is

λ2 − 3λ + 1 = 0

Solving we get
λ = 2.618, 0.382
So, √
σ1 = 2.618 = 1.618

σ2 = 0.382 = 0.618

13.  
2 0
A=
1 2
    
T 2 1 2 0 5 2
A A= =
0 2 1 2 2 4
The characteristic equation is

λ2 − 9λ + 16 = 0
Solving we get √
9± 17
λ=
2
σ1 = 2.56, σ2 = 1.56
 
2.56 0
Σ=
0 1.56
For λ = 6.56155, we have

(AT A − 6.56155I)X = 0
  
−1.56 2 x
=0
2 −2.56 y
−1.56x + 2y = 0
2x = 2.56y
Let y = k then x = 1.28k So,
 
0.788
v1 =
0.615

Similarly, for other eigenvalues, we have


  
2.56 2 x
=0
2 1.56 y

2.56x + 2y = 0
2x + 1.56y = 0
Let y = k then x = −0.78k So,
 
−0.615
v2 =
0.788
 
0.788 −0.615
Q2 =
0.615 0.788
Now,
Av1
y1 =
σ1
 
0.615
y1 =
0.788
Similarly,
Av2
y2 =
σ2
 
−0.788
y2 =
0.615
So,
   T
0.615 −0.788 2.56 0 0.788 −0.615
A=
0.788 0.615 0 1.56 0.615 0.788
Week 6 Graded Assignment Solution

1.
f (x) = x2 + y 2
∂(x, y)
fx = = 2x
∂x
∂(x, y)
fy = = 2x
∂y
Put fx = 0 and fy = 0,

fx = 0 ⇒ 2x = 0 ⇒ x = 0

fy = 0 ⇒ 2y = 0 ⇒ y = 0

Thus, stationary points = (0, 0)

2.    
4 2 a b
A= ⇔
2 2 b c

Here, a = 4 > 0 and ac − b2 = 4(2) − 22 = 4 > 0


So, A is positive definite.

Now,    
1 1 a b
B= ⇔
1 2 b c

Here, a = 1 > 0 and ac − b2 = 1(2) − 11 = 1 > 0


So, B is positive definite.
       
4 2 1 1 5 3 a b
A+B = + = ⇔ a = 5 > 0 and ac − b2 =
2 2 1 2 3 4 b c
5(4) − 32 = 11 > 0
So, A + B is also positive definite.

3.  
2 −1 1
A = −1 2 −1
1 −1 2
The characteristic polynomial is

λ3 − 6λ2 + {3 + 3 + 3}λ − 4 = 0
FORMULA:
x3 − [trace(A)]x2 + Σ[Minors of diagonal elements(A)]x − det(A) = 0

λ3 − 6λ2 + 9λ − 4 = 0
Solving we get λ = 4, 1, 1 Since all eigenvalues are greater than zero, A is
positive definite.

4.
f (x, y) = 2x2 + 2xy + 2y 2 − 6x
fx = 4x + 2y − 6
fy = 2x + 4y
For stationary point, fx = 0 and fy = 0

4x + 2y − 6 = 0 ⇒ 2x + y = 3
2x + 4y = 0 ⇒ x = −2y
Solving we get x = 2 and y = −1

Hence, the stationary point = (2, −1)

5.   
  a b c x
x y z d e f  y 
g h i z
 
  x
ax + dy + gz bx + ey + hz cx + f y + iz y 
z
ax2 + ey 2 + iz 2 + (b + d)xy + (g + c)xz + (f + h)yz
Compare this with x2 + y 2 − z 2 − xy + yz + xz we get
a = 1, e = 1, i = −1, b + d = −1, c + g = 1, f + h = 1

Only in options A and C, the diagonal elements are 1, 1, −1

- If we check in option C, then f + h = −1 which is not satisfied. So, this


is incorrect.
- If we check in option A, then all b + d = −1, c + g = 1, f + h = 1 are
satisfied.
So, option A is correct.
6.
f (x, y) = 3x2 + 4xy + 2y 2
fx = 6x + 4y ⇒ fxx = 6
fy = 4x + 4y ⇒ fyy = 4

Since fxx > 0 and fyy > 0, the point (0, 0) is a minima.

7.    
4 2 a b
A= ⇔
2 3 b c

Here, a = 4 > 0 and ac − b2 = 4(3) − 22 = 8 > 0


So, A is positive definite.

8.    
1 2 a b
A= ⇔
2 1 b c

Here, a = 1 > 0 and ac − b2 = 1(1) − 2 ∗ 2 = −1 < 0


Since a > 0 but ac − b2 < 0, A is NOT positive definite.

9.  
3 0 0
A = 0 5 0
0 0 7
Since A is a diagonal matrix, the eigenvalues are 3, 5 and 7.

Now since all eigenvalues are greater than 0, A is positive definite.

10.  
1 1 0 1
A = 0 0 0 1
1 1 0 0
 
  1 0 1  
1 1 0 1  3 1 2
1 0 1
AAT = 0

0 0 1 
0 0 0
 = 1 1 0
1 1 0 0 2 0 2
1 1 0
The characteristics polynomial of AAT :
λ3 − 6λ2 + 6λ = 0

Solving we get √
λ = 0, 3 ± 3

Now, σ = λ
So,
√ √
q q
σ1 = 3+ 3 and σ2 = 3− 3
are the non-zero singular values.

11.  
1 0 1 0
A=
0 1 0 1
 
  1 0  
1 0 1 0 1 = 2
0  0
AAT =

0 1 0 1 1 0 0 2
0 1
The characteristics polynomial of AAT :
λ2 − 4λ + 4 = 0

Solving we get
λ = 2, 2
√ √
σ= λ= 2
√ 
2 √0 0 0
Σ=
0 2 0 0
Now for λ = 2,

(AAT − 2I)X = 0
    
2 0 1 0 x
−2 =0
0 2 0 1 y
  
0 x
=0
0 y
 
x
v=k×
y
 
1
for x = 1, y = 0; v1 = k
0
 
0
for x = 0, y = 1; v2 = k
1
 
1 0
Q1 =
0 1
AT x 1
y1 =
σ1
   √1 
1 0   2
1 0 1 1 =  01 
 
y1 = √ × 
2 1 0 0 √ 
2
0 1 0
Similarly,
0
   
1 0  
1 0 1 0  √1 
y2 = √ ×   = 2
2 1 0 1 0
0 1 √1
2

For the other two

 
0
  1
a b c d 0 = 0

1
and
 
1
  0
a b c d 1 = 0

0
From above we have
a + c = 0 and b + d = 0
Let c = k1 and d = k2
       
−1 − −1 −
0 −1 1 0 1 −1
k1   + k2   N ormalizing → √ k1   ; √ k2  
       N ow,
1 0 2 1 2 0
0 1 0 1
   
1 0
1  0  1 1
y3 = √   f or k1 = −1 and y4 = √   f or k2 = −1
2 −1 2 0 
0 −1
So,  
1 0 1 0
1  0 1 0 1
Q2 = √  
2 1
 0 −1 0
0 1 0 −1
 
1 0 1 0
T 1  0 1 0 1
Q2 = √  
2 1 0 −1 0 
0 1 0 −1
Page 1 of 3

Course: Machine Learning - Foundations


Practice solution: Week 7 .

Questions 1-6 are based on common data.


Consider the data points x1 , x2 , x3 to answer the following questions.
     
0 1 2
x1 = , x2 = , x3 =
2 1 0
1. (1 point) The mean vector of the data points x1 , x2 , x3 is
 
0
A.
0
 
1
B.
1
 
0
C.
1
 
0.5
D.
0.5

Answer: B
   
(0 + 1 + 2) 1
Mean vector = X̄ = Σni=1 n1 xi = 1
3
=
(2 + 1 + 0) 1

2. (2 points) The covariance matrix C = n1 Σni=1 (xi −x̄)(xi −x̄)T for the data points x1 , x2 , x3
is
 
0 0
A.
0 0
 
0.5 0.5
B.
0.5 0.5
 
0.67 −0.67
C.
−0.67 0.67
 
1 0
D.
0 1

Answer: C
       
1 −1   0   1   1 2 −2
C = 3( −1 1 + 0 0 + 1 −1 ) = 3
1 0 −1 −2 2

3. (2 points) The eigenvalues of the covariance matrix C = n1 Σni=1 (xi − x̄)(xi − x̄)T are
Course: Machine Learning - Foundations Page 2 of 3

A. 0.5, 0.5
B. 1, 1
C. 4, 0
D. 0, 0

Answer: C
Characteristics equation:
2
− λ − 32

3 I=0
− 23 2
3
−λ
The determinant of the obtained matrix is λ(λ − 4) = 0
Eigenvalues:
The roots are λ1 = 43 , λ2 = 0
Eigenvectors:
2
− 23
  2
− 3 − 23

−λ
λ1 = 4, 3 2 2 =
−3 3
−λ − 23 − 23
   
−1 1 −1
The null space of this matrix is , Corresponding eigenvector is, u1 = √2
1 1
2
− λ − 23 2
−2
  
λ2 = 0, 3 2 2 = 32 23
−3 3
−λ −3 3
   
1 1 1
The null space of this matrix is , Corresponding eigenvector is, u2 = √2
1 1

4. (2 points) The eigenvectors of the covariance matrix C = n1 Σni=1 (xi − x̄)(xi − x̄)T are
(Note: The eigenvectors should be arranged in the descending order of eigenvalues from
left to right in the matrix.)
 
1 0
A.
0 1
 
0.5 0.5
B.
1 1
 
1 1
C.
1 1
 
−0.7 0.7
D.
0.7 0.7

Answer: D
Refer the solution of the previous question.
5. (2 points) The data points x1 , x2 , x3 are projected onto the one dimensional space using
PCA as points z1 , z2 , z3 respectively.
Course: Machine Learning - Foundations Page 3 of 3
     
1 1 1
A. z1 = , z2 = , z3 =
1 1 1
     
0.5 0 −0.5
B. z1 = , z2 = , z3 =
0.5 0 −0.5
     
0 1 2
C. z1 = , z2 = , z3 =
2 1 0
     
−1 0 1
D. z1 = , z2 = , z3 =
1 0 −1

Answer: D

√1
−1
λ1 = 4, u1 = 21
     
1
  −1 1 −1 −1
z1 = √2 ( 0 2 ) √2 =
1 1 1
     
 −1 1 −1 0
z2 = √12 ( 1 1

) √2 =
1 1 0
     
 −1 1 −1 1
z3 = √12 ( 2 0

) √2 =
1 1 −1

6. (1 point) The approximation error J is given by Σni=1 ||xi − zi ||2 . What could be the
possible value of the reconstruction error?
A. 1
B. 2
C. 10
D. 20

Answer: B
Reconstruction error, J = n1 Σni=1 ||xi − zi ||2 = 31 [(12 + 12 ) + (12 + 12 ) + (12 + 12 )] = 2
Course: Machine Learning - Foundations
Test Questions
Lecture Details: Week 7

Questions 1-6 are based on common data


Consider
 these data
  points 
to 
answer the following questions:
0 1 2
x1 = 1, x2 = 1, x3 = 1
2 1 0
1. (1 point) The mean vector of the data points x1 , x2 , x3 is
 
0
A. 0
0
 
1
B. 1
1
 
0.9
C. 0.6

0.3
 
0.5
D. 0.5

0.5

Answer: B

Explanation:
       
3 0 1 2 1
1 X 1        
x= = 1 + 1 + 1 = 1
3 i=1 3
2 1 0 1

∴ Option B is correct.

2. (2 points) The covariance matrix C = n1 Σni=1 (xi − x̄)(xi − x̄)T of the data points x1 , x2 , x3
is
 
0 0 0
A. 0 0 0
0 0 0
 
0.5 0.5 0.5
B. 0.5 0.5 0.5
0.5 0.5 0.5
Course: Machine Learning - Foundations Page 2 of 6
 
0.67 0 −0.67
C.  0 0 0 
−0.67 0 0.67
 
1 0 0
D. 0 1 0
0 0 1

Answer: C

Explanation: To solve this question we first take the data and center it by subtracting
the mean. Doing this will give us the centered dataset.
     
−1 0 1
x 1 − x = 0 , x2 − x = 0 , x3 − x = 0 
    
1 0 −1

1 n
Now, we can use the formula C = Σ (x
n i=1 i
− x̄)(xi − x̄)T to get the covariance ma-
trix C.
     
0 −1
1 0 0 0 1 0 −1
1  0
C= 0 0  + 0 0 0 +  0 0 0 
3
−1
0 1 0 0 0 −1 0 1
 
2 0−2
1 0 0
= 0
3
−2 02
 
0.67 0 −0.67
≈ 0 0 0 
−0.67 0 0.67

∴ Option C is correct.

3. (2 points) The eigenvalues of the covariance matrix C = n1 Σni=1 (xi − x̄)(xi − x̄)T are
A. 2, 0, 0
B. 1, 1, 1
C. 1.34, 0, 0
D. 0.5, 0, 0.5

Answer: C
Course: Machine Learning - Foundations Page 3 of 6

Explanation: To find the eigenvalues, we find the characteristic polynomial and then
find the roots.

0.67 − λ 0 −0.67
|C − λI| = 0 −λ 0
−0.67 0 0.67 − λ
= (0.67 − λ)(−λ)(0.67 − λ) + (−0.67)(−0.67λ)
= −λ3 + 1.34λ2 − 0.45λ + 0.45λ
= λ2 (1.34 − λ)
Solving for the roots, we get

|C − λI| = 0 =⇒ λ = 1.34, 0, 0

∴ Option C is correct.

4. (2 points) The eigenvectors of the covariance matrix C = n1 Σni=1 (xi − x̄)(xi − x̄)T are
(Note: The eigenvectors should be arranged in the descending order of eigenvalues from
left to right in the matrix.)
 
1 0 1
A. 0 1 0
1 0 1
 
0.71 0 1
B.  0 0.71 0
0.71 0.71 0
 
−0.71 0 0.71
C.  0 1 0 
0.71 0 0.71
 
0.33 0 0
D. 0.33 1 0
0.34 0 1

Answer: C

Explanation: To solve this question, lets consider the eigenvalues one by one and find
the eigenvectors.
Course: Machine Learning - Foundations Page 4 of 6

Consider the eigenvalue λ = 1.34,

E1.34 = null(C − 1.34I)


 
−0.67 0 −0.67
= null  0 −1.34 0 
−0.67 0 −0.67
 
1 0 1
= null  0 1 0
0 0 0
  
−1
= col   0 
1

Now, λ = 0,
E0 = null(C)
 
0.67 0 −0.67
= null  0 0 0 
−0.67 0 0.67
 
1 0 −1
= null 0 0 0 
0 0 0
 
0 1
= col 1 0
0 1

So, the eigenvectors in order of eigenvalue are -


 
-1 0 1
0 1 0
1 0 1

Keep in mind that the eigenvectors themselves can be scaled so, scaling it appropriately,
we can see that the columns of option C match our eigenvectors.

∴ Option C is correct.

5. (2 points) The data points x1 , x2 , x3 are projected onto the one dimensional space using
PCA as points z1 , z2 , z3 respectively. (Use eigenvector with the maximum eigenvalue for
this projection.)
     
1 1 1
A. z1 = 1, z2 = 1, z3 = 1
1 1 1
Course: Machine Learning - Foundations Page 5 of 6
     
0.5 0 −0.5
B. z1 = 0.5, z2 = 0, z3 = −0.5
0.5 0 −0.5
     
0 1 2
C. z1 = 2 , z2 = 1 , z3 = 2
    
2 1 0
     
0 1.0082 2.0164
D. z1 =  1 , z2 =  1 , z3 =  1 
2.0164 1.0082 0

Answer: D

Explanation: To solve this question,


  we can calculate zi to be the projection of xi onto
1
the principal eigenvector v = 0  =⇒ ||v||2 = 2. That is,

−1
 
−1
x1 · v 0
z1 = projv x1 = v =
||v||2
1
 
0
x2 · v
z2 = projv x1 = v = 0
||v||2
0
 
1
x3 · v
z3 = projv x1 = v= 0

||v||2
−1

∴ Option D is correct.

n
1X
6. (1 point) The approximation error J on the given data set is given by ||xi − zi ||2 .
n i=1
What is the reconstruction error?
A. 6.724 × 10−4
B. 5
C. 10
D. 20

Answer: A
Course: Machine Learning - Foundations Page 6 of 6

Explanation: Let us plugin the values of xi and zi into the formula and find the ap-
proximation error.

3
1 X
J= ||xi − z1 ||2
3 i=1
       
1 1 1
1    2
=   2
|| 1 || + || 1 || + || 1 ||2 

3
1 1 1
=3

∴ Option A is correct.
Course: Machine Learning - Foundations
8 Practice questions
Week 7:

1. (2 points) Two positive numbers have a sum of 60. What is the maximum product of
one number times the square of other number?
A. 0
B. 32000
C. 60000
D. 64000

Answer: B
Let the two numbers be x and y
x+y=60
objective function from the question will be,
f (x) = x2 (60 − x)
For optima f 0 (x) = 0, 120x − 3x2 = 0
x = 0, 40
Product is maximum when x=40.
maximum product=32000

2. (2 points) (Multiple select) The point on y = x2 + 1 closest to (0,1.5) is


A. (0, 1)
B. (0.707, -1.5)
C. (-0.707,1.5)
D. (0, -1)

Answer: A
Objective function f (x) = (x − 0)2 + (x2 + 1 − 1.5)2
f (x) = x4 + 0.25
For minima f 0 (x) = 0
4x3 = 0
x=0
Corresponding y = 1

3. (2 points) The volume of largest cone that can be inscribed in a circle of radius 3 m is
(correct up to two decimal places)

Answer: 33.51 m3
V = 31 πr2 h

r = 9 − x2
Course: Machine Learning - Foundations Page 2 of 5

h=3+x
For maxima, V 0 (x) = 0
− 3x2 − 6x + 9 = 0
x = 1, −3
x can not be nagative.
So r = 2.828
h=4
V = 33.5

4. (2 points) The area of largest rectangle that can be inscribed in a circle of radius 4 is
A. 16
B. 8
C. 32
D. 20

Answer: C
Let x and y be two sides of the rectangle.
x2 +√y 2 = 64
y = 64 − x√2
A = xy = x 64 − x2
For maxima dA =0
√ dx
x = 4√ 2
y=4 2
A = 32
(Question 5-8 have common data) A manufacturing plant produces two products M and
N. Maximum production capacity is 700 for total production. At least 270 units must
be produced every day. Machine hours consumption per unit is 6 hours for M and 5
hours for N. At least 1100 machine hours must be used daily. Manufacturing cost is Rs
25 for M and Rs 35 for N.
Let, x1 = No of units of M produced per day
and x2 = No of units of N produced per day

5. (1 point) The objective function for above problem is


A. min f (x) = 25x1 + 55x2
B. min f (x) = 35x1 + 25x2
C. min f (x) = 25x1 + 35x2
D. min f (x) = 10x1 + 35x2

Answer: C
We need to minimize the cost of the function.

6. (1 point) The constraint due to maximum production capacity is


A. x1 + x2 ≤ 700
Course: Machine Learning - Foundations Page 3 of 5

B. x1 + x2 ≥ 700
C. x1 + x2 ≥ 270
D. x1 + x2 = 700

Answer: A
Maximum production capacity is 700.

7. (1 point) The constraint due to minimum production capacity is


A. x1 + x2 6= 270
B. x1 + x2 = 270
C. x1 + x2 ≤ 270
D. x1 + x2 ≥ 270

Answer: D
Minimum production capacity is 270.

8. (1 point) The constraint due to machine hour consumption is


A. 6x1 + 5x2 ≤ 1100
B. 6x1 + 5x2 ≥ 1100
C. 6x1 + 5x2 6= 1100
D. 6x1 + 5x2 = 1100

Answer: B
At least 1100 hours must be used.
(Questions 9-11 have common data)
A factory manufactures two products A and B. To manufacture one unit of A, 3 machine
hours and 5 labour hours are required. To manufacture product B, 2 machine hours and
4 labour hours are required. In a month, 270 machine hours and 280 labour hours are
available. Profit per unit for A is Rs. 55 and for B is Rs. 15.
Let x1 =Number of units of A produced per month
and x2 =Number of units of B produced per month

9. (1 point) The objective function for above problem is


A. max f (x) = 55x1 + 15x2
B. min f (x) = 55x1 + 15x2
C. max f (x) = 15x1 + 45x2
D. min f (x) = 15x1 + 55x2

Answer: A
We need to maximise profit.

10. (2 points) The constraint for machine hours is


Course: Machine Learning - Foundations Page 4 of 5

A. 3x1 + 2x2 ≥ 270


B. 3x1 + 2x2 ≤ 270
C. 3x1 + 2x2 6= 270
D. 3x1 + 2x2 = 270

Answer: B
270 hours available.

11. (2 points) The constraint for labour hours is


A. 5x1 + 4x2 = 280
B. 5x1 + 4x2 ≤ 280
C. 5x1 + 4x2 ≥ 280
D. 5x1 + 4x2 6= 280

Answer: B
280 labour hours is available.

12. (2 points) The value of a function at a point x = 5 is 3.2 and the value of the function’s
derivative at point x = 5 is 1.2. What will be the approximate value of the function at
a point x = 5.2(First order approximation)?

Answer: 3.44
According to Taylor’s series,
2 0
f (x + h) = f (x) + hf 0 (x) + h f2 (x) + ......
Here x = 5, h = 0.2
∴ f (x + h) = 3.44

13. (2 points) For the function f (x) = xsinx−1


2
, with an initial guess of x0 = −7, and step
size of 0.25, the value of the function after two iterations is (correct up to 3 decimal
places)
Answer: -2.471
xn+1 = xn − ηf 0 (x)
After first iteration x1 = −6.258
After second iteration x2 = −5.479
f (−5.479) = −2.47

14. (2 points) The area of the largest rectangle that can be inscribed in a circle of radius 1
is
A. 1
B. 1.5
C. 6
Course: Machine Learning - Foundations Page 5 of 5

D. 2

Answer: D
Follow same steps as q no 4
Course: Machine Learning - Foundations
8 Test questions
Week 7:

1. (2 points) Two positive numbers have a sum of 60. What is the minimum product of
one number times the square of other number?
A. 0
B. 900
C. 60
D. 240

Answer: A
Let the two numbers be x and y
x+y=60
objective function from the question will be,
f (x) = x2 (60 − x)
For optima f 0 (x) = 0, 120x − 3x2 = 0
x = 0, 40
Product is minimum when x=0.

2. (2 points) (Multiple select) The point on y = x2 + 1 closest to (0,2) is


A. (0.707, 1.5)
B. (0.707, -1.5)
C. (-0.707,1.5)
D. (-0.707, -1.5)

Answer: A,C
Objective function f (x) = (x − 0)2 + (x2 + 1 − 2)2
f (x) = x4 − x2 + 1
For minima f 0 (x) = 0
4x3 − 2x = 0
x = 0, 0.707, −0.707
Corresponding y = 1, 1.5, 1.5

3. (2 points) The volume of the largest cone that can be inscribed in a circle of radius 6 m
is (correct up to two decimal places)

Answer: 268.19 m3
V = 31 πr2 h

r = 36 − x2
h=6+x
Course: Machine Learning - Foundations Page 2 of 5

For maxima, V 0 (x) = 0


− 3x2 − 12x + 36 = 0
x = 2, −6
x can not be nagative.
So r = 5.65
h=8
V = 268.19
(Questions 5-8 have common data) A firm produces two products A and B. Maximum
production capacity is 500 for total production. At least 200 units must be produced
every day. Machine hours consumption per unit is 5 hours for A and 3 hours for B. At
least 1000 machine hours must be used daily. Manufacturing cost is Rs 30 for A and Rs
20 for B.
Let x1 = No of units of A produced per day
and x2 = No of units of B produced per day
4. (1 point) The objective function for above problem is
A. min f (x) = 30x1 + 20x2
B. min f (x) = 15x1 + 55x2
C. min f (x) = 5x1 + 155x2
D. min f (x) = 30x1 − 20x2
Answer: A
We should minimise cost function.
Objective function is
min f (x) = 30x1 + 20x2
5. (2 points) The constraint due to maximum production capacity is
A. x1 + x2 ≥ 500
B. x1 + x2 ≤ 500
C. x1 + x2 6= 500
D. x1 + x2 = 500
Answer: B
Maximum production capacity is 500.
6. (2 points) The constraint due to minimum production capacity is
A. x1 + x2 = 200
B. x1 + x2 ≤ 200
C. x1 + x2 ≥ 200
D. x1 + x2 6= 200
Answer: C
Minimum production capacity is 200.
Course: Machine Learning - Foundations Page 3 of 5

7. (2 points) The constraint due to machine hour consumption is


A. 5x1 + 3x2 ≤ 1000
B. 5x1 + 3x2 6= 1000
C. 5x1 + 3x2 = 1000
D. 5x1 + 3x2 ≥ 1000

Answer: D
1000 machine hours must be used daily.
(Questions 9-11 have common data)
A factory manufactures two products A and B. To manufacture one unit of A, 1 machine
hours and 2 labour hours are required. To manufacture product B, 2 machine hours and
1 labour hours are required. In a month, 200 machine hours and 140 labour hours are
available. Profit per unit for A is Rs. 45 and for B is Rs. 35.
Let x1 =Number of units of A produced per month
and x2 =Number of units of B produced per month

8. (1 point) The objective function for above problem is


A. max f (x) = 45x1 + 35x2
B. min f (x) = 45x1 + 35x2
C. max f (x) = 35x1 + 45x2
D. min f (x) = 35x1 + 45x2

Answer: A
We need to maximize profit.

9. (2 points) The constraint for machine hours is


A. x1 + 2x2 ≥ 200
B. x1 + 2x2 ≤ 200
C. x1 + 2x2 6= 200
D. x1 + 2x2 = 200

Answer: B
Total machine hours available=200.

10. (2 points) The constraint for labour hours is


A. 2x1 + x2 = 140
B. 2x1 + x2 ≤ 140
C. 2x1 + x2 ≥ 140
D. 2x1 + x2 6= 140

Answer: B
Total labour hour available is 140.
Course: Machine Learning - Foundations Page 4 of 5

11. (2 points) (Multiple select) Gradient of a continuous and differentiable function


A. is zero at a minimum
B. is non zero at a maximum
C. is zero at a saddle point
D. decreases as you get closer to minimum

Answer: A,C,D
For critical points gradient of a function is 0.
As we move towards minima gradient decreases.

12. The value of a function at point 10 is 100. The values of the function’s first and second
order derivatives at this point are 20 and 2 respectively. What will be the function’s
approximate value correct up to two decimal places at the point 10.5 (Use second order
approximation)?

Answer: 110.25
According to Taylor’s series,
2 0
f (x + h) = f (x) + hf 0 (x) + h f2 (x) + ......
Here x = 10, h = 0.5
∴ f (x + h) = 110.25

13. (2 points) For the function f (x) = x sin(x) − 1, with an initial guess of x0 = 2.5, and
step size of 0.1, as per gradient descent algorithm, what will be the value of the function
after 4 iterations? (Correct up to 3 decimal places)

Answer: -1.710 (-1.624 to-1.795)


xn+1 = xn − ηf 0 (x)
After first iteration x1 = 2.64
After second iteration x2 = 2.823
After third iteration x3 = 3.059
After fourth iteration x4 = 3.355
f (3.355) = −1.710

14. (2 points) The value of f (x1 , x2 ) = 4x21 − 4x1 x2 + 2x22 with an initial guess of (2, 3)
after two iterations of gradient descent algorithm will be ............... Take the step size
1
η = t+1 , where t= 0,1,2....
Answer: 130
xn+1 =xn − η∇f (x)
8x1 − 4x2
∇f =
 −4x
 1 + 4x2
−2
x1 =
−1
Course: Machine Learning - Foundations Page 5 of 5
 
4
x2 =
−3
f (4, −3) = 130

15. (2 points) The point of minimum for the function f (x1 , x2 ) = x21 − x1 x2 + 2x22 with an
initial guess of (3, 2) with step size=0.5 using gradient descent algorithm after second
iteration will be .............. (correct up to 3 decimal places)

Answer: 2.312 (2.196 to 2.428)


xn+1 =xn − η∇f (x)
2x1 − x2
∇f =
 −x1+ 4x2
1
x1 =
 −0.5 
−0.25
x2 =
1
f (−0.25, 1) = 2.312

16. (2 points) Suppose we have n data points randomly distributed in space given by D =
{x1 , x2 , ....., xn }. A function f (p) is defined to
Pncalculate the sum of distances of data
2
points from a fixed point, say p. Let f (p) = i=1 (p − xi ) . What is the value of p so
that f(p) is minimum?
A. x1 + x2 + ..... + xn
B. x1 − x2 + x3 − x4 ....
x1 + x2 + .... + xn
C.
n
x1 − x2 + x3 − x4 ....
D.
n
Answer: Pn C
f (p) = i=1 (p − xi )2
f (p) = (p − x1 )2 + ...... + (p − xn )2
f 0 (p) = 2p(p − x1 ) + ..... + 2p(p − xn )
For minima f 0 (p) = 0
(p − x1 ) + (p − x2 ) + ..... + (p − xn ) = 0
np − (x1 + x2 + .... + xn ) = 0
x1 + x2 + .... + xn
p=
n
Course: Machine Learning - Foundations
Practice Questions - Solution
Lecture Details: Week 8

1. (1 point) Given if x1 , x2 , x1 +x
2
2
∈ S then 34 x1 + 41 x2 ∈ S. Is this a true statement?
A. Yes
B. No

Answer: A
x1 +x2
(x1 + )
x1 , x2 , x1 +x
2
2
∈ S then 2
2
∈ S =⇒ 3
x
4 1
+ 14 x2 ∈ S
It is clear that if the set is midpoint convex, then the set is a convex set.
By definition, For a convex set S, if x1 , x2 ∈ S =⇒ λx1 + (1 − λ)x2 ∈ S, λ ∈ [0, 1]
2. (1 point) Which of the following is a convex function?
A. f (x) = ax + b over R where a, b ∈ R
B. f (x) = eax over R where a ∈ R
C. f (x) = x2 over R
D. f (x) = x3 over R

Answer: A, B, C
1. f (x) = ax + b over R where a, b ∈ R
∂f (x) ∂ 2 f (x)
∂x
= a, ∂x2
= 0,
The second order partial derivative is non-negative. Hence, the function is a convex
function.
2. f (x) = eax over R where a ∈ R
∂f (x) ∂ 2 f (x)
∂x
= aeax , ∂x2
= a2 eax ≥ 0 ∀ x ∈ R,
The second order partial derivative is non-negative (positive curvature). Hence, the
function is a convex function.
3. f (x) = x2 over R
∂f (x) ∂ 2 f (x)
∂x
= 2x, ∂x2
= 2 ≥ 0,
The second order partial derivative is non-negative (positive curvature). Hence, the
function is a convex function.
4. f (x) = x3 over R
∂f (x) ∂ 2 f (x)
∂x
= 3x2 , ∂x2
= 6x,
The second order partial derivative depends on the x and can be negatie or positive.
Hence, the function is a non-convex function in nature.
Course: Machine Learning - Foundations Page 2 of 7

3. (1 point) What is the value of a, the function f : R → R, f (x, y) = ax4 + 8y is a convex


function
A. a > 0
B. a < 1
C. a ≥ 1
D. None of these

Answer: A
f : R → R, f (x, y) = ax4 + 8y
∂f (x,y) ∂ 2 f (x,y)
fx = ∂x
= 4ax3 , fxx = ∂x2
= 12ax2 ,
∂f (x,y) ∂ 2 f (x,y)
fy = ∂y
= 8, fyy = ∂y 2
= 0,
∂ 2 f (x,y)
fxy = ∂x∂y
= 0,
   
fxx fxy 12ax2 0
The hessian matrix, H = =
fxy fyy 0 0
The determinant of the hessian matrix, D = fxx fyy − fxy 2 = 0
For the function to be a convex function, the second order partial derivative with respect
to x should be positive, in other words fxx > 0
For this to be true, a > 0

4. (1 point) Which of the following hessian matrix corresponds to the convex function?
 
−2 2
A.
2 −2
 
2 1
B.
1 2
 
−2 2
C.
2 0
 
−2 2
D.
2 2

Answer: B
 
fxx fxy
The hessian matrix is denoted as, H =
fxy fyy
A function f (x, y) is convex when fxx > 0 and D = fxx fyy − fxy 2 ≥ 0

5. (1 point) Function f : Rd → R, f (x) = xT Ax is a convex function if


A. A is positive definite matrix
B. A is positive semi-definite matrix
Course: Machine Learning - Foundations Page 3 of 7

C. A is negative definite matrix


D. A is negative semi-definite matrix

Answer: A, B
 
fxx fxy
The hessian matrix is denoted as, H =
fxy fyy
A function f (x, y) is positive semi-definite when fxx > 0 and D = fxx fyy − fxy 2 ≥ 0
A function f (x, y) is positive definite when fxx > 0 and D = fxx fyy − fxy 2 > 0
In both cases, the function fulfills the criteria of convexity.

6. (1 point) A twice differentiable function f : Rn → R is convex if and only if for all


x, y ∈ Rn
A. Hessian matrix is positive definite
B. Hessian matrix is positive semi-definite
C. Hessian matrix is negative definite
D. Hessian matrix is negative semi-definite

Answer: A, B
Please refer to the previous solution.

7. (1 point) Given a function f : Rn → R, the linear approximation of a function f at the


point (x + d) is:
A. f (x) +  dT ∇f (x)
B. f (x) +  ∇f (x)
C. f (x) + dT ∇f (x)
D. None of these

Answer: A
Please refer to the lecture videos.

8. (1 point) (multiple select) A function in one variable is said to be convex function if it


has:
A. Positive curvature
B. Negative curvature
C. Non-positive curvature
D. Non-negative curvature
Course: Machine Learning - Foundations Page 4 of 7

Answer: A, D
Since the function is convex, its second order partial derivative will be non-negative.
Hence, both (A) and (D) are true answer.

9. (1 point) What is the relationship between eigenvalues of the hessian matrix of twice
differentiable convex function?
A. All eigenvalues are non-negative
B. Eigenvalues are both positive and negative
C. All eigenvalues are non-positive
D. There is no relationship

Answer: A
For a positive definite function, the eigen values are always positive. For a positive
semi-definite function, the eigen values are always non-negative.
For a convex function, the determinant of the hessian matrix is non-negative and positive
semi-definite (or definite).
Therefore, in case of convex function option (A) is the true answer.

10. (1 point) A batch of cookies requires 4 cups of flour, and a cake requires 7 cups of flour.
What would be the constraint limiting the amount of cookies(a) and cakes(b) that can
be made with 50 cups of flour.
A. 4a + 7b ≤ 50
B. 7a + 4b ≤ 50
C. 11(a + b) ≤ 50
D. 4a.7b ≤ 50

Answer: A
Since we need min 4 cup for cookies (a) ⇒ 4a.
We need min 7 cup for cake (b) ⇒ 7b
and Max amt of flour available is⇒ 50 cup. Hence equation becomes 4a + 7b ≤ 50

11. (1 point) If objective function which is to be minimised is f (x, y, z) = x + z and the


constrained equation is g(x, y, z) = x2 + y 2 + z 2 = 1. The point where minimum value
occurs will be
−1
A. ( √12 , 0, √ 2
)
−1
B. ( √ 2
, 0, √12 )
−1 −1
C. ( √ 2
, 0, √ 2
)
Course: Machine Learning - Foundations Page 5 of 7

D. ( √12 , 0, √12 )

Answer: C
Given
f (x, y, z) = x + z
g(x, y, z) = x2 + y 2 + z 2 = 1
To get critical point we need to solve ∇f (x, y, z) = λ∇g(x, y, z):
using above we get following equation:
(i) diff w.r.t x ⇒ 1 = 2λx
(ii) diff w.r.t y ⇒ 0 = 2λy
(iii) diff w.r.t z ⇒ 1 = 2λz
1
using above we get: x = z = 2λ
and y = 0
Substituting above in x2 + y 2 + z 2 = 1 we get critical point as
−1 −1
(√ 2
, 0, √ 2
) and ( √12 , 0, √12 )
−1 −1
Here f ( √ 2
, 0, √ 2
) ≤ f ( √12 , 0, √12 )
and since constrained equation shows a sphere,so:
−1 −1
f(√ 2
, 0, √ 2
) is constrained minimum point.
and f ( √12 , 0, √12 ) is constrained maximum point.

12. (1 point) If objective function which is to be maximized is f (x, y, z) = x + z and the


constrained equation is g(x, y, z) = x2 + y 2 + z 2 = 1. The point where maximum value
occurs will be
−1
A. ( √12 , 0, √ 2
)
−1
B. ( √ 2
, 0, √12 )
−1 −1
C. ( √ 2
, 0, √ 2
)
D. ( √12 , 0, √12 )

Answer: D
Refer Previous solution

13. (1 point) Find the points on the surface y 2 = 1 + xz that are closest to the origin.
A. (0, −1, 0)
B. (1, 1, 1)
C. (0, 0, 0)
D. (0, 2, 0)
Course: Machine Learning - Foundations Page 6 of 7

E. (1, 2, 0)

Answer: A
To get the closest point on the surface from a point we can create distance function and
try to minimise them.
Since, coordinate of origin = (0, 0, 0)
Our objective function will be distance between them, so:
p
d = (x − 0)2 + (y 2 − 0) + (z − 0)2
Hence,
objective function= f (x, y, z) = x2 + y 2 + z 2
constrained equation g(x, y, z) = y 2 − 1 − xz = 0
To get critical point we need to solve ∇f (x, y, z) = λ∇g(x, y, z):
using above we get following equation:
(i) diff w.r.t x ⇒ 2x = −λz
(ii) diff w.r.t y ⇒ 2y = 2λy → λ = 1
(iii) diff w.r.t z ⇒ 2z = −λx
using above we get: x = z = 0
and By putting x = z = 0 in y 2 − 1 − xz = 0 we get y = 1, −1
so Point closet to origin are (0, 1, 0) and (0, −1, 0)

14. (1 point) The minimum value of the function f (x, y) = xy 2 on the circle x2 + y 2 = 1 is
(correct upto two decimal places) .

Answer: 0.39, Range 0.00 to 0.50


Given f (x, y) = xy 2 , g(x, y) = x2 + y 2 = 1
 2  
y 2x
∇f (x, y) = , ∇g(x, y) =
2xy 2y
We find the values of x, yλ that simultaneously satisfy the equations to get the extreme
points
∇f (x, y) = λ∇g(x, y) and, g(x, y) = x2 + y 2 = 1
Solving, ∇f (x, y) = λ∇g(x, y)

 2  
y 2x
=⇒ =λ =⇒ x = λ, 0, y = 2λ, 0
2xy 2y
The point (0,0) does not lie on the circle, g(x, y) = x2 + y 2 = 1.

Solving, g(x, y) = x2 + y 2 = 1 =⇒ λ2 + 2λ2 = 1 =⇒ λ = ± 3/3
Therefore, the extreme point coordinates will be,
Course: Machine Learning - Foundations Page 7 of 7
√ √ √ √ √
(x1 , y1 ) = (λ, 2λ) = ( 3/3, 6/3), f ( 3/3, 6/3) = 0.39,
√ √ √ √ √
(x2 , y2 ) = (−λ, 2λ) = (− 3/3, 6/3), f ( 3/3, 6/3) = −0.39,
√ √ √ √ √
(x3 , y3 ) = (λ, − 2λ) = ( 3/3, − 6/3), f (− 3/3, − 6/3) = 0.39
√ √ √ √ √
(x2 , y2 ) = (λ, 2λ) = (− 3/3, − 6/3), f (− 3/3, − 6/3) = −0.39
√ √ √ √
We can see the function f(x,y) has a minimum at the points (− 3/3, 6/3), (− 3/3, − 6/3).

15. (1 point) (multiple select) The minimum value of the function f (x, y) = xy 2 on the
circle x2 + y 2 = 1 occurs at the below points:
√ √
A. ( 3/3, 6/3)
√ √
B. (− 3/3, 6/3)
√ √
C. ( 3/3, − 6/3)
√ √
D. (− 3/3, − 6/3)

Answer: C, D
Refer to the solution of the previous question
Course: Machine Learning - Foundations
Graded Questions - Solution
Lecture Details: Week 8

1. (1 point) Points (0, 0), (5, 0), (5, 5), (5, 0) forms a convex hull. Which of the following
points are the part of this convex hull?
A. (1,1)
B. (1,-1)
C. (-1,1)
D. (-1,-1)

Answer: A
Let (x1 , y1 ) = (0, 0), (x2 , y2 ) = (5, 0), (x3 , y3 ) = (5, 5), (x4 , y4 ) = (5, 0) Any point on the
convex hull of these points will be the the set S such that
{S = (x, y) | x = λ1 x1 + λ2 x2 + λ3 x3 + λ4 x4 , y = λ1 y1 + λ2 y2 + λ3 y3 + λ4 y4 , λ1 , λ2 , λ3 , λ4 ∈
[0, 1]}
The convex hull of these 4 points forms a square and any point inside or on the square
will be the part of the convex hull.
The point (1,1) lies inside this square, while (1,-1), (-1,1) and (-1,=1) lies outside this
square.

2. (1 point) Given S is a convex set and the points x1 , x2 , x3 , x4 ∈ S. Which of the following
points must be the part of convex set S:
A. 0.1x1 + 0.2x2 + 0.3x3 + 0.4x4
B. −0.1x1 + −0.2x2 + 0.6x3 + 0.7x4
C. 0.1x1 + 0.12 x2 + 0.13 x3 + 0.13 x4
D. 0.25x1 + 0.25x2 + 0.25x3 + 0.25x4

Answer: A, D
The convex combination of the points will always be the part of the convex set and is
known as convex hull.
For the convex combination, the coefficients should be non-negative and should sum to
1. Therefore , options (A) and (D) are correct.

3. (1 point) Which of the following is a convex function in R2 ?


A. f (x) = x2 + y 2
B. f (x) = −x2 − y 2
C. f (x) = x2 − y 2
Course: Machine Learning - Foundations Page 2 of 7

D. None of these

Answer: A
  
2 2 T
  1 0 x
f (x) = x + y = v Av = x y
0 1 y
     
a b 1 0 x
where A = = ,v =
b c 0 1 y
Here, a > 0, ac − b2 = 1.1 − 02 = 1 > 0, This shows the matrix A is a positive definite
matrix. Therefore, the function is a convex function.

4. (1 point) What is the boundary value of x so that the function (x − 3)3 + (y + 1)2 to
remain convex?
A. x ≥ 1
B. x ≥ 2
C. x ≥ 3
D. None of these

Answer: C
f (x, y) = (x − 3)3 + (y + 1)2
First order partial derivatives, fx = 3(x − 3)2 , fy = 2(y + 1)
Second order partial derivatives, fxx = 6(x − 3), fxy = 0, fyy = 2
The Second order partial derivatives with respect to x, fxx = 6(x − 3) changes sign at
the point where x = 3.
When x ≥ 3, fxx ≥ 0 otherwise fxx < 0
The hessian matrix will be,
   
fxx fxy 6(x − 3) 0
D= = = 12(x − 3)
fxy fyy 0 2
D will be non negative when x ≥ 3 and the function remains convex.

5. (1 point) Select the most appropriate linear approximation of f (x, y) = x2 + y 2 at a


small step 0.02 in the direction (1,2) from the point (3,2)
A. 11.02
B. 10.28
C. 9.98
D. None of these
Course: Machine Learning - Foundations Page 3 of 7

Answer: A
     
1 2x 6
 = 0.02, d = , ∇f (x, y)(3,2) = =
2 2y (3,2) 4
The linear approximation of a function f at the point (x + d) is
 
  6
f (x, y) +  dT ∇f (x, y) = (32 + 12 ) + 0.02 ∗ 1 2 = 10 + 0.02 ∗ 14 = 10.28
4

6. (1 point) The minimum value of f (x, y) = x2 + 4y 2 − 2x + 8y subject to the constraint


x + 2y = 7 is .

Answer: 27.00, Range 26.50 to 27.50


Given f (x, y) = x2 + 4y 2 − 2x + 8y, g(x, y) = x + 2y = 7
   
2x − 2 1
∇f (x, y) = , ∇g(x, y) =
8y + 8 2
We find the values of x, y, λ that simultaneously satisfy the equations to get the extreme
points ∇f (x, y) = λ∇g(x, y) and, g(x, y) = x + 2y = 7
Solving, ∇f (x, y) = λ∇g(x, y)
   
2x − 2 1
=⇒ =λ =⇒ x = (λ + 2)/2, y = (λ − 4)/4
8y + 8 2
Solving, g(x, y) = x + 2y = 7 =⇒ (λ + 2)/2 + 2 ∗ (λ − 4)/4 = 7 =⇒ 2λ − 2 = 14 =⇒
λ=8
Therefore, the extreme point coordinates will be, (x1 , y1 ) = ((λ+2)/2, (λ−4)/4) = (5, 1)
f (5, 1) = 52 + 4 ∗ 12 − 2 ∗ 5 + 8 ∗ 1 = 25 + 4 − 10 + 8 = 27
Taking 2 neighbouring point on the line g(x, y) = x + 2y = 7,
f (3, 2) = 32 + 4 ∗ 22 − 2 ∗ 3 + 8 ∗ 2 = 9 + 16 − 6 + 16 = 35
f (1, 3) = 12 + 4 ∗ 32 − 2 ∗ 1 + 8 ∗ 3 = 2 + 36 − 2 + 24 = 60
We can see the function f(x,y) has a minimum the point (5,1).
7. (1 point) The the minimum value of f (x, y) = x2 +4y 2 −2x+8y subject to the constraint
x + 2y = 7 occurs at the below point:
A. (5,5)
B. (-5,5)
C. (1,5)
D. (5,1)

Answer: D
Refer to the solution of the previous question
Course: Machine Learning - Foundations Page 4 of 7

8. (1 point) The distance of the plane x + y − 2z = 6 from the origin is .


Answer: 2.45, Range 2.30 to 2.60
Given Squared Distance, d2 = f (x, y, z) = x2 + y 2 + z 2 , g(x, y, z) = x + y − 2z = 6
   
2x 1
∇f (x, y, z) = 2y , ∇g(x, y, z) = 1 
  
2z −2
We find the values of x, y, z, λ that simultaneously satisfy the equations to get the ex-
treme points ∇f (x, y, z) = λ∇g(x, y, z) and, g(x, y, z) = x + y − 2z = 6
Solving, ∇f (x, y, z) = λ∇g(x, y, z)
   
2x 1
=⇒ 2y = λ 1  =⇒ x = λ/2, y = λ/2, z = −λ
  
2z −2
Solving, g(x, y, z) = x + y − 2z = 6 =⇒ λ/2 + λ/2 + 2λ = 6 =⇒ λ = 2
Therefore, the extreme point coordinates will be, (x1 , y1 , z1 ) = (λ/2, λ/2, −λ) = (1, 1, −2)
f (1, 1, −2) = 12 + 12 + (−2)2 = 1 + 1 + 4 = 6,
Taking 2 neighbouring point on the line g(x, y, z) = x + y − 2z = 6,
f (0, 0, −3) = 02 + 02 + (−3)2 = 0 + 0 + 9 = 9
f (2, 2, −1) = 22 + 22 + (−1)2 = 4 + 4 + 1 = 9
We can see the function f(x,y,z) has a minimum at the point √ (1, 1, -2). The minimum
distance of the plane x + y − 2z = 6 from the origin is, d = 6 = 2.45

9. (1 point) The point on the plane x + y − 2z = 6 that is closest to the origin is


A. (0, 0, 0)
B. (1, 1, 1)
C. (−1, 1, 2)
D. (1, 2, 0)
E. (1, 1, −2)

Answer: E
Refer to the solution of the previous question

10. (1 point) A box (cuboid shaped) is to made out of the cardboard with the total area of
24 cm2 . The maximum volume occupied by the box will be .

Answer: 8, Range 7.50 to 8.50


Given Volume, V = f (x, y, z) = xyz,
Surface area, g(x, y, z) = 2xy + 2yz + 2zx = 24 =⇒ g(x, y, z) = xy + yz + zx = 12
Course: Machine Learning - Foundations Page 5 of 7
   
yz y+z
∇f (x, y, z) = zx, ∇g(x, y, z) = z + x
xy x+y
We find the values of x, y, z, λ that simultaneously satisfy the equations to get the ex-
treme points ∇f (x, y, z) = λ∇g(x, y, z) and, g(x, y, z) = xy + yz + zx = 12
Solving, ∇f (x, y, z) = λ∇g(x, y, z)
   
yz y+z
=⇒ zx = λ z + x =⇒ xy + yz = yz + xz = xz + yz = xyz/λ =⇒ x = y = z =
xy x+y
0, 2λ
Solving, g(x, y, z) = xy + yz + zx = 12 =⇒ x = y = z = 2λ = ±2
Since, x, y, z are length. This can not be a negative quantity. Therefore, the extreme
point coordinates will be,
(x1 , y1 , z1 ) = (2λ, 2λ, 2λ) = (2, 2, 2), Volume, V = f (2, 2, 2) = 2 ∗ 2 ∗ 2 = 8
(x2 , y2 , z2 ) = (0, 0, 0), Volume, V = f (0, 0, 0) = 0 ∗ 0 ∗ 0 = 0
The box has maximum volume when it is a cube with edge 2cm, and the maximum
volume is 8cm3

11. (1 point) You are planning to setup a manufacturing business where the revenue (r) is
a function of labour units (l), material units (m) and fixed cost (c), r = l.m2 + 2c. You
have an annual budget (b) of 1004 million rupees, b = 2l + 16m + c to run the business.
What would be maximum revenue that can be generated from the business in million
rupees under the optimum combination of labour units (l), material units (m) and fixed
cost(c).
A. 1944
B. 2036
C. 2072
D. 2080

Answer: A
Given
Revenue, r = f (l, m, c) = l.m2 + 2c, Budget constraint, g(l, m, c) = 2l + 16m + c = 1004
 2  
m 2
∇f (x, y, z) = 2lm , ∇g(x, y, z) = 16
  
2 1
We find the values of x, y, z, λ that simultaneously satisfy the equations to get the ex-
treme points ∇f (l, m, c) = λ∇g(x, y, z) and, g(l, m, c) = 2l + 16m + c = 1004
Solving, ∇f (l, m, c) = λ∇g(x, y, z)
Course: Machine Learning - Foundations Page 6 of 7
  
m2 2
=⇒ 2lm = λ 16 =⇒ m = ±2, l = ±8, λ = 2
  
2 1
Solving, g(l, m, c) = 2l + 16m + c = 1004 =⇒ c = 1004 − 2l − 16m
Therefore, the extreme point coordinates will be,
(l1 , m1 , c1 ) = (8, 2, 956), Revenue, r = f (8, 2, 956) = 8 ∗ (2)2 + 2 ∗ 956 = 1944
(l2 , m2 , c2 ) = (−8, −2, 1052), Revenue, r = f (−8, −2, 1052) = −8 ∗ (−2)2 + 2 ∗ 1052 =
2072
Since, the configurations can not be negative. So, l ≥ 0, m ≥ 0, c ≥ 0
We can see the maximum revenue is achieved under the configuration l = 8, m = 2, c =
956. The maximum revenue is 1944.

12. (1 point) The distance of the point on the sphere x2 + y 2 + z 2 = 3 closest to the point
(2,2,2) is .

Answer: 1.73, Range 1.50 to 2.00


Given Squared Distance, d2 = f (x, y, z) = (x − 2)2 + (y − 2)2 + (z − 2)2 , g(x, y, z) =
x2 + y 2 + z 2 = 3
   
2(x − 2) 2x
∇f (x, y, z) = 2(y − 2) , ∇g(x, y, z) = 2y 
  
2(z − 2) 2z
We find the values of x, y, z, λ that simultaneously satisfy the equations to get the ex-
treme points ∇f (x, y, z) = λ∇g(x, y, z) and, g(x, y, z) = x2 + y 2 + z 2 = 3
Solving, ∇f (x, y, z) = λ∇g(x, y, z)
   
2(x − 2) 2x
=⇒ 2(y − 2) = λ 2y  =⇒ x = y = z = 2/(1 − λ)
  
2(z − 2) 2z
Solving, g(x, y, z) = x2 + y 2 + z 2 = 3 =⇒ 12/(1 − λ)2 = 3 =⇒ λ = 3, −1
Therefore, the extreme point coordinates will be,
p √
(x1 , y1 , z1 ) = (1, 1, 1), d = (1 − 2)2 + (1 − 2)2 + (1 − 2)2 = 3 = 1.73
p √
(x2 , y2 , z2 ) = (−1, −1, −1), d = ((−1 − 2)2 + (−1 − 2)2 + (−1 − 2)2 = 3 3 = 5.20
Taking 2 neighbouring point on the line g(x, y, z) = x + y − 2z = 6,
p √
(x3 , y3 , z3 ) = (1, 1, −1), d = (1 − 2)2 + (1 − 2)2 + (−1 − 2)2 = 11 = 3.32
p √
(x3 , y3 , z3 ) = (1, −1, −1), d = (1 − 2)2 + (−1 − 2)2 + (−1 − 2)2 = 3 = 4.39
We can see the point on the sphere closest to the point (2,2,2) is (1, 1, 1) and farthest
from the point (2,2,2) is (-1, -1, -1)
Course: Machine Learning - Foundations Page 7 of 7

13. (1 point) The distance of the point on the sphere x2 + y 2 + z 2 = 3 farthest from the
point (2,2,2) is .

Answer: 5.20, Range 5.00 to 5.50


Refer to the solution of the previous question
Course: Machine Learning - Foundations
Practice Questions
Lecture Details: Week 9

1. (1 point) Let f (x) = −2x2 + 5. At x = −3, is f (x) increasing or decreasing?


A. increasing
B. decreasing

Answer: A
If the first derivative of f (x) at any point x is positive or negative, then we call f (x) as
increasing or decreasing respectively.
f 0 (x) = −4x, at x = −3 we get f 0 (−3) = 12. Hence increasing.

2. (1 point) For a function f (x) = −x + 2x2 , the global minimum occurs at


A. x = −0.25
B. x = −0.5
C. x = 0.5
D. x = 0.25

Answer: D
Global minimum occurs at a point x when ∇(f (x)) = 0.
−1 + 4x = 0
x = 0.25
2
3. (1 point) (Multiple select) Consider two convex functions f (x) = x2 and g(x) = e3x .
Choose the correct convex function(s) that is a resultant of combination of f (x) and
g(x).
2
A. h(x) = x2 + e3x
2
B. h(x) = x2 e−3x
2
C. h(x) = x2 e3x
2
D. h(x) = x2 − e3x

Answer: A,C
2 2
h(x) = f (x) + g(x) = x2 + e3x is a convex function h(x) = f (x) × g(x) = x2 e3x is a
convex function

4. (1 point) Consider two functions g(x) = 2x − 3 and f (x) = x − 10ln(5x). Select the
true statement.
A. h = f og is a convex function.
B. h = f og is a concave function.
Course: Machine Learning - Foundations Page 2 of 6

Answer: A
Since f (x) is a convex function and g(x) is a linear function, h = f og is also a convex
function.
(Common data for Q5-Q7) Given below is a set of data points and their labels.

X y
[1,0] 1.5
[2,1] 2.9
[3,2] 3.4
[4,2] 3.8
[5,3] 5.3

To perform linear regression on this data set, the sum of squares error with respect to
w is to be minimized.
5. (1 point) Which of the following is the optimal w∗ computed using the analytical method?
 
1.5
A.
0
 
1.255
B.
−0.317
 
1.512
C.
0.004
 
−1.5
D.
0
Answer: B
optimal
 w∗ = (X T X)−1 (X T y)
1 0
2 1
 
X3 2

4 2
5 3 
T 55 31
X X=
 31 18
59.2
XT y =
 33.2 
1.255
w∗ =
−0.317
 
0.1
6. (1 point) Let w1 be initialized to . Gradient descent optimization is used to find
0.1
the value of optimal w∗ . For the first iteration t = 1, which of the following is the
gradient computed with respect to w1 ?
Course: Machine Learning - Foundations Page 3 of 6
 
50.6
A.
28.3
 
8.6
B.
4.9
 
−50.6
C.
−28.3
 
−8.6
D.
−4.9
Answer: C
T T
Gradient ∇f
 (w) = (X X)w − X y
−50.6
∇f (w) =
−28.3

7. (1 point) Using the gradient descent update equation with a learning rate ηt = 0.1,
compute the value of w at t = 2.
 
5.16
A.
−2.93
 
5.16
B.
2.93
 
5.5
C.
3.5
 
5.5
D.
−3.5
Answer: B
 
1 0.1
Given w =
0.1
2 1 1
update
 equation:
 w = w − ηt ∇f (w )
0.1 −50.6
w2 = − 0.1
0.1  −28.3
5.16
w2 =
2.93
(Common data for Q8-Q10) A rectangle has a perimeter of 20 m. Using the Lagrange
multiplier method, find the height and width of the rectangle which results in maximum
area.

8. (1 point) What is the optimal height?


Answer: 5
range: 4.5, 5,5
Course: Machine Learning - Foundations Page 4 of 6

9. (1 point) What is the optimal width?


Answer: 5
range: 4.5, 5,5
10. (1 point) Enter the value of Lagrange multiplier.
Answer: 2.5
The optimization problem can be stated as follows:
max f (x, y) = xy
subject to
2x + 2y = 20
Solve the equation ∇f (x, y) = λ∇g(x, y)
∂f ∂g
∂x
= λ ∂x and ∂f
∂y
∂g
= λ ∂y
y = 2λ and x = 2λ
Therefore, y2 = λ = x2 will lead to x = y.
Substituting x = y in 2x + 2y = 20 we get y = 5. Therefore x = 5 and λ = 2.5.
11. (1 point) (Multiple select) Which of the following statements about primal and dual
problems is (are) true?.
A. Dual of dual is primal.
B. If either the primal or dual problem has an infeasible solution, then the value
of the objective function of the other is unbounded.
C. If either the primal or dual problem has a solution then the other also has a
solution and their optimum values are equal.
D. If one of the variables in the primal has unrestricted sign, the corresponding
constraint in the dual is satisfied with equality.
Answer: A,B,C,D
(Common data for Q12, Q13) Rahul is a consumer who wants to maximize his utility
subject to some constraints. He consumes two goods x and y and the utility function is
the product of x and y. His budget is Rs.1000. The per unit price of goods x and y are
Rs.15 and Rs.20 respectively.
12. (1 point) Choose the correct optimization problem.
A. maximize x + y subject to (15x)(20y) = 1000
B. maximize xy subject to 15x + 20y = 1000
C. minimize xy subject to 15x + 20y = 1000
D. minimize x + y subject to (15x)(20y) = 1000
Answer: B
Utility function to maximixe is xy
constraint is the budget: 15x + 20y = 1000
Course: Machine Learning - Foundations Page 5 of 6

13. (1 point) Choose the equivalent Lagrange function for the problem.
A. L(x, y, z) = x + y − λ(15x + 20y − 1000)
B. L(x, y, z) = x + y + λ(15x + 20y + 1000)
C. L(x, y, z) = xy + λ(15x + 20y − 1000)
D. L(x, y, z) = xy − λ(15x + 20y + 1000)

Answer: C
Lagrange function = f (x) + λh(x) = xy + λ(15x + 20y − 1000)

14. (2 points) Minimize the function f = x21 + 60x1 + x22 subject to the constraints g1 =
x1 − 80 ≥ 0 and g2 = x1 + x2 − 120 ≥ 0 using KKT conditions. Which of the following
is the optimal solution set?
A. [x∗1 , x∗2 ] = [80, 40]
B. [x∗1 , x∗2 ] = [−80, −40]
C. [x∗1 , x∗2 ] = [45, 75]
D. [x∗1 , x∗2 ] = [−45, −75]

Answer: A
The KKT conditions for the problem with multiple variables are given as follows:
(i) Stationary condition
2
∂f X ∂gj
+ uj = 0, i = 1, 2
∂xi j=1 ∂xi

Therefore, we get
2x1 + 60 + u1 + u2 = 0 (1)
2x2 + u2 = 0 (2)
(ii) Complementary slackness condition

ui gi = 0, i = 1, 2

Therefore, we get
u1 (x1 − 80) = 0 (3)
u2 (x1 + x2 − 120) = 0 (4)
(iii) Primal feasibility condition
gi ≥ 0
Therefore, we get
x1 − 80 ≥ 0 (5)
x1 + x2 − 120 ≥ 0 (6)
(iv) Dual feasibility condition
ui ≥ 0
Course: Machine Learning - Foundations Page 6 of 6

Therefore, we get
x1 − 80 ≥ 0 (7)
x1 + x2 − 120 ≥ 0 (8)
From (3), either u1 = 0 or x1 = 80
Case (i): Let u1 = 0
Substitute in (1);

2x1 + 60 + u2 = 0
u2
x1 = − − 30 (9)
2
Substitute in (2);

2x2 + u2 = 0
u2
x2 = − (10)
2
Substitute in (9) and (10) in (4);
u2 u2
u2 (− − 30 − − 120) = 0
2 2
u2 (−u2 − 150) = 0
From this, u2 = 0 or u2 = −150
Using u2 = 0 in (9) and (1) we get, x1 = −30 and x2 = 0 respectively. This violates (5)
and (6).
Using u2 = −150 in (9) and (10) we get, x1 = 45 and x2 = 75, respectively. This violates
(5).
Case (ii): Let x1 = 80, substitute in (1)

2(80) + 60 + u1 + u2 = 0
u2 = −u1 − 220 (11)
Substitute (11) in (2):
u1 = 2x2 − 220 (12)
Using (12) in (11) we get u2 = −2x2 . Using x1 = 80 and u2 = −2x2 in (4) we get
−2x2 (80 + x2 − 120) = 0
x2 (x2 − 40) = 0 (13)
From this, either x2 = 0 or x2 = 40.
Case (ii-a): Let x2 = 0, then from (12) we get u1 = −220.
Substituting x1 = 80, x2 = 0 and u1 = −220 in (6) will violate the condition.
Case (ii-b): Let x2 = 40
Substituting x1 = 80, x2 = 40 in (5) and (6), the conditions are satisfied.
Substituting x1 = 80, x2 = 40 in (11) and (12), we get u1 = −140 and u2 = −80. These
values satisfy conditions (7) and (8).
Thus the optimal solution is [x∗1 , x∗2 ] = [80, 40] because it satisfies the KKT conditions.
Course: Machine Learning - Foundations
Test Questions
Lecture Details: Week 10

1. (1 point) For a given point x = 2, is the function f (x) = xe−3x increasing or decreasing?
A. increasing
B. decreasing

Answer: B
If the first derivative of f (x) at any point x is positive or negative, then we call f (x) as
increasing or decreasing respectively.
f ′ (x) = e−3x (1 − 3x), at x = 2 we get f ′ (2) = −0.012. Hence decreasing.

2. (1 point) Find the value of a functionf (x) = x + 3x2 at its global minimum point.
Answer: -0.0833
range: -0.09,-0.08

To get global minimum, f ′ (x) = 0. Therefore 1 + 6x = 0, x = − 16 .


1
f ( ) = −0.0833
6
2
3. (1 point) Choose the largest interval of x in which a function f (x) = xex is convex.
A. (-0.5,0.5)
B. (0, ∞)
C. (∞,0)
D. (1,0)

Answer: B
In order to find the interval over which f (x) is convex, let us find where f ′′ (x) > 0.
2
f ′ (x) = 2x2 ex
2
f ′′ (x) = ex (4x3 + 6x)
2
Therefore f ′′ (x) > 0 implies ex (4x3 + 6x) > 0.
2
To satisfy this inequality, we want an interval of x for which ex > 0 and (4x3 + 6x) > 0.
Because e raised to any power will be positive, first condition can be satisfied for any
value of x.
Second condition can be written as 2x(x2 + 3) > 0.
(x2 + 3) will be greater than 0 for any values of x.
2x will be positive only when x > 0.
Thus in interval notation, the largest interval of x for which f (x) is convex is (0, inf).
Course: Machine Learning - Foundations Page 2 of 8

4. (1 point) (Multiple select) Let the composition of two functions f (x) = sin(x) − 2x2 + 1
and g(x) = ex be h = f og. At a point x = 5, Select the true statement(s).
A. h(x) is a convex function.
B. h(x) is a concave function.
C. h(x) is a non-decreasing function.
D. h(x) is a decreasing function.

Answer: B,D

h(x) = f (g(x)) = sin(ex ) − 2e2x + 1


First derivative:
h′ (x) = ex cos(ex ) − 4e2x
At x = 5; h′ (x) = −88232.28 which is less than 0. Hence h(x) is decreasing.
Second derivative:
h′′ (x) = e2x (cos(ex ) − sin(ex )) − 8e2x
At x = 5; h′′ (x) = −177055.6 which is negative. Hence h(x) is concave.

5. (2 points) Find the minimum value of f (x, y) = x + y subject to x2 + y 2 = 1, where x,


y are the coordinates of points on the circumference of the unit circle.
Answer: -1.414
range: -1,-2 The Lagrangian function for the problem is

L(x, y, λ) = x + y + λx2 + λy 2 − λ

Take partial derivatives with respect to x,y, λ and equate to zero.


∂L
= 2λx + 1 = 0
∂x
1
x=− (1)

Course: Machine Learning - Foundations Page 3 of 8

∂L
= 2λy + 1 = 0
∂y
1
y=− (2)

∂L
= x2 + y 2 = 1 (3)
∂λ
Substituting (1) and (2) in (3), we get
1 1
2
+ 2 =1
4λ 4λ
1
λ = ±√ (4)
2
Using (3) in (1) and (2) we get, x = ∓ √12 and y = ∓ √12 . Since we want to minimize
f (x, y), we shall consider x = − √12 and y = − √12 .
Minimum value of f (x, y) = −1.414.

6. (1 point) For the functions g(x) = (3x + 2)2 and f (x) = ex , select the plot that corre-
sponds to the correct composition h = f og.

A.
Course: Machine Learning - Foundations Page 4 of 8

B.

C.

D.

Answer: A
Course: Machine Learning - Foundations Page 5 of 8

The functions f(x), g(x) and their composition h(x) are plotted as follows:

(Common data for Q7, Q8) Linear programming deals with the problem of finding a
vector x that minimizes a given linear function cT x, where x ranges over all vectors
(x ≥ 0) satisfying a given system of linear equations Ax = b. Here A is a m × n matrix,
c, x ∈ Rn and b ∈ Rm .
7. (1 point) Choose the correct dual program with y as the dual variable for the above
linear program from the following.
A.
min by subject to AT y ≥ c
y

B.
max bT y subject to AT y ≤ c
y

C.
max bT y subject to AT y ≥ c
y

D.
max by subject to AT y ≤ c
y

Answer: B
8. (1 point) From the below given statements regarding constraints and decision variables
related to the primal and dual problems of the linear program, choose the correct state-
ment.
A. Primal problem has m constraints and m decision variables whereas dual prob-
lem has n constraints and n decision variables.
B. Primal problem has m constraints and n decision variables whereas dual prob-
lem has n constraints and m decision variables.
Course: Machine Learning - Foundations Page 6 of 8

C. Primal problem has n constraints and n decision variables whereas dual prob-
lem has m constraints and m decision variables.
D. Primal problem has n constraints and m decision variables whereas dual prob-
lem has m constraints and n decision variables.
Answer: B
9. (2 points)
 Let a set of data points with five samples and two features per sample be
1 2 1.5
2 3  2
   
X=  4 2.5 and the corresponding labels be y = 2.5. Perform linear regression
  
6 4 3
7.5 5 4

on this data set and choose the optimal solution for w to minimize the sum of squares
error.
 
0.2763
A.
1.2039
 
0.0691
B.
0.3010
 
0.1382
C.
0.6019
 
0.0276
D.
0.1204

Answer: C
optimal w∗ = (X T X)−1 (X T y)
 
113.25 79.5
XT X =
 79.5 60.25
63.5
XT y =
 47.25
0.1382
w∗ =
0.6019
A, B and D are scalar multiples of C.
(Common data for Q10, Q11) Krishna runs a steel fabrication industry and produces
steel products. He regularly purchases
√ raw steel for Rs.500 per ton. His revenue is
modeled by a function R(s) = 100 s, where s is the tons of steel purchased. His budget
for steel purchase is Rs.150000.
10. (1 point) Using Lagrangian function, find the amount of raw steel to be purchased to
get maximum revenue?
Answer: 300
range: 298, 302
Course: Machine Learning - Foundations Page 7 of 8

Lagrangian function: L(s, λ) = 100s0.5 + λ(s − 150000)

L(s, λ) = 100s0.5 + sλ − λ150000


Take partial derivatives with respect to s and λ and equate to zero.
∂L
= 0 = 50s−0.5 + λ500 = 0
∂s
∂L
= 0 = s500 − 150000 = 0
∂λ
s = 300

11. (1 point) What is the maximum revenue value in Rs?


Answer: 1732.05
range: 1730,1734


Maximum revenue: 100 ∗ 300 = 1732.05
 
2
12. (1 point) Consider a vector ŵ = 4 in R3 . In R3 , there are many unit vectors. Use

3
Lagrange method to find the unit vector which gives the minimum dot product.
 
2
1  
A. û = 2λ 4 , with λ ≥ 0
3
 
2
−1  
B. û = 3λ 4 , with λ ≥ 0
3
 
2
−1  
C. û = 4λ 4 , with λ ≥ 0
3
 
2
D. û = −1

4, with λ ≥ 0
3
Answer: D
 
x
Let unit vector be ⃗u = y .
z
⃗ subject to x2 + y 2 + z 2 = 1.
Objective is to minimize ⃗u.w

f (x, y, z) = ⃗u.w
⃗ = 2x + 4y + 3z
Course: Machine Learning - Foundations Page 8 of 8

Lagrangian function can be written as follows:

L(x, y, z, λ) = (2x + 4y + 3z) + λ(x2 + y 2 + z 2 − 1)

Take partial derivatives with respect to x, y, z and λ and equate to zero.


∂L
= 2 + 2λx = 0
∂x
−1
x= 2

∂L
= 4 + 2λy = 0
∂y
−1
y= 4

∂L
= 3 + 2λz = 0
∂z
−1
z= 3

(Common data for Q13, Q14) Solve the following linear program using KKT conditions.

minimize v = 24y1 + 60y2

subject to
0.5y1 + y2 ≥ 6
2y1 + 2y2 ≥ 14
y1 + 4y2 ≥ 13
y1 ≥ 0, y2 ≥ 0

13. (2 points) Choose the optimal solution for [y1∗ , y2∗ ]


A. [8,2.5]
B. [10,6]
C. [11,0.5]
D. [10.5,1]

Answer: C
][y1∗ , y2∗ ] = [11, 0.5], satisfies all the constraints and gives minimum value of v.

14. (1 point) Enter the minimum value of v.


Answer: 294

v = 24(11) + 60(0.5) = 294


Course: Machine Learning - Foundations
Week 11 Questions

PRACTICE QUESTIONS

1. (1 point) The continuous random variable X represents the amount of sunshine in hours
between noon and 8 pm at a skiing resort in the high season. The probability density
function, f (x), of X is modelled by
(
kx2 , for 0 ≤ x ≤ 8
f (x) =
0, otherwise

Find the probability that on a particular day in the high season there is more than two
hours of sunshine between noon and 8 pm.

Answer: 0.98

Explanation: To solve this question, we will use the pdf to find the required probability.
But first we must find k using the fact that total probability is 1.

Z Z8
k 3 8 512
fX (x)dx = kx2 dx = x = k
3 0 3
x 0

3
Since the total probability is 1 =⇒ k = . So, now the probability that there is
512
more than two hours of sunshine is P (X > 2) given by

Z8 Z8
3 3 3 1 4 8 3 (84 − 24 )
P (X > 2) = fX (x)dx = x dx = x = ≈ 0.98
512 512 4 2 (512)(4)
2 2

∴ The answer is 0.98

2. (1 point) Let X be a continuous random variable with PDF


(
6x + bx2 for 0 < x < 1
fX (x) =
0 otherwise

Calculate Value of b.
Course: Machine Learning - Foundations Page 2 of 18
Answer: -6

Explanation: To solve this question, we will use the fact that total probability is 1.

Z Z1
b 1 b
fX (x)dx = (6x + bx2 )dx = 3x2 + x3 = 3 +
3 0 3
x 0

b
Since the total probability is 1 =⇒ 3 + = 1 =⇒ b = −6.
3
∴ The answer is −6

3. (1 point) Let X be a continuous random variable with PDF


(
4xk for 0 < x < 1
fX (x) =
0 otherwise

for some k, find E(X).

Answer: 0.8

Explanation: To solve this question, we will use the pdf to find the expectation. But
first we must find k using the fact that total probability is 1.

Z Z1
4 1 4
fX (x)dx = 4xk dx = xk+1 =
k+1 0 k+1
x 0

4
Since the total probability is 1 =⇒ = 1 =⇒ k = 3. So, now the expectation is
k+1
given by
Z Z1
4 1
(x) 4x3 dx = x4 = 0.8

E(X) = xfX (x)dx =
5 0
x 0

∴ The answer is 0.8

4. (1 point) The time that passenger train will reach the station is uniformly distributed
between 2:00 PM and 4:00 PM. What is the probability that the train reaches station
Course: Machine Learning - Foundations Page 3 of 18
exactly at 04:00 PM?

Answer: 0

Explanation: Let X be the random variable denoting the time the train will arrive
such that 2 ≤ X ≤ 4. Now, X is a continuous random variable so the probability of X
being equal to a single point is 0. That is, P (X = 4) = 0.

∴ The answer is 0.

5. (1 point) If X is an exponential random variable with rate parameter λ then which of


the following statement(s) is(are) correct.
A. P (X > x + k|X < k) = P (X > k) for k, x ≥ 0.
B. P (X > x + k|X > k) = P (X > k) for k, x ≤ 0.
C. P (X > x + k|X > k) = P (X > x) for k, x ≥ 0.
D. P (X > x + k|X > k) = P (X > k) for k, x ≥ 0.

Answer: C

Explanation: For an exponential random variable, we have a memory-less property.


Say X denotes waiting time in hours. Given that you have waited for k hours, waiting
for x more hours is the same as waiting for only x hours. That is,

P (X > x + k | X > k) = P (X > x)

We can see why this is true. We know that the cdf of an exponential distribution is
FX (x) = 1 − e−λx =⇒ P (X > x) = 1 − FX (x) = e−λx . Now,

P (X > x + k ∧ X > k)
P (X > x + k | X > k) =
P (X > k)
P (X > x + k)
=
P (X > k)
−λ(x+k)
e
=
e−λk
−λx
=e
= P (X > x)
Course: Machine Learning - Foundations Page 4 of 18
∴ Option C is correct.

6. (1 point) The lifetime of a electric bulb is exponentially distributed with a mean life of
18 months. If there is a 60% chance that an electric bulb will last for at most t months,
then what is the value of t?

Answer: 16.5

Explanation: Let X be the random variable denoting the lifetime of an electric bulb
1
in months. Since the mean life is 18 months, the rate parameter λ = , which makes
18
1
the pdf fX (x) =
18
Now, the probability that an electric bulb will last for at most t months is P (X < t)
given by
−t −t
P (X < t) = 1 − e 18 = 0.6 =⇒ e 18 = 0.4 =⇒ t = (−18) ln(0.4) ≈ 16.5

∴ The answer is 16.5 months.

7. (1 point) (Multiple Select)Let X be uniformly distributed with parameters a and b,


then which of the following is/are true:
2 (b2 + a2 + ab)
A. E(X ) =
3
−1
B. f (x) = ; a≤x≤b
(a − b)
(b + a)
C. E(X) =
2
(b − a)2
D. V (X) =
12
Answer: A, B, C, D

Explanation: Here, we have a uniformly distributed random variable X ∼ U(a, b).


1
Such a distribution has a constant pdf fX (x) = , x ∈ [a, b].
b−a
−1
Since fX (x) = =⇒ Option B is correct.
a−b
Course: Machine Learning - Foundations Page 5 of 18

Now, let us find the expectation of X

Zb
b 2 − a2
Z
x 1 b a+b
E(X) = xfX (x)dx = dx = x2 = =
b−a 2(b − a) a 2(b − a) 2
x a

So, Option C is correct.

Now, let us find the expectation of X 2

Zb
x2 b 3 − a3 b2 + a2 + ab
Z
2
 2 1 3
b
E X = x fX (x)dx = dx = x = =
b−a 3(b − a) a 2(b − a) 3
x a

So, Option A is correct.

Now, let us find the variance of X


2
b2 + a2 + ab

2
 2 a+b
V (X) = E X − E(X) = −
3 2
2 2 2 2
4b + 4a + 4ab − 3a − 3b − 6ab (b − a)2
= =
12 12
So, Option D is correct.

∴ Options A, B, C and D are correct.

8. (1 point) (Multiple Select) Which of the following option is/are correct?


A. The shape of the normal density curve is bell shaped.
B. Normal density curve is symmetric about its mean.
C. The area under a standard normal density curve is 1.
D. The standard normal density curve is symmetric about the value 0.

Answer: A, B, C, D

Explanation: The pdf of a normal curve is


 2 !
1 −1 x−µ
fX (x) = √ exp
σ 2π 2 σ
Course: Machine Learning - Foundations Page 6 of 18

Firstly, we know this equation is a bell curve, so option A is correct. Also, we can
see that this curve is symmetric about the mean. This is because of the fact that
fX (µ − x) = fX (x). So, Option B is correct.

When considering the standard normal, the mean is 0 and variance is 1. Since it is
still a pdf, the area under it will be 1. So, Option C is correct.
Finally, Option D is correct for the same reason as B.

∴ Options A, B, C and D are correct.

9. (1 point) Let X and Y be continuous random variables with joint density


(
cxy for 0 < x < 1, 0 < y < 1
fXY (x, y)
0 otherwise

1 1
Calculate P (0 < X < , 0 < Y < )
2 2

Answer: 0.0625

Explanation: To solve this question, we will use the pdf to find the required probability.
But first we must find c using the fact that total probability is 1.
Z1 Z1
x2 y2 1 c
ZZ 1
fX,Y (x, y)d(x, y) = cxy dxdy = c . . =
2 0 2 0 4
x,y 0 0

Since the total probability is 1 =⇒ c = 4. So, now the required probability is given by
 Z0.5Z0.5
x2 y2

1 1 0.5 0.5 1
P 0<X< , 0<Y < = 4xy dxdy = 4 . . = = 0.0625
2 2 2 0 2 0 16
0 0

∴ The answer is 0.0625

25
10. ( points) Let X be a uniformly distributed random variable with µx = 15 and σx2 = .
3
Calculate P (X > 17)
Course: Machine Learning - Foundations Page 7 of 18
Answer: 0.3

Explanation: Here, we have a uniformly distributed random variable X ∼ U(a, b). To


solve this question, we fill first find a, b to find the required probability. Now, given the
mean, we have
a+b
µX = = 15 =⇒ a = 30 − b
2
Substituting the above into the equation for variance.

2 (b − a)2 25
σX = = =⇒ (2b − 30)2 = 100 =⇒ b = 20 =⇒ a = 10
12 3
So, X ∼ U(10, 20). Finally, the required probability is given by

Z20
1 x 20 3
P (X > 17) = dx = = = 0.3
20 − 10 10 17 10
17

∴ The answer is 0.3.

11. (1 point) Let X be a continuous random variable with PDF



ax
 for 0 < x < 3
fX (x) = a(6 − x) for 3 ≤ x ≤ 6

0 otherwise

Calculate P (0 ≤ x ≤ 4)

Answer: 0.77

Explanation: To solve this question, we will use the pdf to find the required probability.
But first we must find a using the fact that total probability is 1.

Z3 Z6
x2
Z  
a 23 6
fX (x)dx = ax dx + a(6 − x) dx = x +a 6x −
2 0 2 3
x 0
 3   
9a 36 9
= + a 36 − − 18 − = 9a
2 2 2
Course: Machine Learning - Foundations Page 8 of 18
1
Since the total probability is 1 =⇒ a = . So, now the required probability is given by
9
Z3 Z4
x2
 
x 6−x 1 23 1 4
P (0 < X < 4) = dx + dx = x + 6x −
9 9 18 0 9 2 3
0 3  
1 1 16 9 2.5
= + 24 − − 18 − = 0.5 + ≈ 0.77
2 9 2 2 9
∴ The answer is 0.77

12. Let X be exponentially distributed with parameter λ, then which of the following is/are
true about the variance of X:

a. V (X) = E[X 2 ] − (E[X])2


b. V (X) = E[X − E[X]]2
c. V (X) = (E[X])2
d. V (X) = E[X 2 ]

Answer: A, B, C

Explanation: According to the definition of the variance,

V (X) = E (X − E(X))2 = E X 2 − E(X)2


  

So, Options A and B are correct.


1 1
For an exponential distribution with rate parameter λ, E(X) = and V (X) = 2 .
λ λ
2 2 2
Also, E(X ) = V (X) + E(X) = 2 .
λ
Given this information, we have option C is correct and D is incorrect.

∴ Options A, B and C are correct.

13. (1 point) Which of the following options is/are always true for three events A, B and C
of a random experiment?

a. If A ⊂ B then P (B|A) = 1
b. If B ⊂ A then P (B|A) = 1
Course: Machine Learning - Foundations Page 9 of 18
c. If B ⊂ A then P (A|B) = 1
d. If P (A|B) > P (A) then P (B|A) > P (B) (Assuming the events have non-zero
probabilities)

Answer: A, C, D

Explanation: To solve this question, we will use the definition of conditional probabil-
P (B ∩ A)
ity, P (B|A) = .
P (A)
P (B ∩ A) P (A)
Now, if A ⊂ B =⇒ B ∩ A = A =⇒ P (B|A) = = = 1.
P (A) P (A)
So, Option A and C are correct.

P (B ∩ A) P (B)
If however, we have B ⊂ A =⇒ B ∩ A = B =⇒ P (B|A) = = ̸= 1.
P (A) P (A)
So, Option B is incorrect.

Finally,
P (A|B) > P (A)
P (A ∩ B)
=⇒ > P (A)
P (B)
P (A ∩ B)
=⇒ > P (B)
P (A)
=⇒ P (B|A) > P (B)
So, Option D is correct.

∴ Options A, C and D are correct.

14. (1 point) Let the random experiment of selecting a number from a set of integers from
1 to 20, both inclusive. Assuming all numbers are equally likely to occur. Let A be
the event that the selected number is odd, B be the event that the selected number is
divisible by 3. Choose the correct option from the following:
A. A and B are dependent on each other.
B. A and B are independent on each other.
C. Can’t say

Answer: B
Course: Machine Learning - Foundations Page 10 of 18
Explanation: To solve this question, we will use the definition of independent events.
Two events A and B are said to be independent if the occurrence of one event does not
effect the other. That is, P (A|B) = P (A) or equivalently, P (A ∩ B) = P (A)P (B).

Let X be the random variable uniformly distributed between integers 1 to 20 inclu-


sive. We can compute,
10
P (A) = P (X is odd) = = 0.5
20
6
P (B) = P (X is divisible by 3) = = 0.3
20
3
P (A ∩ B) = P (X is odd and divisible by 3) = = 0.15
20
Since P (A)P (B) = (0.5)(0.3) = 0.15 = P (A ∩ B), A and B are independent.

∴ Option B is correct.

15. (1 point) Mayur rolls a fair die repeatedly until a number that is multiple of 3 is ob-
served. Let the random variable N represent the total number of times the die is rolled.
Find the probability distribution of N .
  k−1
2
 1
× , k = 1, 2, 3, . . .
A. fN (k) = 3 3

0 otherwise
  k−1
1
 2
× , k = 1, 2, 3, . . .
B. fN (k) = 3 3

0 otherwise
  k−1
1
 1
× , k = 1, 2, 3, . . .
C. fN (k) = 2 2

0 otherwise
  k−1
1
 5
× , k = 1, 2, 3, . . .
D. fN (k) = 6 6

0 otherwise

Answer: B

Explanation: Here, the random variable N , representing the number of dice rolls fol-
lows a geometric distribution. Here, the probability of success p, is the probability of
Course: Machine Learning - Foundations Page 11 of 18
2 1
getting a multiple of 3. So, we get p = = .
6 3
The pmf of a geometric distribution is given by fN (k) = (1 − p)k−1 p.
 k−1
1 1 2
For p = , we get fN (k) = × .
3 3 3
∴ Option B is correct.

16. (1 point) Shelly wrote an exam that contains 20 multiple choice questions. Each ques-
tion has 4 options out of which only one option is correct and each question carries 1
mark. She knows the correct answer of 10 questions, and for the remaining 10 questions,
she chooses the options at random. Assume that all the questions are independent. Find
the probability that she will score 18 marks in the exam.

   8  2
10 1 3
a. × ×
8 4 4
   8  2
10 1 3
b. × ×
2 4 4
   2  8
10 1 3
c. × ×
2 4 4
   8  2
10 3 1
d. × ×
8 4 4

Answer: A, B

Explanation: Here, we need to find the probability that Shelly scores 18 marks. We
know that she knows the answer of 10 questions, which guarantees her 10 marks. Let X
be the random variable denoting the number of questions she will guess correctly. This
will bring her total marks to be 10 + X.

Now, since she guesses on a total of 10 questions, each being independent with a proba-
bility of 0.25 of being correct, we can say that X follows a binomial distribution. That
is, X ∼ Bin(n = 10, p = 0.25).

Putting it all together we get -


 
10
P (marks = 18) = P (X + 10 = 18) = P (X = 8) = (0.25)8 (0.75)2
8
Course: Machine Learning - Foundations Page 12 of 18
So, Option A is correct.
       
n n 10 10
Also, using the property of binomial coefficients, = =⇒ = .
k n−k 8 2
So, Option B is also correct.

∴ Options A and B are correct.

17. (1 point) Suppose the number of runs scored of a delivery is uniform in {1, 2, 3, 4, 5, 6}
independent of what happens in other deliveries. A batsman needs to bat till he hits a
four. What is the probability that he needs fewer than 6 deliveries to do so? (Answer
the question correct to two decimal points.)

Answer: 0.6

Explanation: Let X be the random variable denoting the number of deliveries such
that the X th delivery is 4 runs. We can see that X here is a geometric
  random variable,
1 1
where p, the probability of success is . That is, X ∼ Geom p = .
6 6
 x−1
1 5
So the pmf of X becomes fX (x) = × , x = 1, 2, 3, . . .
6 6

Now, we can calculate the required probability -


5
X
P (X < 6) = P (X = x)
x=1
5  x−1
X 1 5
= .
x=1
6 6
 5
5
1−
1 6
= .
6 5
1−
6
 5
5
=1−
6
≈ 0.6
∴ The answer is 0.6
Course: Machine Learning - Foundations Page 13 of 18
18. (1 point) Let X and Y be two random variables with joint PMF fX,Y (x, y) given below.
Calculate covariance between X and Y .

X
1 2 3
Y

1 0.25 0.25 0

2 0 0.25 0.25

Answer: 0.25

Explanation: We can find the covariance between X and Y using the formula Cov(X, Y ) =
E(XY ) − E(X)E(Y ). Let us find the relevant expectations.
2
X
E(X) = xP (X = x) = 0.5 + (2)(0.5) = 1.5
x=1
3
X
E(Y ) = yP (Y = y) = 0.25 + (2)(0.5) + (3)(0.25) = 2
y=1

X
E(XY ) = xy P (X = x ∩ Y = y)
x,y
3 X
X 2
= xy P (X = x ∩ Y = y)
y=1 x=1

= (1)(0.25) + (2)(0.25) + (4)(0.25) + (6)(0.25)


= 3.25

Putting it all together we get -

Cov(X, Y ) = E(XY ) − E(X)E(Y ) = 3.25 − (2)(1.5) = 0.25

∴ The answer is 0.25.

19. (1 point) Let X and Y be two random variables with joint PMF fX,Y (x, y) given below.
Calculate fY |X=2 (2).
Course: Machine Learning - Foundations Page 14 of 18
X
1 2 3
Y

1 0.25 0.25 0

2 0.125 a1 0.125

Answer: 0.5

Explanation: To find the required probability, we first must find a1 . We can use the
law of total probability here.
X
P (X = x ∩ Y = y) =1
x,y

0.25 + 0.25 + 0.125 + a1 + 0.125 =1


0.75 + a1 =1
a1 =0.25

So, calculating the required probability -

fY |X=2 (2) = P (Y = 2|X = 2)


P (Y = 2 ∩ X = 2)
=
P (X = 2)
a1
= 3
P
P (X = 2 ∩ Y = y)
y=1
a1
=
0.25 + a1
0.25
=
0.5
= 0.5

∴ The answer is 0.5.

20. (1 point) Two random variables X and Y are jointly distributed with PDF
( y
ax + for x, y ∈ {0, 1}
fX,Y (x, y) = 4
0 otherwise
Course: Machine Learning - Foundations Page 15 of 18
Calculate the value of a.

Answer: 0.25

Explanation: For this question we can find the value of a by using the law of total
probability.
X
fX,Y (x, y) =1 (1)
x,y

fX,Y (0, 0) + fX,Y (1, 0) + fX,Y (0, 1) + fX,Y (1, 1) =1 (2)


   
1 1
(0) + (a) + + a+ =1 (3)
4 4
=⇒ 1 = 2a + 0.5 =⇒ a = 0.25 (4)

∴ The answer is 0.25

21. (1 point) A discrete random variable X has PMF as follows


(
k × (1 − x)2 for x = 1, 2, 3
P (X = x) =
0 otherwise

Calculate the value of k

Answer: 0.2

Explanation: For this question we can find the value of k by using the law of total
probability.
X
1= P (X = x)
x
=⇒ 1 = P (X = 1) + P (X = 2) + P (X = 3)
=⇒ 1 = (0) + (k) + (4k)

=⇒ 1 = 5k =⇒ k = 0.2

∴ The answer is 0.2.


Course: Machine Learning - Foundations Page 16 of 18
22. (1 point) A discrete random variable X has PMF as given below where a, b, c are con-
stants.

x 1 2 3 4
P (X = x) a b c 0.3

The CDF FX (x) is given below

x 1 2 3 4
FX (x) 0.2 0.6 0.7 d

Find the value of a + b + c + d.

Answer: 1.7

Explanation: Let us look the the pmf first. From the law of total probability, we know
4
X
that P (X = x) = 1 =⇒ a + b + c + 0.3 = 1 =⇒ a + b + c = 0.7.
x=1

Now, since X takes the values 1, 2, 3, 4, which means P (X ≤ 4) = 1 =⇒ FX (4) =


1 =⇒ d = 1.

So, a + b + c + d = 1.7

∴ The answer is 1.7.

23. (1 point) A series of four matches is played between India and England. Let the random
variable X represent the absolute difference in the number of matches won by India and
England. Find the set of possible values that X can take. (Assume that the match does
not result in a tie.)

A. {0,2,4}
B. {0,1,2,4}
C. {0,1,2,3,4}
D. {0,1,3,4}

Answer: A
Course: Machine Learning - Foundations Page 17 of 18
Explanation: Let Y be the random variable denoting the number of wins for India.
Since 4 matches are played, the number of wins for England is 4 − Y .
X is the absolute difference in the number of wins, we get, X = |Y − (4 − Y )| = |2Y − 4|.
Since Y represents number of wins, Y can take on the values 1, 2, 3, 4. This implies X
can take on the values 0, 2, 4.

∴ Option A is correct.

24. (1 point) There are five multiple choice questions asked in an exam. There is 70% chance
that Shelly will solve a question correctly and independent of the rest of the solution. Let
X be the random variable that represents the number of questions she solves correctly.
Which of the following is the probability mass function of X?


 0.00243 ,x = 0

0.02835 ,x = 1





0.1323 ,x = 2
A. P (X = x) =


 0.3087 ,x = 3

0.36015

 ,x = 4

0.16807 ,x = 5



 0.00243 ,x = 0

0.02835 ,x = 1





0.16807 ,x = 2
B. P (X = x) =


0.3087 ,x = 3



0.36015 ,x = 4

0.1323 ,x = 5



0.00243 ,x = 0

0.02835 ,x = 1





0.3087 ,x = 2
C. P (X = x) =


0.1323 ,x = 3

0.36015

 ,x = 4

0.16807 ,x = 5

Course: Machine Learning - Foundations Page 18 of 18


 0.00243 , x = 0

0.01835 , x = 1





0.1223 , x = 2
D. P (X = x) =


 0.2987 , x = 3



 0.37015 , x = 4

0.19807 , x = 5

Answer: A

Explanation: Let X be the random variable denoting the number of questions solved
correctly. Since there are 5 questions in total and they have independently a probability
of 0.7 to be solved correctly. This means that X follows a binomial distribution.
 
5
That is, X ∼ Binom(n = 5, p = 0.7) =⇒ P (X = x) = (0.7)x (0.3)5−x .
x
Simplifying for each value of x,


 0.00243 ,x = 0

0.02835 ,x = 1





0.1323 ,x = 2
P (X = x) =


 0.3087 ,x = 3



 0.36015 ,x = 4

0.16807 ,x = 5

∴ Option A is correct.
Course: Machine Learning - Foundations
Week 11 Questions

GRADED QUESTIONS
1. (1 point) The number of hours Messi spends each day practicing in ground is modelled
by the continuous random variable X, with p.d.f. f (x) defined by
(
a(x − 1)(6 − x) for 1 < x < 6
fX (x) =
0 otherwise

Find the probability that Messi will practice between 2 and 5 hours in ground on a
randomly selected day.

Answer: 0.80
R∞
We know that −∞ f (x)dx = 1

6
Solving above equation taking required f (x), value of a can be calculated. i.e a =
125
R5
Then calculate P (2 ≤ X ≤ 5) = 2 f (x)dx
2. (1 point) Let X be a continuous random variable with PDF

ax
 for 0 < x < 2
fX (x) = a(4 − x) for 2 ≤ x ≤ 4

0 otherwise

Calculate P (1 ≤ x ≤ 3)

Answer: 0.75
R∞
We know that −∞ f (x)dx = 1

1
Solving above equation taking required f (x), value of a can be calculated. i.e a =
4
R2
Then calculate P (1 ≤ X ≤ 2) = 1 f (x)dx
3. (1 point) The probability density function of X is given by

x
 for 0 < x < 1
fX (x) = 2 − x for 1 < x < 2

0 otherwise

Course: Machine Learning - Foundations Page 2 of 13
Calculate E(X)

Answer: 1
Solution: R∞
We know that −∞ f (x)dx = 1

R2
E(X) = 0
xf (x)dx

4. (1 point) The distribution of the lengths of a cricket bat is uniform between 80 cm and
100 cm. There is no cricket bat outside this range. The mean and variance of the lengths
of the the cricket ball is a and b. Calculate a + b

Answer: 127.33
(h − l)2
V (X) =
12
(h + l)
E(X) =
2

5. ( points) Suppose that random variable X is uniformly distributed between 0 and 10.
10
Then find P (X + ≥ 7). (Write answer upto two decimal places)
X

Answer: 0.7
10
Solve this quadratic equation, X + ≥7
X
get the values of X for this X ≥ 0.

X ∈ [0, 2] ∪ [5, 10]


So the total area equals 0.2 + 0.5 = 0.7

6. (1 point) (Multiple Select) Which of the following option is/are correct?


A. For a standard normal variate, the value of Standard Deviation is 1.
B. Normal Distribution is also known as Gaussian distribution.
C. In Normal distribution, the highest value of ordinate occurs at mean.
D. The shape of the normal curve depends on its standard deviation.
Course: Machine Learning - Foundations Page 3 of 13

Answer: A, B, C, D
Option A: Standard normal variate(distributions) have a mean of 0 and variance of 1.
SD is the squared root of variance.

Option B: Normal distribution is indeed known as the Gaussian distribution.

Option C: The normal distribution resembles a bell curve. The maximum concentration
is around the mean value/middle portion.

Option D: The spread of the distribution will change based on the SD. Hence, the
shape is dependent on the standard deviation.
—————————————————————————————————————-.
Let X and Y be continuous random variables with joint density
(
cxy for 0 < x < 2, 1 < y < 3
fXY (x, y)
0 otherwise

From the above information answer questions from 7-13

7. ( points) Calculate the value of c

1
Answer:
8
Solution: R∞ R∞
We know that −∞ −∞ fXY (x, y)dydx = 1

R2R3
0 1
cxydydx = 1
c can be calculated from above equation.

8. ( points) Calculate P (0 < X < 1, 1 < Y < 2)

3
Answer:
32
Solution:

R1R2 1
P (0 < X < 1, 1 < Y < 2) = 0 1
xydydx
8
9. ( points) Calculate P (0 < X < 1, Y > 2)
Course: Machine Learning - Foundations Page 4 of 13
5
Answer:
32
Solution:

R1R3 1
P (0 < X < 1, Y > 2) = 0 2
xydydx
8
10. ( points) Calculate P ((X + Y ) < 3)

Answer: 0.25
Solution:

R 2 R 3−x 1
= 0 1
xydydx
8
11. ( points) Calculate FX (1)

Answer: 0.25
Solution:

Here, FX (x) means marginal cumulative distribution i.e P (X ≤ x)


It can be calculated as follows i.e to integrate x from 0 to x = 1 and for all values of y.
R3Rx 1
FX (x) = 1 0 xydydx
8
12. ( points) Calculate FY (2)

3
Answer:
8
Solution:

Here, FY (y) means marginal cumulative distribution i.e P (Y ≤ y)


It can be calculated as follows i.e to integrate y from 1 to y = 2 and for all values of x.
R2Ry
FX (x) = 0 1 cxydydx

13. ( points) Calculate FX,Y (1, 4)

Answer: 0.25
FX,Y = FX (x) × FY (y)
Course: Machine Learning - Foundations Page 5 of 13
14. (1 point) Suppose a random variable X is best described by a uniform probability dis-
tribution with range 1 to 5. Find the value of a such that P (X ≤ a) = 0.5

Answer: 3
Solution: P (X ≤ 3) = 0.5, From the area of Uniform distribution curve.

15. (1 point) If X is an exponential random variable with rate parameter λ then which of
the following statement(s) is(are) correct.
1
a) E[X] = λ
1
b) V ar[X] = λ2
c) P (X > x + k|X > k) = P (X > x) for k, x ≥ 0.
d) P (X > x + k|X > k) = P (X > k) for k, x ≥ 0.

Answer: A, B, C
Solution :
Options (a) and (b) are correct for Exponential distribution.

P ((X > x + k) ∩ (X > k)


P (X > x + k|X > k) =
P (X > k)
P (X > x + k) e−λ×(x+k)
⇒ =
P (X > k) e−λk
⇒ e−λx
Hence, Option C is also correct and option (d) is incorrect.

16. (1 point) (Multiple Select) For three events, A, B, and C, with P (C) > 0, Which of
the following is/are correct?
A. P (Ac |C)= 1 - P (A|C)
B. P (ϕ|C) = 0
C. P (A|C) ≤ 1
D. if A ⊂ B then P (A|C) ≤ P (B|C)

Answer: A, B, C, D
Option A: Using standard probability properties. If we have an event E, then:

P (E) = 1 − P (E C )
Course: Machine Learning - Foundations Page 6 of 13
Option B: The option asks for the probability of getting a null set given that an event
C has already occurred. It is also given to us that the probability of the occurrence of
the event C is not zero. Hence, P (ϕ|C) = 0
Option C: The probability of an event given to another with non-zero probability will
always be less than or equal to 1 because the total probability can only be 1.
Option D: The probability of getting a bigger set is more than a smaller set. A is a
smaller set than B, hence P (A|C) ≤ P (B|C)
17. (2 points) (Multiple Select) Let the random experiment be tossing an unbiased coin
two times. Let A be the event that the first toss results in a head, B be the event that
the second toss results in a tail and C be the event that on both the tosses, the coin
landed on the same side. Choose the correct statements from the following:
A. A and C are independent events.
B. A and B are independent events.
C. B and C are independent events.
D. A, B, and C are independent events.
Answer: A, B, C
Solution:

A = {HT, HH}
B = {HT, T T }
C = {T T, HH}
1
P (A) =
2
1
P (B) =
2
1
P (C) =
2
P (A ∩ B) = {HT }
P (C ∩ B) = {T T }
P (A ∩ C) = {HH}
P (A ∩ B) = P (A) × P (B) Hence, option B is correct
P (A ∩ C) = P (A) × P (C) Hence, option A is correct
P (C ∩ B) = P (C) × P (B) Hence, option C is correct
18. (2 points) (Multiple Select) If A1 , A2 , A3 , , An are non empty disjoint sets and subsets
of sample space S, and a set An+1 is also a subset of S, then which of the following
statements are true?
Course: Machine Learning - Foundations Page 7 of 13
A. The sets A1 ∩ An+1 , A2 ∩ An+1 , A3 ∩ An+1 , , An ∩ An+1 are disjoint.
B. If An+1 , An are disjoint then A1 , A2 , , An−1 are disjoint with An+1 .
C. The sets A1 , A2 , A3 , , An , ϕ are disjoint.
D. The sets A1 , A2 , A3 , , An , S are disjoint.

Answer: A,C
Option A: Consider two sets Ai ∩ An+1 and Aj ∩ An+1 , where i ̸= j. The intersection of
these two sets is

(Ai ∩ An+1 ) ∩ (Aj ∩ An+1 ) = (Ai ∩ Aj ) ∩ An+1


= ϕ ∩ An+1 (since Ai , Aj are disjoint sets)

Hence, Ai ∩ An+1 and Aj ∩ An+1 are disjoint sets for all i ̸= j. Therefore, the sets
A1 ∩ An+1 , A2 ∩ An+1 , A3 ∩ An+1 , , An ∩ An+1 are disjoint.
Option B: Again, if An+1 is disjoint with An , it doesn’t mean that it’ll be disjoint with
other n-1 sets as well. Take this example:
you have 3 sets, A1 , A2 , A3 , you can have A4 = A1 + A2 Now, A4 is disjoint with A3 but
not with the other two.
Option C: A1 , ...An are disjoint with each other(given). Every n set will also be disjoint
with ϕ (Intersection will give empty set)
Option D: Every n set is a subset of S. So, it’ll result in something when taken an
intersection with S. Hence, not disjoint.

19. (3 points) A triangular spinner having three outcomes can lands on one of the numbers
0, 1 and 2 with probabilities shown in table.

Outcome 0 1 2
Probability 0.7 0.2 0.1

Table 1: Table 10.2: Probability distribution

The spinner is spun twice. The total of the numbers on which it lands is denoted by X.
The the probability distribution of X is.
x 2 3 4 5 6
A. 49 28 1 4 18
P (X = x)
100 100 100 100 100
x 2 3 4 5 6
B. 28 49 18 1 4
P (X = x)
100 100 100 100 100
Course: Machine Learning - Foundations Page 8 of 13
x 0 1 2 3 4
C. 49 28 18 4 1
P (X = x)
100 100 100 100 100
x 2 3 4 5 6
D. 28 49 18 4 1
P (X = x)
100 100 100 100 100

Answer: C
The maximum sum you can get is 4 (2+2).

1
P (X = 4) = P (2 and 2) = P (2) ∗ P (2) = 0.1 ∗ 0.1 =
100

P (X = 3) = P (2 and 1) or P (1 and 2) = (P (2) ∗ P (1)) + (P (1) ∗ P (2))


4
=⇒ P (X = 3) = (0.1 ∗ 0.2) + (0.2 ∗ 0.1) =
100

Similarly, We’ll have the other probabilities like:

18
P (X = 2) =
100
28
P (X = 1) = P (X = 2) =
100
49
P (X = 0) =
100

20. (1 point) When throwing a fair die, what is the variance of the number of throws needed
to get a 1?

Answer: 30
Solution:

1−p
= V ar(X) =
p2
1
1−
= 6
12
6
= 30
Course: Machine Learning - Foundations Page 9 of 13
21. (1 point) Joint pmf of two random variables X and Y are given in Table

y
1 2 3 fX (x)
x
1 0.05 0 a1 0.15
2 0.1 0.2 a3 a2
3 a4 0.2 a5 0.45
fY (y) 0.3 0.4 a6

Find the value of fY |X=3 (1) i.e (P (Y = 1|X = 3))

Answer: 0.22
Solution:

P
fXY (x, y) = 1 ............. (i)

P
fX (x) = fXY (x, y) .............(ii)
y∈Ry
P
fY (y) = fXY (x, y) ...............(iii)
x∈RX

Hence, a1 = 0.10 , a2 = 0.40 , a3 = 0.1, a4 = 0.15, a5 = 0.1, a6 = 0.3


fXY (1, 3) 0.1
fY |X=3 (1) = = = 0.22
fX (3) 0.45

22. (1 point) (Multiple Select) Which of the following options is/are correct?
A. If Cov[X, Y ] = 0, then X and Y are independent random variables.
B. Cov[X, X] = V ar(X)
C. If X andPY are two independent random variables and Z = X + Y then
fZ (z) = x fX (x) × fY (z − x)
D. If X andPY are two independent random variables and Z = X + Y then
fZ (z) = y fX (x) × fY (z − x)

Answer: B, C
Solution:

Option B
Course: Machine Learning - Foundations Page 10 of 13
Cov[X, X] is the covariance between X and X i.e V ar(X)

Option C is correct from its definition.


23. (1 point) (Multiple Select)A discrete random variables X has the cumulative distri-
bution function is defined as follows.

x3 + k
FX (x) = , for x = 1, 2, 3
40
Which of the following options is/are correct for F (x) as given?
A. k = 17
259
B. V ar(X) =
320
C. k = 13
249
D. V ar(X) =
310

Answer: B, C
Solution:

For k

FX (3) = 1

x3 + k
=1
40

Solving above equation to get k = 13

To calculate the variance, first calculate the probability distribution of X

We will get

14
P (X = 1) =
40
7
P (X = 2) =
40
19
P (X = 3) =
40
Course: Machine Learning - Foundations Page 11 of 13
259
Now easily with V ar(X) equation we will get V ar(X) =
320
24. (1 point) In a game of Ludo, Player A needs to repeatedly throw an unbiased die till
he gets a 6. What is the probability that he needs fewer than 4 throws? (Answer the
question correct to two decimal points.)

Solution:

1
P (6) =
6

As it resembles geometric distribution. Hence,


P3 1 1
n=1 × (1 − )n−1 = 0.6
6 6

25. (1 point) (Multiple Select) Let X and Y be two random variables with joint PMF
fXY (x, y) given in Table 10.3.

y
0 1 2
x
1 1 1
0 6 4 8
1 1 1
1 8 6 6

Table 10.3: Joint PMF of X and Y .

Which of the following options is/are correct for fXY (x, y) given in Table 10.1.
5
A. P (X = 0, Y ≤ 1) =
12
7
B. P (X = 0, Y ≤ 1) =
12
C. X and Y are independent.
D. X and Y are dependent.

Answer: A, D
1 1 10 5
P (X = 0, Y ≤ 1) = P (X = 0 and Y = 0) or P (X = 0 and Y = 1) = + = =
6 4 24 12

1 1 1 13
P (X = 0) = + + =
6 4 8 24

1 1 7
P (Y = 0) = + =
6 8 24
Course: Machine Learning - Foundations Page 12 of 13
Now, if they are independent, then the product of the marginal should be equal to the
joint probability.

13 7 91
PXY (0, 0) = ∗ = ≈ 0.158 (If independent)
24 24 576

1
Also, the tables says that PXY (0, 0) = = 0.1667
6
Because the two are not equal, we can conclude that they are not independent.

26. (1 point) A discrete random variables X has the probability function as given in table
10.4.

x 1 2 3 4 5 6
P (X) a a a b b 0.3

Table 2: Table 10.4: Probability distribution

If E(X) = 4.2, then evaluate a + b

Answer: 0.3
P
P (X = x) = 1

3a + 2b = 0.7

P
E(X) = P (X = xi ) × xi

6a + 9b = 2.4

Solving both equations, we get a = 0.1 and b = 0.2

27. (1 point) A discrete random variable X has the probability function as follows.
(
k × (1 − x)2 , for x = 1, 2, 3
P (X = x) =
0, otherwise

Evaluate E(X)

Answer: 2.8
Course: Machine Learning - Foundations Page 13 of 13
Solution:

P
P (X = x) = 1

k + 4k = 1

k = 0.2

P
E(X) = P (X = xi ) × xi

0.2 × 2 + 0.8 × 3

0.4 + 2.4 = 2.8

You might also like