EE782 Advanced Topics in Machine Learning
End-Semester Examination Question Paper and Answer Sheet
November 24, 2023; 08:30 am to 11:30 am
ROLL NO. ________________ NAME: _________________________
Instructions:
Exam is open notes as long as the notes are on paper and not on an electronic device
Collaboration between a student and any other person or the Internet is prohibited
SUBMIT THIS SHEET ONLY. Answer all questions in the space given on this sheet only
Use separate sheet for rough work
Total marks = 27; weight in course 27%
1. Let f(x,y) be 2x + 3y2 − 2x2. Find the critical points of this function and characterize them into local
maxima, local minima, furrow (flat in one direction, minima in the other), saddle point, or point of
inflection in a given direction. Show your work. [1.5]
Derivative = 0 to give (−0.5, 0) [0.5]
Hessian = [−4, 0; 0 6] [0.5]
Saddle point because Hessian is diagonal, one eigenvalue is positive, the other is negative. [0.5]
2. Not carefully initializing weights in a deep neural network where every layer has a sigmoid
nonlinearity will lead to (a) no problem, (b) vanishing gradients, (c) exploding gradients, or (d) both
vanishing and exploding gradients? Explain the case for both vanishin and exploding gradients. [2]
For high positive and negative activation, the sigmoid has zero gradient, which leads to vanishing gradients
if the weights are not carefully initialized. [1]
There is no repeated multiplication of weights which could lead to exploding gradients. [1]
3. Suppose one layer has an output of size C (one-dimensional array) before a nonlinearity is applied.
A second layer is a convolutional layer with ReLU nonlinearity, whose output has dimensions
H×W×C (three-dimensional tensor). The output of the first layer needs to be used as channel-wise
attention weight for the output of the second layer. Suggest a nonlinearity and any additional
operations that need to be applied to the output of the first layer for this purpose. [2]
Firstly, to convert the first output into a 0-1 range, for which we will use softmax nonlinearity. [1]
Secondly, the dimensions are incompatible, therefore we will repeat each output element H×W times. [1]
4. For a convolutional layer with output of size H×W×C×B, where B is the batch size and C is the
number of channels, what will be the number of elements that will be averaged for computing one
mean during batch normalization? [1]
H.W.B elements will be averaged. We will get C such averages, one for each neuron/kernel/filter [1]
5. What will happen if we train a GAN that has a discriminator with low capacity (e.g. not enough
learnable parameters)? Will it lead to (a) mode collapse, (b) generation of unrealistic samples, or (c)
a discriminator that easily classifies between real and fake? Justify your answer. [1]
(b) generation of unrealistic samples, because the generator can generate low-quality (obvious) fakes, and
the discriminator will not be able to tell the difference between those and the real images.
6. Which of the following is a good principle for designing a loss function for regression that is
robust to outliers? Justify with an example of a robust loss function and how it treats inliers (non-
outliers) versus outliers. [1.5]
a) The loss function should be convex
b) The loss function should have a constant upper bound
c) The gradient of the loss function should have a constant upper bound
d) The absolute value of the gradient of the loss function should have a constant upper bound
(d) The absolute value of the gradient should be upper-bounded, because then an outlier cannot have more
than a certain max contribution to the overall gradient sum across samples. An example of this is Huber
loss, where samples with errors more than ±δ have a constant absolute gradient.
(b) is also correct if the reasoning is that outliers will get a vanishing gradient, like this orange curve
7. Write the formula for generalized cross entropy and explain how it might help deal with
mislabeled samples by drawing an approximate graph and explaining the role of its hyperparameter.
[1.5]
(1−yijq)/q. For Lim q→0, this becomes CE [0.5], and for q=1, this becomes MAE [0.5]. For any other q in-
between, the max gradient is limited by an upper bound, which limits the impact of mislabled samples [1].
8. According to the paper titled “Normalized Loss Functions for Deep Learning with Noisy Labels”
by Ma et al., (a) draw the approximate graphs of cross entropy and (b) normalized cross entropy for
the predicted probability of the correct class [Hint: assume binary classification], and (c) explain the
advantage of normalized cross entropy over cross entropy when dealing with mislabled samples, as
well as (d) the disadvantage of using only an active loss (e.g. NCE) based on the graphs. [2]
[0.5 + 0.5]
NCE has a limited gradient, which limits the impact of mislabeled samples. [0.5]
However, the gradient is 0 for error of 1, which discourages learning on correctly labeled samples initially. For this,
we need passive loss as well. [0.5]
9. List four different methods of augmenting images for self-supervised learning. [2]
Any four, e.g. rotation, flip, blur, noise addition, distortion, color jitter, grayscale. [0.5x4]
10. Give the general architecture of a neural network that is being trained in a self-supervised
manner to restore images of old degraded photographs. Your answer must given an example each of
(a) the dimensions of the input layer, (b) the dimensions and nonlinearity (if any) of the output layer,
(c) the loss function, (d) a plausible architecture, and (e) a method to create the training dataset. [2.5]
(a) HxWx3
(b) HxWx3, with sigmoid (because pixel ranges are in 0 and 1)
(c) Pixel-wise MSE, MAE, or MS-SSIM
(d) UNet (with skip connections)
(e) Take clean images and simulate degradation to create old looking photos
11. What could be an advantage in few-shot learning of decreasing the relative distance of a query
sample from the prototype of its class as opposed to the support samples of its class? [1]
If one sample in the support set is an outlier, its impact can be reduced. Overall, the training is more stable.
12. Suppose that we want to use graph neural networks on Facebook communities to classify them
into those who might respond to a particular ad versus those who will not. Give at least two
examples of vertex attributes and two examples of edge attributes that one can use. [2]
Vertex: age, gender, location etc.
Edge: friend-connection, # likes, # comments on each others’ posts
13. Write the Laplacian matrix for the following graph where the vertex serial numbers are written
inside the vertex. [1.5]
D A L
2 0 0 1 1 2 0 -1 -1
1 0 0 1 0 0 1 -1 0
2 1 1 0 0 -1 -1 2 0
1 1 0 0 0 -1 0 0 1
14. What is a pseudo-label for a classification problem? [1]
PL is assuming that the label assigned to an unlabeled sample by a model is valid.
15. Write the formula for entropy and describe one way to use it for semi-supervised classification.
[1]
−Σj yij log yij. [0.5]
For SSL, we can minimize the entropy of the unlabeled samples, and CE of labeled samples. [0.5]
16. Explain how Grad-CAM (gradient-weighted class activation map) works to localize objects when
a CNN is only trained for image classification. [1.5]
By doing gradient ascend on the class probability with respect to activations, it determines the important
activations. Then it averages them for a channel. Then it takes their ReLU and weighs it by channel
average.
17. List and briefly describe the key differences between how dropout is used for regular training
and inference versus how it is used for uncertainty estimation. [1.5]
In regular DO, we turn it off during inference, and scale the weights down. [0.5]
In uncertainty estimation, we keep DO on during inference and do no scale the weights. [0.5]
Then we take the variability of the estimation for various instances of DO during inference. [0.5]
18. Describe two different ways to compare the performance of uncertanity estimation methods for
classification. [1.5]
AUC of outlier versus inlier identification [0.5]
Average accuracy versus percent of least uncertain samples used in averaging. [1]