0% found this document useful (0 votes)
31 views13 pages

hw7 Sol

Homework 7 for EECS 182 focuses on deep neural networks, specifically autoencoders and their applications in unsupervised learning. It includes tasks such as designing autoencoders, understanding PCA in relation to autoencoders, and implementing self-supervised learning with linear autoencoders. The assignment emphasizes the importance of latent representation sizes and their impact on model performance, along with the use of regularization techniques in training.

Uploaded by

sspakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views13 pages

hw7 Sol

Homework 7 for EECS 182 focuses on deep neural networks, specifically autoencoders and their applications in unsupervised learning. It includes tasks such as designing autoencoders, understanding PCA in relation to autoencoders, and implementing self-supervised learning with linear autoencoders. The assignment emphasizes the importance of latent representation sizes and their impact on model performance, along with the use of regularization techniques in training.

Uploaded by

sspakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Homework 7 @ 2023-03-22 08:26:34Z

EECS 182 Deep Neural Networks


Spring 2023 Anant Sahai Homework 7
This homework is due on Friday, March 17, 2022, at 10:59PM.

1. Auto-encoder : Learning without Labels


So far, with supervised learning algorithms we have used labelled datasets D = {X, y} to learn a mapping
fθ from input x to labels y. In this problem, we now consider algorithms for unsupervised learning, where
we are given only inputs x, but no labels y. At a high-level, unsupervised learning algorithms leverage
some structure in the dataset to define proxy tasks, and learn representations that are useful for solving
downstream tasks.
Autoencoders present a family of such algorithms, where we consider the problem of learning a function
fθ : Rm → Rk from input x to a intermediate representation z of x (usually k ≪ m). For autoencoders,
we use reconstructing the original signal as a proxy task, i.e. representations are more likely to be useful for
downstream tasks if they are low-dimensional but encode sufficient information to approximately reconstruct
the input. Broadly, autoencoder architectures have two modules:

• An encoder fθ : Rm → Rk that maps input x to a representation z.


• A decoder gϕ : Rk → Rm that maps representation z to a reconstruction x̂ of x.

In such architectures, the parameters (θ, ϕ) are learnable, and trained with the learning objective of mini-
mizing the reconstruction error ℓ(x̂i , xi ) = ∥x − x̂∥22 using gradient descent.
N
1 X
θenc , ϕdec = argmin ℓ(x̂i , xi )
Θ,Φ N
i=1
x̂ = gϕ (fθ (x))

Note that above optimization problem does not require labels y. In practice however, we would want to use
the pretrained models to solve the downstream task at hand, e.g. classifying MNIST digits. Linear Probing
is one such approach, where we fix the encoder weights, and learn a simple linear classifier over the features
z = encoder(x).

(a) Designing AutoEncoders


Please follow the instructions in this notebook. You will train autoencoders, denoising autoencoders,
and masked autoencoders on a synthetic dataset and the MNIST dataset. Once you finished with the
notebook,
• Download submission_log.json and submit it to “Homework 7 (Code)” in Gradescope.
• Answer the following questions in your submission of the written assignment:
(i) Show your visualization of the vanilla autoencoder with different latent representation sizes.
Solution:
See Figure 1. Please refer to the solution notebook for the codes.

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 1
Homework 7 @ 2023-03-22 08:26:34Z

Figure 1: Visualization of the vanilla autoencoder with different latent representation sizes.

(ii) Based on your previous visualizations, answer this question: How does changing the latent
representation size of the autoencoder affect the model’s performance in terms of recon-
struction accuracy and linear probe accuracy? Why?
Solution: Based on the given synthetic dataset, each data point has 100 dimensions, with 20
high-variance dimensions affecting the class label. The following observations can be made from
the visualizations in Figure 1.
Firstly, the reconstruction error of the autoencoder decreases as the size of the latent representation
increases. However, this reduction becomes marginal when the size of the latent representation
exceeds the dimension of the data (100).
Secondly, the linear probe accuracy increases as the size of the latent representation increases.
When the size of the latent representation exceeds the dimension of the data (100), the linear probe
accuracy approaches ∼100% even without training the autoencoder. However, when the size of
the latent representation equals the number of interpretive dimensions (20), training the autoen-
coder becomes essential for achieving a high linear probe accuracy. The linear probe accuracy
finally converges to ≥95% with training. On the other hand, if the size of the latent representation
is as small as 5, it fails to capture all useful information in the input data, leading to significantly
lower linear probe accuracy.

(b) PCA & AutoEncoders


In the case where the encoder fθ , gϕ are linear functions, the model is termed as a linear autoencoder.
In particular, assume that we have data xi ∈ Rm and consider two weight matrices: an encoder
W1 ∈ Rk×m and decoder W2 ∈ Rm×k (with k < m). Then, a linear autoencoder learns a low-
dimensional embedding of the data X ∈ Rm×n (which we assume is zero-centered without loss of
generality) by minimizing the objective,
1
L(W1 , W2 ; X) = ||X − W2 W1 X||2F (1)
n

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 2
Homework 7 @ 2023-03-22 08:26:34Z

We will assume σ12 > · · · > σk2 > 0 are the k largest eigenvalues of n1 XX⊤ . The assumption that the
σ1 , . . . , σk are positive and distinct ensures identifiability of the principal components, and is common
in this setting. Therefore, the top-k eigenvalues of X are S = diag(σ1 , . . . , σk ), with corresponding
eigenvectors are the columns of Uk ∈ Rm×k . A well-established result from (Baldi & Hornik, 1989)
shows that principal components are the unique optimal solution to linear autoencoders (up to sign
changes to the projection directions). In this subpart, we take some steps towards proving this result.
(i) Write out the first order optimality conditions that the minima of Eq. 1 would satisfy.
Solution: We can compute the first order conditions for W1 and W2 , respectively. To get started,
let’s note that for matrices A ∈ Rd×n , W ∈ Rk×d

∥W A∥2F = tr(A⊤ W ⊤ W A)
∇W ∥W A∥2F = 2W AA⊤

Now, consider taking gradient w.r.t the loss function


1
L(W1 , W2 ; X) = ||X − W2 W1 X||2F
n
1
= ∥(I − W2 W1 )X∥2F
n
1
= ∥H(W1 , W2 )X∥2F
n
where H(W1 , W2 ) = I − W2 W1 . Taking the gradient of the above loss function w.r.t W2 , we get
1
∇W2 L = ∇W2 ∥H(W1 , W2 )X∥2F
n
= 2H(W1 , W2 )XX⊤ W1⊤
 
= 2 I − W2 W1 XX⊤ W1⊤

Similarly, gradient w.r.t W1 gives


1
∇W1 L = ∇W1 ∥H(W1 , W2 )X∥2F
n
= 2H(W1 , W2 )XX⊤ W2
 
= 2W2⊤ I − W2 W1 XX⊤

In this case, for W1 , W2 that satisfy the above set of equations provide the first order optimality
conditions and be the minima, i.e. we have
 
X − W2 W1 X X⊤ W1⊤ = 0
 
W2⊤ X − W2 W1 X X⊤ = 0

(ii) Show that the principal components Uk satisfy the optimality conditions outlined in (i).
Solution: (Note: The solution here is questionable. It is possible that the principal components
Uk do not satisfy the optimality conditions outlined in (i).)
Using the principal components (top-k), we have W2 = W1⊤ = Uk = [u1 , ..., uk ], where ui ∈

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 3
Homework 7 @ 2023-03-22 08:26:34Z

Pm ⊤
Rm×1 , X = i=1 σi ui ui . Next, consider the term I − W2 W1 , we have

I − W2 W1 = I − Uk U⊤
k (2)
" #
0k,k 0
= (3)
0 Im−k,m−k

Plugging this back into the optimality conditions for part (i) we have
   
X − W2 W1 X X⊤ W1⊤ = I − W2 W1 XX⊤ W1⊤ (4)

Similarly, the term XX⊤ W2 can be simplified as

XX⊤ W2 = UΣ2 U⊤ Uk
" #
2 Ik,k
= UΣ
0m−k,k

Plugging these together, we have (similarly for ∇W2 L)

∇W1 L = (X − W2 W1 X)X⊤ W2
" # " #
0k,k 0 I k,k
= UΣ2
0 Im−k,m−k 0m−k,k
=0

2. Self-supervised Linear Autoencoders


We consider linear models consisting of two weight matrices: an encoder W1 ∈ Rk×m and decoder W2 ∈
Rm×k (assume 1 < k < m). The traditional autoencoder model learns a low-dimensional embedding of the
n points of training data X ∈ Rm×n by minimizing the objective,
1
L(W1 , W2 ; X) = ||X − W2 W1 X||2F (5)
n

We will assume σ12 > · · · > σk2 > σk+1 2 ≥ 0 are the k + 1 largest eigenvalues of n1 XX⊤ . The assumption
that the σ1 , . . . , σk are positive and distinct ensures identifiability of the principal components.
Consider an ℓ2 -regularized linear autoencoder where the objective is:

Lλ (W1 , W2 ; X) = n1 ||X − W2 W1 X||2F + λ∥W1 ∥2F + λ∥W2 ∥2F . (6)

where ∥ · ∥2F represents the Frobenius norm squared of the matrix (i.e. sum of squares of the entries).

(a) You want to use SGD-style training in PyTorch (involving the training points one at a time) and self-
supervision to find W1 and W2 which optimize (6) by treating the problem as a neural net being trained
in a supervised fashion. Answer the following questions and briefly explain your choice:
(i) How many linear layers do you need?
□ 0
□ 1

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 4
Homework 7 @ 2023-03-22 08:26:34Z

□ 2
□ 3
Solution: 2, we would use two linear layers, one for the encoder, one for the decoder.
(ii) What is the loss function that you will be using?
□ nn.L1Loss
□ nn.MSELoss
□ nn.CrossEntropyLoss
Solution: We should use MSE-Loss to train the model (reconstruction under l2-loss) since what
we want is for each vector to be close to its reconstruction in a squared-error sense.
(iii) Which of the following would you need to optimize (6) exactly as it is written? (Select all
that are needed)
□ Weight Decay
□ Dropout
□ Layer Norm
□ Batch Norm
□ SGD optimizer
Solution: We need to use Weight Decay to achieve the desired regularization and of the opti-
mizers listed, the SGD-Optimizer is the one that would work the best.
(b) Do you think that the solution to (6) when we use a small nonzero λ has an inductive bias towards
finding a W2 matrix with approximately orthonormal columns? Argue why or why not?
(Hint: Think about the SVDs of W1 = U1 Σ1 V1⊤ and W2 = U2 Σ2 V2⊤ . You can assume that if a k × m
or m × k matrix has all k of its nonzero singular values being 1, then it must have orthonormal rows
or columns. Remember that the Frobenius norm squared of a matrix is just the sum of the squares of
its singular values. Further think about the minimizer of σ12 + σ 2 . Is it unique?)
Solution: If there were no regularization terms, we know that all the optimizers have to have W2 W1
acting like a projection matrix that projects onto the k largest singular vectors of X.
This means that the W2 W1 to minimize the main loss has to be the identity when restricted to the
subspace spanned by the k largest singular vectors of X.
Therefore we would expect W1 , W2 be approximate psuedo-inverses of each other since they are not
square, and the rank of either one is at most k.
Therefore regularizing by penalizing the Frobenius norms forces us to consider:

∥W1 ∥2F + ∥W2 ∥2F = ∥Σ1 ∥2F + ∥Σ2 ∥2F


k 
X 1
= σi2 + 2
i=1
σi

where W1 is bringing the σi terms and its approximate pseudo-inverse W2 is bringing the σ1i for its
singular values.
Minimizing σ12 + σ 2 by taking derivatives results in setting 0 = − σ23 + 2σ which has a unique non-
negative real solution at σ = 1, and so this is the unique minimizer since clearly this expression goes
to ∞ as σ → ∞ or σ → 0.
If the λ is small enough, then the optimization essentially decouples: the main loss forces W2 and W1
to be pseudoinverses and to have the product W2 W1 project onto the supspace spanned by the k sin-
gular vectors of X whose singular values are largest; and the regularization term forces the individual
W2 and W1 to have all nonzero singular values as close to 1 as possible.

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 5
Homework 7 @ 2023-03-22 08:26:34Z

Once σi ≈ 1, the matrix has approximately orthonormal columns for W2 and approximately orthonor-
mal rows for W1 . You can see this by simply writing W = U ΣV ⊤ and then noticing for a tall W that
W ⊤ W = V Σ⊤ ΣV ⊤ ≈ V IV ⊤ = I and similarly for W W ⊤ for a wide W .

3. Justifying Scaled-Dot Product Attention


Suppose q, k ∈ Rd are two random vectors with q, k N (µ, σ 2 I), where µ ∈ Rd and σ ∈ R+ . In other
words, each component qi of q is drawn from a normal distribution with mean µ and stand deviation σ, and
the same if true for k.

(a) Define E[q T k] in terms of µ, σ and d.


Solution: E[q T k] = E[ di=1 qi ki ] = di=1 E[qi ki ] == di=1 E[qi ]E[ki ] = di=1 µ2i = ||µ||22
P P P P

(b) Considering a practical case where µ = 0 and σ = 1, define Var(q T k) in terms of d.


Solution: Var(q T k) = E[(q T k)2 ] − E[q T k]2 = µT Σq µ + µT Σk µ + tr(Σq Σk ) = 2σ 2 ||µ||22 + d ∗ σ 4
(here’s an alternate way of solving this)
d
X
Var(q T k) = Var( (qi ki )) // def of dot product
i=1
d
X
= (Var(qi ki )) // all qi and ki are independent
i=1
= dVar(q1 k1 ) // the variance is the same for all i
2 2
= d(E[q1 ] Var(k1 ) + E[k1 ] Var(q1 ) + Var(q1 )Var(k1 )) // Var of product of indep. variables
= d ∗ Var(q1 )Var(k1 ) // all the expectations are 0
4
= dσ
=d

(c) Continue to use the setting in part (b), where µ = 0 and σ = 1. Let s be the scaling factor on the dot
T T
product. Suppose we want E[ q s k ] to be 0, and Var( q s k ) to be σ = 1. What should s be in terms of d?
Solution: From part (a), we know that E[q T k] = di=1 E[µi ]2 = 0. From part (b), we know that
P
T T √
Var[q T k] = d2 . So to E[ q s k ] to be 0, and Var( q s k ) = sd2 to be σ = 1, we can have s = d

4. Argmax Attention
Recall from lecture that we can think about attention as being queryable softmax pooling. In this problem,
we ask you to consider a hypothetical argmax version of attention where it returns exactly the value cor-
responding to the key that is most similar to the query, where similarity is measured using the traditional
inner-product.

(a) Perform argmax attention with the following keys and values:
Keys:        
 1 0 5 0 

 

2 , 3 , 0 , 0
       
       
 0 4 0 1 

 

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 6
Homework 7 @ 2023-03-22 08:26:34Z

Corresponding Values:        
 2 1 0 1 

 

0 , 4 , −1 ,  0 
       
 1 3 4 −1 

 

using the following query:


 
1
q = 1
 
2

What would be the output of the attention layer for this query?
Hint: For example, argmax([1, 3, 2]) = [0, 1, 0]
Solution: Compute inner products of query with the keys:

⟨[1, 1, 2]⊤ , [1, 2, 0]⊤ ⟩ = 3

⟨[1, 1, 2]⊤ , [0, 3, 4]⊤ ⟩ = 11


⟨[1, 1, 2]⊤ , [5, 0, 0]⊤ ⟩ = 5
⟨[1, 1, 2]⊤ , [0, 0, 1]⊤ ⟩ = 2

We take argmax rather than softmax, and have argmax([3, 11, 5, 2]) = [0, 1, 0, 0]
Now, take weighted sum of value vectors (in this case all are zeroed out except for the one correspond-
ing to the highest dot-product between query and key)
0 · [2, 0, 1]⊤ + 1 · [1, 4, 3]⊤ + 0 · [0, −1, 4]⊤ + 0 · [1, 0, −1]⊤
 
1
So final output is 4
 
3
(b) Note that instead of using softmax we used argmax to generate outputs from the attention layer. How
does this design choice affect our ability to usefully train models involving attention?
(Hint: think about how the gradients flow through the network in the backward pass. Can we learn to
improve our queries or keys during the training process?)
Solution: It wouldn’t work in real life, since the gradients only flow through the one element that
is selected by argmax and thus most of the parameters in the transformer layers would remain the
same if we did gradient-based training. The reason is that generically, the argmax is not sensitive to
small changes in the keys and queries, since any such tiny perturbations will not change the winner.
Consequently, the gradients with respect to the keys and queries will always be zero. This means that
the keys and queries will never be improved — or more precisely, the functions used to generate keys
and queries from the state will never get updated.

5. Kernelized Linear Attention


The softmax attention is widely adopted in transformers (Luong et al., 2015; Vaswani et al., 2017), however
2
the O N (N stands for the sequence length) complexity in memory and computation often makes it less

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 7
Homework 7 @ 2023-03-22 08:26:34Z

desirable for processing long document like a book or a passage, where the N could be beyond thousands.
There is a large body of the research studying how to resolve this 1 .
Under this context, this question presents a formulation of attention via the lens of the kernel. A large
portion of the context is adopted from Tsai et al. (2019). In particular, attention can be seen as applying
a kernel over the inputs with the kernel scores being the similarities between inputs. This formulation
sheds light on individual components of the transformer’s attention, and helps introduce some alternative
attention mechanisms that replaces the “softmax” with linearized kernel functions, thus reducing the O N 2
complexity in memory and computation.
We first review the building block in the transformer. Let x ∈ RN ×F denote a sequence of N feature vectors
of dimensions F . A transformer Vaswani et al. (2017) is a function T : RN ×F → RN ×F defined by the
composition of L transformer layers T1 (·), . . . , TL (·) as follows,

Tl (x) = fl (Al (x) + x). (7)

The function fl (·) transforms each feature independently of the others and is usually implemented with
a small two-layer feedforward network. Al (·) is the self attention function and is the only part of the
transformer that acts across sequences.
We now focus on the the self attention module which involves softmax. The self attention function Al (·)
computes, for every position, a weighted average of the feature representations of all other positions with
a weight proportional to a similarity score between the representations. Formally, the input sequence x is
projected by three matrices WQ ∈ RF ×D , WK ∈ RF ×D and WV ∈ RF ×M to corresponding representations
Q, K and V . The output for all positions, Al (x) = V ′ , is computed as follows,

Q = xWQ , K = xWK , V = xWV ,


QK T (8)
Al (x) = V ′ = softmax( √ )V.
D

Note that in the previous equation, the softmax function is applied rowwise to QK T . Following common
terminology, the Q, K and V are referred to as the “queries”, “keys” and “values” respectively.
Equation 8 implements a specific form of self-attention called softmax attention where the similarity score
is the exponential of the dot product between a query and a key. Given that subscripting a matrix with i
returns the i-th row as a vector, we can write a generalized attention equation for any similarity function as
follows, PN 
′ j=1 sim Qi , Kj Vj
Vi = PN  . (9)
j=1 sim Qi , Kj

Equation 9 is equivalent to equation 8 if we substitute the similarity function with simsoftmax (q, k) =
T
exp( q√Dk ). This can lead to

PN QTi Kj
j=1 exp( )Vj

D
Vi′ = T . (10)
PN Qi Kj
j=1 exp( √
D
)

For computing the resulting self-attended feature Al (x) = V ′ , we need to compute all Vi′ i ∈ 1, ..., N in
equation 10.
1
https://huggingface.co/blog/long-range-transformers

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 8
Homework 7 @ 2023-03-22 08:26:34Z

(a) Identify the conditions that needs to be met by the sim function to ensure that Vi in Equation 9
remains finite (the denominator never reaches zero).
Solution: In order for Vi in Equation 9 to remain finite, it is necessary that N
P 
j=1 sim Qi , Kj ̸= 0
for all possible values of Q and K. This condition implies that the function value of sim must always
have the same sign (positive or negative).
For the purposes of this problem, we will only consider the case where sim consists of kernels k(x, y) :
RF × RF → R+ that yield positive values.
(b) The definition of attention in equation 9 is generic and can be used to define several other attention
implementations.

(i) One potential attention variant is the “polynomial kernel attention”, where the similarity function
as sim (q, k) is measured by polynomial kernel K 2 . Considering a special case for a “quadratic
kernel attention” that the degree of “polynomial kernel attention” is set to be 2, derive the
sim (q, k) for “quadratic kernel attention”. (NOTE: any constant factor is set to be 1.) .
 2
Solution: Quadratic kernel attention is sim (q, k) = q T k + 1 = ϕ (q)T ϕ (k).
(ii) One benefit of using kernelized attention is that we can represent a kernel using a feature map
ϕ (·) 3 . Derive the corresponding feature map ϕ (·) for the quadratic kernel.
Solution: The feature map ϕ (·) can be expressed as
√ √ √ √ √
ϕ (x) = [1, 2x1 , 2x2 , ..., 2xD , x21 , x22 , ..., x2D , 2x1 x2 , ..., 2xD−1 xD ]T

(iii) Considering a general kernel attention, where the kernel can be represented using feature
map that K(q, k) = (ϕ (q)T ϕ (k)), rewrite kernel attention of equation 9 with feature map
ϕ (·).
Solution: The general kernel attention can be rewritten as
PN T 
′ j=1 ϕ (Qi ) ϕ Kj Vj
Vi = PN T  ,
j=1 ϕ (Qi ) ϕ Kj

(c) We can rewrite the softmax attention in terms of equation 9 as equation 10. For all the Vi′ (i ∈
{1, ..., N }), derive the time complexity (asymptotic computational cost) and space complexity
(asymptotic memory requirement) of the above softmax attention in terms of sequence length N ,
D and M .
NOTE: for memory requirement, we need to store any intermediate results for backpropagation, in-
cluding all Q, K, V
Solution: The computational graph can be illustrated with the following pseudocode:
for i in range(N):
for j in range(N):
S[i, j] = Q[i, :].T @ K[j, :] / sqrt(D)
for i in range(N):
Z = 0
2
https://en.wikipedia.org/wiki/Polynomial_kernel
3
https://en.wikipedia.org/wiki/Kernel_method

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 9
Homework 7 @ 2023-03-22 08:26:34Z

for j in range(N):
Z += exp(S[i, j])
for j in range(N):
A[i, j] = exp(S[i, j]) / Z
for i in range(N):
O[i, :] = 0
for j in range(N):
O[i, :] += A[i, j] * V[j, :]
Time: Therefore, the computational cost of each double-layer for-loop is O(N 2 D), O(N 2 D), and
O(N 2 M ), respectively, So the total time complexity is O(N 2 (D+M )) (or equivalently, O(N 2 max(D, M ))).
Space: Q, K takes O (N D) memory, S, A needs O N 2 memory, and V, O needs O (N M ) memory.


So the total space complexity is O N (N + M + D) (or equivalently, O(N max(N, M, D))).


(d) Assume we have a kernel K as the similarity function and the kernel can be represented
 with a feature
map ϕ (·), we can rewrite equation 9 with sim (x, y) = K(x, y) = (ϕ (Qi )T ϕ Kj ) in part (b). We
can then further simplify it by making use of the associative property of matrix multiplication to
T PN  T
ϕ (Q i ) j=1 ϕ Kj Vj
Vi′ = T PN  . (11)
ϕ (Qi ) j=1 ϕ Kj

Note that the feature map ϕ (·) is applied row-wise to the matrices Q and K.
Considering using a linearized polynomial kernel ϕ (x) of degree 2, and assume M ≈ D, derive
the computational cost and memory requirement of this kernel attention as in (11).
Solution: The computational graph can be illustrated with the following pseudocode:
U[:, :] = 0 # shape: [C, M]
for j in range(N):
U[:, :] += phi(K[j, :]) @ V[j, :].T
Z[:] = 0 # shape: [C]
for j in range(N):
Z[:] += phi(K[j, :])
for i in range(N):
O[i, :] = phi(Q[i, :]).T @ U[:, :]
O[i, :] /= phi(Q[i, :]).T @ Z[:]
Time: the computational cost of each step is O (N CM ), O (N C), O (N CM ), respectively. So the
total computational cost is O (N CM ). For the quadratic kernel, we have C = O D2 , and apply


M ≈ D, then the time complexity is O N D3 . If N >> D2 , kernelized linear attention with a


quadratic polynomial kernel is faster than softmax attention.
Space: Q, K takes O (N D) memory, U needs O (CM ) memory, and V, O needs O (N M ) memory. So
2 ,

the total space complexity is O N (D + M ) + CM . For the quadratic kernel, we have C = O D
and apply M ≈ D, then the space complexity is O N D + D3 (or equivalently O D max(N, D2 ) ).
 

If N >> D2 , kernelized linear attention uses much less memory than softmax attention.

6. Debugging DNNs (Optional)


(a) Your friends want to train a classifier for a new app they’re designing. They implement a deep con-
volutional network and train models with two configurations: a 20 layer model and a 56 layer model.

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 10
Homework 7 @ 2023-03-22 08:26:34Z

However, they observe the following training curves and are surprised that the 20-layer network has
better training as well as test error.

Figure 2: Training deep networks on CIFAR10

What are the potential reasons for this observation? Are there changes to the architecture design
that could help mitigate the problem?
Solution: The fact that the training error is also not getting better is a classic indication of some
sort of “underfitting” type behavior going on. Since the deeper model clearly has more expressive
capacity, the underfitting is not likely due to an issue with too much approximation error and hence has
to do with the behavior of training itself. Since we’re talking about Deep Learning here, the training
is happening using gradient based algorithms. So the reason is likely to involve updates that are too
small, which suggests some kind of dying-gradient problem.
In a deep network, dying gradients could be occuring because of poor initialization in which case using
He or Xavier initialiation could help. But the question is asking about changes to the architecture
design, so this isn’t what was intended.
An architectural source of dying gradients is having gradients attenuated as they backprop through
multiple layers. This can be mitigated by adding residual/skip connections, which is an architectural
matter.
Adding normalization layers could also potentially help in such settings.
(b) You and your teammate want to compare batch normalization and layer normalization for the ImageNet
classification problem. You use ResNet-152 as a neural network architecture. The images have input
dimension 3 × 224 × 224 (channels, height, width). You want to use a batch size of 1024; however,
the GPU memory is so small that you cannot load the model and all 1024 samples at once — you can
only fit 32. Your teammate proposes using a gradient-accumulation algorithm:
Gradient accumulation refers to running the forward and backward pass of a model a fixed number
of steps (accumulation_steps) without updating the model parameters, while aggregating the
gradients. Instead, the model parameters are updated every (accumulation_steps). This allows
us to increase the effective batch size by a factor of accumulation_steps.
You implement the algorithm in PyTorch as:
model.train()
optimizer.zero_grad()
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs)

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 11
Homework 7 @ 2023-03-22 08:26:34Z

loss = loss_function(predictions, labels)


loss = loss / accumulation_steps
loss.backward()
if (i+1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Note that the .backward() operator in PyTorch implicitly keeps accumulating the gradient for all
the parameters, unless zero’d out with an optimizer.zero_grad() call.
Before running actual experiments, your friend suggests that you should test whether the gradient
accumulation algorithm is implemented correctly. To do so, you collect the output logits (i.e. the
outputs of the last layer) from two models — ResNet-152, one with batchnorm and the other with
layernorm — using different combinations of batch sizes and the number of accumulation steps that
keep the effective batch size to 32.
h
Note that the effective batch size is product of the batch size and accumulation_steps. In other
words, the possible combinations for effective batch-size 32 are: i
(batch_size, accumulation_steps) = (1, 32), (2, 16), (4, 8), (8, 4), (16, 2), (32, 1).
Here, the (32,1) combination is the approach without the “gradient accumulation” trick, and we want
to see whether the others agree with this.
On running these tests, you observe that one of models: either with batchnorm or with layernorm,
doesn’t pass the test. Which one do you expect to not pass the test and why?
Solution: Answer: Batchnormalization. Because batch statistics are calculated along the batch size
not along the effective batch size. So by using smaller batches and accumulating, we are actually
computing something different. Meanwhile, layer normalization doesn’t care about what is happening
elsewhere in the batch and so would not be effected by the batch size when it comes to activations.
(c) You are training a CIFAR model and observe that the model is diverging (instead of the training loss
decreasing over iterations). Debug the pseudocode and give a correction that you believe would
actually result in reasonable convergence during training.
(Note: You can assume that the datasets are loaded correctly, model is trained with SGD optimizer
with learning rate= 0.001, batchsize= 100)
(HINT: Ideas from the previous part of this question might be relevant.)

model . t r a i n ( )
optimizer . zero_grad ()
for ( inputs , l a b e l s ) in t r a i n i n g _ s e t :
p r e d i c t i o n s = model ( i n p u t s )
loss = loss_fn ( predictions , labels )
l o s s . backward ( )
optimizer . step ()

Figure 3: Training loss for CIFAR10

Solution: The problem here is that the gradients are not being zeroed out in the training loop.
So we keep applying old gradients over and over again along with new ones, which results in training

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 12
Homework 7 @ 2023-03-22 08:26:34Z

instability. We need to move the optimizer.zero_grad() call to the beginning of the inside of the training
loop instead of having it before the training loop.
Note, you need to provide a very unambiguous patch/correction to the pseudocode for full credit.

7. Homework Process and Study Group


Citing sources and collaborators are an important part of life, including being a student!
We also want to understand what resources you find helpful and how much time homework is taking, so we
can change things in the future if possible.

(a) What sources (if any) did you use as you worked through the homework?
(b) If you worked with someone on this homework, who did you work with?
List names and student ID’s. (In case of homework party, you can also just describe the group.)
(c) Roughly how many total hours did you work on this homework? Write it down here where you’ll
need to remember it for the self-grade form.

References
Baldi, P. and Hornik, K. Neural networks and principal component analysis: Learning from examples
without local minima. Neural networks, 2(1):53–58, 1989.
Luong, M.-T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural machine trans-
lation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,
pp. 1412–1421, 2015.
Tsai, Y.-H. H., Bai, S., Yamada, M., Morency, L.-P., and Salakhutdinov, R. Transformer dissection: An
unified understanding for transformer’s attention via the lens of kernel. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), pp. 4344–4353, 2019.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
Attention is all you need. Advances in neural information processing systems, 30, 2017.
Contributors:

• Kumar Krishna Agrawal.


• Linyuan Gong.
• Sheng Shen.
• Anant Sahai.
• David M. Chan.
• Saagar Sanghavi.
• Shaojie Bai.
• Angelos Katharopoulos.
• Hao Peng.
• Suhong Moon.

Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 13

You might also like