hw7 Sol
hw7 Sol
In such architectures, the parameters (θ, ϕ) are learnable, and trained with the learning objective of mini-
mizing the reconstruction error ℓ(x̂i , xi ) = ∥x − x̂∥22 using gradient descent.
N
1 X
θenc , ϕdec = argmin ℓ(x̂i , xi )
Θ,Φ N
i=1
x̂ = gϕ (fθ (x))
Note that above optimization problem does not require labels y. In practice however, we would want to use
the pretrained models to solve the downstream task at hand, e.g. classifying MNIST digits. Linear Probing
is one such approach, where we fix the encoder weights, and learn a simple linear classifier over the features
z = encoder(x).
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 1
Homework 7 @ 2023-03-22 08:26:34Z
Figure 1: Visualization of the vanilla autoencoder with different latent representation sizes.
(ii) Based on your previous visualizations, answer this question: How does changing the latent
representation size of the autoencoder affect the model’s performance in terms of recon-
struction accuracy and linear probe accuracy? Why?
Solution: Based on the given synthetic dataset, each data point has 100 dimensions, with 20
high-variance dimensions affecting the class label. The following observations can be made from
the visualizations in Figure 1.
Firstly, the reconstruction error of the autoencoder decreases as the size of the latent representation
increases. However, this reduction becomes marginal when the size of the latent representation
exceeds the dimension of the data (100).
Secondly, the linear probe accuracy increases as the size of the latent representation increases.
When the size of the latent representation exceeds the dimension of the data (100), the linear probe
accuracy approaches ∼100% even without training the autoencoder. However, when the size of
the latent representation equals the number of interpretive dimensions (20), training the autoen-
coder becomes essential for achieving a high linear probe accuracy. The linear probe accuracy
finally converges to ≥95% with training. On the other hand, if the size of the latent representation
is as small as 5, it fails to capture all useful information in the input data, leading to significantly
lower linear probe accuracy.
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 2
Homework 7 @ 2023-03-22 08:26:34Z
We will assume σ12 > · · · > σk2 > 0 are the k largest eigenvalues of n1 XX⊤ . The assumption that the
σ1 , . . . , σk are positive and distinct ensures identifiability of the principal components, and is common
in this setting. Therefore, the top-k eigenvalues of X are S = diag(σ1 , . . . , σk ), with corresponding
eigenvectors are the columns of Uk ∈ Rm×k . A well-established result from (Baldi & Hornik, 1989)
shows that principal components are the unique optimal solution to linear autoencoders (up to sign
changes to the projection directions). In this subpart, we take some steps towards proving this result.
(i) Write out the first order optimality conditions that the minima of Eq. 1 would satisfy.
Solution: We can compute the first order conditions for W1 and W2 , respectively. To get started,
let’s note that for matrices A ∈ Rd×n , W ∈ Rk×d
∥W A∥2F = tr(A⊤ W ⊤ W A)
∇W ∥W A∥2F = 2W AA⊤
In this case, for W1 , W2 that satisfy the above set of equations provide the first order optimality
conditions and be the minima, i.e. we have
X − W2 W1 X X⊤ W1⊤ = 0
W2⊤ X − W2 W1 X X⊤ = 0
(ii) Show that the principal components Uk satisfy the optimality conditions outlined in (i).
Solution: (Note: The solution here is questionable. It is possible that the principal components
Uk do not satisfy the optimality conditions outlined in (i).)
Using the principal components (top-k), we have W2 = W1⊤ = Uk = [u1 , ..., uk ], where ui ∈
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 3
Homework 7 @ 2023-03-22 08:26:34Z
Pm ⊤
Rm×1 , X = i=1 σi ui ui . Next, consider the term I − W2 W1 , we have
I − W2 W1 = I − Uk U⊤
k (2)
" #
0k,k 0
= (3)
0 Im−k,m−k
Plugging this back into the optimality conditions for part (i) we have
X − W2 W1 X X⊤ W1⊤ = I − W2 W1 XX⊤ W1⊤ (4)
XX⊤ W2 = UΣ2 U⊤ Uk
" #
2 Ik,k
= UΣ
0m−k,k
∇W1 L = (X − W2 W1 X)X⊤ W2
" # " #
0k,k 0 I k,k
= UΣ2
0 Im−k,m−k 0m−k,k
=0
We will assume σ12 > · · · > σk2 > σk+1 2 ≥ 0 are the k + 1 largest eigenvalues of n1 XX⊤ . The assumption
that the σ1 , . . . , σk are positive and distinct ensures identifiability of the principal components.
Consider an ℓ2 -regularized linear autoencoder where the objective is:
where ∥ · ∥2F represents the Frobenius norm squared of the matrix (i.e. sum of squares of the entries).
(a) You want to use SGD-style training in PyTorch (involving the training points one at a time) and self-
supervision to find W1 and W2 which optimize (6) by treating the problem as a neural net being trained
in a supervised fashion. Answer the following questions and briefly explain your choice:
(i) How many linear layers do you need?
□ 0
□ 1
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 4
Homework 7 @ 2023-03-22 08:26:34Z
□ 2
□ 3
Solution: 2, we would use two linear layers, one for the encoder, one for the decoder.
(ii) What is the loss function that you will be using?
□ nn.L1Loss
□ nn.MSELoss
□ nn.CrossEntropyLoss
Solution: We should use MSE-Loss to train the model (reconstruction under l2-loss) since what
we want is for each vector to be close to its reconstruction in a squared-error sense.
(iii) Which of the following would you need to optimize (6) exactly as it is written? (Select all
that are needed)
□ Weight Decay
□ Dropout
□ Layer Norm
□ Batch Norm
□ SGD optimizer
Solution: We need to use Weight Decay to achieve the desired regularization and of the opti-
mizers listed, the SGD-Optimizer is the one that would work the best.
(b) Do you think that the solution to (6) when we use a small nonzero λ has an inductive bias towards
finding a W2 matrix with approximately orthonormal columns? Argue why or why not?
(Hint: Think about the SVDs of W1 = U1 Σ1 V1⊤ and W2 = U2 Σ2 V2⊤ . You can assume that if a k × m
or m × k matrix has all k of its nonzero singular values being 1, then it must have orthonormal rows
or columns. Remember that the Frobenius norm squared of a matrix is just the sum of the squares of
its singular values. Further think about the minimizer of σ12 + σ 2 . Is it unique?)
Solution: If there were no regularization terms, we know that all the optimizers have to have W2 W1
acting like a projection matrix that projects onto the k largest singular vectors of X.
This means that the W2 W1 to minimize the main loss has to be the identity when restricted to the
subspace spanned by the k largest singular vectors of X.
Therefore we would expect W1 , W2 be approximate psuedo-inverses of each other since they are not
square, and the rank of either one is at most k.
Therefore regularizing by penalizing the Frobenius norms forces us to consider:
where W1 is bringing the σi terms and its approximate pseudo-inverse W2 is bringing the σ1i for its
singular values.
Minimizing σ12 + σ 2 by taking derivatives results in setting 0 = − σ23 + 2σ which has a unique non-
negative real solution at σ = 1, and so this is the unique minimizer since clearly this expression goes
to ∞ as σ → ∞ or σ → 0.
If the λ is small enough, then the optimization essentially decouples: the main loss forces W2 and W1
to be pseudoinverses and to have the product W2 W1 project onto the supspace spanned by the k sin-
gular vectors of X whose singular values are largest; and the regularization term forces the individual
W2 and W1 to have all nonzero singular values as close to 1 as possible.
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 5
Homework 7 @ 2023-03-22 08:26:34Z
Once σi ≈ 1, the matrix has approximately orthonormal columns for W2 and approximately orthonor-
mal rows for W1 . You can see this by simply writing W = U ΣV ⊤ and then noticing for a tall W that
W ⊤ W = V Σ⊤ ΣV ⊤ ≈ V IV ⊤ = I and similarly for W W ⊤ for a wide W .
(c) Continue to use the setting in part (b), where µ = 0 and σ = 1. Let s be the scaling factor on the dot
T T
product. Suppose we want E[ q s k ] to be 0, and Var( q s k ) to be σ = 1. What should s be in terms of d?
Solution: From part (a), we know that E[q T k] = di=1 E[µi ]2 = 0. From part (b), we know that
P
T T √
Var[q T k] = d2 . So to E[ q s k ] to be 0, and Var( q s k ) = sd2 to be σ = 1, we can have s = d
4. Argmax Attention
Recall from lecture that we can think about attention as being queryable softmax pooling. In this problem,
we ask you to consider a hypothetical argmax version of attention where it returns exactly the value cor-
responding to the key that is most similar to the query, where similarity is measured using the traditional
inner-product.
(a) Perform argmax attention with the following keys and values:
Keys:
1 0 5 0
2 , 3 , 0 , 0
0 4 0 1
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 6
Homework 7 @ 2023-03-22 08:26:34Z
Corresponding Values:
2 1 0 1
0 , 4 , −1 , 0
1 3 4 −1
What would be the output of the attention layer for this query?
Hint: For example, argmax([1, 3, 2]) = [0, 1, 0]
Solution: Compute inner products of query with the keys:
We take argmax rather than softmax, and have argmax([3, 11, 5, 2]) = [0, 1, 0, 0]
Now, take weighted sum of value vectors (in this case all are zeroed out except for the one correspond-
ing to the highest dot-product between query and key)
0 · [2, 0, 1]⊤ + 1 · [1, 4, 3]⊤ + 0 · [0, −1, 4]⊤ + 0 · [1, 0, −1]⊤
1
So final output is 4
3
(b) Note that instead of using softmax we used argmax to generate outputs from the attention layer. How
does this design choice affect our ability to usefully train models involving attention?
(Hint: think about how the gradients flow through the network in the backward pass. Can we learn to
improve our queries or keys during the training process?)
Solution: It wouldn’t work in real life, since the gradients only flow through the one element that
is selected by argmax and thus most of the parameters in the transformer layers would remain the
same if we did gradient-based training. The reason is that generically, the argmax is not sensitive to
small changes in the keys and queries, since any such tiny perturbations will not change the winner.
Consequently, the gradients with respect to the keys and queries will always be zero. This means that
the keys and queries will never be improved — or more precisely, the functions used to generate keys
and queries from the state will never get updated.
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 7
Homework 7 @ 2023-03-22 08:26:34Z
desirable for processing long document like a book or a passage, where the N could be beyond thousands.
There is a large body of the research studying how to resolve this 1 .
Under this context, this question presents a formulation of attention via the lens of the kernel. A large
portion of the context is adopted from Tsai et al. (2019). In particular, attention can be seen as applying
a kernel over the inputs with the kernel scores being the similarities between inputs. This formulation
sheds light on individual components of the transformer’s attention, and helps introduce some alternative
attention mechanisms that replaces the “softmax” with linearized kernel functions, thus reducing the O N 2
complexity in memory and computation.
We first review the building block in the transformer. Let x ∈ RN ×F denote a sequence of N feature vectors
of dimensions F . A transformer Vaswani et al. (2017) is a function T : RN ×F → RN ×F defined by the
composition of L transformer layers T1 (·), . . . , TL (·) as follows,
The function fl (·) transforms each feature independently of the others and is usually implemented with
a small two-layer feedforward network. Al (·) is the self attention function and is the only part of the
transformer that acts across sequences.
We now focus on the the self attention module which involves softmax. The self attention function Al (·)
computes, for every position, a weighted average of the feature representations of all other positions with
a weight proportional to a similarity score between the representations. Formally, the input sequence x is
projected by three matrices WQ ∈ RF ×D , WK ∈ RF ×D and WV ∈ RF ×M to corresponding representations
Q, K and V . The output for all positions, Al (x) = V ′ , is computed as follows,
Note that in the previous equation, the softmax function is applied rowwise to QK T . Following common
terminology, the Q, K and V are referred to as the “queries”, “keys” and “values” respectively.
Equation 8 implements a specific form of self-attention called softmax attention where the similarity score
is the exponential of the dot product between a query and a key. Given that subscripting a matrix with i
returns the i-th row as a vector, we can write a generalized attention equation for any similarity function as
follows, PN
′ j=1 sim Qi , Kj Vj
Vi = PN . (9)
j=1 sim Qi , Kj
Equation 9 is equivalent to equation 8 if we substitute the similarity function with simsoftmax (q, k) =
T
exp( q√Dk ). This can lead to
PN QTi Kj
j=1 exp( )Vj
√
D
Vi′ = T . (10)
PN Qi Kj
j=1 exp( √
D
)
For computing the resulting self-attended feature Al (x) = V ′ , we need to compute all Vi′ i ∈ 1, ..., N in
equation 10.
1
https://huggingface.co/blog/long-range-transformers
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 8
Homework 7 @ 2023-03-22 08:26:34Z
(a) Identify the conditions that needs to be met by the sim function to ensure that Vi in Equation 9
remains finite (the denominator never reaches zero).
Solution: In order for Vi in Equation 9 to remain finite, it is necessary that N
P
j=1 sim Qi , Kj ̸= 0
for all possible values of Q and K. This condition implies that the function value of sim must always
have the same sign (positive or negative).
For the purposes of this problem, we will only consider the case where sim consists of kernels k(x, y) :
RF × RF → R+ that yield positive values.
(b) The definition of attention in equation 9 is generic and can be used to define several other attention
implementations.
(i) One potential attention variant is the “polynomial kernel attention”, where the similarity function
as sim (q, k) is measured by polynomial kernel K 2 . Considering a special case for a “quadratic
kernel attention” that the degree of “polynomial kernel attention” is set to be 2, derive the
sim (q, k) for “quadratic kernel attention”. (NOTE: any constant factor is set to be 1.) .
2
Solution: Quadratic kernel attention is sim (q, k) = q T k + 1 = ϕ (q)T ϕ (k).
(ii) One benefit of using kernelized attention is that we can represent a kernel using a feature map
ϕ (·) 3 . Derive the corresponding feature map ϕ (·) for the quadratic kernel.
Solution: The feature map ϕ (·) can be expressed as
√ √ √ √ √
ϕ (x) = [1, 2x1 , 2x2 , ..., 2xD , x21 , x22 , ..., x2D , 2x1 x2 , ..., 2xD−1 xD ]T
(iii) Considering a general kernel attention, where the kernel can be represented using feature
map that K(q, k) = (ϕ (q)T ϕ (k)), rewrite kernel attention of equation 9 with feature map
ϕ (·).
Solution: The general kernel attention can be rewritten as
PN T
′ j=1 ϕ (Qi ) ϕ Kj Vj
Vi = PN T ,
j=1 ϕ (Qi ) ϕ Kj
(c) We can rewrite the softmax attention in terms of equation 9 as equation 10. For all the Vi′ (i ∈
{1, ..., N }), derive the time complexity (asymptotic computational cost) and space complexity
(asymptotic memory requirement) of the above softmax attention in terms of sequence length N ,
D and M .
NOTE: for memory requirement, we need to store any intermediate results for backpropagation, in-
cluding all Q, K, V
Solution: The computational graph can be illustrated with the following pseudocode:
for i in range(N):
for j in range(N):
S[i, j] = Q[i, :].T @ K[j, :] / sqrt(D)
for i in range(N):
Z = 0
2
https://en.wikipedia.org/wiki/Polynomial_kernel
3
https://en.wikipedia.org/wiki/Kernel_method
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 9
Homework 7 @ 2023-03-22 08:26:34Z
for j in range(N):
Z += exp(S[i, j])
for j in range(N):
A[i, j] = exp(S[i, j]) / Z
for i in range(N):
O[i, :] = 0
for j in range(N):
O[i, :] += A[i, j] * V[j, :]
Time: Therefore, the computational cost of each double-layer for-loop is O(N 2 D), O(N 2 D), and
O(N 2 M ), respectively, So the total time complexity is O(N 2 (D+M )) (or equivalently, O(N 2 max(D, M ))).
Space: Q, K takes O (N D) memory, S, A needs O N 2 memory, and V, O needs O (N M ) memory.
Note that the feature map ϕ (·) is applied row-wise to the matrices Q and K.
Considering using a linearized polynomial kernel ϕ (x) of degree 2, and assume M ≈ D, derive
the computational cost and memory requirement of this kernel attention as in (11).
Solution: The computational graph can be illustrated with the following pseudocode:
U[:, :] = 0 # shape: [C, M]
for j in range(N):
U[:, :] += phi(K[j, :]) @ V[j, :].T
Z[:] = 0 # shape: [C]
for j in range(N):
Z[:] += phi(K[j, :])
for i in range(N):
O[i, :] = phi(Q[i, :]).T @ U[:, :]
O[i, :] /= phi(Q[i, :]).T @ Z[:]
Time: the computational cost of each step is O (N CM ), O (N C), O (N CM ), respectively. So the
total computational cost is O (N CM ). For the quadratic kernel, we have C = O D2 , and apply
If N >> D2 , kernelized linear attention uses much less memory than softmax attention.
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 10
Homework 7 @ 2023-03-22 08:26:34Z
However, they observe the following training curves and are surprised that the 20-layer network has
better training as well as test error.
What are the potential reasons for this observation? Are there changes to the architecture design
that could help mitigate the problem?
Solution: The fact that the training error is also not getting better is a classic indication of some
sort of “underfitting” type behavior going on. Since the deeper model clearly has more expressive
capacity, the underfitting is not likely due to an issue with too much approximation error and hence has
to do with the behavior of training itself. Since we’re talking about Deep Learning here, the training
is happening using gradient based algorithms. So the reason is likely to involve updates that are too
small, which suggests some kind of dying-gradient problem.
In a deep network, dying gradients could be occuring because of poor initialization in which case using
He or Xavier initialiation could help. But the question is asking about changes to the architecture
design, so this isn’t what was intended.
An architectural source of dying gradients is having gradients attenuated as they backprop through
multiple layers. This can be mitigated by adding residual/skip connections, which is an architectural
matter.
Adding normalization layers could also potentially help in such settings.
(b) You and your teammate want to compare batch normalization and layer normalization for the ImageNet
classification problem. You use ResNet-152 as a neural network architecture. The images have input
dimension 3 × 224 × 224 (channels, height, width). You want to use a batch size of 1024; however,
the GPU memory is so small that you cannot load the model and all 1024 samples at once — you can
only fit 32. Your teammate proposes using a gradient-accumulation algorithm:
Gradient accumulation refers to running the forward and backward pass of a model a fixed number
of steps (accumulation_steps) without updating the model parameters, while aggregating the
gradients. Instead, the model parameters are updated every (accumulation_steps). This allows
us to increase the effective batch size by a factor of accumulation_steps.
You implement the algorithm in PyTorch as:
model.train()
optimizer.zero_grad()
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs)
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 11
Homework 7 @ 2023-03-22 08:26:34Z
model . t r a i n ( )
optimizer . zero_grad ()
for ( inputs , l a b e l s ) in t r a i n i n g _ s e t :
p r e d i c t i o n s = model ( i n p u t s )
loss = loss_fn ( predictions , labels )
l o s s . backward ( )
optimizer . step ()
Solution: The problem here is that the gradients are not being zeroed out in the training loop.
So we keep applying old gradients over and over again along with new ones, which results in training
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 12
Homework 7 @ 2023-03-22 08:26:34Z
instability. We need to move the optimizer.zero_grad() call to the beginning of the inside of the training
loop instead of having it before the training loop.
Note, you need to provide a very unambiguous patch/correction to the pseudocode for full credit.
(a) What sources (if any) did you use as you worked through the homework?
(b) If you worked with someone on this homework, who did you work with?
List names and student ID’s. (In case of homework party, you can also just describe the group.)
(c) Roughly how many total hours did you work on this homework? Write it down here where you’ll
need to remember it for the self-grade form.
References
Baldi, P. and Hornik, K. Neural networks and principal component analysis: Learning from examples
without local minima. Neural networks, 2(1):53–58, 1989.
Luong, M.-T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural machine trans-
lation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,
pp. 1412–1421, 2015.
Tsai, Y.-H. H., Bai, S., Yamada, M., Morency, L.-P., and Salakhutdinov, R. Transformer dissection: An
unified understanding for transformer’s attention via the lens of kernel. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), pp. 4344–4353, 2019.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
Attention is all you need. Advances in neural information processing systems, 30, 2017.
Contributors:
Homework 7, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 13