0% found this document useful (0 votes)
100 views12 pages

hw1 Sol

This document outlines Homework 1 for EECS 182 on Deep Neural Networks, focusing on concepts such as the bias-variance tradeoff, least squares estimation, and ridge regression. It includes detailed mathematical derivations and solutions related to expected mean squared error, ordinary least squares, and interpretations of ridge regression. The homework is due on February 3, 2022, and emphasizes understanding the underlying principles of these topics.

Uploaded by

sspakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views12 pages

hw1 Sol

This document outlines Homework 1 for EECS 182 on Deep Neural Networks, focusing on concepts such as the bias-variance tradeoff, least squares estimation, and ridge regression. It includes detailed mathematical derivations and solutions related to expected mean squared error, ordinary least squares, and interpretations of ridge regression. The homework is due on February 3, 2022, and emphasizes understanding the underlying principles of these topics.

Uploaded by

sspakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Homework 1 @ 2023-02-07 03:21:36Z

EECS 182 Deep Neural Networks


Spring 2023 Anant Sahai Homework 1
This homework is due on Friday, February 3, 2022, at 10:59PM.

1. Bias-Variance Tradeoff Review


(a) Show that we can decompose the expected mean squared error into three parts: bias, variance,
and irreducible error σ 2 :

Error = Bias2 + Variance + σ 2

Formally, suppose we have a randomly sampled training set D (drawn independently of our test data),
and we select an estimator denoted θ = θ̂(D) (for example, via empirical risk minimization). The
expected mean squared error for a test input x can be decomposed as below:

EY ∼p(y|x),D [(Y − fθ̂(D) (x))2 ] = Bias(fθ̂(D) (x))2 + Var(fθ̂(D) (x)) + σ 2

You may find it helpful to recall the formulaic definitions of Variance and Bias, reproduced for you
below:

Bias(fθ̂(D) (x)) = EY ∼p(Y |x),D [fθ̂(D) (x) − Y ]


h i
Var(fθ̂(D) (x)) = ED (fθ̂(D) (x) − E[fθ̂(D) (x)])2

Solution: For simplicity of notation, let E[·] denote EY ∼p(y|x),D [·]

E[(Y − fθ̂(D) (x))2 ] = E[(Y − fθ̂(D) (x))2 ]


= E[fθ̂(D) (x)2 − 2Y fθ̂(D) (x) + Y 2 ]

By independence of Y and D and linearity of expectation,

E[(Y − fθ̂(D) (x))2 ] = E[fθ̂(D) (x)2 ] − 2E[Y ]E[fθ̂(D) (x)] + E[Y 2 ]

Noting the definition of variance,

E[(Y − fθ̂(D) (x))2 ] = Var(fθ̂(D) (x)) + E[fθ̂(D) (x)]2 − 2E[Y ]E[fθ̂(D) (x)] + E[Y 2 ]
= Var(fθ̂(D) (x)) + (E[fθ̂(D) (x)] − E[Y ])2 + Var(Y |X = x)
= Var(fθ̂(D) (x)) + Bias(fθ̂(D) (x))2 + Var(Y |X = x)

The conditional variance Var(Y |x), which we will denote σ 2 captures the irreducible error that will
be incurred no matter what learner θ̂ we use.

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 1
Homework 1 @ 2023-02-07 03:21:36Z

(b) Suppose our training dataset consists of D = {(xi , yi )}ni=1 where the only randomness is coming from
the label vector Y = Xθ∗ + ε where θ∗ is the true underlying linear model and each noise variable
εi is i.i.d. with zero mean and variance 1. We use ordinary least squares to estimate a θ̂ from this
data. Calculate the bias and covariance of the θ̂ estimate and use that to compute the bias and
variance of the prediction at particular test inputs x. Recall that the OLS solution is given by

θ̂ = (X ⊤ X)−1 X ⊤ Y,

where X ∈ Rn×d is our (nonrandom) data matrix, Y ∈ Rn is the (random) vector of training targets.
For simplicity, assume that X ⊤ X is diagonal.
Solution: We first compute the bias of θ̂. Recalling that we have Y = Xθ + ε for a noise vector ε,
we then have

E[θ̂] = E[(X ⊤ X)−1 X ⊤ (Xθ + ε)]


= E[θ + (X ⊤ X)−1 X ⊤ ε]
= θ + (X ⊤ X)−1 X ⊤ E[ε]
=θ ε has 0 mean.

So our bias is
bias = E[θ̂ − θ] = E[θ̂] − θ = θ − θ = 0

We thus see that the OLS estimator θ̂ is an unbiased estimator of the true parameter θ. Considering
the bias of our estimate at a particular test input x, we see that our prediction is also unbiased.

E[x⊤ θ̂ − x⊤ θ − ϵ] = 0

Next, we compute the variance of θ̂, and we will proceed by first computing the covariance of θ̂,

E[(θ̂ − θ)(θ̂ − θ)⊤ ] = E[(X ⊤ X)−1 X ⊤ εε⊤ ((X ⊤ X)−1 X ⊤ )⊤ ]


= (X ⊤ X)−1 X ⊤ E[εε⊤ ]X((X ⊤ X)−1 )⊤
= (X ⊤ X)−1 X ⊤ In ((X ⊤ X)−1 X ⊤ )⊤ noise variables are iid
⊤ −1 ⊤ ⊤ −1 ⊤
= (X X) X X((X X) )
⊤ −1
= (X X) .

Now for a particular test input x, we can compute the variance

Var[x⊤ θ̂] = Var[x⊤ (θ̂ − θ)]


= E[x⊤ (θ̂ − θ)(θ̂ − θ)x]
= x⊤ (X ⊤ X)−1 x.

For simplicity, suppose X ⊤ X were a diagonal matrix (we could have applied an orthogonal transfor-
mation to achieve this) with sorted entries σ12 ≥ σ22 . . . ≥ σd2 (corresponding to the data variances in
each dimension). Now we can easily compute the variance as di=1 x2i /σi2 , and we see that in direc-
P
tions where σi is close to 0 (which means there is very little variance in the data in this dimension), the
variance of our estimate can explode (and thus our risk as well).

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 2
Homework 1 @ 2023-02-07 03:21:36Z

2. Least Squares and the Min-norm problem from the Perspective of SVD
Consider the equation Xw = y, where X ∈ Rm×n is a non-square data matrix, w is a weight vector, and y
is vector of labels corresponding to the datapoints in each row of X.
Let’s say that X = U ΣV T is the (full) SVD of X. Here, U and V are orthonormal square matrices, and Σ
is an m × n matrix with non-zero singular values (σi ) on the "diagonal".
For this problem, we define Σ† an n × m matrix with the reciprocals of the singular values ( σ1i ) along the
"diagonal".

(a) First, consider the case where m > n, i.e. our data matrix X has more rows than columns (tall matrix)
and the system is overdetermined. How do we find the weights w that minimizes the error between
Xw and y? In other words, we want to solve minw ∥Xw − y∥2 .
Solution: Meta: Students may be confused about which form of SVD to be using. Make sure they
know it is FULL SVD, U and V are square orthonormal matrices.
This is the classic least squares problem. The solution is given by

ŵ = (X T X)−1 X T y

This can be derived from vector calculus, and also has an elegant interpretation in the context of
orthogonal projection of y on the column space of X.
(b) Plug in the SVD X = U ΣV T and simplify. Be careful with dimensions!
Solution:
(X T X)−1 X T = (V ΣT U T U ΣV T )−1 V ΣT U T
Since U has orthonormal columns, U T U = I. Notice ΣT Σ is a square, n × n diagonal matrix with
squared singular values σi2 along the diagonal.

(X T X)−1 X T = (V ΣT ΣV T )−1 V ΣT U T

Apply the fact that (AB)−1 = B −1 A−1 , and that V −1 = V T since the matrix is orthonormal.

(X T X)−1 X T = V (ΣT Σ)−1 V T V ΣT U T

Simplify since V T V = I.
(X T X)−1 X T = V (ΣT Σ)−1 ΣT U T
Notice that (ΣT Σ)−1 ΣT is an n × m matrix with the reciprocals of the singular values, σ1i , on the
"diagonal". We can call this matrix Σ† . Note that this isn’t a true matrix inverse (since the matrix Σ is
not square). So we can write our answer as

(X T X)−1 X T = V Σ† U T

You should draw out the matrix shapes and convince yourself that all the matrix multiplications make
sense.
(c) You’ll notice that the least-squares solution is in the form w∗ = Ay. What happens if we left-
multiply X by our matrix A? This is why the matrix A of the least-squares solution is called the
left-inverse.
Solution: (X T X)−1 X T X = I. We can also see this from our SVD interpretation,

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 3
Homework 1 @ 2023-02-07 03:21:36Z

V Σ† U T U ΣV T = V Σ† ΣV T = V V T = I

Students should make sure to understand why Σ† Σ = I (What are the dimensions, and what are the
entries?)
This is why the least-squares solution is called the left-inverse.
(d) Now, let’s consider the case where m < n, i.e. the data matrix X has more columns than rows and
the system is underdetermined. There exist infinitely many solutions for w, but we seek the minimum-
norm solution, ie. we want to solve min ∥w∥2 s.t.Xw = y. What is the minimum norm solution?
Solution: The min-norm problem is solved by

w = X T (XX T )−1 y

We can see this by choosing w that has a zero component in the nullspace of X, and thus w is in the
range of X T . Alternatively, one can write the Lagrangian, take the dual, apply the KKT conditions,
and solve to get the same answer.
(e) Plug in the SVD X = U ΣV T and simplify. Be careful with dimensions!
Solution:
X T (XX T )−1 = (U ΣV T )T (U ΣV T (U ΣV T )T )−1
= V ΣT U T (U ΣV T V ΣT U T )−1
= V ΣT U T U (ΣΣT )−1 U T
= V ΣT (ΣΣT )−1 U T

Here, we have that ΣT (ΣΣT )−1 is an n × m matrix with the reciprocals of the singular values, 1
σi , on
the "diagonal". We can call this matrix Σ† so that we have

= V Σ† U T

(f) You’ll notice that the min-norm solution is in the form w∗ = By. What happens if we right-multiply
X by our matrix B? This is why the matrix B of the min-norm solution is called the right-inverse.
Solution: Similar to the previous part, XX T (XX T )−1 = I. This can also be seen from the SVD
perspective.
This is why the min-norm solution is called the right-inverse.

3. The 5 Interpretations of Ridge Regression


(a) Perspective 1: Optimization Problem. Ridge regression can be understood as the unconstrained opti-
mization problem

argmin ∥y − Xw∥22 + λ∥w∥22 , (1)


w

where X ∈ Rn×d is a data matrix, and y ∈ Rn is the target vector of measurement values. What’s new
compared to the simple OLS problem is the addition of the λ∥w∥2 term, which can be interpreted as a
"penalty" on the weights being too big.

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 4
Homework 1 @ 2023-02-07 03:21:36Z

Use vector calculus to expand the objective and solve this optimization problem for w.
Solution: Call our objective f . Expand

f (w) = wT X T Xw − 2wT Xy + yT y + λwT w

Take gradient wrt w and set it to zero:

∇w f = 2X T Xw − 2X T y + 2λw = 0

Solve for w:
(X T X + λI)w = X T y
w = (X T X + λI)−1 X T y

(b) Perspective 2: "Hack" of shifting the Singular Values. In the previous part, you should have found the
optimal w is given by
w = (X T X + λI)−1 X T y
(If you didn’t get this, you should check your work for the previous part).
Let X = U ΣV T be the (full) SVD of the X. Recall that U and V are square orthonormal (norm-
preserving) matrices, and Σ is a n × d matrix with singular values σi along the "diagonal". Plug
this into the Ridge Regression solution and simplify. What happens to the singular values of
(X T X + λI)−1 X T when σi << λ? What about when σi >> λ?
Solution: We want to plug in X = U ΣV T into w = (X T X + λI)−1 X T y. U and V are square
(although they may be different sizes). Recall that U and V are orthonormal (real unitary) matrices, so
U T U = U U T = I and V V T = V T V = I.

w = ((U ΣV T )T (U ΣV T ) + λI)−1 (U ΣV T )T y
= (V ΣT U T U ΣT ΣV T + λI)−1 (V ΣT U T )y
= (V ΣT ΣV T + λV IV T )−1 (V ΣT U T y)
= (V (ΣT Σ + λI)V T )−1 (V ΣT U T y)
= (V −T (ΣT Σ + λI)V −1 )(V ΣT U T y)
= V (ΣT Σ + λI)−1 ΣT U T y

Now, let’s consider the case when n > d. We have


 −1  
σ12 + λ 0 ... 0 σ1 0 ... 0 0 ... 0
 0 σ22 + λ ... 0   0 σ2 ... 0 0 ... 0 T
   
w=V  U y
 ...   ...
 

0 0 ... σd2 + λ 0 0 ... σd 0 ... 0

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 5
Homework 1 @ 2023-02-07 03:21:36Z

 
σ1
σ12 +λ
0 ... 0 0 ... 0

 0 σ2 
σ22 +λ
... 0 0 ... 0 T
=V 
 ...
U y

 
σd
0 0 ... σd2 +λ
0 ... 0

σi
You can see the "diagonal" terms in the SVD form are σi2 +λ
. By adding the λ to the denominator,
we prevent an explosion if any σi was close to zero. That is, when σi << λ, the σi2 term becomes
negligible and the effective singular value of the matrix in the solution is σλi . When σi >> λ, the λ
term is negligible and the effective singular value of the matrix in the solution is σ1i , the same as it
would be without any regularization.
The case when n = d is similar, but without the extra zero-columns since the matrix Σ is square. The
case when n < d is also similar, except we would now have additional rows of 0’s below the square.
You can work out for yourself to see what those look like.
(c) Perspective 3: Maximum A Posteriori (MAP) estimation. Ridge Regression can be viewed as finding
the MAP estimate when we apply a prior on the (now viewed as random parameters) W. In particular,
we can think of√the prior for W as being N (0, I) and view the random Y as being generated using
Y = xT W + λN where the noise N √ is distributed iid (across training samples) as N (0, 1). At the
vector level, we have Y = XW + λN. Note that the X matrix whose rows are the n different
training points are not random.
Show that (1) is the MAP estimate for W given an observation Y = y.
Solution:
From how we define MAP estimation,

M AP (w|Y = y) = argmax f (w|Y = y)


w

f (w, y)
= argmax
w f (y)
The denominator doesn’t affect the argmax since it doesn’t depend on w, so we can omit it. Then we
can use the chain rule to expand out the numerator

= argmax f (w)f (y|w)


w

Now, split up the conditional joint density into the product of conditional densities since each element
of the y is independent given w.
n
Y
= argmax f (w) f (yi |w)
w
i=1

−z 2 /2
We can now recall the formula for standard normal pdf is fZ (z) = e √2π . To findf (yi |w), we know
√ T
that yi = xi T w + λNi , where Ni ∼ N (0, 1), so we’d have yi −x√ i w ∼ N (0, 1). Now, plugging in
λ
the pdf in the previous expressions we’d have

yi −xi T w 2
2 n −( √ ) /2
e−∥w∥ /2 Y e λ
M AP = argmax √ √
w 2π i=1 2π

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 6
Homework 1 @ 2023-02-07 03:21:36Z

We can ignore multiplicative scaling constants (since they don’t affect the value of the argmax). Also,
since log is a monotonically increasing function, we can take log of both sides without affecting the
argmax. Taking logs is useful since it allows us to turn products into sums. So we now have:

∥w∥2 1 X (yi − xTi w)2


M AP = argmax − −
w 2 λ 2
i

We can ignore scaling factors again, and arrange all the summation terms in a vector and taking the
norm-squared. We can also turn argmax into argmin by negating the objective function:

1
M AP = argmin ∥w∥2 + ∥y − Xw∥2
w λ
Finally, we can multiply through by λ without changing the argmax since it is a positive constant:

M AP = argmax λ∥w∥2 + ∥Xw − y∥2


w

And now it is the same form as the original Ridge Regression optimization problem.
(d) Perspective 4: Fake Data. Another way to interpret “ridge regression” is as the ordinary least squares
for an augmented data set — i.e. adding a bunch of fake data points to our data. Consider the following
augmented measurement vector ŷ and data matrix X̂:
" # "#
y X
ŷ = X̂ = √ ,
0d λId

where 0d is the zero vector in Rd and Id ∈ Rd×d is the identity matrix. Show that the classical OLS
optimization problem argminw ∥ŷ − X̂w∥22 has the same minimizer as (1).
Solution: There are two easy ways of seeing the answer. The first is to look at the optimization
problem itself and expand out the terms.
Recall that ∥ŷ − X̂w∥22 = n+d 2
P
i=1 (ŷi − x̂i w) where x̂i are rows of X̂: the squared norm of the error is
the sum of squared errors in individual coordinates. Our augmentation adds d more terms to that sum,
which exactly give the ridge regularization. To see that we can write
n
X n
X
2
(ŷi − x̂i w) = (yi − xi w)2 = ∥y − Xw∥22
i=1 i=1
d
X Xd √
(ŷi − x̂i w)2 = ( λwi )2 = λ∥w∥22
i=n+1 i=n+1
n+d
X
(ŷi − x̂i w)2 = ∥y − Xw∥22 + λ∥w∥22 .
i=1

Alternatively, we can look at the solution and simplify it. We know that the solution to ordinary least
squares for the augmented data is just
" #⊤ " # " #⊤ " #
X X X y
(X̂ ⊤ X̂)−1 X̌ ⊤ ŷ = ( √ √ )−1 √
λId λId λId 0d

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 7
Homework 1 @ 2023-02-07 03:21:36Z

" # " #
h √ i X h √ i y
= ( X⊤ λId √ )−1 X ⊤ λId
λId 0d
= (X ⊤ X + λId )−1 X ⊤ y

Notice that this is the same as the solution for ridge regression. Either way, we get the desired result.
(e) Perspective 5: Fake Features. For this last interpretation, let’s instead construct an augmented design
matrix in the following way: √
X̌ = [X λIn ]

i.e. we stack X with λIn horizontally. Now our problem is underdetermined: the new dimension
d + n is larger than the number of points n. Therefore, there are infinitely many values η ∈ Rd+n for
which X̌η = y. We are interested in the min-norm solution, ie. the solution to

argmin ∥η∥22 s.t. X̌η = y. (2)


η

Show that this is yet another form of ridge regression and that the first d coordinates of η ∗ form
the minimizer of (1).
" #
w
Solution: Let’s look inside the d + n dimensional vector η by writing it as η = . Here, w is
ξ
d-dimensional and ξ is n-dimensional. Then (2) expands to

argmin ∥w∥22 + ∥ξ∥22 s.t. Xw + λξ = y.
w,ξ
√ √
The constraint just says that λξ = y − Xw. In other words, λξ is the classic residual. This yields
ξ = √1λ (y − Xw) and plugging that into the first part we get

1
argmin ∥w∥22 + ∥y − Xw∥22 .
w,ξ λ

When considering whether the optimization problem is equivalent, we need to think about the mini-
mizer and not the minimum itself. For this, we simply notice that scaling the objective by a constant
factor doesn’t change the minimizers and so:
1
argmin ∥w∥22 + ∥y − Xw∥22 = argmin λ∥w∥22 + ∥y − Xw∥22
w,ξ λ w,ξ

which is equivalent to (1).


(f) We know that the Moore-Penrose pseudo-inverse for an underdetermined system (wide matrix) is
given by A† = AT (AAT )−1 , which corresponds to the min-norm solution for Aη = z. That is, the
optimization problem

argmin ∥η∥2 s.t.Aη = z


is solved by η = A† z. Let ŵ be the minimizer of (1).
Use the pseudo-inverse to show that solving to the optimization problem in (2) yields

ŵ = X T (XX T + λI)−1 y

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 8
Homework 1 @ 2023-02-07 03:21:36Z

Then, show that this is equivalent to the standard formula for Ridge Regression

ŵ = (X T X + λI)−1 X T y

Hint: It may be helpful to review Kernel Ridge Form.


Solution: First, we simply need to plug in our matrix into the pseudo-inverse formula provided and
simplify
h √ i⊤ h √ ih √ i⊤
X̂ ⊤ (X̂ X̂ ⊤ )−1 ŷ = X λIn ( X λIn X λIn )−1 y
" # " # " #
ŵ X ⊤ h √ i X⊤
= √ ( X λIn √ )−1 y
ξ λIn λIn
" #
X ⊤ √ 2
= √ (XX ⊤ + λ I)−1 y
λIn

Looking at just the top d terms, we see ŵ = X T (XX T + λI)−1 y as desired.


Now, let’s show the two forms of Ridge Regression are equivalent. That is, let’s show

(X T X + λI)−1 X T = X T (XX T + λI)−1

With the expression above, we can left-multiply both sides by (X T X + λI) and right-multiply both
sides by (XX T + λI) to get

X T (XX T + λI) = (X T X + λI)X T

And distributing the matrix multiplication we have

X T XX T + λX T = X T XX T + λX T

which we can see is always true as desired.


(g) We know that the solution to ridge regression (1) is given by ŵr = (X ⊤ X + λI)−1 X ⊤ y. What
happens when λ → ∞? It is for this reason that sometimes ridge regularization is referred to as
“shrinkage.”
Solution:
As λ → ∞ the matrix (X ⊤ X + λI)−1 converges to the zero matrix, and so we have w = 0.
(h) What happens to the solution of ridge regression when you take the limit λ → 0? Consider both
the cases when X is wide (underdetermined system) and X is tall (overdetermined system).
Solution:
When X is wide (underdetermined), we converge to the min-norm solution.

w∗ = X T (XX T )−1 y

When X is tall (overdetermined), we converge to the OLS solution.

w∗ = (X T X)−1 X T y
Both of these can be seen by using the relevant form of Ridge Regression and plugging in λ = 0
directly, or by using the SVD perspective.

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 9
Homework 1 @ 2023-02-07 03:21:36Z

4. General Case Tikhonov Regularization


Consider the optimization problem:

min ||W1 (Ax − b)||22 + ||W2 (x − c)||22


x

Where W1 , A, and W2 are matrices and x, b and c are vectors. W1 can be viewed as a generic weighting of
the residuals and W2 along with c can be viewed as a generic weighting of the parameters.

(a) Solve this optimization problem manually by expanding it out as matrix-vector products, setting the
gradient to 0, and solving for x.
Solution: Expand our objective function:

f (x) = (Ax − b)T W1T W1 (Ax − b) + (x − c)T W2T W2 (x − c)

f (x) = xT AT W1T W1 Ax−2bT W1T W1 Ax+bT W1T W1 b+xT W2T W2 x−2cT W2T W2 x+cT W2T W2 c
Now take gradients and set it to 0:

∇f = 2AT W1T W1 Ax − 2AT W1T W1 b + 2W2T W2 x − 2W2T W2 c = 0

Isolating the x terms on one side, we have:

(AT W1T W1 A + W2T W2 )x = AT W1T W1 b + W2T W2 c

So we can solve to get

x = (AT W1T W1 A + W2T W2 )−1 (AT W1T W1 b + W2T W2 c)

(b) Construct an appropriate matrix C and vector d that allows you to rewrite this problem as

min ∥Cx − d∥2


x

and use the OLS solution (x∗ = (C T C)−1 C T d) to solve. Confirm your answer is in agreement with
the previous part.
Solution: We can rewrite our problem in least-squares form using
" # " #
W1 A W1 b
C= , and d =
W2 W2 c

Now, using least squares solution and solving, we get


" #⊤ " # " #⊤ " #
W1 A W1 A −1 W1 A W1 b
x∗ = (C T C)−1 C T d = ( )
W2 W2 W2 W2 c
" # " #
h i W A h i W b
T T T 1 −1 T T T 1
= ( A W1 W2 ) A W1 W2
W2 W2 c

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 10
Homework 1 @ 2023-02-07 03:21:36Z

x∗ = (AT W1T W1 A + W2T W2 )−1 (AT W1T W1 b + W2T W2 c)


Which is the same as the previous part, as desired.
(c) Choose a W1 , W2 , and c such that this reduces to the simple case of ridge regression that you’ve seen
in the previous problem, x∗ = (AT A + λI)−1 AT b.

Solution: This reduces to ridge regression when W1 = I, W2 = λI, and c = 0. You can see this
in both the optimization problem and the result.

5. Coding Fully Connected Networks


In this coding assignment, you will be building a fully-connected neural network from scratch using NumPy.
You will have the choice between two options:

Use Google Colab (Recommended). Open this url and follow the instructions in the notebook.
Use a local Conda environment. Clone https://github.com/gonglinyuan/cs182hw1 and re-
fer to README.md for further instructions.

For this question, please submit a .zip file your completed work to the Gradescope assignment titled “Home-
work 1 (Code)”. Please answer the following question in your submission of the written assignment:

(a) Did you notice anything about the comparative difficulty of training the three-layer net vs train-
ing the five layer net?

Solution: Training a five-layer neural network is more difficult and more sensitive to hyperparameters
(initialization scale and learning rate).

6. Visualizing features from local linearization of neural nets


This problem expects you to modify the Jupyter Notebook you were given in the first discussion section for
the course to allow the visualization of the effective “features” that correspond to the local linearization of
the network in the neighborhood of the parameters.
We provide you with some starter code on Google Colab. For this question, please do not submit your
code to Gradescope. Instead, just include your plots and comments regarding the questions in the subparts.
∂ ∂ (1)
(a) Visualize the features corresponding to (1) y(x) and (1) y(x) where wi are the first hidden
∂wi ∂bi
(1)
layer’s weights and the bi are the first hidden layer’s biases. These derivatives should be evaluated
at at least both the random initialization and the final trained network. When visualizing these features,
plot them as a function of the scalar input x, the same way that the notebook plots the constituent
“elbow” features that are the outputs of the penultimate layer.
Solution: See notebook.
(b) During training, we can imagine that we have a generalized linear model with a feature matrix cor-
responding to the linearized features corresponding to each learnable parameter. We know from our
analysis of gradient descent, that the singular values and singular vectors corresponding to this feature
matrix are important.
Use the SVD of this feature matrix to plot both the singular values and visualize the “principle
features” that correspond to the d-dimensional singular vectors multiplied by all the features
corresponding to the parameters.

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 11
Homework 1 @ 2023-02-07 03:21:36Z

(HINT: Remember that the feature matrix whose SVD you are taking has n rows where each row cor-
responds to one training point and d columns where each column corresponds to each of the learnable
features. Meanwhile, you are going to be plotting/visualizing the “principle features” as functions of
x even at places where you don’t have training points.)
Solution: See notebook.
(c) Augment the jupyter notebook to add a second hidden layer of the same size as the first hidden layer,
fully connected to the first hidden layer. Allow the visualization of the features corresponding to
the parameters in both hidden layers, as well as the “principle features” and the singular values.
Solution: See notebook.

7. Homework Process and Study Group


Citing sources and collaborators are an important part of life, including being a student!
We also want to understand what resources you find helpful and how much time homework is taking, so we
can change things in the future if possible.

(a) What sources (if any) did you use as you worked through the homework?
(b) If you worked with someone on this homework, who did you work with?
List names and student ID’s. (In case of homework party, you can also just describe the group.)
(c) Roughly how many total hours did you work on this homework? Write it down here where you’ll
need to remember it for the self-grade form.

Contributors:

• Brandon Trabucco.

• Saagar Sanghavi.

• Alexander Tsigler.

• Anant Sahai.

• Jane Yu.

• Philipp Moritz.

• Soroush Nasiriany.

• Linyuan Gong.

• Sheng Shen.

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 12

You might also like