0% found this document useful (0 votes)

100 views12 pages

hw1 Sol

This document outlines Homework 1 for EECS 182 on Deep Neural Networks, focusing on concepts such as the bias-variance tradeoff, least squares estimation, and ridge regression. It includes detailed mathematical derivations and solutions related to expected mean squared error, ordinary least squares, and interpretations of ridge regression. The homework is due on February 3, 2022, and emphasizes understanding the underlying principles of these topics.

Uploaded by

sspakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views12 pages

hw1 Sol

Uploaded by

sspakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Homework 1 @ 2023-02-07 03:21:36Z

EECS 182 Deep Neural Networks

Spring 2023 Anant Sahai Homework 1
This homework is due on Friday, February 3, 2022, at 10:59PM.

1. Bias-Variance Tradeoff Review

(a) Show that we can decompose the expected mean squared error into three parts: bias, variance,
and irreducible error σ 2 :

Error = Bias2 + Variance + σ 2

Formally, suppose we have a randomly sampled training set D (drawn independently of our test data),
and we select an estimator denoted θ = θ̂(D) (for example, via empirical risk minimization). The
expected mean squared error for a test input x can be decomposed as below:

EY ∼p(y|x),D [(Y − fθ̂(D) (x))2 ] = Bias(fθ̂(D) (x))2 + Var(fθ̂(D) (x)) + σ 2

You may find it helpful to recall the formulaic definitions of Variance and Bias, reproduced for you
below:

Bias(fθ̂(D) (x)) = EY ∼p(Y |x),D [fθ̂(D) (x) − Y ]

h i
Var(fθ̂(D) (x)) = ED (fθ̂(D) (x) − E[fθ̂(D) (x)])2

Solution: For simplicity of notation, let E[·] denote EY ∼p(y|x),D [·]

E[(Y − fθ̂(D) (x))2 ] = E[(Y − fθ̂(D) (x))2 ]

= E[fθ̂(D) (x)2 − 2Y fθ̂(D) (x) + Y 2 ]

By independence of Y and D and linearity of expectation,

E[(Y − fθ̂(D) (x))2 ] = E[fθ̂(D) (x)2 ] − 2E[Y ]E[fθ̂(D) (x)] + E[Y 2 ]

Noting the definition of variance,

E[(Y − fθ̂(D) (x))2 ] = Var(fθ̂(D) (x)) + E[fθ̂(D) (x)]2 − 2E[Y ]E[fθ̂(D) (x)] + E[Y 2 ]
= Var(fθ̂(D) (x)) + (E[fθ̂(D) (x)] − E[Y ])2 + Var(Y |X = x)
= Var(fθ̂(D) (x)) + Bias(fθ̂(D) (x))2 + Var(Y |X = x)

The conditional variance Var(Y |x), which we will denote σ 2 captures the irreducible error that will
be incurred no matter what learner θ̂ we use.

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 1
Homework 1 @ 2023-02-07 03:21:36Z

(b) Suppose our training dataset consists of D = {(xi , yi )}ni=1 where the only randomness is coming from
the label vector Y = Xθ∗ + ε where θ∗ is the true underlying linear model and each noise variable
εi is i.i.d. with zero mean and variance 1. We use ordinary least squares to estimate a θ̂ from this
data. Calculate the bias and covariance of the θ̂ estimate and use that to compute the bias and
variance of the prediction at particular test inputs x. Recall that the OLS solution is given by

θ̂ = (X ⊤ X)−1 X ⊤ Y,

where X ∈ Rn×d is our (nonrandom) data matrix, Y ∈ Rn is the (random) vector of training targets.
For simplicity, assume that X ⊤ X is diagonal.
Solution: We first compute the bias of θ̂. Recalling that we have Y = Xθ + ε for a noise vector ε,
we then have

E[θ̂] = E[(X ⊤ X)−1 X ⊤ (Xθ + ε)]

= E[θ + (X ⊤ X)−1 X ⊤ ε]
= θ + (X ⊤ X)−1 X ⊤ E[ε]
=θ ε has 0 mean.

So our bias is
bias = E[θ̂ − θ] = E[θ̂] − θ = θ − θ = 0

We thus see that the OLS estimator θ̂ is an unbiased estimator of the true parameter θ. Considering
the bias of our estimate at a particular test input x, we see that our prediction is also unbiased.

E[x⊤ θ̂ − x⊤ θ − ϵ] = 0

Next, we compute the variance of θ̂, and we will proceed by first computing the covariance of θ̂,

E[(θ̂ − θ)(θ̂ − θ)⊤ ] = E[(X ⊤ X)−1 X ⊤ εε⊤ ((X ⊤ X)−1 X ⊤ )⊤ ]

= (X ⊤ X)−1 X ⊤ E[εε⊤ ]X((X ⊤ X)−1 )⊤
= (X ⊤ X)−1 X ⊤ In ((X ⊤ X)−1 X ⊤ )⊤ noise variables are iid
⊤ −1 ⊤ ⊤ −1 ⊤
= (X X) X X((X X) )
⊤ −1
= (X X) .

Now for a particular test input x, we can compute the variance

Var[x⊤ θ̂] = Var[x⊤ (θ̂ − θ)]

= E[x⊤ (θ̂ − θ)(θ̂ − θ)x]
= x⊤ (X ⊤ X)−1 x.

For simplicity, suppose X ⊤ X were a diagonal matrix (we could have applied an orthogonal transfor-
mation to achieve this) with sorted entries σ12 ≥ σ22 . . . ≥ σd2 (corresponding to the data variances in
each dimension). Now we can easily compute the variance as di=1 x2i /σi2 , and we see that in direc-
P
tions where σi is close to 0 (which means there is very little variance in the data in this dimension), the
variance of our estimate can explode (and thus our risk as well).

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 2
Homework 1 @ 2023-02-07 03:21:36Z

2. Least Squares and the Min-norm problem from the Perspective of SVD
Consider the equation Xw = y, where X ∈ Rm×n is a non-square data matrix, w is a weight vector, and y
is vector of labels corresponding to the datapoints in each row of X.
Let’s say that X = U ΣV T is the (full) SVD of X. Here, U and V are orthonormal square matrices, and Σ
is an m × n matrix with non-zero singular values (σi ) on the "diagonal".
For this problem, we define Σ† an n × m matrix with the reciprocals of the singular values ( σ1i ) along the
"diagonal".

(a) First, consider the case where m > n, i.e. our data matrix X has more rows than columns (tall matrix)
and the system is overdetermined. How do we find the weights w that minimizes the error between
Xw and y? In other words, we want to solve minw ∥Xw − y∥2 .
Solution: Meta: Students may be confused about which form of SVD to be using. Make sure they
know it is FULL SVD, U and V are square orthonormal matrices.
This is the classic least squares problem. The solution is given by

ŵ = (X T X)−1 X T y

This can be derived from vector calculus, and also has an elegant interpretation in the context of
orthogonal projection of y on the column space of X.
(b) Plug in the SVD X = U ΣV T and simplify. Be careful with dimensions!
Solution:
(X T X)−1 X T = (V ΣT U T U ΣV T )−1 V ΣT U T
Since U has orthonormal columns, U T U = I. Notice ΣT Σ is a square, n × n diagonal matrix with
squared singular values σi2 along the diagonal.

(X T X)−1 X T = (V ΣT ΣV T )−1 V ΣT U T

Apply the fact that (AB)−1 = B −1 A−1 , and that V −1 = V T since the matrix is orthonormal.

(X T X)−1 X T = V (ΣT Σ)−1 V T V ΣT U T

Simplify since V T V = I.
(X T X)−1 X T = V (ΣT Σ)−1 ΣT U T
Notice that (ΣT Σ)−1 ΣT is an n × m matrix with the reciprocals of the singular values, σ1i , on the
"diagonal". We can call this matrix Σ† . Note that this isn’t a true matrix inverse (since the matrix Σ is
not square). So we can write our answer as

(X T X)−1 X T = V Σ† U T

You should draw out the matrix shapes and convince yourself that all the matrix multiplications make
sense.
(c) You’ll notice that the least-squares solution is in the form w∗ = Ay. What happens if we left-
multiply X by our matrix A? This is why the matrix A of the least-squares solution is called the
left-inverse.
Solution: (X T X)−1 X T X = I. We can also see this from our SVD interpretation,

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 3
Homework 1 @ 2023-02-07 03:21:36Z

V Σ† U T U ΣV T = V Σ† ΣV T = V V T = I

Students should make sure to understand why Σ† Σ = I (What are the dimensions, and what are the
entries?)
This is why the least-squares solution is called the left-inverse.
(d) Now, let’s consider the case where m < n, i.e. the data matrix X has more columns than rows and
the system is underdetermined. There exist infinitely many solutions for w, but we seek the minimum-
norm solution, ie. we want to solve min ∥w∥2 s.t.Xw = y. What is the minimum norm solution?
Solution: The min-norm problem is solved by

w = X T (XX T )−1 y

We can see this by choosing w that has a zero component in the nullspace of X, and thus w is in the
range of X T . Alternatively, one can write the Lagrangian, take the dual, apply the KKT conditions,
and solve to get the same answer.
(e) Plug in the SVD X = U ΣV T and simplify. Be careful with dimensions!
Solution:
X T (XX T )−1 = (U ΣV T )T (U ΣV T (U ΣV T )T )−1
= V ΣT U T (U ΣV T V ΣT U T )−1
= V ΣT U T U (ΣΣT )−1 U T
= V ΣT (ΣΣT )−1 U T

Here, we have that ΣT (ΣΣT )−1 is an n × m matrix with the reciprocals of the singular values, 1
σi , on
the "diagonal". We can call this matrix Σ† so that we have

= V Σ† U T

(f) You’ll notice that the min-norm solution is in the form w∗ = By. What happens if we right-multiply
X by our matrix B? This is why the matrix B of the min-norm solution is called the right-inverse.
Solution: Similar to the previous part, XX T (XX T )−1 = I. This can also be seen from the SVD
perspective.
This is why the min-norm solution is called the right-inverse.

3. The 5 Interpretations of Ridge Regression

(a) Perspective 1: Optimization Problem. Ridge regression can be understood as the unconstrained opti-
mization problem

argmin ∥y − Xw∥22 + λ∥w∥22 , (1)

where X ∈ Rn×d is a data matrix, and y ∈ Rn is the target vector of measurement values. What’s new
compared to the simple OLS problem is the addition of the λ∥w∥2 term, which can be interpreted as a
"penalty" on the weights being too big.

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 4
Homework 1 @ 2023-02-07 03:21:36Z

Use vector calculus to expand the objective and solve this optimization problem for w.
Solution: Call our objective f . Expand

f (w) = wT X T Xw − 2wT Xy + yT y + λwT w

Take gradient wrt w and set it to zero:

∇w f = 2X T Xw − 2X T y + 2λw = 0

Solve for w:
(X T X + λI)w = X T y
w = (X T X + λI)−1 X T y

(b) Perspective 2: "Hack" of shifting the Singular Values. In the previous part, you should have found the
optimal w is given by
w = (X T X + λI)−1 X T y
(If you didn’t get this, you should check your work for the previous part).
Let X = U ΣV T be the (full) SVD of the X. Recall that U and V are square orthonormal (norm-
preserving) matrices, and Σ is a n × d matrix with singular values σi along the "diagonal". Plug
this into the Ridge Regression solution and simplify. What happens to the singular values of
(X T X + λI)−1 X T when σi << λ? What about when σi >> λ?
Solution: We want to plug in X = U ΣV T into w = (X T X + λI)−1 X T y. U and V are square
(although they may be different sizes). Recall that U and V are orthonormal (real unitary) matrices, so
U T U = U U T = I and V V T = V T V = I.

w = ((U ΣV T )T (U ΣV T ) + λI)−1 (U ΣV T )T y
= (V ΣT U T U ΣT ΣV T + λI)−1 (V ΣT U T )y
= (V ΣT ΣV T + λV IV T )−1 (V ΣT U T y)
= (V (ΣT Σ + λI)V T )−1 (V ΣT U T y)
= (V −T (ΣT Σ + λI)V −1 )(V ΣT U T y)
= V (ΣT Σ + λI)−1 ΣT U T y

Now, let’s consider the case when n > d. We have

 −1  
σ12 + λ 0 ... 0 σ1 0 ... 0 0 ... 0
 0 σ22 + λ ... 0   0 σ2 ... 0 0 ... 0 T
   
w=V  U y
 ...   ...
 

0 0 ... σd2 + λ 0 0 ... σd 0 ... 0

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 5
Homework 1 @ 2023-02-07 03:21:36Z

 
σ1
σ12 +λ
0 ... 0 0 ... 0

 0 σ2 
σ22 +λ
... 0 0 ... 0 T
=V 
 ...
U y

 
σd
0 0 ... σd2 +λ
0 ... 0

σi
You can see the "diagonal" terms in the SVD form are σi2 +λ
. By adding the λ to the denominator,
we prevent an explosion if any σi was close to zero. That is, when σi << λ, the σi2 term becomes
negligible and the effective singular value of the matrix in the solution is σλi . When σi >> λ, the λ
term is negligible and the effective singular value of the matrix in the solution is σ1i , the same as it
would be without any regularization.
The case when n = d is similar, but without the extra zero-columns since the matrix Σ is square. The
case when n < d is also similar, except we would now have additional rows of 0’s below the square.
You can work out for yourself to see what those look like.
(c) Perspective 3: Maximum A Posteriori (MAP) estimation. Ridge Regression can be viewed as finding
the MAP estimate when we apply a prior on the (now viewed as random parameters) W. In particular,
we can think of√the prior for W as being N (0, I) and view the random Y as being generated using
Y = xT W + λN where the noise N √ is distributed iid (across training samples) as N (0, 1). At the
vector level, we have Y = XW + λN. Note that the X matrix whose rows are the n different
training points are not random.
Show that (1) is the MAP estimate for W given an observation Y = y.
Solution:
From how we define MAP estimation,

M AP (w|Y = y) = argmax f (w|Y = y)

f (w, y)
= argmax
w f (y)
The denominator doesn’t affect the argmax since it doesn’t depend on w, so we can omit it. Then we
can use the chain rule to expand out the numerator

= argmax f (w)f (y|w)

Now, split up the conditional joint density into the product of conditional densities since each element
of the y is independent given w.
n
Y
= argmax f (w) f (yi |w)
w
i=1

−z 2 /2
We can now recall the formula for standard normal pdf is fZ (z) = e √2π . To findf (yi |w), we know
√ T
that yi = xi T w + λNi , where Ni ∼ N (0, 1), so we’d have yi −x√ i w ∼ N (0, 1). Now, plugging in
λ
the pdf in the previous expressions we’d have

yi −xi T w 2
2 n −( √ ) /2
e−∥w∥ /2 Y e λ
M AP = argmax √ √
w 2π i=1 2π

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 6
Homework 1 @ 2023-02-07 03:21:36Z

We can ignore multiplicative scaling constants (since they don’t affect the value of the argmax). Also,
since log is a monotonically increasing function, we can take log of both sides without affecting the
argmax. Taking logs is useful since it allows us to turn products into sums. So we now have:

∥w∥2 1 X (yi − xTi w)2

M AP = argmax − −
w 2 λ 2
i

We can ignore scaling factors again, and arrange all the summation terms in a vector and taking the
norm-squared. We can also turn argmax into argmin by negating the objective function:

1
M AP = argmin ∥w∥2 + ∥y − Xw∥2
w λ
Finally, we can multiply through by λ without changing the argmax since it is a positive constant:

M AP = argmax λ∥w∥2 + ∥Xw − y∥2

And now it is the same form as the original Ridge Regression optimization problem.
(d) Perspective 4: Fake Data. Another way to interpret “ridge regression” is as the ordinary least squares
for an augmented data set — i.e. adding a bunch of fake data points to our data. Consider the following
augmented measurement vector ŷ and data matrix X̂:
" # "#
y X
ŷ = X̂ = √ ,
0d λId

where 0d is the zero vector in Rd and Id ∈ Rd×d is the identity matrix. Show that the classical OLS
optimization problem argminw ∥ŷ − X̂w∥22 has the same minimizer as (1).
Solution: There are two easy ways of seeing the answer. The first is to look at the optimization
problem itself and expand out the terms.
Recall that ∥ŷ − X̂w∥22 = n+d 2
P
i=1 (ŷi − x̂i w) where x̂i are rows of X̂: the squared norm of the error is
the sum of squared errors in individual coordinates. Our augmentation adds d more terms to that sum,
which exactly give the ridge regularization. To see that we can write
n
X n
X
2
(ŷi − x̂i w) = (yi − xi w)2 = ∥y − Xw∥22
i=1 i=1
d
X Xd √
(ŷi − x̂i w)2 = ( λwi )2 = λ∥w∥22
i=n+1 i=n+1
n+d
X
(ŷi − x̂i w)2 = ∥y − Xw∥22 + λ∥w∥22 .
i=1

Alternatively, we can look at the solution and simplify it. We know that the solution to ordinary least
squares for the augmented data is just
" #⊤ " # " #⊤ " #
X X X y
(X̂ ⊤ X̂)−1 X̌ ⊤ ŷ = ( √ √ )−1 √
λId λId λId 0d

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 7
Homework 1 @ 2023-02-07 03:21:36Z

" # " #
h √ i X h √ i y
= ( X⊤ λId √ )−1 X ⊤ λId
λId 0d
= (X ⊤ X + λId )−1 X ⊤ y

Notice that this is the same as the solution for ridge regression. Either way, we get the desired result.
(e) Perspective 5: Fake Features. For this last interpretation, let’s instead construct an augmented design
matrix in the following way: √
X̌ = [X λIn ]
√
i.e. we stack X with λIn horizontally. Now our problem is underdetermined: the new dimension
d + n is larger than the number of points n. Therefore, there are infinitely many values η ∈ Rd+n for
which X̌η = y. We are interested in the min-norm solution, ie. the solution to

argmin ∥η∥22 s.t. X̌η = y. (2)

Show that this is yet another form of ridge regression and that the first d coordinates of η ∗ form
the minimizer of (1).
" #
w
Solution: Let’s look inside the d + n dimensional vector η by writing it as η = . Here, w is
ξ
d-dimensional and ξ is n-dimensional. Then (2) expands to
√
argmin ∥w∥22 + ∥ξ∥22 s.t. Xw + λξ = y.
w,ξ
√ √
The constraint just says that λξ = y − Xw. In other words, λξ is the classic residual. This yields
ξ = √1λ (y − Xw) and plugging that into the first part we get

1
argmin ∥w∥22 + ∥y − Xw∥22 .
w,ξ λ

When considering whether the optimization problem is equivalent, we need to think about the mini-
mizer and not the minimum itself. For this, we simply notice that scaling the objective by a constant
factor doesn’t change the minimizers and so:
1
argmin ∥w∥22 + ∥y − Xw∥22 = argmin λ∥w∥22 + ∥y − Xw∥22
w,ξ λ w,ξ

which is equivalent to (1).

(f) We know that the Moore-Penrose pseudo-inverse for an underdetermined system (wide matrix) is
given by A† = AT (AAT )−1 , which corresponds to the min-norm solution for Aη = z. That is, the
optimization problem

argmin ∥η∥2 s.t.Aη = z

is solved by η = A† z. Let ŵ be the minimizer of (1).
Use the pseudo-inverse to show that solving to the optimization problem in (2) yields

ŵ = X T (XX T + λI)−1 y

Homework 1, © UCB EECS 182, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 8
Homework 1 @ 2023-02-07 03:21:36Z

Then, show that this is equivalent to the standard formula for Ridge Regression

ŵ = (X T X + λI)−1 X T y

Hint: It may be helpful to review Kernel Ridge Form.

Solution: First, we simply need to plug in our matrix into the pseudo-inverse formula provided and
simplify
h √ i⊤ h √ ih √ i⊤
X̂ ⊤ (X̂ X̂ ⊤ )−1 ŷ = X λIn ( X λIn X λIn )−1 y
" # " # " #
ŵ X ⊤ h √ i X⊤
= √ ( X λIn √ )−1 y
ξ λIn λIn
" #
X ⊤ √ 2
= √ (XX ⊤ + λ I)−1 y
λIn

Looking at just the top d terms, we see ŵ = X T (XX T + λI)−1 y as desired.

Now, let’s show the two forms of Ridge Regression are equivalent. That is, let’s show

(X T X + λI)−1 X T = X T (XX T + λI)−1

With the expression above, we can left-multiply both sides by (X T X + λI) and right-multiply both
sides by (XX T + λI) to get

X T (XX T + λI) = (X T X + λI)X T

And distributing the matrix multiplication we have

X T XX T + λX T = X T XX T + λX T

which we can see is always true as desired.

(g) We know that the solution to ridge regression (1) is given by ŵr = (X ⊤ X + λI)−1 X ⊤ y. What
happens when λ → ∞? It is for this reason that sometimes ridge regularization is referred to as
“shrinkage.”
Solution:
As λ → ∞ the matrix (X ⊤ X + λI)−1 converges to the zero matrix, and so we have w = 0.
(h) What happens to the solution of ridge regression when you take the limit λ → 0? Consider both
the cases when X is wide (underdetermined system) and X is tall (overdetermined system).
Solution:
When X is wide (underdetermined), we converge to the min-norm solution.

w∗ = X T (XX T )−1 y

When X is tall (overdetermined), we converge to the OLS solution.

w∗ = (X T X)−1 X T y
Both of these can be seen by using the relevant form of Ridge Regression and plugging in λ = 0
directly, or by using the SVD perspective.

4. General Case Tikhonov Regularization

Consider the optimization problem:

min ||W1 (Ax − b)||22 + ||W2 (x − c)||22

Where W1 , A, and W2 are matrices and x, b and c are vectors. W1 can be viewed as a generic weighting of
the residuals and W2 along with c can be viewed as a generic weighting of the parameters.

(a) Solve this optimization problem manually by expanding it out as matrix-vector products, setting the
gradient to 0, and solving for x.
Solution: Expand our objective function:

f (x) = (Ax − b)T W1T W1 (Ax − b) + (x − c)T W2T W2 (x − c)

f (x) = xT AT W1T W1 Ax−2bT W1T W1 Ax+bT W1T W1 b+xT W2T W2 x−2cT W2T W2 x+cT W2T W2 c
Now take gradients and set it to 0:

∇f = 2AT W1T W1 Ax − 2AT W1T W1 b + 2W2T W2 x − 2W2T W2 c = 0

Isolating the x terms on one side, we have:

(AT W1T W1 A + W2T W2 )x = AT W1T W1 b + W2T W2 c

So we can solve to get

x = (AT W1T W1 A + W2T W2 )−1 (AT W1T W1 b + W2T W2 c)

(b) Construct an appropriate matrix C and vector d that allows you to rewrite this problem as

min ∥Cx − d∥2

and use the OLS solution (x∗ = (C T C)−1 C T d) to solve. Confirm your answer is in agreement with
the previous part.
Solution: We can rewrite our problem in least-squares form using
" # " #
W1 A W1 b
C= , and d =
W2 W2 c

Now, using least squares solution and solving, we get

" #⊤ " # " #⊤ " #
W1 A W1 A −1 W1 A W1 b
x∗ = (C T C)−1 C T d = ( )
W2 W2 W2 W2 c
" # " #
h i W A h i W b
T T T 1 −1 T T T 1
= ( A W1 W2 ) A W1 W2
W2 W2 c

x∗ = (AT W1T W1 A + W2T W2 )−1 (AT W1T W1 b + W2T W2 c)

Which is the same as the previous part, as desired.
(c) Choose a W1 , W2 , and c such that this reduces to the simple case of ridge regression that you’ve seen
in the previous problem, x∗ = (AT A + λI)−1 AT b.
√
Solution: This reduces to ridge regression when W1 = I, W2 = λI, and c = 0. You can see this
in both the optimization problem and the result.

5. Coding Fully Connected Networks

In this coding assignment, you will be building a fully-connected neural network from scratch using NumPy.
You will have the choice between two options:

Use Google Colab (Recommended). Open this url and follow the instructions in the notebook.
Use a local Conda environment. Clone https://github.com/gonglinyuan/cs182hw1 and re-
fer to README.md for further instructions.

For this question, please submit a .zip file your completed work to the Gradescope assignment titled “Home-
work 1 (Code)”. Please answer the following question in your submission of the written assignment:

(a) Did you notice anything about the comparative difficulty of training the three-layer net vs train-
ing the five layer net?

Solution: Training a five-layer neural network is more difficult and more sensitive to hyperparameters
(initialization scale and learning rate).

6. Visualizing features from local linearization of neural nets

This problem expects you to modify the Jupyter Notebook you were given in the first discussion section for
the course to allow the visualization of the effective “features” that correspond to the local linearization of
the network in the neighborhood of the parameters.
We provide you with some starter code on Google Colab. For this question, please do not submit your
code to Gradescope. Instead, just include your plots and comments regarding the questions in the subparts.
∂ ∂ (1)
(a) Visualize the features corresponding to (1) y(x) and (1) y(x) where wi are the first hidden
∂wi ∂bi
(1)
layer’s weights and the bi are the first hidden layer’s biases. These derivatives should be evaluated
at at least both the random initialization and the final trained network. When visualizing these features,
plot them as a function of the scalar input x, the same way that the notebook plots the constituent
“elbow” features that are the outputs of the penultimate layer.
Solution: See notebook.
(b) During training, we can imagine that we have a generalized linear model with a feature matrix cor-
responding to the linearized features corresponding to each learnable parameter. We know from our
analysis of gradient descent, that the singular values and singular vectors corresponding to this feature
matrix are important.
Use the SVD of this feature matrix to plot both the singular values and visualize the “principle
features” that correspond to the d-dimensional singular vectors multiplied by all the features
corresponding to the parameters.

(HINT: Remember that the feature matrix whose SVD you are taking has n rows where each row cor-
responds to one training point and d columns where each column corresponds to each of the learnable
features. Meanwhile, you are going to be plotting/visualizing the “principle features” as functions of
x even at places where you don’t have training points.)
Solution: See notebook.
(c) Augment the jupyter notebook to add a second hidden layer of the same size as the first hidden layer,
fully connected to the first hidden layer. Allow the visualization of the features corresponding to
the parameters in both hidden layers, as well as the “principle features” and the singular values.
Solution: See notebook.

7. Homework Process and Study Group

Citing sources and collaborators are an important part of life, including being a student!
We also want to understand what resources you find helpful and how much time homework is taking, so we
can change things in the future if possible.

(a) What sources (if any) did you use as you worked through the homework?
(b) If you worked with someone on this homework, who did you work with?
List names and student ID’s. (In case of homework party, you can also just describe the group.)
(c) Roughly how many total hours did you work on this homework? Write it down here where you’ll
need to remember it for the self-grade form.

Contributors:

• Brandon Trabucco.

• Saagar Sanghavi.

• Alexander Tsigler.

• Anant Sahai.

• Jane Yu.

• Philipp Moritz.

• Soroush Nasiriany.

• Linyuan Gong.

• Sheng Shen.

Notes Linearregression
No ratings yet
Notes Linearregression
4 pages
Solution Quiz 1
No ratings yet
Solution Quiz 1
5 pages
Machine Learning Regression Basics
No ratings yet
Machine Learning Regression Basics
33 pages
Data Analysis: Statistics & Linear Algebra
No ratings yet
Data Analysis: Statistics & Linear Algebra
40 pages
Day 1
No ratings yet
Day 1
41 pages
Solution Manual For Discrete Time Signal Processing 3 E 3rd Edition Alan V Oppenheim Ronald W Schafer
0% (1)
Solution Manual For Discrete Time Signal Processing 3 E 3rd Edition Alan V Oppenheim Ronald W Schafer
4 pages
Total Least Squares
No ratings yet
Total Least Squares
11 pages
Total Least Squares
No ratings yet
Total Least Squares
11 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
4 - Multiple Linear Regressions
No ratings yet
4 - Multiple Linear Regressions
61 pages
Machine Learning Matrix Methods
No ratings yet
Machine Learning Matrix Methods
25 pages
Least Squares
No ratings yet
Least Squares
12 pages
hw3 Solutions PDF
No ratings yet
hw3 Solutions PDF
11 pages
Homework 4 MATH2050
No ratings yet
Homework 4 MATH2050
7 pages
Least Squares, QR, SVD Explained
No ratings yet
Least Squares, QR, SVD Explained
22 pages
Cs419 Closed Form Derv
No ratings yet
Cs419 Closed Form Derv
5 pages
Solution 2
0% (1)
Solution 2
6 pages
Statistical Learning Solution Guide
No ratings yet
Statistical Learning Solution Guide
147 pages
(Textbook) (Solution) The Elements of Statistical Learning
No ratings yet
(Textbook) (Solution) The Elements of Statistical Learning
147 pages
Advanced Regression Homework
No ratings yet
Advanced Regression Homework
7 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Kayatu
No ratings yet
Kayatu
3 pages
Lecture II - Docx - 12
No ratings yet
Lecture II - Docx - 12
12 pages
ES Key
No ratings yet
ES Key
4 pages
hw3 Sol
No ratings yet
hw3 Sol
14 pages
S M S T C Lecture Notes Lecture2
No ratings yet
S M S T C Lecture Notes Lecture2
9 pages
Ecomt Solns1
No ratings yet
Ecomt Solns1
15 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
LeastSquares DeptMath
No ratings yet
LeastSquares DeptMath
7 pages
Midtermsols Sp2010
No ratings yet
Midtermsols Sp2010
6 pages
hw3 Soln
No ratings yet
hw3 Soln
7 pages
Ee127-Fa2018-Mt1-El Ghaoui-Soln
No ratings yet
Ee127-Fa2018-Mt1-El Ghaoui-Soln
15 pages
C195X PDF AppA
No ratings yet
C195X PDF AppA
23 pages
Multiple View Geometry Exercises
No ratings yet
Multiple View Geometry Exercises
2 pages
Lecture 13: Simple Linear Regression in Matrix Format
No ratings yet
Lecture 13: Simple Linear Regression in Matrix Format
9 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
Wooldridge 6e AppE IM
No ratings yet
Wooldridge 6e AppE IM
5 pages
Least Squares & Vector Calculus
No ratings yet
Least Squares & Vector Calculus
4 pages
Lecture03a Least Squares Annotated
No ratings yet
Lecture03a Least Squares Annotated
9 pages
ML Quiz 2
No ratings yet
ML Quiz 2
1 page
Leastsquares Minnorm Problems
No ratings yet
Leastsquares Minnorm Problems
6 pages
Se1 PDF
No ratings yet
Se1 PDF
35 pages
HW 5
100% (1)
HW 5
11 pages
Weatherwax Epstein Hastie Solution Manual
No ratings yet
Weatherwax Epstein Hastie Solution Manual
147 pages
Elements of Statistical Learning Solutions
100% (3)
Elements of Statistical Learning Solutions
112 pages
Lecture-04 - Least Squares and Geometry
No ratings yet
Lecture-04 - Least Squares and Geometry
35 pages
Lecture-4 2
No ratings yet
Lecture-4 2
50 pages
Chap 03
No ratings yet
Chap 03
59 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
hw4 Red
No ratings yet
hw4 Red
6 pages
SVD Slides
No ratings yet
SVD Slides
17 pages
ML Module 2,3,4
No ratings yet
ML Module 2,3,4
13 pages
Linear Models for Math Students
No ratings yet
Linear Models for Math Students
5 pages
Linear Algebraic Equations, SVD, and The Pseudo-Inverse: 1 A Little Background
No ratings yet
Linear Algebraic Equations, SVD, and The Pseudo-Inverse: 1 A Little Background
8 pages
Homework - 1
No ratings yet
Homework - 1
10 pages
31 Least Squares
No ratings yet
31 Least Squares
39 pages
Extra 2
No ratings yet
Extra 2
7 pages
hw7 Sol
No ratings yet
hw7 Sol
13 pages
hw4 Sol
No ratings yet
hw4 Sol
14 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Akash Resume
No ratings yet
Akash Resume
3 pages
Solved Problems in Quantum and Statistic
No ratings yet
Solved Problems in Quantum and Statistic
2 pages
Mathematics IV: PDE Question Bank
No ratings yet
Mathematics IV: PDE Question Bank
1 page
(Ebooks PDF) Download Heat Conduction Using Green S Functions 2nd Edition Beck Full Chapters
No ratings yet
(Ebooks PDF) Download Heat Conduction Using Green S Functions 2nd Edition Beck Full Chapters
71 pages
Mat and Det
No ratings yet
Mat and Det
2 pages
1 Baker's Map: I I I I 1 0
No ratings yet
1 Baker's Map: I I I I 1 0
4 pages
Noether Theorem For Fractional Singular Systems
No ratings yet
Noether Theorem For Fractional Singular Systems
10 pages
Transient Convective Heat Transfer: J. Padet
No ratings yet
Transient Convective Heat Transfer: J. Padet
23 pages
QP Xii Maths PB Set 9
No ratings yet
QP Xii Maths PB Set 9
7 pages
Gauge Theory
No ratings yet
Gauge Theory
7 pages
MEC 420 Example 3D - Dynamic Reaction
No ratings yet
MEC 420 Example 3D - Dynamic Reaction
5 pages
Eem721s Practice Exercise 2.3 Memo
No ratings yet
Eem721s Practice Exercise 2.3 Memo
7 pages
Jordan Form
No ratings yet
Jordan Form
4 pages
Model Answer Control 2 - 2018 - 2019
No ratings yet
Model Answer Control 2 - 2018 - 2019
10 pages
Solitons and Quantum Behavior
No ratings yet
Solitons and Quantum Behavior
188 pages
1BMATE101
No ratings yet
1BMATE101
4 pages
Vertical Curve 1
No ratings yet
Vertical Curve 1
43 pages
Thermodynamics For Solution
No ratings yet
Thermodynamics For Solution
32 pages
03 Fator - Remainder Theorem
No ratings yet
03 Fator - Remainder Theorem
13 pages
Magnetohydrodynamics Basics
No ratings yet
Magnetohydrodynamics Basics
5 pages
ANNUAL PHYSICS LESSON PLAN f4
No ratings yet
ANNUAL PHYSICS LESSON PLAN f4
1 page
M - I (R23) Unit 1 - 1 Matrices
No ratings yet
M - I (R23) Unit 1 - 1 Matrices
52 pages
Matrices& Calculus
No ratings yet
Matrices& Calculus
7 pages
MAT 222 Midterm
No ratings yet
MAT 222 Midterm
2 pages
SAT # 12 (Thermal Physics) - HS
No ratings yet
SAT # 12 (Thermal Physics) - HS
4 pages
Thermodynamics for Engineers
No ratings yet
Thermodynamics for Engineers
38 pages
Sheet Metal Forming Processes Constituti
100% (2)
Sheet Metal Forming Processes Constituti
318 pages
Lorentz Spaces
No ratings yet
Lorentz Spaces
3 pages
FEM Quiz4
No ratings yet
FEM Quiz4
4 pages
Assure AU Physics Syllabus
No ratings yet
Assure AU Physics Syllabus
3 pages
Arnold Solutions
No ratings yet
Arnold Solutions
55 pages

hw1 Sol

Uploaded by

hw1 Sol

Uploaded by

Homework 1 @ 2023-02-07 03:21:36Z

EECS 182 Deep Neural Networks

1. Bias-Variance Tradeoff Review

Error = Bias2 + Variance + σ 2

EY ∼p(y|x),D [(Y − fθ̂(D) (x))2 ] = Bias(fθ̂(D) (x))2 + Var(fθ̂(D) (x)) + σ 2

Bias(fθ̂(D) (x)) = EY ∼p(Y |x),D [fθ̂(D) (x) − Y ]

Solution: For simplicity of notation, let E[·] denote EY ∼p(y|x),D [·]

E[(Y − fθ̂(D) (x))2 ] = E[(Y − fθ̂(D) (x))2 ]

By independence of Y and D and linearity of expectation,

E[(Y − fθ̂(D) (x))2 ] = E[fθ̂(D) (x)2 ] − 2E[Y ]E[fθ̂(D) (x)] + E[Y 2 ]

Noting the definition of variance,

E[θ̂] = E[(X ⊤ X)−1 X ⊤ (Xθ + ε)]

E[(θ̂ − θ)(θ̂ − θ)⊤ ] = E[(X ⊤ X)−1 X ⊤ εε⊤ ((X ⊤ X)−1 X ⊤ )⊤ ]

Now for a particular test input x, we can compute the variance

Var[x⊤ θ̂] = Var[x⊤ (θ̂ − θ)]

(X T X)−1 X T = V (ΣT Σ)−1 V T V ΣT U T

3. The 5 Interpretations of Ridge Regression

argmin ∥y − Xw∥22 + λ∥w∥22 , (1)

f (w) = wT X T Xw − 2wT Xy + yT y + λwT w

Take gradient wrt w and set it to zero:

Now, let’s consider the case when n > d. We have

M AP (w|Y = y) = argmax f (w|Y = y)

= argmax f (w)f (y|w)

∥w∥2 1 X (yi − xTi w)2

M AP = argmax λ∥w∥2 + ∥Xw − y∥2

argmin ∥η∥22 s.t. X̌η = y. (2)

which is equivalent to (1).

argmin ∥η∥2 s.t.Aη = z

Hint: It may be helpful to review Kernel Ridge Form.

Looking at just the top d terms, we see ŵ = X T (XX T + λI)−1 y as desired.

(X T X + λI)−1 X T = X T (XX T + λI)−1

X T (XX T + λI) = (X T X + λI)X T

And distributing the matrix multiplication we have

which we can see is always true as desired.

When X is tall (overdetermined), we converge to the OLS solution.

4. General Case Tikhonov Regularization

min ||W1 (Ax − b)||22 + ||W2 (x − c)||22

f (x) = (Ax − b)T W1T W1 (Ax − b) + (x − c)T W2T W2 (x − c)

∇f = 2AT W1T W1 Ax − 2AT W1T W1 b + 2W2T W2 x − 2W2T W2 c = 0

Isolating the x terms on one side, we have:

(AT W1T W1 A + W2T W2 )x = AT W1T W1 b + W2T W2 c

So we can solve to get

x = (AT W1T W1 A + W2T W2 )−1 (AT W1T W1 b + W2T W2 c)

min ∥Cx − d∥2

Now, using least squares solution and solving, we get

x∗ = (AT W1T W1 A + W2T W2 )−1 (AT W1T W1 b + W2T W2 c)

5. Coding Fully Connected Networks

6. Visualizing features from local linearization of neural nets

7. Homework Process and Study Group

You might also like