Deep Learning Course Overview
Deep Learning Course Overview
Lecture 1 – Introduction
1.1 Introduction
                                  3
1.1
Introduction
Team
 Lecturer:
  I Prof. Dr.-Ing. Andreas Geiger
 TAs:
  I Dr. Joo-Ho Lee
  I Songyou Peng
  I Aditya Prakash
  I Christian Reiser
  I Axel Sauer
                                    5
Contents
                                                                                              7
Exercises
   I Every 2 weeks (6 assignments in total)
   I Handed out on Wednesdays via ILIAS and introduced via Zoom
   I Q&A every other Wednesday via Zoom
   I Can be conducted in groups of up to 2 students
       I No sharing across groups
       I Every group member must submit the solution
       I Find a partner via ILIAS booking pool
   I Assignments involve pen & paper as well as coding tasks
       I Assignment 1-3: Educational Deep-Learning Framework (Python NumPy)
       I Assignment 4-6: PyTorch (Google Colab)
   I 50% must be successfully completed to participate in exam
   I 75% successfully completed leads to a 0.3 bonus in the exam
                                                                              8
Lecture Notes
  Books:
   I Goodfellow, Bengio, Courville: Deep Learning
     http://www.deeplearningbook.org
                                                                              10
Materials & Credits
  Courses:
   I McAllester (TTI-C): Fundamentals of Deep Learning
      http://mcallester.github.io/ttic-31230/Fall2020/
                                                                           11
Materials & Credits
  Tutorials:
   I The Python Tutorial
      https://docs.python.org/3/tutorial/
   I NumPy Quickstart
      https://numpy.org/devdocs/user/quickstart.html
   I PyTorch Tutorial
      https://pytorch.org/tutorials/
   I Latex / Overleaf Tutorial
      https://www.overleaf.com/learn
  Frameworks / IDEs:
    I Visual Studio Code
      https://code.visualstudio.com/
   I Google Colab
      https://colab.research.google.com                12
Prerequisites
                                                              13
Prerequisites
  Linear Algebra:
   I Vectors: x, y ∈ Rn
   I Matrices: A, B ∈ Rm×n
   I Operations: AT , A−1 , Tr(A), det(A), A + B, AB, Ax, x> y
   I Norms: kxk1 , kxk2 , kxk∞ , kAkF
   I SVD: A = UDV>
                                                                 14
Prerequisites
                                                                        15
Prerequisites
                                                                   16
1.2
History of Deep Learning
A Brief History of Deep Learning
sm
                                                                                      g
                                                                                     in
                             s
ni
                                                                                      n
                          tic
                                                                                   ar
                                                         tio
                    ne
                                                                                Le
                                                       ec
                      r
                                                                              ep
                   be
nn
                                                                           De
                Cy
                                                  Co
         1950      1960          1970   1980      1990       2000      2010        2020
                                                                                              18
A Brief History of Deep Learning
                                                                   t
                                                               per
                                                            Pa
                                                         y/
                                                       sk
                                                     in
                                                   M
                                                                                    n
                                                                                tro
                                                                                i
                                                                             gn
                                                                          co
                                                                       eo
                                                                     N
                                                                                    n
                                                                                tro
                                                                                i
                                                                             gn
                                                                          co
                                                                       eo
                                                                     N
                                                                                                   n
                                                                                                 io
                                                                                              at
       I Remains main workhorse today
                                                                                           pag
                                                                                       pro
                                                                                    ck
                  1950             1960              1970             1980       Ba     1990           2000   2010   2020
Rumelhart, Hinton and Williams: Learning representations by back-propagating errors. Nature, 1986.                          23
A Brief History of Deep Learning
                                                                                     TM
                                                                                    LS
                 1950             1960             1970            1980      1990    2000   2010   2020
Hochreiter, Schmidhuber: Long short-term memory. Neural Computation, 1997.                                24
A Brief History of Deep Learning
                                                                                     TM
                                                                                    LS
                 1950             1960             1970            1980      1990    2000   2010   2020
Hochreiter, Schmidhuber: Long short-term memory. Neural Computation, 1997.                                24
A Brief History of Deep Learning
                                                                                                               et
        I But did not scale up (yet)
                                                                                                          n vN
                                                                                                       Co
                  1950              1960             1970              1980             1990              2000            2010   2020
LeCun, Bottou, Bengio, Haffner: Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.                 25
A Brief History of Deep Learning
                                                                                                                          et
                                                                                                                         N
         via GPU training, deep models, data
                                                                                                                       ex
                                                                                                                       Al
                                                                                                                    e/
                                                                                                                  ag
                                                                                                                Im
                  1950              1960             1970              1980             1990             2000   2010        2020
Krizhevsky, Sutskever, Hinton. ImageNet classification with deep convolutional neural networks. NIPS, 2012.                         26
A Brief History of Deep Learning
                                                                                                                          et
                                                                                                                         N
         via GPU training, deep models, data
                                                                                                                       ex
                                                                                                                       Al
                                                                                                                    e/
       I Sparked deep learning revolution
                                                                                                                  ag
                                                                                                                Im
                  1950              1960             1970              1980             1990             2000   2010        2020
Krizhevsky, Sutskever, Hinton. ImageNet classification with deep convolutional neural networks. NIPS, 2012.                         26
A Brief History of Deep Learning
                                                                                                                             ts
                                                                                                                           se
                                                                                                                       ta
                                                                                                                      Da
                  1950            1960              1970            1980             1990             2000     2010    2020
Geiger, Lenz and Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. CVPR, 2012.                      27
A Brief History of Deep Learning
                                                                                                                     ts
                                                                                                                   se
                                                                                                               ta
                                                                                                              Da
                  1950              1960              1970              1980             1990   2000   2010    2020
Dosovitskiy et al.: FlowNet: Learning Optical Flow with Convolutional Networks. ICCV, 2015.                               28
A Brief History of Deep Learning
                                                                                                                     ts
                                                                                                                   se
                                                                                                               ta
                                                                                                              Da
                  1950              1960              1970              1980             1990   2000   2010    2020
Dosovitskiy et al.: FlowNet: Learning Optical Flow with Convolutional Networks. ICCV, 2015.                               28
A Brief History of Deep Learning
    2014: Generalization
       I Empirical demonstration that deep
         representations generalize well
         despite large number of parameters
       I Pre-train CNN on large amounts of
         data on generic task (e.g., ImageNet)
       I Fine-tune (re-train) only last layers on
         few data of a new task
                                                                                                                                                n
                                                                                                                                              io
                                                                                                                                            at
                                                                                                                                           liz
       I State-of-the-art performance
                                                                                                                                         ra
                                                                                                                                     ne
                                                                                                                                    Ge
                  1950             1960              1970             1980             1990             2000              2010             2020
Razavian, Azizpour, Sullivan, Carlsson: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. CVPR Workshops, 2014.                   29
A Brief History of Deep Learning
    2014: Visualization
       I Goal: provide insights into what the
         network (black box) has learned
       I Visualized image regions that most
         strongly activate various neurons at
         different layers of the network
       I Found that higher levels capture
         more abstract semantic information
                                                                                                                                  n
                                                                                                                                io
                                                                                                                             at
                                                                                                                         a liz
                                                                                                                      su
                                                                                                                      Vi
                  1950             1960             1970             1980             1990             2000    2010          2020
Zeiler and Fergus: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. CVPR Workshops, 2014.                          30
A Brief History of Deep Learning
                                                                                                                RL
                                                                                                            ep
                                                                                                           De
                  1950              1960              1970             1980           1990   2000   2010        2020
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015.                                    33
A Brief History of Deep Learning
     2016: WaveNet
        I Deep generative model
          of raw audio waveforms
        I Generates speech which
          mimics human voice
        I Generates music
                                                                                                       et
                                                                                                      eN
                                                                                                    av
                                                                                                   W
                  1950              1960              1970             1980   1990   2000   2010     2020
Oord et al.: WaveNet: A Generative Model for Raw Audio. Arxiv, 2016.                                        34
A Brief History of Deep Learning
                                                                                                                        er
                                                                                                                      sf
                                                                                                                    an
                                                                                                                  Tr
                                                                                                                 yle
                                                                                                               St
                  1950             1960             1970             1980             1990       2000   2010      2020
Gatys, Ecker and Bethge: Image Style Transfer Using Convolutional Neural Networks. CVPR, 2016.                               35
A Brief History of Deep Learning
                                                                                                         a Go
                                                                                                      ph
                                                                                                      Al
                  1950             1960              1970             1980       1990   2000   2010        2020
Silver et al.: Mastering the game of Go without human knowledge. Nature, 2017.                                    36
A Brief History of Deep Learning
                                                                                                         N
                                                                                                       CN
                                                                                                    R-
                                                                                                   k
                                                                                               as
                                                                                               M
                  1950              1960              1970         1980   1990   2000   2010       2020
He, Gkioxari, Dollár and Ross Girshick: Mask R-CNN. ICCV, 2017.                                             37
A Brief History of Deep Learning
                                                                                                     E
                                                                                                   LU
                                                                                                /G
                                                                                             RT
                                                                                           BE
                   1950              1960               1970   1980   1990   2000   2010     2020
Vaswani et al.: Attention is All you Need. NIPS 2017.                                                    38
A Brief History of Deep Learning
                                                                                                                                             E
                                                                                                                                           LU
                                                                                                                                        /G
                                                                                                                                     RT
                                                                                                                                   BE
                 1950              1960             1970              1980             1990             2000              2010       2020
Devlin, Chang, Lee and Toutanova: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Arxiv, 2018.                 38
A Brief History of Deep Learning
                                                                                                                                   E
                                                                                                                                 LU
      I But: Computers still fail in dialogue
                                                                                                                              /G
                                                                                                                           RT
                                                                                                                         BE
                 1950             1960             1970             1980            1990             2000         2010     2020
Wang et al.: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ICLR, 2019.                        38
A Brief History of Deep Learning
                                                                                                                                   E
                                                                                                                                 LU
      I But: Computers still fail in dialogue
                                                                                                                              /G
                                                                                                                           RT
                                                                                                                         BE
                 1950             1960             1970             1980            1990             2000         2010     2020
Wang et al.: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ICLR, 2019.                        38
A Brief History of Deep Learning
                                                                            d
                                                                          ar
                                                                         Aw
                                                                     g
                                                                    in
                                                                     r
                                                                  Tu
         1950      1960     1970      1980   1990   2000   2010    2020
                                                                                39
A Brief History of Deep Learning
                                                                                                                                              DL
                                                                                                                                           3D
                  1950             1960              1970             1980             1990             2000              2010             2020
Niemeyer, Mescheder, Oechsle, Geiger: Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision. CVPR, 2020.   40
A Brief History of Deep Learning
    2020: GPT-3
       I Language model by OpenAI
       I 175 Billion parameters
       I Text-in / text-out interface
       I Many use cases: coding, poetry,
         blogging, news articles, chatbots
       I Controversial discussions
       I Licensed exclusively to Microsoft
         on September 22, 2020
                                                                                                     3
                                                                                                  T-
                                                                                                GP
                  1950             1960              1970           1980   1990   2000   2010   2020
Brown et al.: Language Models are Few-Shot Learners. Arxiv, 2020.                                        41
A Brief History of Deep Learning
  Current Challenges
   I Un-/Self-Supervised Learning
   I Interactive learning
   I Accuracy (e.g., self-driving)
   I Robustness and generalization
   I Inductive biases
   I Understanding and mathematics
   I Memory and compute
   I Ethics and legal questions
   I Does “Moore’s Law of AI” continue?
                                          42
1.3
Machine Learning Basics
Goodfellow et al.: Deep Learning, Chapter 5
 http://www.deeplearningbook.org/contents/ml.html
Learning Problems
   I Supervised learning
       I Learn model parameters using dataset of data-label pairs {(xi , yi )}N
                                                                              i=1
       I Examples: Classification, regression, structured prediction
   I Unsupervised learning
       I Learn model parameters using dataset without labels {xi }Ni=1
       I Examples: Clustering, dimensionality reduction, generative models
   I Self-supervised learning
       I Learn model parameters using dataset of data-data pairs {(xi , x0i )}N
                                                                              i=1
       I Examples: Self-supervised stereo/flow, contrastive learning
   I Reinforcement learning
       I Learn model parameters using active exploration from sparse rewards
       I Examples: deep q learning, gradient policy, actor critique
                                                                                    45
Supervised Learning
Classification, Regression, Structured Prediction
  Classification / Regression:
f :X →N or f :X →R
                                       48
Supervised Learning
                                                                          48
Supervised Learning
                                                                          48
Classification
"Beach"
                                                             48
Regression
143,52 €
I Mapping: fw : RN → R
                                               48
Structured Prediction
                                              "Das Pferd
                                               frisst keinen
                                               Gurkensalat."
                                                               48
Structured Prediction
                                                Can
                                                      Monkey
                                                               48
Structured Prediction
                                         3
   I Mapping: fw : RW ×H×N → {0, 1}M
   I Suppose: 323 voxels, binary variable per voxel (occupied/free)
                                                         3
   I Question: How many different reconstructions? 232 = 232768
   I Comparison: Number of atoms in the universe? ∼ 2273
                                                                        48
Linear Regression
Linear Regression
  Let X denote a dataset of size N and let (xi , yi ) ∈ X denote its elements (yi ∈ R).
  Goal: Predict y for a previously unseen input x. The input x may be multidimensional.
                          1.5
                                                                        Ground Truth
                                                                        Noisy Observations
                          1.0
0.5
                          0.0
                      y
0.5
1.0
                          1.5
                                1.00   0.75   0.50   0.25 0.00   0.25   0.50   0.75   1.00
                                                            x
                                                                                             50
Linear Regression
  The error function E(w) measures the displacement along the y dimension between
  the data points (green) and the model f (x, w) (red) specified by the parameters w.
                                                   1.5
       f (x, w) = w> x                                          Ground Truth
                                                                Noisy Observations
                   N
                   X                               1.0          Linear Fit
         E(w) =          (f (xi , w) − yi )2
                                                   0.5
                   i=1
                    N                2           0.0
                                               y
                   X
               =          x>
                           i w − yi
                   i=1                             0.5
                                                   1.5
                                                         1.00     0.75   0.50    0.25 0.00   0.25   0.50   0.75   1.00
                                                                                        x
= 2X> Xw − 2X> y
As E(w) is quadratic and convex in w, its minimizer (wrt. w) is given in closed form:
∇w E(w) = 0
                                                  −1
                                   ⇒ w = (X> X)        X> y
                     −1
  The matrix (X> X)       X> is also called Moore-Penrose inverse or pseudoinverse.
                                                                                          52
Example: Line Fitting
Line Fitting
    1.5                                                                           8
                 Ground Truth                                                                                         Error Curve
                 Noisy Observations                                               7                                   Minimum
    1.0          Linear Fit
                                                                                  6
    0.5
                                                                                  5
                                                                                  4
                                                                          Error
    0.0
y
                                                                                  3
    0.5
                                                                                  2
    1.0                                                                           1
    1.5                                                                           0
          1.00     0.75   0.50    0.25 0.00   0.25   0.50   0.75   1.00               1.0   0.5   0.0   0.5   1.0   1.5       2.0
                                         x                                                              w1
                        M
                        X
           f (x, w) =         wj x j = w > x   with features   x = (1, x1 , x2 , . . . , xM )>
                        j=0
  Tasks:
   I Training: Estimate w from dataset X
   I Inference: Predict y for novel x given estimated w
  Note:
   I Features can be anything, including multi-dimensional inputs (e.g., images, audio),
     radial basis functions, sine/cosine functions, etc. In this example: monomials.
                                                                                                 56
Polynomial Curve Fitting
                     M
                     X
        f (x, w) =         wj x j = w > x   with features          x = (1, x1 , x2 , . . . , xM )>
                     j=0
                                               N
                                               X
                                      E(w) =         (f (xi , w) − yi )2
                                               i=1
                                                                                                     57
Polynomial Curve Fitting
  The error function from above is quadratic in w but not in x:
                                                                                                    2
                 N                             N                 2       N         M
                                                                                           wj xji − yi 
                 X                             X                           X         X
        E(w) =         (f (xi , w) − yi )2 =         w> xi − yi        =         
                 i=1                           i=1                         i=1       j=0
    0.0                                                        0.0
y
                                                           y
    0.5                                                        0.5
1.0 1.0
    1.5                                                        1.5
          0.0   0.2   0.4       0.6    0.8        1.0                0.0   0.2   0.4       0.6    0.8        1.0
                            x                                                          x
  Plots of polynomials of various degrees M (red) fitted to the data (green). We observe
  underfitting (M = 0/1) and overfitting (M = 9). This is a model selection problem.
                                                                                                                      60
Polynomial Curve Fitting
    1.5                                                        1.5
                        M=3           Ground Truth                                 M=9           Ground Truth
                                      Noisy Observations                                         Noisy Observations
    1.0                               Polynomial Fit           1.0                               Polynomial Fit
                                      Test Set                                                   Test Set
    0.5                                                        0.5
    0.0                                                        0.0
y
                                                           y
    0.5                                                        0.5
1.0 1.0
    1.5                                                        1.5
          0.0   0.2   0.4       0.6    0.8        1.0                0.0   0.2   0.4       0.6    0.8        1.0
                            x                                                          x
  Plots of polynomials of various degrees M (red) fitted to the data (green). We observe
  underfitting (M = 0/1) and overfitting (M = 9). This is a model selection problem.
                                                                                                                      60
Capacity, Overfitting and Underfitting
  Goal:
   I Perform well on new, previously
                                                 1.5
     unseen inputs (test set, blue), not                                           Ground Truth
                                                                                   Noisy Observations
                                                 1.0                               Test Set
     only on the training set (green)
   I This is called generalization and           0.5
     separates ML from optimization
                                                 0.0
                                             y
   I Assumption: training and test data
                                                 0.5
     independent and identically (i.i.d.)
     drawn from distribution pdata (x, y)        1.0
                                                                                                        61
Capacity, Overfitting and Underfitting
  Terminology:
   I Capacity: Complexity of functions which can be represented by model f
   I Underfitting: Model too simple, does not achieve low error on training set
   I Overfitting: Training error small, but test error (= generalization error) large
      1.5                                                        1.5                                                        1.5
                          M=1           Ground Truth                                 M=3           Ground Truth                                 M=9           Ground Truth
                                        Noisy Observations                                         Noisy Observations                                         Noisy Observations
      1.0                               Polynomial Fit           1.0                               Polynomial Fit           1.0                               Polynomial Fit
                                        Test Set                                                   Test Set                                                   Test Set
      0.5                                                        0.5                                                        0.5
                                                                                                                        y
      0.5                                                        0.5                                                        0.5
                           101
                   Error
100
10 1
10 2
                           10   3
                                    0     1      2      3     4       5     6   7   8   9
                                                         Degree of Polynomial               63
Capacity, Overfitting and Underfitting
  General Approach: Split dataset into training, validation and test set
   I Choose hyperparameters (e.g., degree of polynomial, learning rate in neural net, ..)
      using validation set. Important: Evaluate once on test set (typically not available).
                                                         Test
20%
                                      60%            20%
                           Training                             Validation
   I When dataset is small, use (k-fold) cross validation instead of fixed split.
                                                                                              64
Ridge Regression
Ridge Regression
  Polynomial Curve Model:
                    M
                    X
       f (x, w) =         wj x j = w > x      with features         x = (1, x1 , x2 , . . . , xM )>
                    j=0
  Ridge Regression:
                                      N
                                      X                               M
                                                                      X
                             E(w) =         (f (xi , w) − yi )2 + λ         w2
                                      i=1                             j=0
    0.0                                                            0.0
y
                                                               y
    0.5                                                            0.5
1.0 1.0
    1.5                                                            1.5
          0.0   0.2     0.4         0.6    0.8        1.0                0.0   0.2     0.4       0.6    0.8        1.0
                              x                                                              x
  Plots of polynomial with degree M = 9 fitted to 10 data points using ridge regression.
  Left: weak regularization (λ = 10−8 ). Right: strong regularization (right, λ = 103 ).
                                                                                                                            67
Ridge Regression
                                                                                         102
                                        Model Weights                                                                                     Training Error
                                                                                                                                          Generalization Error
 100000
                                                                                         101
  50000
                                                                                         100
     0
                                                                                 Error
  50000                                                                                  10   1
 100000
                                                                                         10   2
          10   13   10   12   10   11   10 10 10 9 10      8   10   7   10   6
                                                                                                  10   11   10   8     10 5        10 2       101        104
                                   Regularization Weight                                                         Regularization weight
  Left: With low regularization, parameters can become very large (ill-conditioning).
  Right: Select model with the smallest generalization error on the validation set.
                                                                                                                                                                 68
Estimators, Bias and Variance
Estimators, Bias and Variance
  Point Estimator:
   I A point estimator g(·) is function that maps a dataset X to model parameters ŵ:
ŵ = g(X )
                                                                                         70
Estimators, Bias and Variance
Bias: Variance:
  Bias-Variance Dilemma:
   I Statistical learning theory tells us that we can’t have both ⇒ there is a trade-off
                                                                                           71
Estimators, Bias and Variance
    0.8                                                       0.8
            Estimates            = 10   8                                             = 10               Estimates
    0.6     Ground Truth                                      0.6                                        Ground Truth
            Mean                                                                                         Mean
    0.4                                                       0.4
0.2 0.2
    0.0                                                       0.0
y
                                                          y
    0.2                                                       0.2
0.4 0.4
0.6 0.6
    0.8                                                       0.8
          0.0       0.2    0.4          0.6   0.8   1.0             0.0   0.2   0.4          0.6   0.8         1.0
                                  x                                                   x
  Variations:
   I If we were choosing pmodel (y|x, w) as a Laplace distribution, we would obtain an
     estimator that minimizes the `1 norm: ŵ = argmin w kXw − yk1
   I Assuming a Gaussian distribution over the parameters w and performing a
     maximum a-posteriori (MAP) estimation yields ridge regression:
                         argmax p(w|y, x) = argmax p(y|x, w)p(w)
                            w                    w
                                                                                         77
Maximum Likelihood Estimation
  Remarks:
   I Consistency: As the number of training samples approaches infinity N → ∞,
     the maximum likelihood (ML) estimate converges to the true parameters
   I Efficiency: The ML estimate converges most quickly as N increases
   I These theoretical considerations make ML estimators appealing
77