Backpropagation
Fatemeh Seyyedsalehi
                                  Sharif University of Technology
                                         Spring 2025
Computer science - Sharif univ.            Backpropagation          1/9
Constructing the MLP
      MLPs are capable to represent any function
      But how do we construct it?
          ▶   I.e., how do we determine the weights (and biases) of the network to
              best represent a target function?
          ▶   Assuming that the architecture of the network is given
      By minimizing expected error
                               Z
                   W = argmin      e(f (X ; W ), t(X ))p(X ) dX
                                     W       X
                                   = argmin E [e(f (X ; W ), t(X ))]
                                         W
 Computer science - Sharif univ.             Backpropagation                         2/9
Estimating the True Function
      The true function t(x) is unknown, so sample it
          ▶   Basically, get input-output pairs for a number of samples of input
          ▶   i.e., preparing the training dataset
      Estimate the function from the samples
      The empirical estimate of the expected error is the average error over
      the samples:
                                                           T
                                                         1 X
                         E [div(f (X ; W ), t(X ))] ≈        e(f (Xi ; W ), yi )
                                                         T
                                                             i=1
      We can hope that minimizing the empirical loss will minimize the true
      loss
 Computer science - Sharif univ.           Backpropagation                         3/9
Training Models and Loss Functions
      We seek parameters that produce the best possible mapping from
      input to output for the task at hand.
      A loss function or cost function returns a single number describing
      the mismatch between:
          ▶   Model predictions f (X ; W )
          ▶   Ground-truth outputs yi
      We shifted perspective to think of neural networks as computing of
      probability distributions pr (y |W ) over the output space.
          ▶   This led to a principled approach for building loss functions.
          ▶   Maximizing the likelihood of the observed data under these
              distributions.
 Computer science - Sharif univ.       Backpropagation                         4/9
Example 1: Univariate Regression
      The loss function is given by:
                                                  N
                                                  X
                                    L[W ] = −           log [p(yi |[xi , W ])]
                                                  i=1
      Considering the conditional probability as a normal distribution, we
      have:
                    " N                                             #
                                                 (yi − f(xi ; W ))2                                             
                        X           1
            arg min −       log √        exp −
               W                   2πσ 2               2σ 2
                        i=1
                                   hP                            i
                                     N
                 = arg min           i=1 (yi   − f (xi ; W ))2
                          W
      Least squares!
 Computer science - Sharif univ.               Backpropagation                   5/9
Example 2: Binary Classification
      Bernoulli distribution is a suitable probability that can be defined over
      the domain of such predictions
                                         p(y |λ) = (1 − λ)1−y · λy
      The neural network can be trained to predict the parameter λ.
                              N
                              X
                L[W ] =              ((1 − yi ) log[1 − f (xi ; W )] − yi log[f (xi ; W )])
                               i=1
      Binary cross-entropy loss!
 Computer science - Sharif univ.                Backpropagation                               6/9
Example 3: Multiclass Classification
      Categorical distribution is a suitable one for this domain:
      y ∈ {1, 2, ...k}
      The neural network should predict k parameters λk ∈ [0, 1], summed
      to 1.
      Usually we use the Softmax function in this situation:
                                                    e zi
                                   Softmax(zi ) = Pn             zj
                                                         j=1 e
      where zj s are outputs of the network.
      multi-class cross-entropy loss!
 Computer science - Sharif univ.       Backpropagation                 7/9
Our main problem
 Computer science - Sharif univ.   Backpropagation   8/9
The learning algorithm
      Searching in the hypothesis space
      Next: a course on optimization and how to do it neural networks.
      Following slides are selected from Deep Learning course CMU 11-785.
 Computer science - Sharif univ.   Backpropagation                       9/9