CSE472, CSE386
Artificial Intelligence
  Neural Networks
Prof. Mahmoud Khalil
    Summer 2024
                          1
 Artificial Neural Networks
Artificial neural networks are inspired by brains and neurons
A neural network is a graph with nodes, or units, connected by links
Each link has an associated weight, a real number
Typically, each node I outputs a real number, which is fed as input to the nodes connected to I
The output of a node is a function of the weighted sum of the node’s inputs
                                                                                             2
Basic Concepts
• A Neural Network maps a set of inputs to a set of
  outputs
• Number of inputs/outputs is variable.                             Input 0    Input 1     Input n
• The Network itself is composed of an arbitrary
  number of nodes or units, connected by links, with an
  arbitrary topology.                                                   Neural Network
• A link from unit i to unit j serves to propagate the activation
  aj to j, and it has a weight Wij.
• What can a neural networks do?                                    Output 0   Output 1   Output m
   Compute a known function / Approximate an unknown
   function Pattern Recognition / Signal Processing
     Learn to do any of the above
                                                                                                     3
Fully Connected NN
                       Different
                     types of nodes
                                      4
Artificial Neuron Node or Unit: Mathematical Abstraction
                                                  Artificial Neuron,
                                                   Node or unit ,
                                                  Processing Unit i
Input edges,                     Input                                                       Output edges,
                                                Activation              Output
each with weights               Function (ini):                                              each with weights
                                 weighted sum function (g)
                                                                               n
(positive, negative, and                                             ai  g(  W j,i a j )
                                  of its inputs,                                             (positive, negative,
                                                    applied to                j0
change over time,                 including                                                  and change over
                                                    input function
                                                                                             time, learning)
learning)                                           (typically
                                     n
                              ini   W j,i a j     non-linear).
                                    j0
             a processing element producing an output based on a function of its inputs
                                                                                                                    5
Activation Functions
      Step Function
                       Sigmoid Function   Sign Function
                                                          6
Normalizing Unit Thresholds
 If t is the threshold value of the output unit, then
 Where W0 = t and I0 = −1
 - We can always assume that the unit’s threshold is 0 This allows thresholds to be
   learned like any other weight
 - We can even allow output values in [0, 1] by replacing step0 by the sigmoid function
                                                                                          7
Units as logic Gates
Units with a threshold activation function can act as logic gates;
we can use these units to compute Boolean function of its inputs.
                           Activation of
                       threshold units when:
                           n
                           W    j,i   a j  W0,i
                           j1
                                                                     8
AND
      x1            x2            output
      0             0                0
                                           W0= 1.5
      0             1                0
      1             0                0
                                           -1
      1             1                1          w1=1      w2=1
                                                     x1   x2
               Activation of
           threshold units when:
               n
              W        j,i   a j  W0,i
              j1
                                                                 9
OR
     x1            x2             output
     0              0                 0    w0= 0.5
     0              1                 1
     1              0                 1    -1
                                                 w1=1     w2=1
     1              1                 1
                                                     x1   x2
              Activation of
          threshold units when:
             n
             W    j,i   a j  W0,i
             j1
                                                                 10
NOT
               x1       output
                                         w0= -
               0             1
               1             0
                                        -1               w1= 1
                                                    x1
          Activation of
      threshold units when:
          n
                                 So, units with a threshold activation function
         W j,i a j  W0,i
         j1                      can act as logic gates given the appropriate
                                            input and bias weights.
                                                                                  11
Network Structures
• Feed-forward networks
   • Activation flows from input layer to output layer
   • single-layer perceptrons
   • multi-layer perceptrons
   • Feed-forward networks implement functions,
     have no internal state (only weights).
• Recurrent networks
   • Feed the outputs back into own inputs
       Network is a dynamical system
       (stable state, oscillations, chaotic behavior)
       Response of the network depends on initial state
   • Can support short-term memory
   • More difficult to understand
                                                           12
    Feed-Forward Network
                          Two input units   Two hidden units   One Output
                                                                               Each unit receives input only
                                                                               from units in the immediately
for simplicity, Bias unit omitted
                                                                                     preceding layer.
                 Given an input vector x = (x1,x2), the activations of the input units are set to values of the
                 input vector, i.e., (a1,a2)=(x1,x2), and the network computes:
                            By adjusting the weights we get different functions:
                             that is how learning is done in neural networks!
                                                                                                                  13
Single Layer Feed-Forward Networks (perceptron)
  •Single-layer neural network (perceptron network)
     A network with all the inputs connected directly to the outputs
          – Output units all operate separately: no shared weights
                                                       Since each output unit is
                                                      independent of the others,
                                                      we can limit our study to
                                                      single output perceptrons.
                                                                                   14
Perceptron Learning Intuition
  • Weight Update
  •  Input Ij (j=1,2,…,n)
  •  Single output O: target output, T.
  • Consider some initial weights
  • Define example error: Err = T – O
  • Now just move weights in right direction!
  • If the error is positive, then we need to increase O.
  •          Err >0  need to increase O;
  • If the error is negative, then we need to decrease O
                                                                               Ij   Wj O
  •          Err <0  need to decrease O;
  • Each input unit j, contributes Wj Ij to total input:
  •           if Ij is positive, increasing Wj tends to increaseO;
  •           if Ij is negative, decreasing Wj tends to decreaseO;
  •          So we use             Wj  Wj +   Ij  Err             is the learning rate.
                                                                                               15
Perceptron Leaning: Example
 •   Let’s consider an example
 •   Framework and notation:
 •   0/1 signals
 •   Input vector:
            X  x0 , x1, x2 … , xn 
 • Weight vector:
     
 W  w0 , w1, w2 … , wn 
 • x0 = 1 and -w0, simulate the threshold.
 • O is output (0 or 1) (single output).
                                                                     Learning rate = 1.
                                           kn
 • Threshold function:                  S   wk xk   S  0 then O  1 else O  0
                                            k 0
                                                                                          16
 Perceptron Leaning: Example
                                                                             Err = T – O
• Set of training examples, each example is a pair
                                                                            Wj  Wj +   Ij  Err
• i.e., an input vector and a label y (0 or 1). (x , y )
                                                    i   i
• Learning procedure, called the “error correcting method”
• Start with all zero-weight vector.
• Cycle (repeatedly) through the training examples and for each example do:
     • If perceptron is 0 while it should be 1,
              add the input vector to the weight vector
• If perceptron is 1 while it should be 0,                               Intuitively correct,
                                                                             (e.g., if output is 0
           subtract the input vector to the weight vector                    but it should be 1,
• Otherwise do nothing.                                                      the weights are
                                                                             increased) !
                                                                                                     17
Perceptron Leaning: Example
 • Consider learning the logical OR function.
 • Our examples are:
 • Training Samples
                      x0       x1      x2   label
          1           1        0       0    0
          2           1        0       1    1
          3           1        1       0    1
          4           1        1       1    1
                                                kn
                 Activation Function        S   wk xk   S  0 then O  1 else O  0
                                                 k 0
                                                                                        18
Perceptron Leaning: Example
                                                          If perceptron is 0 while it should be 1,
                 k n
                                                               add the input vector to the weight vector
             S   wk x k   S  0 then O  1 else O  0   If perceptron is 1 while it should be 0,
                 k 0
                                                               subtract the input vector to the weight vector
                               Error correcting method    Otherwise do nothing.
• We’ll use a single perceptron with three inputs.
• We’ll start with all weights 0 W= <0,0,0>
                                                                                    1
                                                                                 I0 w 0
• Example 1     I= < 1,0, 0> label=0 W= <0,0,0>
• Perceptron (10+ 00+ 00 =0, S=0) output  0                                   I1 w1 O
•      it classifies it as 0, so correct, do nothing
                                                                                  I2 w2
• Example 2      I=<1, 0, 1> label=1 W= <0,0,0>
• Perceptron (10+ 00+ 10 = 0) output 0
•      it classifies it as 0, while it should be 1, so we add input to weights
•              W = <0,0,0> + <1,0,1>= <1,0,1>                                                        19
Perceptron Leaning: Example
 • Example 3       I=<1,1, 0>         label=1 W= <1,0,1>
 • Perceptron (11+ 10+ 01 > 0) output = 1                I0 w0
 •      it classifies it as 1, correct, do nothing
 •             W = <1,0,1>                                  I1 w1 O
                                                            I2 w2
 • Example 4       I=<1,1,1>           label=1 W= <1,0,1>
 • Perceptron (11+ 10+ 11 > 0) output = 1
 •      it classifies it as 1, correct, do nothing
 •             W = <1,0,1>
                                                                      20
Perceptron Leaning: Example
 • Epoch 2, through the examples, W = <1,0,1> .                            1
                                                                          I0 w0
 • Example 1     I = <1,0,0> label=0 W = <1,0,1>                          I1 w1 O
 • Perceptron (11+ 00+ 01 >0) output  1
                                                                          I2 w2
 •      it classifies it as 1, while it should be 0, so subtract input
   from weights
 •             W = <1,0,1> - <1,0,0> = <0, 0, 1>
 • Example 2      I=<1,0, 1> label=1 W= <0,0,1>
 • Perceptron (10+ 00+ 11 > 0) output 1
 •      it classifies it as 1, so correct, do nothing                              21
Perceptron Leaning: Example
 • Example 3      I=<1,1,0> label=1 W= <0,0,1>
 • Perceptron (10+ 10+ 01 = 0) output = 0
 •      it classifies it as 0, while it should be 1, so add input to weights
 •              W = I + W = <1, 1, 1>
 • Example 4      I=<1,1,1> label=1 W= <1,1,1>
 • Perceptron (11+ 11+ 11 > 0) output = 1
 •      it classifies it as 1, correct, do nothing
 •              W = <1,1,1>
                                                                                22
Perceptron Leaning: Example
• Epoch 3, through the examples, W = <1,1,1> .
                                                                    1 I0 w0
• Example 1     I=<1,0,0> label=0 W = <1,1,1>                         I1 w1 O
• Perceptron (11+ 01+ 01 >0) output  1
                                                                      I2 w2
•       it classifies it as 1, while it should be 0, so subtract
  input from weights
•              W = <1,1,1> - <1,0,0> = <0, 1, 1>
• Example 2     I=<1,0,1> label=1 W= <0, 1, 1>
• Perceptron (10+ 01+ 11 > 0) output 1
•      it classifies it as 1, so correct, do nothing
•                                                                               23
Perceptron Leaning: Example
 • Example 3      I=<1,1,0>        label=1 W= <0, 1, 1>
 • Perceptron (10+ 11+ 01 > 0) output = 1
 •     it classifies it as 1, correct, do nothing
 • Example 4      I=<1,1,1>        label=1 W= <0, 1, 1>
 • Perceptron (10+ 11+ 11 > 0) output = 1
 •     it classifies it as 1, correct, do nothing
                                                          24
Perceptron Leaning: Example
 • Epoch 4, through the examples, W= <0, 1, 1>.
                                                                      1
 • Example 1     I= <1,0,0> label=0 W = <0,1,1>                        I0   W0 =0
 • Perceptron (10+ 01+ 01 = 0) output  0
 •      it classifies it as 0, so correct, do nothing                 I1   W1=1O
   So the final weight vector W= <0, 1, 1> classifies all              I2   W2=1
   examples correctly, and the perceptron has learned the function!
                                                                            OR
 Aside: in more realistic cases the bias (W0) will not be 0.
 (This was just a toy example!)
 Also, in general, many more inputs (100 to 1000)
                                                                                    25
Perceptron Leaning: Example
                               Desired                                 New   New   New
       Epoch    x0   x1   x2             w0 w1   w2   Output   Error                w2
                               Target                                  w0    w1
     1 example 1 1   0    0       0      0   0   0      0       0       0      0    0
                1    0    1       1      0   0   0      0       1       1     0     1
                1    1    0       1      1   0   1      1       0       1     0     1
                1    1    1       1      1   0   1      1       0       1     0     1
         2      1    0    0       0      1   0   1      1       -1      0     0     1
                1    0    1       1      0   0   1      1       0       0     0     1
                1    1    0       1      0   0   1      0       1       1     1     1
                1    1    1       1      1   1   1      1       0       1     1     1
         3      1    0    0       0      1   1   1      1       -1      0     1     1
                1    0    1       1      0   1   1      1       0       0     1     1
                1    1    0       1      0   1   1      1       0       0     1     1
                1    1    1       1      0   1   1      1       0       0     1     1
         4      1    0    0       0      0   1   1      0       0       0     1     1
                                                                                         26
Expressiveness of Perceptron
       What hypothesis space can a perceptron represent?
      Even more complex Booelan functions such as majority function .
      But can it represent any arbitrary Boolean function?
                                                                        27
Expressiveness of Perceptron
A threshold perceptron returns 1 iff the weighted sum of its inputs
(including the bias) is positive, i.e.,:
I.e., iff the input is on one side of the hyperplane it defines.
         Perceptron  Linear Separator
         Linear discriminant function or linear decision surface.
         Weights determine slope and bias determines offset.
                                                                      28
Linear Separability
   Consider example with two inputs, x1, x2:
                   x2
                                  + +
                                   +           Can view trained network
                                ++   +         as defining a “separation line”.
                                           +
                                                    What is its equation?
                                                        w0  w1 x1  w2 x2 0
                                               x1
                                                                w1     w
                                                       x2        x1  0
                                                                w2     w2
     Percepton used for classification
                                                                                  29
Linear Separability
                      x2
                               
                           OR
                                  x1
                                         30
Linear Separability
                      x2
                                
                           AND
                                   x1
                                          31
Linear Separability
                      x2
                                
                           XOR
                                   x1
                                          32
Linear Separability
                                 x2
                                                      Not linearly separable
                                            
                                      XOR
                                                    x1
                                  Minsky & Papert (1969)
              Bad News: Perceptrons can only represent linearly separable functions.
                                                                                       33
    Linear Separability XOR
• Consider a threshold perceptron for the logical XOR function (two inputs):
                                  w1 x1  w2 x2  T
• Our examples are:
                                     • Given our examples, we have the following inequalities for
       x1    x2       label            the perceptron:
1      0     0        0
                                     • From (1) 0 + 0 ≤ T
2      1     0        1
                                     • From (2) w1+ 0 > T
3      0     1        1
                                     • From (3) 0 + w2 > T
4      1     1        0
                                     • From (4) w1 + w2 ≤ T (contradiction)
                              So, XOR is not linearly separable
                                                                                                    34
Convergence of Perceptron Learning Algorithm
    Perceptron converges to a consistent function, if…
      • … training data linearly separable
      • … step size  sufficiently small
      • … no “hidden” units
                                                         35
Non Linear Classifiers
       • The XOR problem
                      x1   x2   XOR   Class
                       0   0     0     B
                       0   1     1     A
                       1   0     1     A
                       1   1     0     B
                                              36
Non Linear Classifiers
       • There is no single line (hyperplane) that separates class A
         from class B. On the contrary, AND and OR operations are
         linearly separable problems
                                                                       37
The Two‐Layer Perceptron
  • For the XOR problem,
    draw two, instead, of
    one lines
                            38
The Two‐Layer Perceptron
  • Then class B is located outside the shaded area and class A inside. This is a
    two‐phase design.
      • Phase 1: Draw two lines (hyperplanes)
        g1 ( x)  g 2 ( x)  0
       Each of them is realized by a perceptron. The outputs of the perceptrons
       will be
                                        0
                   yi  f ( g i ( x))   i  1, 2
                                        1
       depending on the position of x.
     • Phase 2: Find the position of x w.r.t. both lines, based on the values of y1,
       y2.
                                                                                       39
The Two‐Layer Perceptron
                          1st phase              2nd
                   x1    x2      y1      y2     phase
                    0     0     0(-)    0(-)     B(0)
                    0     1     1(+)    0(-)     A(1)
                    1     0     1(+)    0(-)     A(1)
                    1     1     1(+)    1(+)     B(0)
  • Equivalently: The computations of the first phase perform a mapping
                              x  y  [ y1 , y2 ]T
                                                                          40
The Two‐Layer Perceptron
 The decision is now performed on the transformed y data.
                     g ( y)  0
 This can be performed via a second line, which can also be
 realized by a perceptron.                                    41
The Two‐Layer Perceptron
        • Computations of the first phase perform a mapping
          that transforms the nonlinearly separable problem to a
          linearly separable one.
        • The architecture
                                                                   42
The Two‐Layer Perceptron
   • This is known as the two layer perceptron with one hidden and one
     output layer. The activation functions are
                              0
                      f (.)  
                              1
   • The neurons (nodes) of the figure realize the following lines
     (hyperplanes)
                                      1
                 g1 ( x)  x1  x2   0
                                      2
                                      3
                 g 2 ( x)  x1  x2   0
                                      2
                                       1
                 g ( y )  y1  2 y2   0
                                       2
                                                                         43