Basic Design Approaches To Accelerating Deep Neural Networks
Basic Design Approaches To Accelerating Deep Neural Networks
Networks
                             Rangharajan Venkatesan
                               NVIDIA Corporation
Contact Info
email: rangharajanv@nvidia.com
 Research Interests
        Machine Learning Accelerators
        High-Level Synthesis
        Low Power VLSI design
        SoC Design methodologies
 Focus on inference
        Most of the techniques are generic and applicable to training as well
 Does cover ..
        Key metrics
        Design considerations
        Hardware optimizations
        Hardware/software co-design techniques
Efficient Compute
Data
Bianco et al., IEEE Access, 2018. Ack: Bill Dally, GTC China, 2020.
Ack: Anand Raghunathan, Purdue University Ref: "Showdown". The Economist, 19 Nov. 2016
      Reconfigurable FPGAs
      Leverage reconfigurability of FPGA to accelerate a specific neural network
      Accelerators
     Programmable Processors                               Fixed Function Accelerators
                                                         Different
                                                         platforms
Wide range of performance
Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/
                           Artificial
                         Intelligence
                           Machine
                           Learning
                             Deep
                           Learning
                  Input Layer
                                         Hidden Layer
 Activation layer
        ReLU, tanh, sigmoid, Leaky ReLU, Clipped ReLU, Swish
 Pooling layer
        Max. pooling, Average pooling, Unpooling
 Fully-connected layer
                    S                     W                         F
                         Element-wise         Partial Sum (psum)
                         Multiplication         Accumulation
   for e=[0:E)
     for f=[0:F)
       for r=[0:R)
         for s=[0:S)
           Out[e][f] += Weight[r][s] * Input[e+r][f+s]
                 S                      W                        F
                         Sliding Window Processing
   for e=[0:E)
     for f=[0:F)
       for r=[0:R)
         for s=[0:S)
           Out[e][f] += Weight[r][s] * Input[e+r][f+s]
                                      …
                                                      output fmap     for c=[0:C)
    C
    …
                                                                       for r=[0:R)
                        H        …                                      for s=[0:S)
R                                                      E
                                                                         Out[e][f] +=
            …
                                      …
                                                                           Weight[r][s][c] *
        S                        W                            F
                                                                           Input[e+r][f+s][c]
                 Many Input Channels (C)
                                 …
                                                                  for m=[0:M)
                                                         …
       C                                                           for e=[0:E)
       …
                             …
                     H                                              for f=[0:F)
   R                                         E
       1                                                             for c=[0:C)
               …
                                                         …
           S                 W                       F                for r=[0:R)
                                                Many
           …
                                                                       for s=[0:S)
                                         Output Channels (M)            Out[e][f][m] +=
       C
       …
                                                                          Weight[r][s][c][m] *
   R                                                                      Input[e+r][f+s][c]
       M
               …
                        Many
                   Input fmaps (N)          Many
                       C
                                       Output fmaps (N)            for n=[0:N)
         filters
                     …
                               …
                                              M
                                                      …
     C                                                              for m=[0:M)
     …
                   H                                                 for e=[0:E)
 R                                        E
                       1                                              for f=[0:F)
                                              1
              …
                                                      …
         S                 W                      F                    for c=[0:C)
                                                                        for r=[0:R)
         …
                                                  …
                       C
                                                                         for s=[0:S)
                     …
                                                      …
     …
                                                                          Out[e][f][m][n] +=
 R                                        E                                 Weight[r][s][c][m] *
                           …
                   H
              …
N Input[e+r][f+s][c][n]
                                                      …
         S
                       N
                               …
                                                  F
                           W
Sze et al., Synthesis Lectures on Computer Architecture, 2020
Rangharajan Venkatesan                                ISSCC 2021                         23 of 93
Activation Layer
 Introduce non-linearity in the network
Max(1,2,4,6)
                  1      2   2   3
                  4      6   5   8   2x2 max pooling with stride = 2   6   8
                  3      1   4   4                                     3   4
                  2      1   3   3
                                        Max. Pooling Example
Rangharajan Venkatesan                         ISSCC 2021                      25 of 93
Fully-Connected Layer
                              1
              C                              1            for m=[0:M)
                                                           for c=[0:C)
 M                                       M                   Out[m] +=
                          C
                                                                Weight[m][c] *
                                                                Input[c]
Matrix-Vector Multiplication
     Neural
                         ResNet      MobileNet            BERT
    Networks
      DL
                         PyTorch     TensorFlow           Caffe
  Framework
     Neural
                         ResNet      MobileNet            BERT
    Networks
      DL
                         PyTorch     TensorFlow           Caffe
  Framework
                                                                     Co-design across
    Compiler             TVM       TimeLoop          ZigZag          different levels for
                                                                     efficient hardware
 Energy efficiency
        Energy/inference, TOPS/W
 Area efficiency
        Inference/sec/mm2, TOPS/mm2
 Flexibility
        Support different types of neural networks and layers
        System
             Array of PEs
             Global buffer
             Controller
             DRAM
MVA
                                                                            MMA
High effort to achieve good
utilization some layer types
                                                                                           Efficiency
                             •   No spatial reuse                     •   High spatial reuse
                             •   High control overheads               •   Low control overheads
Scratchpads
Global Buffer
                                 Large Capacity,
                                  High Latency,       DRAM
                                  High Energy
                                               Large Capacity,
                                                High Latency,       DRAM
                                                High Energy
 Examples: Venkatesan et al. HotChips 2019, Zimmer et al. JSSC 2020, Google TPU,
  NVDLA
Rangharajan Venkatesan                          ISSCC 2021                  41 of 93
Dataflows: Temporal Data Reuse
 Output-Stationary (OS) Dataflow
                                                             Network: ResNet-50
                                                             Dataset: ImageNet
                                                             Technology: 16ff
Datapath
Buffer 1 Buffer 2
                               Lower-level
                                 Memory
 Buffet achieves 2.3X reduction in energy-delay product (EDP) and 2.1X area
  efficiency gains over DMA with double buffering
                                                     Weights
                                                     Input Activations
          PE             PE   PE   PE                Output Activations
                                                     Partial Sums
                Interconnect
Memory
                                                               Weights
                                                               Input Activations
          PE             PE   PE   PE                          Output Activations
                                                               Partial Sums
                Interconnect
                                                     Pattern-1
                                                     • Unicast Weights
                         Memory
                                                               Weights
                                                               Input Activations
          PE             PE   PE   PE                          Output Activations
                                                               Partial Sums
                Interconnect
                                                     Pattern-1
                                                     • Unicast Weights
                                                     • Multicast Input activations
                         Memory
                                                               Weights
                                                               Input Activations
          PE             PE   PE   PE                          Output Activations
                                                               Partial Sums
                Interconnect
                                                     Pattern-1
                                                     • Unicast Weights
                                                     • Multicast Input activations
                         Memory                      • Unicast Output activations
                                                               Weights
                                                               Input Activations
          PE             PE   PE   PE                          Output Activations
                                                               Partial Sums
                Interconnect
                                                     Pattern-2
                                                     • Multicast Weights
                                                     • Unicast Input activations
                         Memory                      • Unicast Output activations
                                                                 Weights
                                                                 Input Activations
          PE             PE   PE   PE                            Output Activations
                                                                 Partial Sums
                Interconnect                         Pattern-3
                                                     • Unicast   Weights
                                                     • Unicast   Input activations
                         Memory                      • Unicast   Partial sums
                                                     • Unicast   Output activations
                                                                 Weights
                                                                 Input Activations
          PE             PE   PE   PE                            Output Activations
                                                                 Partial Sums
                Interconnect                         Pattern-4
                                                     • Unicast   Weights
                                                     • Unicast   Input activations
                         Memory                      • Unicast   Partial sums
                                                     • Unicast   Output activations
          input activations
             from different
      input channels (C)
Network On-Package
Network On-Chip
                                                          Venkatesan et al., HotChips 2019
                                                          Zimmer et al. JSSC 2020
 Opportunities
        Data reuse
        Parallelism
        Pipelining
 Example
        An architecture implementing weight-
         stationary dataflow                                                                  PE1
        Tile weights and distribute them to
         different PEs
        Compute different output activations                                                 PE2
         by streaming in the input activations
PE3
PE3
                          PE      PE   PE   PE      PE        PE   PE   PE
                          PE      PE   PE   PE      PE        PE   PE   PE
                          PE      PE   PE   PE      PE        PE   PE   PE
                          PE      PE   PE   PE      PE        PE   PE   PE
   Large number of possible tilings for a given layer and hardware configuration
   >10x difference in performance and energy
   Need to explore optimized tiling to achieve best energy and performance
                                                            Accuracy
                                                                             Small accuracy loss for
                                                                             large efficiency gain
Hardware cost
               Pre-trained                              Quantized
                                Quantization
                  Model                                  Model
             Pre-trained                                       Quantized
                                Quantization
                Model                                           Model
Re-Training
Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/
* =
Venkatesan et al.,
ICCAD 2019
Rangharajan Venkatesan   ISSCC 2021   85 of 93
Summary
 Deep neural networks are increasing
  used across a wide range of
  applications
        Large amounts of data
        High computation demand