0% found this document useful (0 votes)

457 views93 pages

Basic Design Approaches To Accelerating Deep Neural Networks

This document provides an overview of basic design approaches to accelerate deep neural networks. It begins with motivations for hardware acceleration of neural networks due to increasing complexity and performance/energy challenges. It then provides basics of deep neural networks including common layer types like convolution, activation and pooling layers. The rest of the document outlines key aspects of neural network accelerator design like compute units, memory hierarchy, interconnects, compiler mapping strategies, and neural network optimizations like quantization and sparsity.

Uploaded by

dxzhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

457 views93 pages

Basic Design Approaches To Accelerating Deep Neural Networks

Uploaded by

dxzhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

Basic Design Approaches to Accelerating Deep Neural

Networks

Rangharajan Venkatesan
NVIDIA Corporation

Live Q&A Session: Feb. 13, 2020, 7:00-7:20am, PST

Contact Info
email: rangharajanv@nvidia.com

Rangharajan Venkatesan ISSCC 2021 1 of 93

Self-Introduction
 B.Tech in Electronics and Communication
Engineering from IIT Roorkee in 2009
 Ph.D. in Electrical and Computer Engineering
from Purdue University in 2014
 Joined NVIDIA
 Research Scientist: 2014-2016
 Senior Research Scientist: 2016-Present

 Research Interests
 Machine Learning Accelerators
 High-Level Synthesis
 Low Power VLSI design
 SoC Design methodologies

Rangharajan Venkatesan ISSCC 2021 2 of 93

About this Tutorial
 Overview of design approaches to improve hardware efficiency

 Focus on inference
 Most of the techniques are generic and applicable to training as well

 Does cover ..
 Key metrics
 Design considerations
 Hardware optimizations
 Hardware/software co-design techniques

 Does not cover ..

 Algorithms and model design
 Detailed software stack
Rangharajan Venkatesan ISSCC 2021 3 of 93
Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 4 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 5 of 93

Vast World of DNN Applications

DATA CENTER EMBEDDED COMPUTERS EMBEDDED DEVICES

Rangharajan Venkatesan ISSCC 2021 6 of 93

Hardware-enabled AI Revolution

Efficient Compute

Data

Rangharajan Venkatesan ISSCC 2021 7 of 93

Growth in Application Complexity

Bianco et al., IEEE Access, 2018. Ack: Bill Dally, GTC China, 2020.

Rangharajan Venkatesan ISSCC 2021 8 of 93

Hardware Performance Challenges

End of Moore’s Law Increasing cost

John Hennessy and David Patterson, Computer

S. Naffziger et al., ISSCC 2020
Architecture: A Quantitative Approach, 6/e. 2018
Rangharajan Venkatesan ISSCC 2021 9 of 93
Energy Efficiency Challenges
Super-human performance at low energy efficiency
1920 CPUs, 280 GPUs, Power: ~300,000 W

Google AlphaGo vs. Lee Sedol

Ack: Anand Raghunathan, Purdue University Ref: "Showdown". The Economist, 19 Nov. 2016

Rangharajan Venkatesan ISSCC 2021 10 of 93

Solution: Hardware Specialization

Reconfigurable FPGAs
Leverage reconfigurability of FPGA to accelerate a specific neural network

Accelerators
Programmable Processors Fixed Function Accelerators

Programmable hardware with support Customized hardware accelerators to

for scalar and vector math functions support a class of neural networks

Rangharajan Venkatesan ISSCC 2021 11 of 93

Many DNN Accelerators Exist!!

Different
platforms
Wide range of performance

Different power targets

Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/

Rangharajan Venkatesan ISSCC 2021 12 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 13 of 93

Deep Learning (aka DNN)

Artificial
Intelligence
Machine
Learning

Deep
Learning

Rangharajan Venkatesan ISSCC 2021 14 of 93

DNN: A Simple Example
 3 
Y j = activation  Wij × Xi 
W11  i=1 
Y1
X1
Y2
X2
Y3
X3
W34 Y4 Output Layer

Input Layer
Hidden Layer

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 15 of 93
Practical DNNs

Input Layer 1 Layer 2 Layer N Output

 10s to 100s of layers

 Different types of layers
 Convolution, Fully-connected, Depthwise-Separable, Attention etc.
 Process several Gigabytes of data
 Perform billions to trillions of operations
 Wide range of applications
 Image processing, language processing, etc.

Rangharajan Venkatesan ISSCC 2021 16 of 93

Examples of DNN Applications
Image Classification Object Detection Recommendation Systems

Dog Cat Sheep

Text Summarization Automatic Speech Recognition Image Captioning

Rangharajan Venkatesan ISSCC 2021 17 of 93

Several Layer Types and Kernels
 Convolution layer
 Many different variants also
 Strided convolution, Dilated convolution, Groupwise convolution, Depthwise convolution,
Pointwise convolution, Depthwise-Separable convolution

 Activation layer
 ReLU, tanh, sigmoid, Leaky ReLU, Clipped ReLU, Swish

 Pooling layer
 Max. pooling, Average pooling, Unpooling

 Fully-connected layer

 And many more .. Attention, Deconvolution, etc.

Rangharajan Venkatesan ISSCC 2021 18 of 93
Convolution Layer
input output
feature map (fmap) feature map (fmap)
filter (weights) an output
activation
H
R E

S W F
Element-wise Partial Sum (psum)
Multiplication Accumulation
for e=[0:E)
for f=[0:F)
for r=[0:R)
for s=[0:S)
Out[e][f] += Weight[r][s] * Input[e+r][f+s]

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 19 of 93
Convolution Layer

input fmap output fmap

filter (weights) an output
activation
H
R E

S W F
Sliding Window Processing
for e=[0:E)
for f=[0:F)
for r=[0:R)
for s=[0:S)
Out[e][f] += Weight[r][s] * Input[e+r][f+s]

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 20 of 93
Convolution Layer

input fmap for e=[0:E)

filter … C for f=[0:F)

…
output fmap for c=[0:C)
C
…

for r=[0:R)
H … for s=[0:S)
R E
Out[e][f] +=
…

…
Weight[r][s][c] *
S W F
Input[e+r][f+s][c]
Many Input Channels (C)

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 21 of 93
Convolution Layer

many input fmap

output fmap
filters (M) C

…
for m=[0:M)

…
C for e=[0:E)
…

…
H for f=[0:F)
R E
1 for c=[0:C)
…

…
S W F for r=[0:R)
Many
…

for s=[0:S)
Output Channels (M) Out[e][f][m] +=
C
…

Weight[r][s][c][m] *
R Input[e+r][f+s][c]
M
…

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 22 of 93
Convolution Layer

Many
Input fmaps (N) Many
C
Output fmaps (N) for n=[0:N)
filters
…

…
M

…
C for m=[0:M)
…

H for e=[0:E)
R E
1 for f=[0:F)
1
…

…
S W F for c=[0:C)
for r=[0:R)
…

…
C
for s=[0:S)
…

…
…

Out[e][f][m][n] +=
R E Weight[r][s][c][m] *
…

H
…

N Input[e+r][f+s][c][n]

…
S
N
…

F
W
Sze et al., Synthesis Lectures on Computer Architecture, 2020
Rangharajan Venkatesan ISSCC 2021 23 of 93
Activation Layer
 Introduce non-linearity in the network

Common Activation Layers in DNNs

Rangharajan Venkatesan ISSCC 2021 24 of 93

Pooling Layer
 Down-sampling operation to provide invariance against minor changes in the
image such as shifting, rotation, etc.
 Can be different types
 Max. pooling, Average pooling

Max(1,2,4,6)

1 2 2 3
4 6 5 8 2x2 max pooling with stride = 2 6 8
3 1 4 4 3 4
2 1 3 3
Max. Pooling Example
Rangharajan Venkatesan ISSCC 2021 25 of 93
Fully-Connected Layer

1
C 1 for m=[0:M)
for c=[0:C)
M M Out[m] +=
C
Weight[m][c] *
Input[c]

Matrix-Vector Multiplication

Rangharajan Venkatesan ISSCC 2021 26 of 93

Deep Learning Stack

Neural
ResNet MobileNet BERT
Networks

DL
PyTorch TensorFlow Caffe
Framework

Compiler TVM TimeLoop ZigZag

Hardware CPU GPU ASIC FPGA

TimeLoop, ISPASS 2019. TVM, OSDI 2018. ZigZag, arXiv 2020

Rangharajan Venkatesan ISSCC 2021 27 of 93
Deep Learning Stack

Neural
ResNet MobileNet BERT
Networks

DL
PyTorch TensorFlow Caffe
Framework

Co-design across
Compiler TVM TimeLoop ZigZag different levels for
efficient hardware

Hardware CPU GPU ASIC FPGA

TimeLoop, ISPASS 2019. TVM, OSDI 2018. ZigZag, arXiv 2020

Rangharajan Venkatesan ISSCC 2021 28 of 93
Evaluation Metrics
 Accuracy
 % predicted correctly

 Performance (Throughput, Latency)

 Inferences/sec, TOPS, delay

 Energy efficiency
 Energy/inference, TOPS/W

 Area efficiency
 Inference/sec/mm2, TOPS/mm2

 Flexibility
 Support different types of neural networks and layers

Rangharajan Venkatesan ISSCC 2021 29 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 30 of 93

Hierarchical View of DL Accelerator
 Multiply & Accumulate (MAC) unit
 Multiply-Accumulate Datapath
 Registers

 Processing Element (PE)

 Arrays of MAC units
 Scratchpad
 PE Control Finite State Machine (FSM)

 System
 Array of PEs
 Global buffer
 Controller
 DRAM

Rangharajan Venkatesan ISSCC 2021 31 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 32 of 93

MAC Unit Implementations
 Scalar MAC unit

 One Multiply and Accumulate Weight Input

operation per cycle Register Register
 Registers are used to store values X
and achieve temporal reuse across
cycles
 Registers are optional +
 All operands cannot be reused across
cycles
Accumulation
 Typically, only one (sometimes two) Register
operands are stored in registers
 Examples
 Google TPU, Eyeriss ISSCC 2016

Rangharajan Venkatesan ISSCC 2021 33 of 93

MAC Unit Implementations
 Vector MAC unit
Weight Input
 Vector-wide multiply-accumulate Register Register
operations every cycle
 Performs vector dot-product Vector Size
 Takes a weight vector, an input activation
vector, and a partial sum scalar as inputs
X X X
 Computes a partial sum as output
 Achieves spatial reuse of partial sum
+
values by reducing them during dot-
product computation
 Registers can be used to store one or Accumulation
more operands for temporal reuse Register

Rangharajan Venkatesan ISSCC 2021 34 of 93

MAC Unit Implementations
 Matrix-Vector Accumulate (MVA) unit
Weight Input
Register Register
 Multiple vector-dot products every cycle
 Example:
 Takes a weight matrix, an input activation
vector, and a partial sum vector as inputs Vector Size
 Computes a partial sum vector as output
 Achieves spatial reuse of activation as well
X X
as partial sum values X

 Registers can be used to store one or

more operands for temporal reuse +
 Examples
Accumulation
 NVDLA, Venkatesan et al. HotChips 2019,
Register
Zimmer et al. JSSC 2020

Rangharajan Venkatesan ISSCC 2021 35 of 93

MAC Unit Implementations
 Matrix-Matrix Accumulate (MMA) unit

 Examples: NVIDIA GPU Tensor cores, Alibaba Hanguang ISSCC 2020

Rangharajan Venkatesan ISSCC 2021 36 of 93
MAC Implementation Tradeoffs
Flexibility

Easy to support different Scalar Hybrid

layer types MAC implementations
Vector
MAC

MVA

MMA
High effort to achieve good
utilization some layer types
Efficiency
• No spatial reuse • High spatial reuse
• High control overheads • Low control overheads

Rangharajan Venkatesan ISSCC 2021 37 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 38 of 93

Memory Hierarchy
Small Capacity, Registers
Low Latency,
Low Energy

Scratchpads

Global Buffer

Large Capacity,
High Latency, DRAM
High Energy

Horowitz, ISSCC 2014

Rangharajan Venkatesan ISSCC 2021 39 of 93
Memory Hierarchy
Small Capacity, Registers
 Efficient reuse of data to achieve Low Latency,
Low Energy
high performance and energy
efficiency Scratchpads

 Optimal number of levels and sizing

Global Buffer
 Buffer management to overlap
communication and computation

Large Capacity,
High Latency, DRAM
High Energy

Rangharajan Venkatesan ISSCC 2021 40 of 93

Dataflows: Temporal Data Reuse
 Weight-Stationary (WS) Dataflow

Less-frequent access More-frequent access

 Examples: Venkatesan et al. HotChips 2019, Zimmer et al. JSSC 2020, Google TPU,
NVDLA
Rangharajan Venkatesan ISSCC 2021 41 of 93
Dataflows: Temporal Data Reuse
 Output-Stationary (OS) Dataflow

Less-frequent access More-frequent access

 Examples: Moons et al. VLSI 2016, Thinker et al. VLSI 2017

Rangharajan Venkatesan ISSCC 2021 42 of 93
Dataflows: Temporal Data Reuse
 Input-Stationary (IS) Dataflow

Less-frequent access More-frequent access

 Example: SCNN, ISCA 2017

Rangharajan Venkatesan ISSCC 2021 43 of 93
Dataflows: Temporal Data Reuse
 Drawback of single data reuse
 Operands with low or no reuse start to dominate energy consumption

Venkatesan et al., ICCAD 2019

Rangharajan Venkatesan ISSCC 2021 44 of 93
Dataflows: Temporal Data Reuse
 Multi-level Dataflows
 Output Stationary – Local Weight Stationary (OS-LWS) Dataflow

Less-frequent access More-frequent access

 Example: MAGNet, ICCAD 2019

Rangharajan Venkatesan ISSCC 2021 45 of 93
Dataflows: Temporal Data Reuse
 Multi-level Dataflows
 Weight Stationary – Local Output Stationary (WS-LOS) Dataflow

Less-frequent access More-frequent access

 Example: MAGNet, ICCAD 2019

Rangharajan Venkatesan ISSCC 2021 46 of 93
Comparison of Dataflows

Network: ResNet-50
Dataset: ImageNet
Technology: 16ff

• OS-LWS dataflow  Temporal reuse in both weights and outputs

• Larger vector size  Spatial input/output reuse, amortize control overheads
Venkatesan et al., ICCAD 2019
Rangharajan Venkatesan ISSCC 2021 47 of 93
Compute-Communication Overlap
 Coarse-grained: DMA with double buffering

Datapath

Buffer 1 Buffer 2

Lower-level
Memory

Rangharajan Venkatesan ISSCC 2021 48 of 93

Buffet Buffer Manager
 Fine-grained compute-communication
overlap
 Operations
 Fill
 Sequentially write data read from DRAM
and set valid state
 Read
 Perform read if the address is valid,
otherwise stall until address becomes valid
 Update
 Perform write operation to address and set
valid state
 Shrink
 Invalidate addresses that are not in use
and request data from lower-level memory

Pellauer et al., ASPLOS 2019

Rangharajan Venkatesan ISSCC 2021 49 of 93
Buffet Buffer Manager

 Buffet achieves 2.3X reduction in energy-delay product (EDP) and 2.1X area
efficiency gains over DMA with double buffering

Pellauer et al., ASPLOS 2019

Rangharajan Venkatesan ISSCC 2021 50 of 93
Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 51 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect

Memory

Rangharajan Venkatesan ISSCC 2021 52 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
Memory

Rangharajan Venkatesan ISSCC 2021 53 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
• Multicast Input activations
Memory

Rangharajan Venkatesan ISSCC 2021 54 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
• Multicast Input activations
Memory • Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 55 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-2
• Multicast Weights
• Unicast Input activations
Memory • Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 56 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect Pattern-3
• Unicast Weights
• Unicast Input activations
Memory • Unicast Partial sums
• Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 57 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect Pattern-4
• Unicast Weights
• Unicast Input activations
Memory • Unicast Partial sums
• Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 58 of 93

Interconnect Design
 In practice, we observe combinations of different patterns depending on the
architecture

 Customized interconnects support specific patterns

 e.g. Systolic Arrays

 Programmable interconnect support flexible patterns

 e.g. Mesh network on-chip (NoC)

Rangharajan Venkatesan ISSCC 2021 59 of 93

Systolic Array
PE

input activations
from different
input channels (C)

psums from different output channels (M)

Sze et al., Synthesis Lectures on Computer Architecture, 2020
Rangharajan Venkatesan ISSCC 2021 60 of 93
Mesh NoC
 Different packet sizes for different data
types
 Unicast and Multicast support
 Flexible routing protocols

Rangharajan Venkatesan ISSCC 2021 61 of 93

Hierarchical Network
 Efficient communication for large scale designs
 Reduces number of hops
 Reduces congestion
 Example:
 Multi-Chip Module (MCM) based DL accelerator

Network On-Package

Network On-Chip
Venkatesan et al., HotChips 2019
Zimmer et al. JSSC 2020

Rangharajan Venkatesan ISSCC 2021 62 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory hierarchy
 Communication: Interconnect topologies

 Compiler Mapping flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 63 of 93

Compiler Mapping Flow
 Goal:
 Efficiently execute the neural network
in the target hardware
 Performance
 Energy efficiency

 Opportunities
 Data reuse
 Parallelism
 Pipelining

Rangharajan Venkatesan ISSCC 2021 64 of 93

Data Reuse Opportunities

Data type Reuse

Input activations* R*S*M
Weights E*F*N
Output activations R*S*C
*except halo

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 65 of 93
Model Parallelism
Input Output
 Exploit parallelism across weights in Activations
Weights
Activations
a layer
PE0

 Example
 An architecture implementing weight-
stationary dataflow PE1
 Tile weights and distribute them to
different PEs
 Compute different output activations PE2
by streaming in the input activations

PE3

Rangharajan Venkatesan ISSCC 2021 66 of 93

Data Parallelism
Input Output
 Exploiting parallelism across Activations
Weights
Activations
activations in a layer
PE0
 Example
 An architecture implementing
input-stationary dataflow PE1
 Tile input activations and map to
different PEs
 Each PE computes different output PE2
activations by streaming in the
weights

PE3

Rangharajan Venkatesan ISSCC 2021 67 of 93

Mapping Tool: TimeLoop
 Loop-nest representation of  Loop-nest representation of an DNN
convolution layer accelerator
for n2=[0:N2)
for n=[0:N) for e2=[0:E2)
for f2=[0:F2) System
for m=[0:M)
for m2=[0:M2)
for e=[0:E) for c2=[0:C2)
for f=[0:F) for n1=[0:N1)
for c=[0:C) for e1=[0:E1)
for r=[0:R) for f1=[0:F1)
for m1=[0:M1)
for s=[0:S) PE
for c1=[0:C1)
Out[e][f][m][n] += for r1=[0:R)
Weight[r][s][c][m] * for s1=[0:S)
Input[e+r][f+s][c][n] for c0=[0:C0)
for m0=[0:M0)
for e0=[0:E0) MAC unit
for f0=[0:F0)
Parashar et al. ISPASS, 2019. MACs

Rangharajan Venkatesan ISSCC 2021 68 of 93

Pipelining
 Parallelism across layers of the network
 Execute one or more layers across different processing elements

Pipe-1 Pipe-2 Pipe-3 Pipe-4

Layer Layer Layer Layer Layer Layer Layer

Input 1 Output
2 3 4 5 6 7

PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE

Rangharajan Venkatesan ISSCC 2021 69 of 93

Impact of Tiling on Hardware Efficiency

 Large number of possible tilings for a given layer and hardware configuration
 >10x difference in performance and energy
 Need to explore optimized tiling to achieve best energy and performance

Venkatesan et al., ICCAD 2019

Rangharajan Venkatesan ISSCC 2021 70 of 93
Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory hierarchy
 Communication: Interconnect topologies

 Compiler Mapping flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 71 of 93

Quantization
 Exploit application-level resiliency to perform compute at reduced precision
 Improves energy efficiency, area efficiency and performance
 Lower storage
 Cheaper compute

Accuracy
Small accuracy loss for
large efficiency gain

Hardware cost

Accuracy vs. Efficiency tradeoff

Chippa et al., DAC 2013
Rangharajan Venkatesan ISSCC 2021 72 of 93
Quantization
 Post-training Quantization

Pre-trained Quantized
Quantization
Model Model

 Pretrained model is quantized to improve efficiency

 No need for training sets during model deployment

Wu et al. arXiv, 2020

Rangharajan Venkatesan ISSCC 2021 73 of 93

Quantization
 Quantization-Aware Training

Pre-trained Quantized
Quantization
Model Model

Re-Training

 Quantization to improve efficiency

 Retraining to recover accuracy

Wu et al. arXiv, 2020

Rangharajan Venkatesan ISSCC 2021 74 of 93

Benefits of Quantization

Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/

Rangharajan Venkatesan ISSCC 2021 75 of 93

Multiple Precision Support
 NVIDIA GPUs: Volta V100 and Ampere A100

Ref: NVIDIA A100 Tensor Core GPU Architecture whitepaper

Rangharajan Venkatesan ISSCC 2021 76 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Network: Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory hierarchy
 Communication: Interconnect topologies

 Compiler Mapping flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 77 of 93

Sparsity
 Neural networks exhibit high degree of sparsity
 Weights/connections are sparse
 Activations at intermediate stages are sparse

Han et al., NeurIPS 2015

Rangharajan Venkatesan ISSCC 2021 78 of 93

Types of Sparsity

Mao et al. NeurIPS 2017

Rangharajan Venkatesan ISSCC 2021 79 of 93
Structured Sparsity
 NVIDIA Ampere A100

Ref: NVIDIA A100 Tensor Core GPU Architecture whitepaper

Rangharajan Venkatesan ISSCC 2021 80 of 93

NVIDIA A100 Performance

Ref: NVIDIA A100 Tensor Core GPU Architecture whitepaper

Rangharajan Venkatesan ISSCC 2021 81 of 93

Unstructured Sparsity
 SCNN
 Only compute partial products where both operands are non-zero
 Get rid of the idea of sliding convolution: doesn’t make sense when most of the
operands are 0
 Vector ops are questionable: most elements of your vector are 0, don’t know a
priori which ones or how many

* =

Parashar et al. ISCA 2017

Rangharajan Venkatesan ISSCC 2021 82 of 93
Unstructured Sparsity
 SCNN

Parashar et al. ISCA 2017

Rangharajan Venkatesan ISSCC 2021 83 of 93
Unstructured Sparsity
 SCNN

 SCNN achieves 2.3X improvement in energy efficiency over Dense CNN

(DCNN) accelerator
Parashar et al. ISCA 2017
Rangharajan Venkatesan ISSCC 2021 84 of 93
An End-to-End Optimization Flow

Venkatesan et al.,
ICCAD 2019
Rangharajan Venkatesan ISSCC 2021 85 of 93
Summary
 Deep neural networks are increasing
used across a wide range of
applications
 Large amounts of data
 High computation demand

 Hardware acceleration is key for

continued growth

 Co-design across algorithm-compiler-

hardware can greatly improve
efficiency

Rangharajan Venkatesan ISSCC 2021 86 of 93

Papers to watch @ ISSCC 2021
 Session 9: ML Processors From Cloud to Edge
 9.1 A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-
Aware Throttling
 9.2 A 28nm 12.1TOPS/W Dual-Mode CNN Processor Using Effective-Weight-Based Convolution and Error-
Compensation-Based Prediction
 9.3 A 40nm 4.81TFLOPS/W 8b Floating-Point Training Processor for Non-Sparse Neural Networks Using Shared
Exponent Bias and 24-Way Fused Multiply-Add Tree
 9.4 PIU: A 248GOPS/W Stream-Based Processor for Irregular Probabilistic Inference Networks Using Precision-
Scalable Posit Arithmetic in 28nm
 9.5 A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile
 9.6 A 1/2.3inch 12.3Mpixel with On-Chip 4.97TOPS/W CNN Processor Back-Illuminated Stacked CMOS Image
 9.7 A 184μW Real-Time Hand-Gesture Recognition System with Hybrid Tiny Classifiers for Smart Wearable
Devices
 9.8 A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech
Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm
 9.9 A Background-Noise and Process-Variation-Tolerant 109nW Acoustic Feature Extractor Based on Spike-
Domain Divisive-Energy Normalization for an Always-On Keyword Spotting Device

Rangharajan Venkatesan ISSCC 2021 87 of 93

Papers to watch @ ISSCC 2021
 Session 15: Compute-in-Memory Processors for Deep Neural Networks
 15.1 A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing
 15.2 A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero
Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating
 15.3 A 65nm 3T Dynamic Analog RAM-Based Computing-in-Memory Macro and CNN Accelerator with Retention
Enhancement, Adaptive Analog Sparsity and 44TOPS/W System Energy Efficiency
 15.4 A 5.99-to-691.1TOPS/W Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity- Based
Optimization and Variable-Precision Quantization
 Session 16: Compute-in-Memory
 16.1 A 22nm 4Mb 8b-Precision ReRAM Computing-in-Memory Macro with 11.91 to 195.7TOPS/W for Tiny AI
Edge Devices
 16.2 eDRAM-CIM: Compute-In-Memory Design with Reconfigurable Embedded-Dynamic-Memory Array
Realizing Adaptive Data Converters and Charge-Domain Computing
 16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b of Precision for AI Edge Chips
 16.4 An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in
22nm for Machine-Learning Edge Applications

Rangharajan Venkatesan ISSCC 2021 88 of 93

References
 Overview and Benchmarking
 V. Sze et al., “Efficient Processing of Deep Neural Networks: A Tutorial and
Survey,” Proceedings of the IEEE 2017
 V. Sze et al., “Efficient Processing of Deep Neural Networks,” Synthesis Lectures
on Computer Architecture 2020
 K. Guo et al., “Neural network accelerator comparison,”
[online]https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/.
 V. J. Reddi et al., “MLPerf Inference Benchmark,” arXiv 2019
 V. Camus et al., “Survey of precision-scalable multiply-accumulate units for
neural-network processing,” AICAS 2019
 H. Wu et al. “Integer quantization for deep learning inference: Principles and
empirical evaluation” arXiv 2020

Rangharajan Venkatesan ISSCC 2021 89 of 93

References
 Deep Learning Hardware
 F. Sijstermans, “The NVIDIA deep learning accelerator,” in Hot Chips 2018
 NVIDIA A100 Tensor Core GPU Architecture whitepaper
 B. Zimmer et al., “A 0.11pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-
based Deep Neural Network Accelerator with Ground-Reference Signaling in
16nm,” VLSI 2019
 N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing
unit,” in ISCA 2017
 A. Parashar et al., “SCNN: An accelerator for compressed-sparse convolutional
neural networks,” in ISCA 2017
 Y. Chen et al., “Eyeriss: An energy-efficient reconfigurable accelerator for deep
convolutional neural networks,” ISSCC 2016
 S. Han et al., “EIE: Efficient inference engine on compressed deep neural
network,” in ISCA 2016

Rangharajan Venkatesan ISSCC 2021 90 of 93

References
 Deep Learning Hardware
 E. H. Lee et al., “LogNet: Energy-efficient neural networks using logarithmic
computation,” ICASSP 2017
 H. Wu, “Low precision inference on GPUs” GTC 2019
 J. R. Stevens et al. “Manna: An Accelerator for Memory-Augmented Neural
Networks” MICRO 2019.
 S. Han et al. “Deep Compression: Compressing Deep Neural Networks with
Pruning, Trained Quantization and Huffman Coding” NeurIPS 2015
 S. Venkataramani et al. “AxNN: energy-efficient neuromorphic systems using
approximate computing” ISLPED 2014
 Y. Lecun et al., Optimal Brain Damage,” NeurIPS 1990

Rangharajan Venkatesan ISSCC 2021 91 of 93

References
 Deep Neural Networks
 A. Krizhevsky et al. “Imagenet classification with deep convolutional neural
networks,” NeurIPS 2012
 K. He et al. “Deep residual learning for image recognition,” CVPR 2016
 M. Tan et al., “EfficientNet: Rethinking Model Scaling for Convolutional Neural
Networks,” ICML 2019
 A. Vaswani et al., “Attention is all you need,” NeurIPS 2017
 M. Shoeybi et al., “Megatron-LM: Training MultiBillion Parameter Language Models
Using Model Parallelism,” arXiv 2019
 J. Choi et al. “PACT: Parameterized Clipping Activation for Quantized Neural
Networks”, arxiv 2018
 R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient
inference: A whitepaper”, arxiv 2018
 A. Mishra et al. “Apprentice: Using knowledge distillation techniques to improve
low-precision network accuracy” ICLR 2018

Rangharajan Venkatesan ISSCC 2021 92 of 93

References
 Modeling, Mapping and Exploration Tools
 A. Parashar et al., “Timeloop: A Systematic Approach to DNN Accelerator
Evaluation,” ISPASS 2019
 R. Venkatesan et al., “MAGNet: A modular aaccelerator generator for neural
networks.” ICCAD 2019
 Y. N. Wu et al., “Accelergy: An Architecture-Level Energy Estimation Methodology
for Accelerator Designs,” ICCAD 2019
 X. Yang et al., “Interstellar: using Halide’s scheduling language toanalyze DNN
accelerators,” ASPLOS 2020
 S. Jain et al., “RxNN: a frameworkfor evaluating deep neural networks on resistive
crossbars,” TCAD 2020
 T. Chen et al., “TVM: an automated end-to-end optimizing compiler for deep
learning,” OSDI 2018
 L. Mei et al., “ZigZag: A memory-centric rapid DNN accelerator design space
exploration framework,” arXiv 2020

Rangharajan Venkatesan ISSCC 2021 93 of 93

Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
SystemC N BehaviorCoding Section2
No ratings yet
SystemC N BehaviorCoding Section2
110 pages
Tutorial T2: Fundamentals of Memory Subsystem Design For HPC and AI
No ratings yet
Tutorial T2: Fundamentals of Memory Subsystem Design For HPC and AI
105 pages
Functional Timing Analysis - How To Detect False Paths: EE219B University of California, Berkeley
No ratings yet
Functional Timing Analysis - How To Detect False Paths: EE219B University of California, Berkeley
31 pages
Fixed Point Signal Processing by W Paddget
100% (1)
Fixed Point Signal Processing by W Paddget
133 pages
System Busses / Networks-on-Chip: EECE 579 - Advanced Topics in VLSI Design Spring 2009 Brad Quinton
No ratings yet
System Busses / Networks-on-Chip: EECE 579 - Advanced Topics in VLSI Design Spring 2009 Brad Quinton
102 pages
DRAM Interface Overview
No ratings yet
DRAM Interface Overview
91 pages
Low-Power Variation-Tolerant Design in Nanometer Silicon (Bhunia) (2010)
No ratings yet
Low-Power Variation-Tolerant Design in Nanometer Silicon (Bhunia) (2010)
456 pages
Verilog Designers Library 0130811548 9780130811547 - Compress
No ratings yet
Verilog Designers Library 0130811548 9780130811547 - Compress
430 pages
Contents Mastering FPGA Chip Design
0% (1)
Contents Mastering FPGA Chip Design
9 pages
Through Silicon Vias - Materials, Models, Design, and Performance
No ratings yet
Through Silicon Vias - Materials, Models, Design, and Performance
232 pages
Lec05 Introduction To Macros and SRAM Lint
No ratings yet
Lec05 Introduction To Macros and SRAM Lint
48 pages
DDR Basics Frescale
No ratings yet
DDR Basics Frescale
53 pages
DRAM Basics by Prof. Matthew D. Sinclair
No ratings yet
DRAM Basics by Prof. Matthew D. Sinclair
103 pages
Lecture 12 Dram
No ratings yet
Lecture 12 Dram
54 pages
Electromigration in Integrated Circuits: Workshop On Reliability and Physical Verification, IIT Delhi, 12 Dec 2009
No ratings yet
Electromigration in Integrated Circuits: Workshop On Reliability and Physical Verification, IIT Delhi, 12 Dec 2009
46 pages
Custom IC Design for MS EE Students
100% (1)
Custom IC Design for MS EE Students
45 pages
UNIT-3 Sources of Power Dissipation
No ratings yet
UNIT-3 Sources of Power Dissipation
6 pages
Clock and Data Recovery For Serial Digital Communication
No ratings yet
Clock and Data Recovery For Serial Digital Communication
60 pages
F6 Wirelineforum 2024
100% (1)
F6 Wirelineforum 2024
468 pages
Systematic Variations in Lithography
No ratings yet
Systematic Variations in Lithography
15 pages
Basics of Memory Design
No ratings yet
Basics of Memory Design
46 pages
Random Offset CMOS IC Design CU Lecture Art Zirger PDF
No ratings yet
Random Offset CMOS IC Design CU Lecture Art Zirger PDF
46 pages
(2012) Design of D-PHY Chip For Mobile Display Interface Supporting MIPI Standard
No ratings yet
(2012) Design of D-PHY Chip For Mobile Display Interface Supporting MIPI Standard
2 pages
ISSCC 2021 Regular Presentations (Template & Guide)
No ratings yet
ISSCC 2021 Regular Presentations (Template & Guide)
17 pages
Chip Design for Engineers
No ratings yet
Chip Design for Engineers
12 pages
Advanced VLSI Design: Dr. Premananda B.S
No ratings yet
Advanced VLSI Design: Dr. Premananda B.S
339 pages
2158-2017 05 IEEE YP Loke FinFET
100% (1)
2158-2017 05 IEEE YP Loke FinFET
29 pages
Analog Design Methodology Jnotor r3
No ratings yet
Analog Design Methodology Jnotor r3
17 pages
Memory Tech Trends for Engineers
No ratings yet
Memory Tech Trends for Engineers
6 pages
DSP Arithmetic for Academics
No ratings yet
DSP Arithmetic for Academics
96 pages
Electronics System Design Using FPGA
No ratings yet
Electronics System Design Using FPGA
15 pages
Energy Effi Cient Embedded Video Processing Systems: Muhammad Usman Karim Khan Muhammad Shafi Que Jörg Henkel
No ratings yet
Energy Effi Cient Embedded Video Processing Systems: Muhammad Usman Karim Khan Muhammad Shafi Que Jörg Henkel
242 pages
GCD Flow
No ratings yet
GCD Flow
34 pages
Isscc2018 31 Digest
No ratings yet
Isscc2018 31 Digest
17 pages
HLS Introduction Gajski Design and Test
No ratings yet
HLS Introduction Gajski Design and Test
10 pages
CMOS Design With Delay Constraints
100% (1)
CMOS Design With Delay Constraints
22 pages
Lec07 Memory sp17
No ratings yet
Lec07 Memory sp17
99 pages
IEDM - The Big Decisions For 5nm - Breakfast Bytes - Cadence Blogs - Cadence Community
No ratings yet
IEDM - The Big Decisions For 5nm - Breakfast Bytes - Cadence Blogs - Cadence Community
8 pages
Secure System Design and Trustable Computing: Chip-Hong Chang Miodrag Potkonjak Editors
No ratings yet
Secure System Design and Trustable Computing: Chip-Hong Chang Miodrag Potkonjak Editors
537 pages
Video/Image Processing On FPGA
No ratings yet
Video/Image Processing On FPGA
94 pages
Multiplexer and Demultiplexer Guide
100% (1)
Multiplexer and Demultiplexer Guide
15 pages
EE292A Lecture 1.intro
No ratings yet
EE292A Lecture 1.intro
61 pages
Phase Noise and Jitter in CMOS Ring Oscillators
No ratings yet
Phase Noise and Jitter in CMOS Ring Oscillators
14 pages
Introduction To Cmos Vlsi Design
No ratings yet
Introduction To Cmos Vlsi Design
29 pages
Youn-Long Steve Lin - Essential Issues in Soc Design - Designing Complex Systems-On-Chip-Springer (2010)
No ratings yet
Youn-Long Steve Lin - Essential Issues in Soc Design - Designing Complex Systems-On-Chip-Springer (2010)
405 pages
DDR Basics, Register Configurations & Pitfalls: July, 2009
No ratings yet
DDR Basics, Register Configurations & Pitfalls: July, 2009
53 pages
FinFET/Nanowire Design For 5nm/3nm Technology Nodes: Channel Cladding and Introducing A "Bottleneck" Shape To Remove Performance Bottleneck
No ratings yet
FinFET/Nanowire Design For 5nm/3nm Technology Nodes: Channel Cladding and Introducing A "Bottleneck" Shape To Remove Performance Bottleneck
3 pages
Low Power Design Logic Level
No ratings yet
Low Power Design Logic Level
24 pages
Verilog A
No ratings yet
Verilog A
38 pages
Intro To Fpga and Quartus Prime Software
No ratings yet
Intro To Fpga and Quartus Prime Software
38 pages
Bang-Bang PLLs for CDR Systems
No ratings yet
Bang-Bang PLLs for CDR Systems
12 pages
Testing and Design-for-Testability (DFT) For Digital Integrated Circuits
100% (1)
Testing and Design-for-Testability (DFT) For Digital Integrated Circuits
95 pages
Architectural Support For High Speed Protection of Memory Integrity and Confidentiality in Multiprocessor Systems
No ratings yet
Architectural Support For High Speed Protection of Memory Integrity and Confidentiality in Multiprocessor Systems
29 pages
CMOS Sequential Circuit Design Guide
No ratings yet
CMOS Sequential Circuit Design Guide
22 pages
5 Lecture 28 01 25
No ratings yet
5 Lecture 28 01 25
47 pages
EECS251Leture-JennyHuang 2021
No ratings yet
EECS251Leture-JennyHuang 2021
67 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
Neural Network Accelerators: CS223 Computer Architecture & Organization
No ratings yet
Neural Network Accelerators: CS223 Computer Architecture & Organization
45 pages
ISSCC 2021 Tutorials T11: Ultra-Low Power Wireless Receiver Design
No ratings yet
ISSCC 2021 Tutorials T11: Ultra-Low Power Wireless Receiver Design
74 pages
Fundamentals of Fully-Integrated Voltage Regulators: Yan Lu University of Macau, Macao, China
100% (1)
Fundamentals of Fully-Integrated Voltage Regulators: Yan Lu University of Macau, Macao, China
81 pages
ISSCC 2021 Tutorials: Designing Amplifiers For Stability
No ratings yet
ISSCC 2021 Tutorials: Designing Amplifiers For Stability
85 pages
Isscc2021 T8
No ratings yet
Isscc2021 T8
100 pages
Isscc2021 T6
No ratings yet
Isscc2021 T6
87 pages
Calibration Techniques in Adcs: Ahmed M. A. Ali Analog Devices, Inc
No ratings yet
Calibration Techniques in Adcs: Ahmed M. A. Ali Analog Devices, Inc
74 pages
Measuring and Evaluating The Security Level of Circuits
No ratings yet
Measuring and Evaluating The Security Level of Circuits
94 pages
Isscc 2021 Tutorial: Silicon Photonics: From Basics To Asics
No ratings yet
Isscc 2021 Tutorial: Silicon Photonics: From Basics To Asics
88 pages
Isscc2021 T1
No ratings yet
Isscc2021 T1
98 pages