Isscc2021 T8
Isscc2021 T8
Yvain Thonnart
CEA-LIST
Data movement is crucial in the Cerebras CS-1 [K. Rocki et al., SC '20]
performance of systems
Core L1
L2
Core L1
L3 MEM
Core L1
L2
Core L1
Tx Rx
Tx Zc Rx
Tx Rx
Tx Rx
Tx Rx
Interconnect
Bus:
As a system function:
Command - Address - Data protocol between system modules
As a structure: “Shared Bus”
interconnect with shared command/address/data lines
Issues:
No buffering possible
Long wiring
Large capacitance
Issue:
Long combinational paths low frequency
Target 1 Target N
Initiator
Read Data
Arbiter Arbiter
Routers – switches
Send data from input ports to appropriate output ports
Roughly equivalent terms for on-chip interconnects:
terminology comes from Internet networks. Preferred use:
N
Switch: Router:
W E
little or no queuing, often more queuing,
ports not directional Switch directional ports
S
Yvain Thonnart T8: On-Chip Interconnects: Basic Concepts, 19 of 100
Designs and Future Opportunities
Bus and/or NoC: buses meet networks
Local handshake between FIFOs & switches enables complex routing topologies
Advanced buses are NoCs!
Interconnect as a general term
Switch
Rx
Tx Rx
Switch
Tx Rx
Rx
Switch
Tx
Tx Rx
PE PE PE
PE PE PE
2 4 PE5
PE4
Static/dynamic data streams
Request
Forward
Response
Time
Completion
Completion
Dataflow architectures
Static/dynamic data streams
Complex interconnect:
Longer transfers
Transaction-based architectures
Routing and arbitration needed
Memory-mapped transactions
Contention
Requests & Responses
Different traffic classes
Src Src
Dst Dst
Src
⇔ Src
Switch
Dst Dst
Src Src
Dst Dst
Src Src
⇔
D1
S1 X ∅ X S2
S2 ∅ X X D2
S3 X X ∅ S3
Switch
Switch
Multi-layered
To limit the design complexity of
many-port switches S1 D1
Pipelined as necessary
S2 D2
Switch
Switch
S3 D3
Star/Fat-Tree
Folded Ring (1D) Mesh (2D)
Switch
Most often centralized control
Quasi-static scheduling
or centralized arbitration
Potential time multiplexing:
alternating switch configurations
every cycle
Potential deadlock
No packet can make progress to
destination
Potential deadlock
No packet can make progress to
destination
Average end-to-end
Simultaneous transfers slow down
the network
latency (ns)
Saturation
Depends on architecture threshold
Topology bottlenecks
Amount of pipelining in the network
Sender Receiver
ready/stalled
En V En V En V En V En V
Lb pipeline stages
Sender Receiver
+1
credits 0 consumed?
V V V V V
(Lf+Lb) slots
Lf pipeline stages
Yvain Thonnart T8: On-Chip Interconnects: Basic Concepts, 51 of 100
Designs and Future Opportunities
Pipelining: elastic buffers
Depth-2 FIFOs allow to break backward path without credit counters
No additional buffering at the receiver: uses a lower total amount of storage
Second place in FIFO used to amortize stalls
V V V V V
Valid
Demux
Network Data
Size
FIFO
interface
converter
from core Ready
In:
Steer Demuxes
Routing
In In In tables
Out
In
Out
In
Out Out Out:
Arbiters
Out In
Arb
Muxes
BOP
EOP
Packet delimiters to route the whole packet Header 10 Route info
to the same port flit
Begin of Packet on header flit: BOP (optional) 00 Data payload
End of Packet on last flit: EOP
@ route ID
0x00FF… R3 I0 T0
I1 R3 T1
BOP
EOP
I2 T2
Header 10 R3
flit I3 T3
L:
L
EENL
BOP
EOP
ENL NL
Header 10 xxxEENL
flit E:ENL E:NL N:L
R(0,1):
X=0,
Y=1:L
(0,1)
BOP
EOP
R(0,0):
X=0,
Header 10 (0,1) Y>0:N
flit R(1,0): R(2,0):
X<1:W X<2:W
BOP
BOP
EOP
EOP
Between virtual channels:
Flits can be interleaved
10 Header 10 Header
Some virtual channels may have higher priority
High-priority packets can preempt low priority packets
00 payload 00 payload
01 payload 01 payload
From in #1 From in #2
Yvain Thonnart T8: On-Chip Interconnects: Basic Concepts, 61 of 100
Designs and Future Opportunities
Arbitration: policies
For arbitration between directions, priority meaningless
Issue: All directions must be served equally
Fairness
Solution: Balanced arbiters
Round-Robin : rotating priority between inputs => cost: 1 counter
Least recently used (LRU) => cost: ordered list of used ports
BOP
EOP
BOP
EOP
10 Header routing
10 Header routing Only if required
00 2nd part of Header
byte 00 mask 4-byte data
00 8-byte data
mask 00 mask 4-byte data
Protocol
compatible
bit reorder
clk1@F1, V1 clk2@F2, V2
Skew
w. timing
margin
CK CK
P1
D flip-flop D Q
N1 P1 N2 P2
VN1 VN2
Yvain Thonnart T8: On-Chip Interconnects: Basic Concepts, 70 of 100
Designs and Future Opportunities
Time characterization of metastability
Forced transitions on D close to clock edge
Measure clock-to-Q delay
Deterministic delay increases
near metastable window
Used for setup/hold characterization
τ
63%
Setup
→
Stochastic delay becomes unbounded Hold
margin margin
within metastability window Tw
Δt
Poisson process: can delay indefinitely
Tw
Trip point
Vout skewed toward 1
CK CK
D flip-flop Error
Vin should never
CK CK 1 happen
Q
D
1
VN2 Vin
Yvain Thonnart T8: On-Chip Interconnects: Basic Concepts, 72 of 100
Designs and Future Opportunities
Resynchronization & metastability
Major problem: logic divergence & reconvergence after the unstable node
Solution: no divergence until certain to have resolved metastability
Output of the 2nd flip-flop
Or more if needed
Q1 Q2 Q3
CK
D
Q1
Q2
Q3
Yvain Thonnart T8: On-Chip Interconnects: Basic Concepts, 73 of 100
Designs and Future Opportunities
Synchronization failures
Mean time between failures (MTBF)
Depends on Example for Fd =100MHz, Fck=1GHz:
Resync flip-flop parameters τ & Tw
Tw τ N MTBF
Number of flip-flop stages N
60ps 2 4 seconds
Sampling frequency Fck (period Tck)
60ps 3 2 years
Input data toggle rate Fd 40ps
30ps 2 2 years
30ps 3 700 trillion years
1 Worst case 30ps 2 1 month
𝑀𝑇𝐵𝐹 =
𝑇 ( )
(Tw/Tck=1) 30ps 3 30 trillion years
𝐹 𝑒
𝑇
Probability of resync. flip-flops tailored for low τ
Metastable
staying metastable Tw less important
events rate
after N Flip-Flops N costly in latency
wvalid +1 +1 rready
wptr rptr
Test Test
wready full? empty? rvalid
synchronizer synchronizer
For large depths: Gray For small depths: Johnson (from 1-hot code)
Derived from binary code Shift left & update lsb with inverted msb
2N values for N bits 2N values for N bits
wvalid +1 +1 rready
wptr rptr
Test Test
wready full? empty? rvalid
Nw Nr
Slow side may require less resynchronization flip-flops than fast side
Tck larger in MTBF equation
To keep peak throughput, need 1 transfer per slow cycle:
FIFO Depth ≥ (Nslow+1) + (Nfast+1)*Tfast/Tslow
Larger FIFO depth can be considered if interconnect must be freed upstream
Sender Receiver
Sender Receiver
Ireq
Iack
Oreq
Oack
A
C Z
B
A B
A B Z
0 0 0 A
Z
0 1 Z-1 Hold B
1 0 Z-1 last value
1 1 1
Data=0
C C C C C
Data=1
C C C C C
Ack
C C
C C
C C C C C
C C C C C
sel1
sel0
ack
sel
Using 3-input Muller gates
possible to add conditions
Select signal of a mux C
Enable signal of a demux
C
way0
C
C
C
C
way1
ack1 C
aclk C ack
+
wvalid +1 +1
wptr rptr
Test Test
wready full? empty?
synchronizer
Sender Receiver
Sender Receiver
Chiplet motivations
Cost driven using 3D stacking
Modularity driven technologies
Heterogeneous integration
Chiplet challenges ?
Eco-system maturity,
Technology & Architecture partitioning,
Chiplet Interfaces, testability, 3D EDA flow, etc [D. Dutoit, Keynote, 3DIC’2014]
Passive nearest-neighbor
connections
Chiplets :
Clusters of Cores
Passive
Interposer SoC infrastructure
Analog, IOs, DFT
Chiplets :
Clusters of Cores
Power
Management
Active Close to cores
Interposer
SoC infrastructure
Analog, IOs, DFT
Additional features
3D-Plug :
• Logic interface
• µ-bumps
• µ-buffer std-cells
Chiplet layout : • DFT
3D-Plug interfaces
[P. Vivet, ISSCC’2020]