Arch3 Pipelining Afterlecture
Arch3 Pipelining Afterlecture
ETH Zürich
Spring 2023
31 March 2023
Extra Credit Assignment 1: Talk Analysis
◼ Intelligent Architectures for Intelligent Machines
◼ Watch and analyze this short lecture (33 minutes)
❑ https://www.youtube.com/watch?v=WxHribseelw (Oct 2022)
3
Extra Credit Assignment 2: Moore’s Law
◼ Guidelines on how to review papers critically
❑ Out-of-Order Execution
❑ Issues in OoO Execution: Load-Store Handling, …
5
Readings
◼ This week & next week
❑ Pipelining
◼ H&H, Chapter 7.5
❑ Pipelining Issues
◼ H&H, Chapter 7.7, 7.8.1-7.8.3
Instruction [5– 0]
CLK CLK
CLK
25:21 WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
15:11
1
WriteReg4:0
PCPlus4
+
SignImm
4 15:0
<<2
Sign Extend PCBranch
+
Result
AS’ Sequential AS
Combinational
Logic
Logic
(State)
10
Review: Multi-Cycle MIPS Processor
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 RD
ALU
Adr 20:16 B ALUResult ALUOut
EN A EN A2 RD2 00 01
1
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2
ImmExt
15:0
Sign Extend
25:0 (Addr)
Op = SW
Op = LW
S5: MemWrite
S7: ALU
Writeback S10: ADDI We can easily
S3: MemRead Writeback
handle multi-cycle
IorD = 1
IorD = 1
RegDst = 1
MemtoReg = 0
RegDst = 0
MemtoReg = 0
memory access
MemWrite
RegWrite RegWrite
S4: Mem
Writeback
RegDst = 0
MemtoReg = 1
RegWrite
13
Can We Do Better?
◼ What limitations do you see with the multi-cycle design?
◼ Limited concurrency
❑ Some hardware resources are idle during different phases of
instruction processing cycle
❑ “Fetch” logic is idle when an instruction is being “decoded” or
“executed”
❑ Most of the datapath is idle when a memory access is
happening
14
Can We Use the Idle Hardware to Improve Concurrency?
16
Can Have Different Instructions in Different Stages
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct
MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 RD
ALU
Adr 20:16 B ALUResult ALUOut
EN A EN A2 RD2 00 01
1
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2
ImmExt
15:0
Sign Extend
25:0 (Addr)
18
Pipelining: Basic Idea
◼ More systematically:
❑ Pipeline the execution of multiple instructions
❑ Analogy: “Assembly line processing” of instructions
◼ Idea:
❑ Divide the instruction processing cycle into distinct “stages” of
processing
❑ Ensure there are enough hardware resources to process one
instruction in each stage
❑ Process a different instruction in each stage
◼ Instructions consecutive in program order are processed in
consecutive stages
20
The Laundry Analogy
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
120 mins C
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order 6 PM
Time
7 8 9 10
- 411 loads
12
of laundry
1 2 AM
in parallel
A
(in steady state)
Task
1 load order B
- throughput increased by 4x
every A
- latency per load is the same
C
30 mins B
D
- no additional resources
C
D 22
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry: In Practice
Time
Task
6 PM 7 8 9 10 11 12 1 2 AM
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
B
Task
order
1 load C
A
every D
B
150 mins
C
(slow dryer) D
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
1 load TimeA
every
Task B
60 mins order
C
A
(slow dryer) D
B
C
the slowest step (the dryer) decides throughput
D
23
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry: In Practice
Time
6 PM 7 8 9 10 11 12 1 2 AM
Task
order
A 6 PM 7 8 9 10 11 12 1 2 AM
Time
Task B
order
1 load C
A
every
D
B
150 mins
C
(slow dryer) D
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order 6 PM 7 8 9 10 11 12 1 2 AM
Time
A A
Task B
1 load order B
every C
A
A
30 mins D
B
B
(2 slow dryers) C
D
throughput restored (1 load per 30-min) using 2 dryers
24
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
A Real-Life Pipeline: Automobile Assembly
https://www.caranddriver.com/features/a15115930/fords-assembly-line-turns-100-how-it-really-put-the-world-on-wheels-feature/
By Apparent scan made by the original uploader User:Steveking 89., Fair use, https://en.wikipedia.org/w/index.php?curid=11009879
25
A Real-Life Pipeline: Automobile Assembly
https://medium.com/@ScottAmyx/toyotas-production-philosophy-combines-human-effort-with-automation-a9df78c64ee1 26
An Old Pipelined Computer: IBM Stretch
https://en.wikipedia.org/wiki/IBM_7030_Stretch 27
An Ideal Pipeline
◼ Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
29
More Realistic Pipeline: Throughput
◼ Nonpipelined version with delay T
Tput = 1 / (T+S) where S = register (sequential logic) delay
T ps
T/k T/k
ps ps
G gates
G/k G/k
This picture ignores resource and register replication that is likely needed (G/k and Rk)
Pipelining Instruction Processing
32
Remember: The Instruction Processing Cycle
❑ FETCH
❑ DECODE
❑ EVALUATE ADDRESS
❑ FETCH OPERANDS
❑ EXECUTE
❑ STORE RESULT
33
Remember: The Instruction Processing Cycle
34
Remember the Single-Cycle Microarchitecture
Instruction [25– 0] Shift Jump address [31– 0]
26
left 2
28 0 PCSrc
1 1=Jump
PC+4 [31– 28] M M
u u
x x
ALU
Add result 1 0
Add Shift
RegDst
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg PCSrc2=Br Taken
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [5– 0]
ALU operation
T Tput=~(1/T)
Based on original figure from [P&H CO&D, COPYRIGHT 2004
Elsevier. ALL RIGHTS RESERVED.]
35
Dividing the Single-Cycle Uarch Into Stages
200ps 100ps 200ps 200ps 100ps
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
0
M
u
x
1
ignore
for now
Add
4 Add Add
result
Shift
left 2
Read
PC Address register 1 Read
data 1
Read
register 2 Zero
Instruction Registers Read ALU ALU
Write 0 Read
Instruction register
data 2
M
u
result Address
Data
data
1
M
u
RF
memory Write x
data 1
Write
memory x
0
write
data
16 32
Sign
extend
Instruction Data
lw $2, 200($0) 8 ns
800ps fetch
Reg ALU
access
Reg
Instruction
lw $3, 300($0) 8800ps
ns fetch
...
8 ns
800ps
Program
200
2 400
4 600
6 800
8 1000
10 1200
12 1400
14
execution
Time
order
(in instructions)
lw $1, 100($0)
Instruction
Reg ALU
Data
Reg
1 instruction every 200 ps
fetch access
Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
200ps fetch access
Instruction Data
lw $3, 300($0) 2 ns
200ps Reg ALU Reg
fetch access
2 ns
200ps 2 ns
200ps 2200ps
ns 2 ns
200ps 2 ns
200ps
Half of the clock cycle is wasted in two stages (ID/RF and WB) 37
Enabling Pipelined Processing: Pipeline Registers
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
00
MM
No resource is used by more than one stage
uu
xx
11
PCE+4
nPCM
Add
Add
Add
Add
44 Add result
Add
result
Shift
Shift
leftleft
22
Read
Read
Instruction
AE
register
11
PCF
PCPC Address
Address register Read
Read
AoutM
data
data 11
Read
Read Zero
MDRW
Instruction register
register 22 Zero
Instruction Registers Read
Registers ALU ALU
ALU
IRD
AoutW
BM
ImmE
1616 3232
Sign
Sign
extend
extend
T/k T/k
ps T ps
38
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Write 0
Write data
data
16 32
lw
All instruction classes must follow the same path
Instruction fetch and timing through the pipeline stages.
lw lw lw
Any performanceExecution
impact?
00
00
lw
M
0
M
MM
Instruction decode Memory
uuuu
xxxxx Write back
111
IF/ID
IF/ID
IF/ID
IF/ID ID/EX
ID/EX
ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
Add
Add
Add
Add
Add
Add
444 Add Add
Add result
Add
Add result
result
result
Shift
Shift
Shift
Shift
left 22
left22
left
left
Read
Instruction
Read
Read
Instruction
Read
Instruction
Instruction
Instruction
PC
PC Address register
register111
register Read
PC Address
Address Read
Read
Read
Read data111
data
data
data
Read
Read
Read
register Zero
Zero
Instruction
Instruction
Instruction register222
register Zero
Zero
Registers Read
Registers Read
Registers ALU
ALU
ALU ALUALU
ALU
memory
memory
memory Write Read 00 ALU ALU
ALU Read
Write data 00 Address Read
Read
Write
register data222
data
data
result
result
result
result Address
Address
Address
Address Read
data 1111
register
register
register MM
MM Data data
data
data
uu Data
Data MMM M
uu Data
Data
Write
Write
Write xxxx
memory
memory uuuu
memory
memory
memory x
xx
data
data
data 11
11
Write
Write 0000
Write
Write
data
data
data
16
16
16 32
32
32
Sign
Sign
Sign
extend
extend
extend
lw
0
0 M Instruction decode lw
M
u
u
x 39
Based on original figure from [P&H
x CO&D,
1 COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Write back
data
register 1M M
uu Data 0M
Write
Write xx
Write uu
data memory
memory xx
data
data 11
Clock 1
Clock
Clock 5 3
lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
sub $11, $2, $3 lw $10,
sub $11,20($1)
$2, $3 sub $11,20($1)
lw $10, $2, $3
00
M
M
u
uu Execution Memory Write back
Write back
xx
11
IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
Add
Add
Add
Add
Add
Add
44 Add result
Add
result
result
Shift
Shift
left 22
left
Read
Read
Instruction
Read
Instruction
PC
PC
PC Address
Address
Address register 11
register Read
Read
Read data 11
data
Read
Read
register 22
register Zero
Zero
Instruction
Instruction ALU ALU
Registers Read
Registers Read
Read ALU
ALU ALU
ALU
memory
memory Write
Write 00 result Address Read
Read
Read
data 22
data result
result Address
Address 11
register
register M
MM data
data
M
M
uu Data
Data
Data uu
Clock
Clock
Clock56 21 43
Clock
Clock
t0 t1 t2 t3 t4 t5
Inst0 IF ID EX MEM WB
Inst1 IF ID EX MEM WB
Inst2 IF ID EX MEM WB
Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM
IF ID EX
IF ID
steady state
(full pipeline) IF
41
Illustrating Pipeline Operation: Resource View
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10
ID I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
EX I0 I1 I2 I3 I4 I5 I6 I7 I8
MEM I0 I1 I2 I3 I4 I5 I6 I7
WB I0 I1 I2 I3 I4 I5 I6
42
Control Points in a Pipeline
PCSrc
0
M
u
x
1
Add
Add
4 Add
result
Branch
Shift
RegWrite left 2
Read MemWrite
Instruction
Instruction
[20– 16]
0
M ALUOp
Instruction u
[15– 11] x
Based on original figure from [P&H CO&D, 1
COPYRIGHT 2004 Elsevier. ALL RIGHTS
RESERVED.] RegDst
Instruction
Control M WB
EX M WB
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0 WriteReg4:0
15:11
1
PCPlus4
+
SignImm
4 15:0 <<2
Sign Extend
PCBranch
+
Result
ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0
15:11
RdE
1
+
SignImmE
4 15:0
<<2
Sign Extend PCBranchM
+
ResultW
CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD
ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
SignImmE
+
15:0 <<2
Sign Extend
4 PCBranchM
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+
15:0
<<2
Sign Extend SignImmE
4 PCBranchM
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
◼ Resource contention
52
Dependences and Their Types
◼ Also called “dependency” or less desirably “hazard”
◼ Two types
❑ Data dependence
❑ Control dependence
54
Carnegie Mellon
55
Data Dependences
◼ Data dependence types
❑ Flow dependence (true data dependence – read after write)
❑ Anti dependence (write after read)
❑ Output dependence (write after write)
Output dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW)
r3 r6 op r7
57
data
register 1M M
uu Data 0M
Write
Write xx
Write uu
data memory
memory xx
data
data 11
Clock 1
Clock
Clock 5 3
lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
sub $11, $2, $3 lw $10,
sub $11,20($1)
$2, $3 sub $11,20($1)
lw $10, $2, $3
00
M
M
u
uu Execution Memory Write back
Write back
xx
11
IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
Add
Add
Add
Add
Add
Add
44 Add result
Add
result
result
Shift
Shift
left 22
left
Read
Read
Instruction
Read
Instruction
PC
PC
PC Address
Address
Address register 11
register Read
Read
Read data 11
data
Read
Read
register 22
register Zero
Zero
Instruction
Instruction ALU ALU
Registers Read
Registers Read
Read ALU
ALU ALU
ALU
memory
memory Write
Write 00 result Address Read
Read
Read
data 22
data result
result Address
Address 11
register
register M
MM data
data
M
M
uu Data
Data
Data uu
Clock
Clock
Clock56 21 43
Clock
Clock
59
Readings for Next Few Lectures
◼ H&H, Chapter 7.5-7.9
60
How to Handle Data Dependences
◼ Anti and output dependences are easier to handle
❑ write to the destination only in last stage and in program order
Output dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW)
r3 r6 op r7
62
RAW Dependence Handling
◼ Which one of the following flow dependences lead to
conflicts in the 5-stage pipeline?
addi ra r- -
IF ID EX MEM WB
addi r- ra - IF ID EX MEM WB
addi r- ra - IF ID EX MEM
addi r- ra - IF ID EX
addi r- ra - IF ?ID
addi r- ra - IF
63
Pipeline Stall: Resolving Data Dependence
t0 t1 t2 t3 t4 t5
Insth IF ID ALU MEM WB
Insti i IF ID ALU MEM WB
Instj j IF ID ALU
ID MEM
ALU
ID ID
WB
MEM
ALU ALU
WB
MEM
Instk IF ID
IF ALU
ID
IF MEM
ALU
ID
IF WB
MEM
ALU
ID
Instl IF ID
IF ALU
ID
IF MEM
ALU
ID
IF
IF ID
IF ALU
ID
IF
i: rx _
j: _ rx
bubble dist(i,j)=1 IF ID
IF
Stall = make the dependent instruction wait
j: _ rx
bubble dist(i,j)=2 IF
until its source data value is available
j: _ rx
bubble dist(i,j)=3 1. stop all up-stream stages
j: _ rx dist(i,j)=4 2. drain all down-stream stages
64
Interlocking
◼ Interlocking: Detection of dependence between instructions
in a pipelined processor to guarantee correct execution
◼ MIPS acronym?
65
Approaches to Dependence Detection (I)
◼ Scoreboarding
❑ Each register in register file has a Valid bit associated with it
❑ An instruction that is writing to the register resets the Valid bit
❑ An instruction in Decode stage checks if all its source and
destination registers are Valid
◼ Yes: No need to stall… No dependence
◼ No: Stall the instruction
◼ Advantage:
❑ Simple. 1 bit per register
◼ Disadvantage:
❑ Need to stall for all types of dependences, not only flow dep.
66
data 1
0
Write
data
Clock 1
lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)$10
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
00
M
M
uu
xx
11
IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
Add
Add
Add
Add
44 Add result
Add
result
Shift
Shift
left 22
left
Read
Read
Instruction
PC
PC Address
Address
Address register 11
register Read
Read
Read data 11
data
Read
register 22
register Zero
Zero
Instruction
Instruction
Registers Read
Registers Read ALU ALU
ALU ALU
memory
memory Write
Write 00 Address Read
Read
data 22
data result
result Address 11
register
register M
M data
data
M
M
uu Data
Data
Write
Write xx uu
memory
memory xx
data
data 11
00
Write
Write
data
data
16
16 32
32
Sign
Sign
extend
extend
extend
Clock
Clock 21 3
◼ Advantage:
❑ No need to stall on anti and output dependences
◼ Disadvantage:
❑ Logic is more complex than a scoreboard
❑ Logic becomes more complex as we make the pipeline deeper
and wider (flash-forward: think superscalar execution)
68
data 1
0
Write
data
Clock 1
lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)$10
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
00
M
M
uu
xx
11
IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
Add
Add
Add
Add
44 Add result
Add
result
Shift
Shift
left 22
left
Read
Read
Instruction
PC
PC Address
Address
Address register 11
register Read
Read
Read data 11
data
Read
register 22
register Zero
Zero
Instruction
Instruction
Registers Read
Registers Read ALU ALU
ALU ALU
memory
memory Write
Write 00 Address Read
Read
data 22
data result
result Address 11
register
register M
M data
data
M
M
uu Data
Data
Write
Write xx uu
memory
memory xx
data
data 11
00
Write
Write
data
data
16
16 32
32
Sign
Sign
extend
extend
extend
Clock
Clock 21 3
70
Data Forwarding/Bypassing
◼ Problem: A consumer (dependent) instruction has to wait in
decode stage until the producer instruction writes its value
in the register file
72
We Covered
Until This Point
in Lecture
73
Digital Design & Computer Arch.
Lecture 12: Pipelining
ETH Zürich
Spring 2023
31 March 2023
Slides for Future Lectures
75
How to Implement Stalling
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
◼ Stall
❑ disable PC and IF/ID latching; ensure stalled instruction stays in its stage
❑ Insert “invalid” instructions/nops into the stage following the stalled one
(called “bubbles”)
76
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
RAW Data Dependence Example
One instruction writes a register ($s0) and next instructions
read this register => read after write (RAW) dependence.
❑ add writes into $s0 in the first half of cycle 5
Wrong results happen only if
❑ and reads $s0 in cycle 3, obtaining the wrong value
❑
the pipeline handles
or reads $s0 in cycle 4, again obtaining the wrong value
❑ sub readsdata
$s0 independences
2nd half of cycle incorrectly!
5, getting the correct value
❑ subsequent instructions read the correct value of $s0
1 2 3 4 5 6 7 8
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Compile-Time Detection and Elimination
1 2 3 4 5 6 7 8 9 10
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
nop DM
nop IM RF RF
nop DM
nop IM RF RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
1 2 3 4 5 6 7 8
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
ALU
1 10 ALUOutM ReadDataW
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+
Sign
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
PCBranchM
ResultW
RegWriteW
ForwardBE
ForwardAE
RegWriteM
DependenceHazard
Detection
Unit Logic
Data Forwarding: Implementation
◼ Forward to Execute stage from either:
❑ Memory stage or
❑ Writeback stage
◼ Forwarding logic for ForwardBE same, but replace rsE with rtE
Data Forwarding Is Not Always Possible
1 2 3 4 5 6 7 8
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
1 2 3 4 5 6 7 8 9
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF
$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF
Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Stalling and Dependence Detection Hardware
CLK CLK CLK
ALU
ReadDataW
EN
1 10 ALUOutM
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+
Sign
15:0
Extend
4
<<2
+
PCPlus4F
CLR
PCPlus4D PCPlus4E
EN
PCBranchM
ResultW
MemtoRegE
RegWriteW
ForwardBE
ForwardAE
RegWriteM
FlushE
StallD
StallF
Dependence
HazardDetection
Unit Logic
Hardware Needed for Stalling
◼ Stalls are supported by adding
❑ enable inputs (EN) to the Fetch and Decode pipeline registers
❑ synchronous reset/clear (CLR) input to the Execute pipeline
register
◼ or an INV bit associated with each pipeline register, indicating that
contents are INValid
88
Control Dependence
◼ Question: What should the fetch PC be in the next cycle?
◼ Answer: The address of the next instruction
❑ All instructions are control dependent on previous ones. Why?
Branch Prediction
Special case of data dependence: dependence on PC
beq:
▪ Conditional branch is not resolved until the fourth stage of the pipeline
▪ Instructions after the branch are fetched before branch is resolved
▪ Simple “branch prediction” example:
▪ Always predict that the next sequential instruction is fetched
▪ Called “Always not taken” prediction
▪ Flush (invalidate) “not-taken path” instructions if the branch is taken
ALU
1 10 ALUOutM ReadDataW
EN
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+
Sign
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN
PCBranchM
ResultW
MemtoRegE
RegWriteW
ForwardBE
ForwardAE
RegWriteM
FlushE
StallD
StallF
91
Carnegie Mellon
Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF
$s0
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF
Flush
$s4 these
or DM instructions
28 or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM
2C sub $t2, $s0, $s5 IM RF $s5 - RF
30 ... Flush
...
$s2
3 instructions
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF
92
Carnegie Mellon
Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF
$s0 Flush
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF this
instruction
30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF
94
Carnegie Mellon
Disadvantages
▪ Potential increase in clock cycle time?
→ Higher clock period and lower frequency?
▪ Additional hardware cost
→ Specialized and likely not used by other instructions
95
Recall: Performance Analysis Basics
◼ Execution time of a single instruction
❑ {CPI} x {clock cycle time}
◼ CPI: Number of cycles it takes to execute an instruction
96
Carnegie Mellon
EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 0 00
A RD 01
ALU
ALUOutM ReadDataW
EN
1 1 10
A RD
Instruction 20:16
A2 RD2 0 00 0 SrcBE Data
Memory 01
A3 1 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+
Sign
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D
CLR
CLR
EN
PCBranchD
ResultW
MemtoRegE
RegWriteW
ForwardBD
ForwardBE
ForwardAD
ForwardAE
RegWriteM
RegWriteE
BranchD
FlushE
StallD
StallF
DependenceHazard
Detection
Unit
Logic
Data forwarding for early branch resolution adds even more complexity 97
Carnegie Mellon
//Stalling logic:
assign lwstall = ((rsD == rtE) | (rtD == rtE)) & MemtoRegE;
// Stall signals;
assign StallF = lwstall | branchstall;
assign StallD = lwstall | branchstall;
assign FLushE = lwstall | branchstall;
98
Carnegie Mellon
100
More on Branch Prediction (I)
https://www.youtube.com/watch?v=h6l9yYSyZHM&list=PL5Q2soXY2Zi_FRrloMa2fUYWPGiZUBQo2&index=22
More on Branch Prediction (II)
https://www.youtube.com/watch?v=z77VpggShvg&list=PL5Q2soXY2Zi_FRrloMa2fUYWPGiZUBQo2&index=23
More on Branch Prediction (III)
https://www.youtube.com/watch?v=yDjsr-jTOtk&list=PL5PHm2jkkXmgVhh8CHAu9N76TShJqfYDt&index=4
Lectures on Branch Prediction
◼ Digital Design & Computer Architecture, Spring 2020, Lecture 16b
❑ Branch Prediction I (ETH Zurich, Spring 2020)
❑ https://www.youtube.com/watch?v=h6l9yYSyZHM&list=PL5Q2soXY2Zi_FRrloMa2fU
YWPGiZUBQo2&index=22
https://www.youtube.com/onurmutlulectures 104
Pipelined Performance Example
105
Carnegie Mellon
Assume:
▪ 40% of loads used by next instruction
▪ 25% of branches mispredicted; flushes the next instruction on misprediction
▪ All jumps flush the next instruction fetched
And
▪ Average CPI =
107
Carnegie Mellon
And
▪ Average CPI = (0.25)(1.4) + load
(0.1)(1) + store
(0.11)(1.25) + beq
(0.02)(2) + jump
(0.52)(1) r-type
= 1.15
108
Carnegie Mellon
Tc = max {
tpcq + tmem + tsetup fetch
2(tRFread + tmux + teq + tAND + tmux + tsetup ) decode
tpcq + tmux + tmux + tALU + tsetup execute
tpcq + tmemwrite + tsetup memory
2(tpcq + tmux + tRFwrite) writeback
}
Decode and Writeback use register file and have only half a
clock cycle to complete → that is why there is a 2 in front of them
109
Carnegie Mellon
112
Carnegie Mellon
113
Recall: How to Handle Data Dependences
◼ Anti and output dependences are easier to handle
❑ write to the destination only in last stage and in program order
116
Question to Ponder: Hardware vs. Software
◼ What is the role of the hardware vs. the software in the
order in which instructions are executed in the pipeline?
❑ Software based instruction scheduling → static scheduling
❑ Hardware based instruction scheduling → dynamic scheduling
117
More on Software vs. Hardware
◼ Software based scheduling of instructions → static scheduling
❑ Hardware executes the instructions in the compiler-dictated order
❑ Contrast this with dynamic scheduling: hardware can execute
instructions out of the compiler-specified order
❑ How does the compiler know the latency of each instruction?
https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=18
Lectures on Static Instruction Scheduling
https://www.youtube.com/onurmutlulectures 120
Recall: Semantic Gap
◼ How close instructions & data types & addressing modes
are to high-level language (HLL)
HLL HLL
Small Semantic Gap
ISA with
Complex Inst Large Semantic Gap
& Data Types
& Addressing Modes ISA with
Simple Inst
& Data Types
& Addressing Modes
HW HW
Control Control
Signals Signals
HLL
Small Semantic Gap
X86-64 ISA with
Complex Inst
& Data Types Software or Hardware Translator
& Addressing Modes
https://en.wikipedia.org/wiki/Rosetta_(software)#Rosetta_2 123
An Example: Rosetta 2 Binary Translator
Apple M1,
2021
126
https://safari.ethz.ch/digitaltechnik/spring2021/lib/exe/fetch.php?media=boggs_ieeemicro_2015.pdf
Transmeta: x86 to VLIW Translation
X86 X86
Klaiber, “The Technology Behind Crusoe Processors,” Transmeta White Paper 2000.
https://www.wikiwand.com/en/Transmeta_Efficeon 127
https://classes.engineering.wustl.edu/cse362/images/c/c7/Paper_aklaiber_19jan00.pdf
More on Static Instruction Scheduling
https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=18
Lectures on Static Instruction Scheduling
https://www.youtube.com/onurmutlulectures 129
Recall: How to Handle Data Dependences
◼ Anti and output dependences are easier to handle
❑ write to the destination only in last stage and in program order
131
Fine-Grained Multithreading
◼ Idea: Fetch from a different thread every cycle such that no
two instructions from a thread are in the pipeline concurrently
❑ Hardware has multiple thread contexts (PC+registers per thread)
❑ Threads are completely independent
❑ No instruction is fetched from the same thread until the prior
branch/instruction from the thread completes
ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+
15:0
<<2
Sign Extend SignImmE
4 PCBranchM
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
We need a PC and a register file for each thread + muxes and control
Fine-Grained Multithreading (II)
◼ Idea: Fetch from a different thread every cycle such that no
two instructions from a thread are in the pipeline concurrently
135
Fine-Grained Multithreading in HEP
◼ Cycle time: 100ns
◼ 8 stages → 800 ns to
complete an
instruction
❑ assuming no memory
access
Burton Smith
(1941-2018)
136
Multithreaded Pipeline Example
Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
138
Fine-Grained Multithreading
◼ Advantages
+ No need for dependence checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from
different threads
+ Improved system throughput, latency tolerance, pipeline utilization
◼ Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs, register
files, …), thread selection logic
- Reduced single thread performance (one instruction fetched every N
cycles from the same thread)
- Resource contention between threads in caches and memory
- Dependence checking logic between threads may be needed (load/store)
139
Modern GPUs are
FGMT Machines
140
NVIDIA GeForce GTX 285 “core”
64 KB of storage
… for thread contexts
(registers)
64 KB of storage
… for thread contexts
(registers)
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Burton Smith
(1941-2018)
144
Further Reading for the Interested (II)
145
More on Multithreading (I)
https://www.youtube.com/watch?v=iqi9wFqFiNU&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=51
More on Multithreading (II)
https://www.youtube.com/watch?v=e8lfl6MbILg&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=52
More on Multithreading (III)
https://www.youtube.com/watch?v=7vkDpZ1-hHM&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=53
More on Multithreading (IV)
https://www.youtube.com/watch?v=-hbmzIDe0sA&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=54
Lectures on Multithreading
◼ Parallel Computer Architecture, Fall 2012, Lecture 9
❑ Multithreading I (CMU, Fall 2012)
❑ https://www.youtube.com/watch?v=iqi9wFqFiNU&list=PL5PHm2jkkXmgDN1PLwOY
_tGtUlynnyV6D&index=51
◼ Parallel Computer Architecture, Fall 2012, Lecture 10
❑ Multithreading II (CMU, Fall 2012)
❑ https://www.youtube.com/watch?v=e8lfl6MbILg&list=PL5PHm2jkkXmgDN1PLwOY_
tGtUlynnyV6D&index=52
◼ Parallel Computer Architecture, Fall 2012, Lecture 13
❑ Multithreading III (CMU, Fall 2012)
❑ https://www.youtube.com/watch?v=7vkDpZ1-
hHM&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=53
◼ Parallel Computer Architecture, Fall 2012, Lecture 15
❑ Speculation I (CMU, Fall 2012)
❑ https://www.youtube.com/watch?v=-
hbmzIDe0sA&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=54
https://www.youtube.com/onurmutlulectures 150
Pipelining and Precise Exceptions:
Preserving Sequential Semantics
Multi-Cycle Execution
◼ Not all instructions take the same amount of time for
“execution”
E E E E E E E E ...
Load/store
152
Issues in Pipelining: Multi-Cycle Execute
◼ Instructions can take different number of cycles in EXECUTE
stage
❑ Integer ADD versus Integer DIVide
DIV R4 R1, R2 F D E E E E E E E E W
ADD R3 R1, R2 F D E W
F D E W
F D E W
DIV R2 R5, R6 F D E E E E E E E E W
ADD R7 R5, R6 F D E W
F D E W
◼ When to Handle
❑ Exceptions: when detected (and known to be non-speculative)
❑ Interrupts: when convenient
◼ Except for very high priority ones
❑ Power failure
❑ Machine check (error)
156
Checking for and Handling Exceptions in Pipelining
157
Why Do We Want Precise Exceptions?
◼ Semantics of the von Neumann model ISA specifies it
❑ Remember von Neumann vs. Dataflow
158
Ensuring Precise Exceptions
◼ Easy to do in single-cycle and multi-cycle machines
◼ Single-cycle
❑ Instruction boundaries == Cycle boundaries
◼ Multi-cycle
❑ Add special states in the control FSM that lead to the
exception or interrupt handlers
❑ Switch to the handler only at a precise state → before fetching
the next instruction
See H&H Section 7.7 for a treatment of exceptions in multi-cycle uarch 159
Precise Exceptions in Multi-Cycle FSM
160
Precise Exceptions in Multi-Cycle Datapath
See H&H Section 7.7 for a treatment of exceptions in multi-cycle uarch 161
Multi-Cycle Execute: More Complications
◼ Instructions can take different number of cycles in EXECUTE
stage
❑ Integer ADD versus Integer DIVide
DIV R4 R1, R2 F D E E E E E E E E W
ADD R3 R1, R2 F D E W
F D E W
F D E W
DIV R2 R5, R6 F D E E E E E E E E W
ADD R7 R5, R6 F D E W
F D E W
DIV R3 R1, R2 F D E E E E E E E E W
ADD R4 R1, R2 F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
◼ Downside
❑ Worst-case instruction latency determines all instructions’ latency
◼ What about memory operations?
◼ Each functional unit takes worst-case number of cycles?
163
Solutions
◼ Reorder buffer
◼ History buffer
◼ Checkpointing
◼ Suggested reading
❑ Smith and Plezskun, “Implementing Precise Interrupts in Pipelined
Processors,” IEEE Trans on Computers 1988 and ISCA 1985.
164
Solution I: Reorder Buffer (ROB)
◼ Idea: Complete instructions out-of-order, but reorder them
before making results visible to architectural state
◼ When instruction is decoded, it reserves the next-sequential
entry in the ROB
◼ When instruction completes, it writes result into ROB entry
◼ When instruction oldest in ROB and it has completed
without exceptions, its result moved to reg. file or memory
Func Unit
Instruction Register Reorder
Cache File Func Unit Buffer
Func Unit
165
Reorder Buffer
166
What’s in a ROB Entry?
Valid bits for reg/data
V DestRegID DestRegVal StoreAddr StoreData PC Exception?
+ control bits
F D E E E E E E E E R W
F D E R W
F D E R W
F D E R W
F D E E E E E E E E R W
F D E R W
F D E R W
168
Reorder Buffer: How to Access?
◼ A register value can be in the register file, reorder buffer,
(or bypass/forwarding paths)
Func Unit
171
Important: Register Renaming with a Reorder Buffer
◼ Output and anti dependences are not true dependences
❑ WHY? The same register refers to values that have nothing to
do with each other
❑ They exist due to lack of register ID’s (i.e. names) in
the ISA
Output-dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW) -- Output
r3 r6 op r7
173
Renaming Example
◼ Assume
❑ Register file has a pointer to the reorder buffer entry that
contains or will contain the value, if the register is not valid
❑ Reorder buffer works as described before
174
In-Order Pipeline with Reorder Buffer
◼ Decode (D): Access regfile/ROB, allocate entry in ROB, check if
instruction can execute, if so dispatch instruction
◼ Execute (E): Instructions can complete out-of-order
◼ Completion (R): Write result to reorder buffer
◼ Retirement/Commit (W): Check for exceptions; if none, write result to
architectural register file or memory; else, flush pipeline and start from
exception handler
◼ In-order dispatch/execution, out-of-order completion, in-order retirement
Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E
R
E E E E E E E E ...
Load/store
175
Reorder Buffer Tradeoffs
◼ Advantages
❑ Conceptually simple for supporting precise exceptions
❑ Can eliminate false dependences
◼ Disadvantages
❑ Reorder buffer needs to be accessed to get the results that
are yet to be written to the register file
◼ CAM or indirection → increased latency and complexity
176
More on State Maintenance & Precise Exceptions
https://www.youtube.com/onurmutlulectures 177
More on State Maintenance & Precise Exceptions
https://www.youtube.com/onurmutlulectures 178
Lectures on State Maintenance & Recovery
◼ Computer Architecture, Spring 2015, Lecture 11
❑ Precise Exceptions, State Maintenance/Recovery (CMU, Spring 2015)
❑ https://www.youtube.com/watch?v=nMfbtzWizDA&list=PL5PHm2jkkXmi5CxxI7b3J
CL1TWybTDtKq&index=13
https://www.youtube.com/onurmutlulectures 179
Suggested Readings for the Interested
◼ Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988 and
ISCA 1985.
◼ Backup Slides
180