0% found this document useful (0 votes)
41 views180 pages

Arch3 Pipelining Afterlecture

Uploaded by

Atmadeep Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views180 pages

Arch3 Pipelining Afterlecture

Uploaded by

Atmadeep Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 180

Digital Design & Computer Arch.

Lecture 12: Pipelining

Prof. Onur Mutlu

ETH Zürich
Spring 2023
31 March 2023
Extra Credit Assignment 1: Talk Analysis
◼ Intelligent Architectures for Intelligent Machines
◼ Watch and analyze this short lecture (33 minutes)
❑ https://www.youtube.com/watch?v=WxHribseelw (Oct 2022)

◼ Assignment – for 1% extra credit


❑ Write a good 1-page summary (following our guidelines)
◼ What are your key takeaways?
◼ What did you learn?
◼ What did you like or dislike?
◼ Submit your summary to Moodle – deadline April 1
2
Extra Credit Assignment 2: Moore’s Law
◼ Paper review
◼ G.E. Moore. "Cramming more components onto integrated
circuits," Electronics magazine, 1965

◼ Optional Assignment – for 1% extra credit


❑ Write a 1-page review
❑ Upload PDF file to Moodle – Deadline: April 1

◼ I strongly recommend that you follow my guidelines for


(paper) review (see next slide)

3
Extra Credit Assignment 2: Moore’s Law
◼ Guidelines on how to review papers critically

❑ Guideline slides: pdf ppt


❑ Video: https://www.youtube.com/watch?v=tOL6FANAJ8c

❑ Example reviews on “Main Memory Scaling: Challenges and


Solution Directions” (link to the paper)
◼ Review 1
◼ Review 2

❑ Example review on “Staged memory scheduling: Achieving


high performance and scalability in heterogeneous
systems” (link to the paper)
◼ Review 1
4
Agenda for Today & Next Few Lectures
◼ Last week & yesterday: Microarchitecture Fundamentals
❑ Single-cycle Microarchitectures
❑ Multi-cycle Microarchitectures
Problem
Algorithm
◼ This week & next: Pipelining
Program/Language
❑ Pipelining System Software
❑ Pipelined Processor Design SW/HW Interface
◼ Control & Data Dependence Handling Micro-architecture
◼ Precise Exceptions: State Maintenance & Recovery Logic
Devices
◼ Next next week: Out-of-Order Execution Electrons

❑ Out-of-Order Execution
❑ Issues in OoO Execution: Load-Store Handling, …
5
Readings
◼ This week & next week
❑ Pipelining
◼ H&H, Chapter 7.5
❑ Pipelining Issues
◼ H&H, Chapter 7.7, 7.8.1-7.8.3

◼ Next week & after Easter Break


❑ Out-of-order execution
❑ H&H, Chapter 7.8-7.9
❑ Smith & Sohi, “The Microarchitecture of Superscalar Processors,”
Proceedings of the IEEE, 1995
◼ More advanced pipelining
◼ Interrupt and exception handling
◼ Out-of-order and superscalar execution concepts
6
Review: Single-Cycle MIPS Processor (I)
PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1

PC+4 [31– 28] M M


u u
x x
ALU
Add result 1 0
Add
RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read
Instruction [20– 16] data 1
Read
register 2 Zero
bcond
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control

Instruction [5– 0]

**Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier.


ALL RIGHTS RESERVED.] JAL, JR, JALR omitted
Review: Single-Cycle MIPS Processor (II)
MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite

CLK CLK
CLK
25:21 WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD

ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
15:11
1
WriteReg4:0
PCPlus4
+

SignImm
4 15:0
<<2
Sign Extend PCBranch

+
Result

Single-cycle processor. Harris and Harris, Chapter 7.3. 8


Review: Single-Cycle MIPS FSM
◼ Single-cycle machine

AS’ Sequential AS
Combinational
Logic
Logic
(State)

AS: Architectural State 9


Can We Do Better?

10
Review: Multi-Cycle MIPS Processor
CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct

MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 RD

ALU
Adr 20:16 B ALUResult ALUOut
EN A EN A2 RD2 00 01
1
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2

ImmExt
15:0
Sign Extend
25:0 (Addr)

Multi-cycle processor. Harris and Harris, Chapter 7.4. 11


Review: Multi-Cycle MIPS FSM
S0: Fetch S1: Decode
IorD = 0
Reset AluSrcA = 0 S11: Jump
ALUSrcB = 01 ALUSrcA = 0
ALUOp = 00 ALUSrcB = 11 Op = J
PCSrc = 00 ALUOp = 00 PCSrc = 10
IRWrite PCWrite
PCWrite
Op = ADDI
Op = BEQ
Op = LW
or Op = R-type
S2: MemAdr Op = SW
S6: Execute S9: ADDI
S8: Branch
Execute
ALUSrcA = 1
ALUSrcA = 1 ALUSrcA = 1 ALUSrcB = 00 ALUSrcA = 1
ALUSrcB = 10 ALUSrcB = 00 ALUOp = 01 ALUSrcB = 10
ALUOp = 00 ALUOp = 10 PCSrc = 01 ALUOp = 00
Branch

Op = SW
Op = LW
S5: MemWrite
S7: ALU
Writeback S10: ADDI We can easily
S3: MemRead Writeback
handle multi-cycle
IorD = 1
IorD = 1
RegDst = 1
MemtoReg = 0
RegDst = 0
MemtoReg = 0
memory access
MemWrite
RegWrite RegWrite

S4: Mem
Writeback

RegDst = 0
MemtoReg = 1
RegWrite

Multi-cycle processor. Harris and Harris, Chapter 7.4. 12


Can We Do Better?

13
Can We Do Better?
◼ What limitations do you see with the multi-cycle design?

◼ Limited concurrency
❑ Some hardware resources are idle during different phases of
instruction processing cycle
❑ “Fetch” logic is idle when an instruction is being “decoded” or
“executed”
❑ Most of the datapath is idle when a memory access is
happening

14
Can We Use the Idle Hardware to Improve Concurrency?

◼ Goal: More concurrency → Higher instruction throughput


(i.e., more “work” completed in one cycle)

◼ Idea: When an instruction is using some resources in its


processing phase, process other instructions on idle
resources not needed by that instruction
❑ E.g., when an instruction is being decoded, fetch the next
instruction
❑ E.g., when an instruction is being executed, decode another
instruction
❑ E.g., when an instruction is accessing data memory (ld/st),
execute the next instruction
❑ E.g., when an instruction is writing its result into the register
file, access data memory for the next instruction
15
Can Have Different Instructions in Different Stages

❑ Fetch 1. Instruction fetch (IF)


❑ Decode 2. Instruction decode and
❑ Evaluate Address register operand fetch (ID/RF)
3. Execute/Evaluate memory address (EX/AG)
❑ Fetch Operands
4. Memory operand fetch (MEM)
❑ Execute 5. Store/writeback result (WB)
❑ Store Result

16
Can Have Different Instructions in Different Stages

CLK
PCWrite
Branch PCEn
IorD Control PCSrc
MemWrite Unit ALUControl2:0
IRWrite ALUSrcB1:0
31:26 ALUSrcA
Op
5:0 RegWrite
Funct

MemtoReg
RegDst
CLK CLK CLK
CLK CLK
0 SrcA
WE WE3 A 31:28 Zero CLK
25:21
PC' PC Instr A1 RD1 1 00
0 RD

ALU
Adr 20:16 B ALUResult ALUOut
EN A EN A2 RD2 00 01
1
Instr / Data 20:16 4 01 SrcB 10
0
Memory 15:11 A3 10
CLK 1 Register PCJump
WD 11
0 File
Data WD3
1
<<2 27:0
<<2

ImmExt
15:0
Sign Extend
25:0 (Addr)

Of course, we need to be more careful than this! 17


Pipelining

18
Pipelining: Basic Idea
◼ More systematically:
❑ Pipeline the execution of multiple instructions
❑ Analogy: “Assembly line processing” of instructions

◼ Idea:
❑ Divide the instruction processing cycle into distinct “stages” of
processing
❑ Ensure there are enough hardware resources to process one
instruction in each stage
❑ Process a different instruction in each stage
◼ Instructions consecutive in program order are processed in
consecutive stages

◼ Benefit: Increases instruction processing throughput (1/CPI)


◼ Downside: Start thinking about this…
19
Example: Execution of Four Independent ADDs
◼ Multi-cycle: 4 cycles per instruction
1 instruction completed every 4 cycles
F D E W
F D E W
F D E W
F D E W
Time

◼ Pipelined: 4 cycles per 4 instructions (steady state)


1 instruction completed every 1 cycle
F D E W
F D E W
F D E W
F D E W

Is life always this beautiful? Time

20
The Laundry Analogy
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A

◼ “place one dirty load of clothes in the washer”


◼ “when the washer is finished, place the wet load in the dryer”
◼ “when the
Time
dryer
6 PM is7 finished,
8 take
9 out
10 the
11 dry 12
load and
1 fold”
2 AM

◼ “when folding is finished, put the clothes away”


Task
order

A - steps to do a load are sequentially dependent


B - no dependence between different loads
C
- different steps do not require the same resource(s)
D
21
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry 6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
TimeA
Task
B
order
A
C
1 load
every D
B

120 mins C

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order 6 PM
Time
7 8 9 10
- 411 loads
12
of laundry
1 2 AM
in parallel
A
(in steady state)
Task
1 load order B
- throughput increased by 4x
every A
- latency per load is the same
C
30 mins B
D
- no additional resources
C

D 22
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry: In Practice
Time
Task
6 PM 7 8 9 10 11 12 1 2 AM

order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
B
Task
order
1 load C
A
every D
B
150 mins
C

(slow dryer) D

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order
6 PM 7 8 9 10 11 12 1 2 AM
1 load TimeA

every
Task B
60 mins order
C
A
(slow dryer) D
B

C
the slowest step (the dryer) decides throughput
D
23
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Pipelining Multiple Loads of Laundry: In Practice
Time
6 PM 7 8 9 10 11 12 1 2 AM

Task
order
A 6 PM 7 8 9 10 11 12 1 2 AM
Time
Task B
order
1 load C
A
every
D
B
150 mins
C

(slow dryer) D

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order 6 PM 7 8 9 10 11 12 1 2 AM
Time
A A
Task B
1 load order B
every C
A
A
30 mins D
B
B
(2 slow dryers) C
D
throughput restored (1 load per 30-min) using 2 dryers
24
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
A Real-Life Pipeline: Automobile Assembly

https://www.caranddriver.com/features/a15115930/fords-assembly-line-turns-100-how-it-really-put-the-world-on-wheels-feature/
By Apparent scan made by the original uploader User:Steveking 89., Fair use, https://en.wikipedia.org/w/index.php?curid=11009879
25
A Real-Life Pipeline: Automobile Assembly

https://medium.com/@ScottAmyx/toyotas-production-philosophy-combines-human-effort-with-automation-a9df78c64ee1 26
An Old Pipelined Computer: IBM Stretch

https://en.wikipedia.org/wiki/IBM_7030_Stretch 27
An Ideal Pipeline
◼ Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)

◼ Repetition of identical operations


❑ The same operation is repeated on a large number of different
inputs (e.g., all laundry loads go through the same steps)
◼ Repetition of independent operations
❑ No dependences between repeated operations
◼ Uniformly partitionable suboperations
❑ Processing can be evenly divided into uniform-latency
suboperations (that do not share resources)

◼ Fitting examples: automobile assembly line, doing laundry


❑ What about the instruction processing “cycle”?
28
Ideal Pipelining
Tput = Throughput

combinational logic (F,D,E,M,W) Tput=~(1/T)


T psec

T/2 ps (F,D,E) T/2 ps (M,W) Tput=~(2/T)

T/3 T/3 T/3 Tput=~(3/T)


ps (F,D) ps (E,M) ps (M,W)

29
More Realistic Pipeline: Throughput
◼ Nonpipelined version with delay T
Tput = 1 / (T+S) where S = register (sequential logic) delay

T ps

◼ k-stage pipelined version Register delay reduces throughput


(sequencing overhead b/w stages)
Tputk-stage = 1 / (T/k + S )
Tputmax = 1 / (1 gate delay + S )

T/k T/k
ps ps

This picture assumes “perfect division of work between stages (T/k)”


More Realistic Pipeline: Cost
◼ Nonpipelined version with combinational cost G
Cost = G+R where R = register cost

G gates

◼ k-stage pipelined version


Costk-stage = G + Rk Registers increase hardware cost

G/k G/k

This picture ignores resource and register replication that is likely needed (G/k and Rk)
Pipelining Instruction Processing

32
Remember: The Instruction Processing Cycle

❑ FETCH
❑ DECODE
❑ EVALUATE ADDRESS
❑ FETCH OPERANDS
❑ EXECUTE
❑ STORE RESULT

33
Remember: The Instruction Processing Cycle

❑ Fetch fetch (IF)


1. Instruction
❑ Decodedecode and
2. Instruction
register operand
❑ Evaluate fetch (ID/RF)
Address
3. Execute/Evaluate
❑ Fetch Operands
memory address (EX/AG)
4. Memory operand fetch (MEM)
❑ Execute
5. Store/writeback result (WB)
❑ Store Result

34
Remember the Single-Cycle Microarchitecture
Instruction [25– 0] Shift Jump address [31– 0]

26
left 2
28 0 PCSrc
1 1=Jump
PC+4 [31– 28] M M
u u
x x
ALU
Add result 1 0
Add Shift
RegDst
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg PCSrc2=Br Taken
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read
Instruction [20– 16] data 1
Read
register 2 Zero
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
bcond data
16 32
Instruction [15– 0] Sign
extend ALU
control

Instruction [5– 0]

ALU operation

T Tput=~(1/T)
Based on original figure from [P&H CO&D, COPYRIGHT 2004
Elsevier. ALL RIGHTS RESERVED.]
35
Dividing the Single-Cycle Uarch Into Stages
200ps 100ps 200ps 200ps 100ps
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
0
M
u
x
1
ignore
for now
Add

4 Add Add
result
Shift
left 2

Read
PC Address register 1 Read
data 1
Read
register 2 Zero
Instruction Registers Read ALU ALU
Write 0 Read
Instruction register
data 2
M
u
result Address
Data
data
1
M
u
RF
memory Write x
data 1
Write
memory x
0
write
data
16 32
Sign
extend

Is this the correct partitioning?


Why not 4 or 6 stages? Why not different boundaries?
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
36
Instruction Pipeline Throughput
Program
execution 2200 400
4 600
6 800
8 1000
10 1200
12 1400
14 1600
16 1800
18
order Time
(in instructions)
lw $1, 100($0)
Instruction
Reg ALU
Data
Reg 1 instruction every 800 ps
fetch access

Instruction Data
lw $2, 200($0) 8 ns
800ps fetch
Reg ALU
access
Reg

Instruction
lw $3, 300($0) 8800ps
ns fetch
...
8 ns
800ps

Program
200
2 400
4 600
6 800
8 1000
10 1200
12 1400
14
execution
Time
order
(in instructions)
lw $1, 100($0)
Instruction
Reg ALU
Data
Reg
1 instruction every 200 ps
fetch access

Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
200ps fetch access

Instruction Data
lw $3, 300($0) 2 ns
200ps Reg ALU Reg
fetch access

2 ns
200ps 2 ns
200ps 2200ps
ns 2 ns
200ps 2 ns
200ps

5-stage speedup is 4, not 5 as predicted by the ideal model. Why?

Half of the clock cycle is wasted in two stages (ID/RF and WB) 37
Enabling Pipelined Processing: Pipeline Registers

IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
00
MM
No resource is used by more than one stage
uu
xx
11

IF/ID ID/EX EX/MEM MEM/WB


PCD+4

PCE+4

nPCM
Add
Add

Add
Add
44 Add result
Add
result
Shift
Shift
leftleft
22

Read
Read
Instruction

AE
register
11
PCF

PCPC Address
Address register Read
Read

AoutM
data
data 11
Read
Read Zero

MDRW
Instruction register
register 22 Zero
Instruction Registers Read
Registers ALU ALU
ALU
IRD

memory Read ALU


Write
Write data 00 Address Read
Read
data 22 result
result Address data 11
register
register MM data
Instruction
uu Data
Data MM
uu
BE
memory Write
Write xx memory
memory xx
data
data 11
Write 00
Write
data
data

AoutW
BM
ImmE

1616 3232
Sign
Sign
extend
extend

T/k T/k
ps T ps
38
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
Write 0
Write data
data
16 32

Pipelined Operation Example


16 32Sign
Sign
extend
extend

lw
All instruction classes must follow the same path
Instruction fetch and timing through the pipeline stages.
lw lw lw
Any performanceExecution
impact?
00
00
lw
M
0
M
MM
Instruction decode Memory
uuuu
xxxxx Write back
111

IF/ID
IF/ID
IF/ID
IF/ID ID/EX
ID/EX
ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB

Add
Add
Add

Add
Add
Add
444 Add Add
Add result
Add
Add result
result
result
Shift
Shift
Shift
Shift
left 22
left22
left
left

Read
Instruction

Read
Read
Instruction

Read
Instruction
Instruction
Instruction

PC
PC Address register
register111
register Read
PC Address
Address Read
Read
Read
Read data111
data
data
data
Read
Read
Read
register Zero
Zero
Instruction
Instruction
Instruction register222
register Zero
Zero
Registers Read
Registers Read
Registers ALU
ALU
ALU ALUALU
ALU
memory
memory
memory Write Read 00 ALU ALU
ALU Read
Write data 00 Address Read
Read
Write
register data222
data
data
result
result
result
result Address
Address
Address
Address Read
data 1111
register
register
register MM
MM Data data
data
data
uu Data
Data MMM M
uu Data
Data
Write
Write
Write xxxx
memory
memory uuuu
memory
memory
memory x
xx
data
data
data 11
11
Write
Write 0000
Write
Write
data
data
data
16
16
16 32
32
32
Sign
Sign
Sign
extend
extend
extend

lw
0
0 M Instruction decode lw
M
u
u
x 39
Based on original figure from [P&H
x CO&D,
1 COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Write back
data
register 1M M
uu Data 0M
Write
Write xx
Write uu
data memory
memory xx
data
data 11

Pipelined Operation Example


16 32 Write 00
Write
Sign data
data
extend
16
16 32
32
Sign
Sign
extend
extend

Clock 1
Clock
Clock 5 3

lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
sub $11, $2, $3 lw $10,
sub $11,20($1)
$2, $3 sub $11,20($1)
lw $10, $2, $3
00
M
M
u
uu Execution Memory Write back
Write back
xx
11

IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB

Add
Add
Add

Add
Add
Add
44 Add result
Add
result
result
Shift
Shift
left 22
left

Read
Read
Instruction

Read
Instruction

PC
PC
PC Address
Address
Address register 11
register Read
Read
Read data 11
data
Read
Read
register 22
register Zero
Zero
Instruction
Instruction ALU ALU
Registers Read
Registers Read
Read ALU
ALU ALU
ALU
memory
memory Write
Write 00 result Address Read
Read
Read
data 22
data result
result Address
Address 11
register
register M
MM data
data
M
M
uu Data
Data
Data uu

Is life always this beautiful?


Write
Write xx memory
data
data
memory xxx
11
00
Write
Write
data
data
data
16
16
16 32
32
32
Sign
Sign
extend
extend
extend

Clock
Clock
Clock56 21 43
Clock
Clock

sub $11, $2, $3 lw $10, 20($1) 40


Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
sub $11, $2, $3 lw $10, 20($1) sub $11, $2, $3
Illustrating Pipeline Operation: Operation View

t0 t1 t2 t3 t4 t5
Inst0 IF ID EX MEM WB
Inst1 IF ID EX MEM WB
Inst2 IF ID EX MEM WB
Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM
IF ID EX
IF ID
steady state
(full pipeline) IF

41
Illustrating Pipeline Operation: Resource View

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10

ID I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

EX I0 I1 I2 I3 I4 I5 I6 I7 I8

MEM I0 I1 I2 I3 I4 I5 I6 I7

WB I0 I1 I2 I3 I4 I5 I6

42
Control Points in a Pipeline
PCSrc

0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
4 Add
result
Branch
Shift
RegWrite left 2

Read MemWrite
Instruction

PC Address register 1 Read


Read data 1 ALUSrc
register 2 Zero
Zero MemtoReg
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6
Sign ALU
extend control MemRead

Instruction
[20– 16]
0
M ALUOp
Instruction u
[15– 11] x
Based on original figure from [P&H CO&D, 1
COPYRIGHT 2004 Elsevier. ALL RIGHTS
RESERVED.] RegDst

Identical set of control points as the single-cycle datapath 43


Control Signals in a Pipeline
◼ For a given instruction
❑ same control signals as single-cycle, but
❑ control signals required at different cycles, depending on stage
 Option 1: decode once using the same logic as single-cycle and
buffer signals until consumed
WB

Instruction
Control M WB

EX M WB

IF/ID ID/EX EX/MEM MEM/WB

 Option 2: carry relevant “instruction word/field” down the pipeline


and decode locally within each or in a previous stage
Which one is better?
44
Pipelined Control Signals
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite
Branch
Shift
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst

Based on original figure from [P&H CO&D,


COPYRIGHT 2004 Elsevier. ALL RIGHTS
45
RESERVED.]
Carnegie Mellon

Another Example: Single-Cycle and Pipelined


CLK CLK
CLK
25:21 WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD

ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0 WriteReg4:0
15:11
1
PCPlus4
+

SignImm
4 15:0 <<2
Sign Extend
PCBranch

+
Result

Is this correct? CLK


CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0
15:11
RdE
1
+

SignImmE
4 15:0
<<2
Sign Extend PCBranchM
+

PCPlus4F PCPlus4D PCPlus4E

ResultW

Fetch Decode Execute Memory Writeback 46


Carnegie Mellon

Another Example: Correct Pipelined Datapath

CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
SignImmE
+

15:0 <<2
Sign Extend
4 PCBranchM

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

Fetch Decode Execute Memory Writeback

 WriteReg control signal must arrive at the same time as Result


Pipelined processor. Harris and Harris, Chapter 7.5
47
Carnegie Mellon

Another Example: Pipelined Control


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM

BranchD BranchE BranchM


31:26 PCSrcM
Op ALUControlD ALUControlE2:0
5:0
Funct ALUSrcD ALUSrcE
RegDstD RegDstE
ALUOutW
CLK CLK CLK
CLK
25:21 WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+

15:0
<<2
Sign Extend SignImmE
4 PCBranchM

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

 Same control unit as single-cycle processor


Control delayed to proper pipeline stage 48
Remember: An Ideal Pipeline
◼ Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)

◼ Repetition of identical operations


❑ The same operation is repeated on a large number of different
inputs (e.g., all laundry loads go through the same steps)
◼ Repetition of independent operations
❑ No dependencies between repeated operations
◼ Uniformly partitionable suboperations
❑ Processing an be evenly divided into uniform-latency
suboperations (that do not share resources)

◼ Fitting examples: automobile assembly line, doing laundry


❑ What about the instruction processing “cycle”?
49
Instruction Pipeline: Not An Ideal Pipeline
◼ Identical operations ... NOT!
 different instructions → not all need the same stages
Forcing different instructions to go through the same pipe stages
→ external fragmentation (some pipe stages idle for some instructions)

◼ Uniform suboperations ... NOT!


 different pipeline stages → not the same latency
Need to force each stage to be controlled by the same clock
→ internal fragmentation (some pipe stages are fast but still have to
take the same clock cycle time)

◼ Independent operations ... NOT!


 instructions are not independent of each other
Need to detect and resolve inter-instruction dependences to ensure
the pipeline provides correct results
→ pipeline stalls (pipeline is not always moving)
50
Issues in Pipeline Design
◼ Balancing work in pipeline stages
❑ How many stages and what is done in each stage

◼ Keeping the pipeline correct, moving, and full in the


presence of events that disrupt pipeline flow
❑ Handling dependences
◼ Data
◼ Control
❑ Handling resource contention
❑ Handling long-latency (multi-cycle) operations

◼ Handling exceptions, interrupts

◼ Advanced: Improving pipeline throughput


❑ Minimizing stalls
51
Causes of Pipeline Stalls
◼ Stall: A condition when the pipeline stops moving

◼ Resource contention

◼ Dependences (between instructions)


❑ Data
❑ Control

◼ Long-latency (multi-cycle) operations

52
Dependences and Their Types
◼ Also called “dependency” or less desirably “hazard”

◼ Dependences dictate ordering requirements between


instructions

◼ Two types
❑ Data dependence
❑ Control dependence

◼ Resource contention is sometimes called resource


dependence
❑ However, this is not fundamental to (dictated by) program
semantics, so we will treat it separately
53
Handling Resource Contention
◼ Happens when instructions in two pipeline stages need the
same resource

◼ Solution 1: Eliminate the cause of contention


❑ Duplicate the resource or increase its throughput
◼ E.g., use separate instruction and data memories (caches)
◼ E.g., use multiple ports for memory structures

◼ Solution 2: Detect the resource contention and stall one of


the contending stages
❑ Which stage do you stall?
❑ Example: What if you had a single read and write port for the
register file?

54
Carnegie Mellon

Example Resource Dependence: RegFile


 The register file can be read and written in the same cycle:
▪ write takes place during the 1st half of the cycle
▪ read takes place during the 2nd half of the cycle => no problem!!!
▪ read/write from/to the register file has only half a clock cycle to
complete

55
Data Dependences
◼ Data dependence types
❑ Flow dependence (true data dependence – read after write)
❑ Anti dependence (write after read)
❑ Output dependence (write after write)

◼ Which ones cause stalls in a pipelined machine?


❑ For all of them, we need to ensure semantics of the program
is correct
❑ Flow dependences always need to be obeyed because they
constitute true dependence on a value
❑ Anti and output dependences exist due to limited number of
architectural registers
◼ They are dependence on a name, not a value
◼ We will later see what we can do about them
56
Data Dependence Types
Flow dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)
Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)

Output dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7
57
data
register 1M M
uu Data 0M
Write
Write xx
Write uu
data memory
memory xx
data
data 11

Pipelined Operation Example


16 32 Write 00
Write
Sign data
data
extend
16
16 32
32
Sign
Sign
extend
extend

Clock 1
Clock
Clock 5 3

lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
sub $11, $2, $3 lw $10,
sub $11,20($1)
$2, $3 sub $11,20($1)
lw $10, $2, $3
00
M
M
u
uu Execution Memory Write back
Write back
xx
11

IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM
EX/MEM MEM/WB
MEM/WB
MEM/WB

Add
Add
Add

Add
Add
Add
44 Add result
Add
result
result
Shift
Shift
left 22
left

Read
Read
Instruction

Read
Instruction

PC
PC
PC Address
Address
Address register 11
register Read
Read
Read data 11
data
Read
Read
register 22
register Zero
Zero
Instruction
Instruction ALU ALU
Registers Read
Registers Read
Read ALU
ALU ALU
ALU
memory
memory Write
Write 00 result Address Read
Read
Read
data 22
data result
result Address
Address 11
register
register M
MM data
data
M
M
uu Data
Data
Data uu

What if the SUB were dependent on LW?


Write
Write xx memory
data
data
memory xxx
11
00
Write
Write
data
data
data
16
16
16 32
32
32
Sign
Sign
extend
extend
extend

Clock
Clock
Clock56 21 43
Clock
Clock

sub $11, $2, $3 lw $10, 20($1) 58


Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
sub $11, $2, $3 lw $10, 20($1) sub $11, $2, $3
Data Dependence Handling

59
Readings for Next Few Lectures
◼ H&H, Chapter 7.5-7.9

◼ Smith and Sohi, “The Microarchitecture of Superscalar


Processors,” Proceedings of the IEEE, 1995
❑ More advanced pipelining
❑ Interrupt and exception handling
❑ Out-of-order and superscalar execution concepts

60
How to Handle Data Dependences
◼ Anti and output dependences are easier to handle
❑ write to the destination only in last stage and in program order

◼ Flow dependences are more interesting & challenging

◼ Six fundamental ways of handling flow dependences


❑ Detect and wait until value is available in register file
❑ Detect and forward/bypass data to dependent instruction
❑ Detect and eliminate the dependence at the software level
◼ No need for the hardware to detect dependence
❑ Detect and move it out of the way for independent instructions
❑ Predict the needed value(s), execute “speculatively”, and verify
❑ Do something else (fine-grained multithreading)
◼ No need to detect
61
Recall: Data Dependence Types
Flow dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)
Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)

Output dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7
62
RAW Dependence Handling
◼ Which one of the following flow dependences lead to
conflicts in the 5-stage pipeline?

addi ra r- -
IF ID EX MEM WB
addi r- ra - IF ID EX MEM WB
addi r- ra - IF ID EX MEM
addi r- ra - IF ID EX
addi r- ra - IF ?ID
addi r- ra - IF
63
Pipeline Stall: Resolving Data Dependence

t0 t1 t2 t3 t4 t5
Insth IF ID ALU MEM WB
Insti i IF ID ALU MEM WB
Instj j IF ID ALU
ID MEM
ALU
ID ID
WB
MEM
ALU ALU
WB
MEM
Instk IF ID
IF ALU
ID
IF MEM
ALU
ID
IF WB
MEM
ALU
ID
Instl IF ID
IF ALU
ID
IF MEM
ALU
ID
IF
IF ID
IF ALU
ID
IF
i: rx  _
j: _  rx
bubble dist(i,j)=1 IF ID
IF
Stall = make the dependent instruction wait
j: _  rx
bubble dist(i,j)=2 IF
until its source data value is available
j: _  rx
bubble dist(i,j)=3 1. stop all up-stream stages
j: _  rx dist(i,j)=4 2. drain all down-stream stages
64
Interlocking
◼ Interlocking: Detection of dependence between instructions
in a pipelined processor to guarantee correct execution

◼ Software based interlocking


vs.
◼ Hardware based interlocking

◼ MIPS acronym?

65
Approaches to Dependence Detection (I)
◼ Scoreboarding
❑ Each register in register file has a Valid bit associated with it
❑ An instruction that is writing to the register resets the Valid bit
❑ An instruction in Decode stage checks if all its source and
destination registers are Valid
◼ Yes: No need to stall… No dependence
◼ No: Stall the instruction

◼ Advantage:
❑ Simple. 1 bit per register

◼ Disadvantage:
❑ Need to stall for all types of dependences, not only flow dep.
66
data 1
0
Write
data

Pipelined Operation Example


16 32
Sign
extend

Clock 1

lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)$10
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
00
M
M
uu
xx
11

IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB

Add
Add

Add
Add
44 Add result
Add
result
Shift
Shift
left 22
left

Read
Read
Instruction

PC
PC Address
Address
Address register 11
register Read
Read
Read data 11
data
Read
register 22
register Zero
Zero
Instruction
Instruction
Registers Read
Registers Read ALU ALU
ALU ALU
memory
memory Write
Write 00 Address Read
Read
data 22
data result
result Address 11
register
register M
M data
data
M
M
uu Data
Data
Write
Write xx uu
memory
memory xx
data
data 11
00
Write
Write
data
data
16
16 32
32
Sign
Sign
extend
extend
extend

Clock
Clock 21 3

sub $11, $2, $3 Now assume SUB is dependent on LW


lw $10, 20($1)
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
67
sub $11, $2, $3 lw $10, 20($1)
Approaches to Dependence Detection (II)
◼ Combinational dependence check logic
❑ Special logic checks if any instruction in later stages is
supposed to write to any source register of the instruction that
is being decoded
❑ Yes: stall the instruction/pipeline
❑ No: no need to stall… no flow dependence

◼ Advantage:
❑ No need to stall on anti and output dependences

◼ Disadvantage:
❑ Logic is more complex than a scoreboard
❑ Logic becomes more complex as we make the pipeline deeper
and wider (flash-forward: think superscalar execution)
68
data 1
0
Write
data

Pipelined Operation Example


16 32
Sign
extend

Clock 1

lw $10,
sub $11,20($1)
$2, $3 lw $10,
sub $11,20($1)$10
$2, $3 lw $10, 20($1)
Instruction fetch Instruction decode Execution
00
M
M
uu
xx
11

IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB

Add
Add

Add
Add
44 Add result
Add
result
Shift
Shift
left 22
left

Read
Read
Instruction

PC
PC Address
Address
Address register 11
register Read
Read
Read data 11
data
Read
register 22
register Zero
Zero
Instruction
Instruction
Registers Read
Registers Read ALU ALU
ALU ALU
memory
memory Write
Write 00 Address Read
Read
data 22
data result
result Address 11
register
register M
M data
data
M
M
uu Data
Data
Write
Write xx uu
memory
memory xx
data
data 11
00
Write
Write
data
data
16
16 32
32
Sign
Sign
extend
extend
extend

Clock
Clock 21 3

sub $11, $2, $3 Now assume SUB is dependent on LW


lw $10, 20($1)
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
69
sub $11, $2, $3 lw $10, 20($1)
Once You Detect the Dependence in Hardware
◼ What do you do afterwards?

◼ Observation: Dependence between two instructions is


detected before the communicated data value becomes
available

◼ Option 1: Stall the dependent instruction right away


◼ Option 2: Stall the dependent instruction only when
necessary → data forwarding/bypassing
◼ Option 3: …

70
Data Forwarding/Bypassing
◼ Problem: A consumer (dependent) instruction has to wait in
decode stage until the producer instruction writes its value
in the register file

◼ Goal: We do not want to stall the pipeline unnecessarily

◼ Observation: The data value needed by the consumer


instruction can be supplied directly from a later stage in the
pipeline (instead of only from the register file)

◼ Idea: Add additional dependence check logic and data


forwarding paths (buses) to supply the producer’s value to
the consumer right after the value is available

◼ Benefit: Consumer can move in the pipeline until the point


the value can be supplied → less stalling
71
Data Dependence Handling:
Concepts and Implementation

72
We Covered
Until This Point
in Lecture

73
Digital Design & Computer Arch.
Lecture 12: Pipelining

Prof. Onur Mutlu

ETH Zürich
Spring 2023
31 March 2023
Slides for Future Lectures

75
How to Implement Stalling
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite
Branch
Shift
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst

◼ Stall
❑ disable PC and IF/ID latching; ensure stalled instruction stays in its stage
❑ Insert “invalid” instructions/nops into the stage following the stalled one
(called “bubbles”)
76
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
RAW Data Dependence Example
 One instruction writes a register ($s0) and next instructions
read this register => read after write (RAW) dependence.
❑ add writes into $s0 in the first half of cycle 5
Wrong results happen only if
❑ and reads $s0 in cycle 3, obtaining the wrong value

the pipeline handles
or reads $s0 in cycle 4, again obtaining the wrong value
❑ sub readsdata
$s0 independences
2nd half of cycle incorrectly!
5, getting the correct value
❑ subsequent instructions read the correct value of $s0
1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Compile-Time Detection and Elimination
1 2 3 4 5 6 7 8 9 10

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

nop DM
nop IM RF RF

nop DM
nop IM RF RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

◼ Insert enough independent instructions for the required result


to be ready by the time it is needed by a dependent one
❑ Reorder/reschedule/insert instructions at the compiler level
Data Forwarding
◼ Also called Data Bypassing

◼ Forward the result value to the dependent instruction


as soon as the value is available

◼ We have already seen the basic idea before


◼ Remember dataflow?
❑ Data value is supplied to dependent instruction as soon as it is
available
❑ Instruction executes when all its operands are available

◼ Data forwarding brings a pipeline closer to data flow


execution principles
Data Forwarding: Locations in Datapath

1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

From latched output of ALU to input of ALU


From WB to input of ALU
From WB to RF (internal in Register File)
Data Forwarding: Datapath & Control
From latched output of ALU to input of ALU
CLK CLK CLK
From WB to input of ALU RegWriteD RegWriteE RegWriteM RegWriteW
Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
From Regfile Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
1 10 ALUOutM ReadDataW
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+

Sign
15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E

PCBranchM

ResultW

RegWriteW
ForwardBE
ForwardAE

RegWriteM
DependenceHazard
Detection
Unit Logic
Data Forwarding: Implementation
◼ Forward to Execute stage from either:
❑ Memory stage or
❑ Writeback stage

◼ When should we forward from either Memory or Writeback


stage?
❑ If that stage will write to a destination register and the
destination register matches the source register
❑ If both the Memory & Writeback stages contain matching
destination registers, Memory stage has priority to forward its
data, because it contains the more recently executed instruction
Data Forwarding (in Pseudocode)
◼ Forward to Execute stage from either:
❑ Memory stage or
❑ Writeback stage

◼ Forwarding logic for ForwardAE (pseudo code):


if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then
ForwardAE = 10 # forward from Memory stage
else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then
ForwardAE = 01 # forward from Writeback stage
else
ForwardAE = 00 # no forwarding

◼ Forwarding logic for ForwardBE same, but replace rsE with rtE
Data Forwarding Is Not Always Possible
1 2 3 4 5 6 7 8

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

◼ Forwarding is usually sufficient to resolve RAW data dependences


◼ Unfortunately, there are cases when forwarding is not possible
❑ due to pipeline design and instruction latencies
❑ The lw instruction does not finish reading data until the end of Memory stage
→ its result cannot be forwarded to the Execute stage of the next instruction
unless we want a long critical path → breaks critical path design principle
Stalling Necessary for MEM-EX Dependence

1 2 3 4 5 6 7 8 9

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF

$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF

Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Stalling and Dependence Detection Hardware
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
ReadDataW
EN

1 10 ALUOutM
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+

Sign
15:0
Extend
4
<<2

+
PCPlus4F
CLR

PCPlus4D PCPlus4E
EN

PCBranchM

ResultW

MemtoRegE

RegWriteW
ForwardBE
ForwardAE

RegWriteM
FlushE
StallD
StallF

Dependence
HazardDetection
Unit Logic
Hardware Needed for Stalling
◼ Stalls are supported by adding
❑ enable inputs (EN) to the Fetch and Decode pipeline registers
❑ synchronous reset/clear (CLR) input to the Execute pipeline
register
◼ or an INV bit associated with each pipeline register, indicating that
contents are INValid

◼ When a lw stall occurs


❑ Keep the values in the Decode and Fetch stage pipeline registers
◼ StallD and StallF are asserted
❑ Clear the contents of the Execute stage register, introducing a
bubble
◼ FlushE is also asserted
A Special Case of Data Dependence
◼ Control dependence
❑ Data dependence on the Instruction Pointer / Program Counter

88
Control Dependence
◼ Question: What should the fetch PC be in the next cycle?
◼ Answer: The address of the next instruction
❑ All instructions are control dependent on previous ones. Why?

◼ If the fetched instruction is a non-control-flow instruction:


❑ Next Fetch PC is the address of the next-sequential instruction
❑ Easy to determine if we know the size of the fetched instruction

◼ If the instruction that is fetched is a control-flow instruction:


❑ How do we determine the next Fetch PC?

◼ In fact, how do we know whether or not the fetched


instruction is a control-flow instruction?
89
Carnegie Mellon

Branch Prediction
 Special case of data dependence: dependence on PC

 beq:
▪ Conditional branch is not resolved until the fourth stage of the pipeline
▪ Instructions after the branch are fetched before branch is resolved
▪ Simple “branch prediction” example:
▪ Always predict that the next sequential instruction is fetched
▪ Called “Always not taken” prediction
▪ Flush (invalidate) “not-taken path” instructions if the branch is taken

 Branch misprediction penalty


▪ number of instructions flushed when branch is incorrectly predicted
▪ Penalty can be reduced by resolving the branch earlier
▪ Called “Early branch resolution”
90
Carnegie Mellon

Our Pipeline So Far


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
1 10 ALUOutM ReadDataW
EN

A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+

Sign
15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E
CLR
EN

PCBranchM

ResultW

MemtoRegE

RegWriteW
ForwardBE
ForwardAE

RegWriteM
FlushE
StallD
StallF

Dependence Detection Logic


Hazard Unit

91
Carnegie Mellon

Control Dependence: Flush on Misprediction


1 2 3 4 5 6 7 8 9

Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF

$s0
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF
Flush
$s4 these
or DM instructions
28 or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM
2C sub $t2, $s0, $s5 IM RF $s5 - RF

30 ... Flush
...
$s2
3 instructions
slt DM $t3

slt
64 slt $t3, $s2, $s3 IM RF $s3 RF

92
Carnegie Mellon

Pipeline with Early Branch Resolution

Dependence Detection Logic

Need to calculate branch target and condition in the Decode Stage 93


Carnegie Mellon

Early Branch Resolution


1 2 3 4 5 6 7 8 9

Time (cycles)
$t1
lw DM
20 beq $t1, $t2, 40 IM RF $t2 - RF

$s0 Flush
and DM
24 and $t0, $s0, $s1 IM RF $s1 & RF this
instruction

28 or $t1, $s4, $s0


Flush
2C sub $t2, $s0, $s5 1 instruction

30 ...
...
$s2
slt DM $t3
slt
64 slt $t3, $s2, $s3 IM RF $s3 RF

94
Carnegie Mellon

Early Branch Resolution: Good Idea?


 Advantages
▪ Reduced branch misprediction penalty
→ Reduced CPI (cycles per instruction)

 Disadvantages
▪ Potential increase in clock cycle time?
→ Higher clock period and lower frequency?
▪ Additional hardware cost
→ Specialized and likely not used by other instructions

95
Recall: Performance Analysis Basics
◼ Execution time of a single instruction
❑ {CPI} x {clock cycle time}
◼ CPI: Number of cycles it takes to execute an instruction

◼ Execution time of an entire program


❑ Sum over all instructions [{CPI} x {clock cycle time}]
❑ {# of instructions} x {Average CPI} x {clock cycle time}

96
Carnegie Mellon

Data Forwarding for Early Branch Resolution


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
BranchD

EqualD PCSrcD
CLK CLK CLK
CLK
WE3
= WE
25:21 SrcAE
0 PC' PCF InstrD A1 RD1 0 00
A RD 01

ALU
ALUOutM ReadDataW
EN

1 1 10
A RD
Instruction 20:16
A2 RD2 0 00 0 SrcBE Data
Memory 01
A3 1 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+

Sign
15:0
Extend
4
<<2
+

PCPlus4F PCPlus4D
CLR

CLR
EN

PCBranchD

ResultW

MemtoRegE

RegWriteW
ForwardBD

ForwardBE
ForwardAD

ForwardAE

RegWriteM
RegWriteE
BranchD

FlushE
StallD
StallF

DependenceHazard
Detection
Unit
Logic

Data forwarding for early branch resolution adds even more complexity 97
Carnegie Mellon

Forwarding and Stalling Hardware Control


// Forwarding logic:
assign ForwardAD = (rsD != 0) & (rsD == WriteRegM) & RegWriteM;
assign ForwardBD = (rtD != 0) & (rtD == WriteRegM) & RegWriteM;

//Stalling logic:
assign lwstall = ((rsD == rtE) | (rtD == rtE)) & MemtoRegE;

assign branchstall = (BranchD & RegWriteE &


(WriteRegE == rsD | WriteRegE == rtD))
|
(BranchD & MemtoRegM &
(WriteRegM == rsD | WriteRegM == rtD));

// Stall signals;
assign StallF = lwstall | branchstall;
assign StallD = lwstall | branchstall;
assign FLushE = lwstall | branchstall;

98
Carnegie Mellon

Final Pipelined MIPS Processor (H&H)

Dependence Detection Logic

Includes always-taken br prediction, early branch resolution, forwarding, stall logic


99
Carnegie Mellon

Doing Better: Smarter Branch Prediction


 Guess whether or not branch will be taken
▪ Backward branches are usually taken (loops iterate many times)
▪ History of whether branch was previously taken can improve the guess

 Accurate branch prediction reduces the fraction of branches


requiring a flush

 Many sophisticated techniques are employed in modern


processors
▪ Including simple machine learning methods (perceptrons)
▪ We will see them in Branch Prediction lectures

100
More on Branch Prediction (I)

https://www.youtube.com/watch?v=h6l9yYSyZHM&list=PL5Q2soXY2Zi_FRrloMa2fUYWPGiZUBQo2&index=22
More on Branch Prediction (II)

https://www.youtube.com/watch?v=z77VpggShvg&list=PL5Q2soXY2Zi_FRrloMa2fUYWPGiZUBQo2&index=23
More on Branch Prediction (III)

https://www.youtube.com/watch?v=yDjsr-jTOtk&list=PL5PHm2jkkXmgVhh8CHAu9N76TShJqfYDt&index=4
Lectures on Branch Prediction
◼ Digital Design & Computer Architecture, Spring 2020, Lecture 16b
❑ Branch Prediction I (ETH Zurich, Spring 2020)
❑ https://www.youtube.com/watch?v=h6l9yYSyZHM&list=PL5Q2soXY2Zi_FRrloMa2fU
YWPGiZUBQo2&index=22

◼ Digital Design & Computer Architecture, Spring 2020, Lecture 17


❑ Branch Prediction II (ETH Zurich, Spring 2020)
❑ https://www.youtube.com/watch?v=z77VpggShvg&list=PL5Q2soXY2Zi_FRrloMa2fU
YWPGiZUBQo2&index=23

◼ Computer Architecture, Spring 2015, Lecture 5


❑ Advanced Branch Prediction (CMU, Spring 2015)
❑ https://www.youtube.com/watch?v=yDjsr-
jTOtk&list=PL5PHm2jkkXmgVhh8CHAu9N76TShJqfYDt&index=4

https://www.youtube.com/onurmutlulectures 104
Pipelined Performance Example

105
Carnegie Mellon

Pipelined Performance Example


 An important program consists of:
▪ 25% loads
▪ 10% stores
▪ 11% branches
▪ 2% jumps
▪ 52% R-type

 Assume:
▪ 40% of loads used by next instruction
▪ 25% of branches mispredicted; flushes the next instruction on misprediction
▪ All jumps flush the next instruction fetched

 What is the average CPI?


106
Carnegie Mellon

Pipelined Performance Example: CPI


 Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush.
Thus:
▪ CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
▪ CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch

 And
▪ Average CPI =

107
Carnegie Mellon

Pipelined Performance Example: CPI


 Load/Branch CPI = 1 when no stall/flush, 2 when stall/flush.
Thus:
▪ CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
▪ CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch

 And
▪ Average CPI = (0.25)(1.4) + load
(0.1)(1) + store
(0.11)(1.25) + beq
(0.02)(2) + jump
(0.52)(1) r-type

= 1.15

108
Carnegie Mellon

Pipelined Performance Example: Cycle Time


 There are 5 stages, and 5 different timing paths:

Tc = max {
tpcq + tmem + tsetup fetch
2(tRFread + tmux + teq + tAND + tmux + tsetup ) decode
tpcq + tmux + tmux + tALU + tsetup execute
tpcq + tmemwrite + tsetup memory
2(tpcq + tmux + tRFwrite) writeback
}

 The clock cycle depends on the slowest stage

 Decode and Writeback use register file and have only half a
clock cycle to complete → that is why there is a 2 in front of them
109
Carnegie Mellon

Final Pipelined MIPS Processor (H&H)

Dependence Detection Logic

Includes always-taken br prediction, early branch resolution, forwarding, stall logic


110
Carnegie Mellon

Pipelined Performance Example: Cycle Time


Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Equality comparator teq 40
AND gate tAND 15
Memory write Tmemwrite 220
Register file write tRFwrite 100

Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )


= 2[150 + 25 + 40 + 15 + 25 + 20] ps
= 550 ps 111
Carnegie Mellon

Pipelined Performance Example: Exec Time


 For a program with 100 billion instructions executing on a
pipelined MIPS processor:
▪ CPI = 1.15
▪ Tc = 550 ps

 Execution Time = (# instructions) × CPI × Tc


= (100 × 109)(1.15)(550 × 10-12)
= 63 seconds

112
Carnegie Mellon

Performance Summary for 3 MIPS microarch.


Execution Time Normalized Execution Time
Processor (seconds) (single-cycle is baseline)
Single-cycle 95 1
Multicycle 133 0.71
Pipelined 63 1.51

 Pipelined implementation is the fastest of 3 implementations

 Even though we have a 5-stage pipeline, speedup is not 5X


over the single-cycle or the multi-cycle system!

113
Recall: How to Handle Data Dependences
◼ Anti and output dependences are easier to handle
❑ write to the destination only in last stage and in program order

◼ Flow dependences are more interesting

◼ Six fundamental ways of handling flow dependences


❑ Detect and wait until value is available in register file
❑ Detect and forward/bypass data to dependent instruction
❑ Detect and eliminate the dependence at the software level
◼ No need for the hardware to detect dependence
❑ Detect and move it out of the way for independent instructions
❑ Predict the needed value(s), execute “speculatively”, and verify
❑ Do something else (fine-grained multithreading)
◼ No need to detect
114
Recall: How to Handle Data Dependences
◼ Anti and output dependences are easier to handle
❑ write to the destination only in last stage and in program order

◼ Flow dependences are more interesting

◼ Six fundamental ways of handling flow dependences


❑ Detect and wait until value is available in register file
❑ Detect and forward/bypass data to dependent instruction
❑ Detect and eliminate the dependence at the software level
◼ No need for the hardware to detect dependence
❑ Detect and move it out of the way for independent instructions
❑ Predict the needed value(s), execute “speculatively”, and verify
❑ Do something else (fine-grained multithreading)
◼ No need to detect
115
Question to Ponder: Hardware vs. Software
◼ What is the role of the hardware vs. the software in data
dependence handling?
❑ Software based interlocking
❑ Hardware based interlocking
❑ Who inserts/manages the pipeline bubbles?
❑ Who finds the independent instructions to fill “empty” pipeline
slots?
❑ What are the advantages/disadvantages of each?
◼ Think of the performance equation as well

116
Question to Ponder: Hardware vs. Software
◼ What is the role of the hardware vs. the software in the
order in which instructions are executed in the pipeline?
❑ Software based instruction scheduling → static scheduling
❑ Hardware based instruction scheduling → dynamic scheduling

◼ How does each impact different metrics?


❑ Performance (and parts of the performance equation)
❑ Complexity
❑ Power consumption
❑ Reliability
❑ Cost
❑ …

117
More on Software vs. Hardware
◼ Software based scheduling of instructions → static scheduling
❑ Hardware executes the instructions in the compiler-dictated order
❑ Contrast this with dynamic scheduling: hardware can execute
instructions out of the compiler-specified order
❑ How does the compiler know the latency of each instruction?

◼ What information does the compiler not know that makes


static scheduling difficult?
❑ Answer: Anything that is determined at run time
◼ Variable-length operation latency, memory address, branch direction

◼ How can the compiler alleviate this (i.e., estimate the


unknown)?
❑ Answer: Profiling (done statically or dynamically)
118
More on Static Instruction Scheduling

https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=18
Lectures on Static Instruction Scheduling

◼ Computer Architecture, Spring 2015, Lecture 16


❑ Static Instruction Scheduling (CMU, Spring 2015)
❑ https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JC
L1TWybTDtKq&index=18

◼ Computer Architecture, Spring 2013, Lecture 21


❑ Static Instruction Scheduling (CMU, Spring 2013)
❑ https://www.youtube.com/watch?v=XdDUn2WtkRg&list=PL5PHm2jkkXmidJOd59RE
og9jDnPDTG6IJ&index=21

https://www.youtube.com/onurmutlulectures 120
Recall: Semantic Gap
◼ How close instructions & data types & addressing modes
are to high-level language (HLL)

HLL HLL
Small Semantic Gap
ISA with
Complex Inst Large Semantic Gap
& Data Types
& Addressing Modes ISA with
Simple Inst
& Data Types
& Addressing Modes
HW HW
Control Control
Signals Signals

Easier mapping of HLL to ISA Harder mapping of HLL to ISA


Less work for software designer More work for software designer
More work for hardware designer Less work for hardware designer
Optimization burden on HW Optimization burden on SW
Recall: How to Change the Semantic Gap Tradeoffs
◼ Translate from one ISA into a different “implementation” ISA

HLL
Small Semantic Gap
X86-64 ISA with
Complex Inst
& Data Types Software or Hardware Translator
& Addressing Modes

ARM v8.4 Implementation ISA with


HW Simple Inst
Control & Data Types
Signals & Addressing Modes

SW, translator, HW can all perform operation re-ordering 122


An Example: Rosetta 2 Binary Translator

https://en.wikipedia.org/wiki/Rosetta_(software)#Rosetta_2 123
An Example: Rosetta 2 Binary Translator

Apple M1,
2021

Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested 124


Another Example: NVIDIA Denver

https://www.anandtech.com/show/8701/the-google-nexus-9-review/4 https://www.toradex.com/computer-on-modules/apalis-arm-family/nvidia-tegra-k1 125


https://safari.ethz.ch/digitaltechnik/spring2021/lib/exe/fetch.php?media=boggs_ieeemicro_2015.pdf
More on NVIDIA Denver Code Optimizer

126
https://safari.ethz.ch/digitaltechnik/spring2021/lib/exe/fetch.php?media=boggs_ieeemicro_2015.pdf
Transmeta: x86 to VLIW Translation

Proprietary VLIW ISA

X86 X86

Klaiber, “The Technology Behind Crusoe Processors,” Transmeta White Paper 2000.
https://www.wikiwand.com/en/Transmeta_Efficeon 127
https://classes.engineering.wustl.edu/cse362/images/c/c7/Paper_aklaiber_19jan00.pdf
More on Static Instruction Scheduling

https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=18
Lectures on Static Instruction Scheduling

◼ Computer Architecture, Spring 2015, Lecture 16


❑ Static Instruction Scheduling (CMU, Spring 2015)
❑ https://www.youtube.com/watch?v=isBEVkIjgGA&list=PL5PHm2jkkXmi5CxxI7b3JC
L1TWybTDtKq&index=18

◼ Computer Architecture, Spring 2013, Lecture 21


❑ Static Instruction Scheduling (CMU, Spring 2013)
❑ https://www.youtube.com/watch?v=XdDUn2WtkRg&list=PL5PHm2jkkXmidJOd59RE
og9jDnPDTG6IJ&index=21

https://www.youtube.com/onurmutlulectures 129
Recall: How to Handle Data Dependences
◼ Anti and output dependences are easier to handle
❑ write to the destination only in last stage and in program order

◼ Flow dependences are more interesting

◼ Six fundamental ways of handling flow dependences


❑ Detect and wait until value is available in register file
❑ Detect and forward/bypass data to dependent instruction
❑ Detect and eliminate the dependence at the software level
◼ No need for the hardware to detect dependence
❑ Detect and move it out of the way for independent instructions
❑ Predict the needed value(s), execute “speculatively”, and verify
❑ Do something else (fine-grained multithreading)
◼ No need to detect
130
Fine-Grained Multithreading

131
Fine-Grained Multithreading
◼ Idea: Fetch from a different thread every cycle such that no
two instructions from a thread are in the pipeline concurrently
❑ Hardware has multiple thread contexts (PC+registers per thread)
❑ Threads are completely independent
❑ No instruction is fetched from the same thread until the prior
branch/instruction from the thread completes

+ No logic needed for handling control and


data dependences within a thread
+ High thread-level throughput
-- Single thread performance suffers
-- Extra logic for keeping thread contexts
-- Throughput loss when there are not
enough threads to keep the pipeline full
Each pipeline stage has an instruction from a different, completely-independent thread
Fine-Grained Multithreading: Basic Idea
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM

BranchD BranchE BranchM


31:26 PCSrcM
Op ALUControlD ALUControlE2:0
5:0
Funct ALUSrcD ALUSrcE
RegDstD RegDstE
ALUOutW
CLK CLK CLK
CLK
25:21 WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+

15:0
<<2
Sign Extend SignImmE
4 PCBranchM

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

Each pipeline stage has an instruction from a different, completely-independent thread

We need a PC and a register file for each thread + muxes and control
Fine-Grained Multithreading (II)
◼ Idea: Fetch from a different thread every cycle such that no
two instructions from a thread are in the pipeline concurrently

◼ Tolerates control and data dependence resolution latencies by


overlapping the latency with useful work from other threads
◼ Improves pipeline utilization by taking advantage of multiple
threads
◼ Improves thread-level throughput but sacrifices per-thread
throughput & latency

◼ Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964.


◼ Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.
134
Fine-Grained Multithreading: History
◼ CDC 6600’s peripheral processing unit is fine-grained
multithreaded
❑ Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964.
❑ Processor executes a different I/O thread every cycle
❑ An operation from the same thread is executed every 10 cycles

◼ Denelcor HEP (Heterogeneous Element Processor)


❑ Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.
❑ 120 threads/processor
❑ Available queue vs. unavailable (waiting) queue for threads
❑ Each thread can have only 1 instruction in the processor pipeline
❑ Each thread independent
❑ To each thread, processor looks like a non-pipelined machine
❑ System throughput vs. single thread performance tradeoff

135
Fine-Grained Multithreading in HEP
◼ Cycle time: 100ns

◼ 8 stages → 800 ns to
complete an
instruction
❑ assuming no memory
access

◼ No control and data


dependence checking

Burton Smith
(1941-2018)

136
Multithreaded Pipeline Example

Slide credit: Joel Emer 137


Sun Niagara Multithreaded Pipeline

Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
138
Fine-Grained Multithreading
◼ Advantages
+ No need for dependence checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from
different threads
+ Improved system throughput, latency tolerance, pipeline utilization

◼ Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs, register
files, …), thread selection logic
- Reduced single thread performance (one instruction fetched every N
cycles from the same thread)
- Resource contention between threads in caches and memory
- Dependence checking logic between threads may be needed (load/store)
139
Modern GPUs are
FGMT Machines

140
NVIDIA GeForce GTX 285 “core”

64 KB of storage
… for thread contexts
(registers)

= data-parallel (SIMD) func. unit, = instruction stream decode


control shared across 8 units
= multiply-add = execution context storage
= multiply

Slide credit: Kayvon Fatahalian


NVIDIA GeForce GTX 285 “core”

64 KB of storage
… for thread contexts
(registers)

◼ Groups of 32 threads share instruction stream (each group is


a Warp): they execute the same instruction on different data
◼ Up to 32 warps are interleaved in an FGMT manner
◼ Up to 1024 thread contexts can be stored
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285 (~2009)

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

30 cores on the GTX 285: 30,720 threads


Slide credit: Kayvon Fatahalian
Further Reading for the Interested (I)

Burton Smith
(1941-2018)

144
Further Reading for the Interested (II)

145
More on Multithreading (I)

https://www.youtube.com/watch?v=iqi9wFqFiNU&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=51
More on Multithreading (II)

https://www.youtube.com/watch?v=e8lfl6MbILg&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=52
More on Multithreading (III)

https://www.youtube.com/watch?v=7vkDpZ1-hHM&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=53
More on Multithreading (IV)

https://www.youtube.com/watch?v=-hbmzIDe0sA&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=54
Lectures on Multithreading
◼ Parallel Computer Architecture, Fall 2012, Lecture 9
❑ Multithreading I (CMU, Fall 2012)
❑ https://www.youtube.com/watch?v=iqi9wFqFiNU&list=PL5PHm2jkkXmgDN1PLwOY
_tGtUlynnyV6D&index=51
◼ Parallel Computer Architecture, Fall 2012, Lecture 10
❑ Multithreading II (CMU, Fall 2012)
❑ https://www.youtube.com/watch?v=e8lfl6MbILg&list=PL5PHm2jkkXmgDN1PLwOY_
tGtUlynnyV6D&index=52
◼ Parallel Computer Architecture, Fall 2012, Lecture 13
❑ Multithreading III (CMU, Fall 2012)
❑ https://www.youtube.com/watch?v=7vkDpZ1-
hHM&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=53
◼ Parallel Computer Architecture, Fall 2012, Lecture 15
❑ Speculation I (CMU, Fall 2012)
❑ https://www.youtube.com/watch?v=-
hbmzIDe0sA&list=PL5PHm2jkkXmgDN1PLwOY_tGtUlynnyV6D&index=54

https://www.youtube.com/onurmutlulectures 150
Pipelining and Precise Exceptions:
Preserving Sequential Semantics
Multi-Cycle Execution
◼ Not all instructions take the same amount of time for
“execution”

◼ Idea: Have multiple different functional units that take


different number of cycles
❑ Can be pipelined or not pipelined
❑ Can let independent instructions start execution on a different
functional unit before a previous long-latency instruction
finishes execution
Integer add
E
Integer mul
E E E E
FP mul
?
F D
E E E E E E E E

E E E E E E E E ...
Load/store
152
Issues in Pipelining: Multi-Cycle Execute
◼ Instructions can take different number of cycles in EXECUTE
stage
❑ Integer ADD versus Integer DIVide

DIV R4  R1, R2 F D E E E E E E E E W
ADD R3  R1, R2 F D E W
F D E W
F D E W

DIV R2  R5, R6 F D E E E E E E E E W
ADD R7  R5, R6 F D E W
F D E W

❑ What is wrong with this picture in a Von Neumann architecture?


◼ Sequential semantics of the ISA NOT preserved!
◼ What if DIV incurs an exception?
153
Exceptions and Interrupts
◼ “Unplanned” changes or interruptions in program execution

◼ Due to internal problems in execution of the program


→ Exceptions

◼ Due to external events that need to be handled by the


processor
→ Interrupts

◼ Both exceptions and interrupts require


❑ stopping of the current program
❑ saving the architectural state
❑ handling the exception/interrupt → switch to handler
❑ return back to program execution (if possible and makes sense)
154
Exceptions vs. Interrupts
◼ Cause
❑ Exceptions: internal to the running thread
❑ Interrupts: external to the running thread

◼ When to Handle
❑ Exceptions: when detected (and known to be non-speculative)
❑ Interrupts: when convenient
◼ Except for very high priority ones
❑ Power failure
❑ Machine check (error)

◼ Priority: process (exception), depends (interrupt)

◼ Handling Context: process (exception), system (interrupt)


155
Precise Exceptions/Interrupts
◼ The architectural state should be consistent (precise)
when the exception/interrupt is ready to be handled

1. All previous instructions should be completely retired.

2. No later instruction should be retired.

Retire = commit = finish execution and update arch. state

156
Checking for and Handling Exceptions in Pipelining

◼ When the oldest instruction ready-to-be-retired is detected


to have caused an exception, the control logic

❑ Ensures architectural state is precise (register file, PC, memory)

❑ Flushes all younger instructions in the pipeline

❑ Saves PC and registers (as specified by the ISA)

❑ Redirects the fetch engine to the appropriate exception


handling routine

157
Why Do We Want Precise Exceptions?
◼ Semantics of the von Neumann model ISA specifies it
❑ Remember von Neumann vs. Dataflow

◼ Aids software debugging

◼ Enables (easy) recovery from exceptions

◼ Enables (easily) restartable processes

◼ Enables traps into software (e.g., software implemented


opcodes)

158
Ensuring Precise Exceptions
◼ Easy to do in single-cycle and multi-cycle machines

◼ Single-cycle
❑ Instruction boundaries == Cycle boundaries

◼ Multi-cycle
❑ Add special states in the control FSM that lead to the
exception or interrupt handlers
❑ Switch to the handler only at a precise state → before fetching
the next instruction

See H&H Section 7.7 for a treatment of exceptions in multi-cycle uarch 159
Precise Exceptions in Multi-Cycle FSM

160
Precise Exceptions in Multi-Cycle Datapath

See H&H Section 7.7 for a treatment of exceptions in multi-cycle uarch 161
Multi-Cycle Execute: More Complications
◼ Instructions can take different number of cycles in EXECUTE
stage
❑ Integer ADD versus Integer DIVide

DIV R4  R1, R2 F D E E E E E E E E W
ADD R3  R1, R2 F D E W
F D E W
F D E W

DIV R2  R5, R6 F D E E E E E E E E W
ADD R7  R5, R6 F D E W
F D E W

❑ What is wrong with this picture in a Von Neumann architecture?


◼ Sequential semantics of the ISA NOT preserved!
◼ What if DIV incurs an exception?
162
Ensuring Precise Exceptions in Pipelining
◼ Idea: Make each operation take the same amount of time

DIV R3  R1, R2 F D E E E E E E E E W
ADD R4  R1, R2 F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W
F D E E E E E E E E W

◼ Downside
❑ Worst-case instruction latency determines all instructions’ latency
◼ What about memory operations?
◼ Each functional unit takes worst-case number of cycles?

163
Solutions
◼ Reorder buffer

◼ History buffer

We will not cover these


◼ Future register file See suggested lecture videos from Spring 2015

◼ Checkpointing

◼ Suggested reading
❑ Smith and Plezskun, “Implementing Precise Interrupts in Pipelined
Processors,” IEEE Trans on Computers 1988 and ISCA 1985.

164
Solution I: Reorder Buffer (ROB)
◼ Idea: Complete instructions out-of-order, but reorder them
before making results visible to architectural state
◼ When instruction is decoded, it reserves the next-sequential
entry in the ROB
◼ When instruction completes, it writes result into ROB entry
◼ When instruction oldest in ROB and it has completed
without exceptions, its result moved to reg. file or memory

Func Unit
Instruction Register Reorder
Cache File Func Unit Buffer

Func Unit

165
Reorder Buffer

◼ Buffers information about all instructions that are decoded


but not yet retired/committed

166
What’s in a ROB Entry?
Valid bits for reg/data
V DestRegID DestRegVal StoreAddr StoreData PC Exception?
+ control bits

◼ Everything required to:


❑ correctly reorder instructions back into the program order
❑ update the architectural state with the instruction’s result(s), if
instruction can retire without any issues
❑ handle an exception/interrupt precisely, if an
exception/interrupt needs to be handled before retiring the
instruction

◼ Need valid bits to keep track of readiness of the result(s)


and find out if the instruction has completed execution
167
Reorder Buffer: Independent Operations
◼ Result first written to ROB on instruction completion
◼ Result written to register file at commit time

F D E E E E E E E E R W
F D E R W
F D E R W
F D E R W
F D E E E E E E E E R W
F D E R W
F D E R W

◼ What if a later instruction needs a value in the reorder buffer?


❑ One option: stall the operation → stall the pipeline
❑ Better: Read the value from the reorder buffer. How?

168
Reorder Buffer: How to Access?
◼ A register value can be in the register file, reorder buffer,
(or bypass/forwarding paths)

Random Access Memory


(indexed with Register ID,
Instruction Register which is the address of an entry)
Cache File
Func Unit

Func Unit

Reorder Func Unit


Content Buffer
Addressable
Memory bypass paths
(searched with
register ID,
which is part of the content of an entry)
169
Simplifying Reorder Buffer Access
◼ Idea: Use indirection

◼ Access register file first (check if the register is valid)


❑ If register not valid, register file stores the ID of the reorder
buffer entry that contains (or will contain) the value of the
register
❑ Mapping of the register to a ROB entry: Register file maps the
register to a reorder buffer entry if there is an in-flight
instruction writing to the register

◼ Access reorder buffer next

◼ Now, reorder buffer does not need to be content addressable


170
Reorder Buffer in Intel Pentium III

Boggs et al., “The


Microarchitecture of the
Pentium 4 Processor,” Intel
Technology Journal, 2001.

171
Important: Register Renaming with a Reorder Buffer
◼ Output and anti dependences are not true dependences
❑ WHY? The same register refers to values that have nothing to
do with each other
❑ They exist due to lack of register ID’s (i.e. names) in
the ISA

◼ The register ID is renamed to the reorder buffer entry that


will hold the register’s value
❑ Register ID → ROB entry ID
❑ Architectural register ID → Physical register ID
❑ After renaming, ROB entry ID used to refer to the register

◼ This eliminates anti and output dependences


❑ Gives the illusion that there are a large number of registers
172
Recall: Data Dependence Types
True (flow) dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW) -- True
Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR) -- Anti

Output-dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW) -- Output
r3  r6 op r7
173
Renaming Example
◼ Assume
❑ Register file has a pointer to the reorder buffer entry that
contains or will contain the value, if the register is not valid
❑ Reorder buffer works as described before

◼ Where is the latest definition of R3 for each instruction


below in sequential order?
LD R0(0) → R3
LD R3, R1 → R10
MUL R1, R2 → R3
MUL R3, R4 → R11
ADD R5, R6 → R3
ADD R7, R8 → R12

174
In-Order Pipeline with Reorder Buffer
◼ Decode (D): Access regfile/ROB, allocate entry in ROB, check if
instruction can execute, if so dispatch instruction
◼ Execute (E): Instructions can complete out-of-order
◼ Completion (R): Write result to reorder buffer
◼ Retirement/Commit (W): Check for exceptions; if none, write result to
architectural register file or memory; else, flush pipeline and start from
exception handler
◼ In-order dispatch/execution, out-of-order completion, in-order retirement
Integer add
E
Integer mul
E E E E
FP mul
R W
F D
E E E E E E E E
R
E E E E E E E E ...
Load/store

175
Reorder Buffer Tradeoffs
◼ Advantages
❑ Conceptually simple for supporting precise exceptions
❑ Can eliminate false dependences

◼ Disadvantages
❑ Reorder buffer needs to be accessed to get the results that
are yet to be written to the register file
◼ CAM or indirection → increased latency and complexity

◼ Other solutions aim to eliminate the disadvantages


❑ History buffer
We will not cover these
❑ Future file See suggested lecture videos from Spring 2015
❑ Checkpointing

176
More on State Maintenance & Precise Exceptions

https://www.youtube.com/onurmutlulectures 177
More on State Maintenance & Precise Exceptions

https://www.youtube.com/onurmutlulectures 178
Lectures on State Maintenance & Recovery
◼ Computer Architecture, Spring 2015, Lecture 11
❑ Precise Exceptions, State Maintenance/Recovery (CMU, Spring 2015)
❑ https://www.youtube.com/watch?v=nMfbtzWizDA&list=PL5PHm2jkkXmi5CxxI7b3J
CL1TWybTDtKq&index=13

◼ Digital Design & Computer Architecture, Spring 2019, Lecture 15a


❑ Reorder Buffer (ETH Zurich, Spring 2019)
❑ https://www.youtube.com/watch?v=9yo3yhUijQs&list=PL5Q2soXY2Zi8J58xLKBNFQ
FHRO3GrXxA9&index=17

https://www.youtube.com/onurmutlulectures 179
Suggested Readings for the Interested
◼ Smith and Plezskun, “Implementing Precise Interrupts in
Pipelined Processors,” IEEE Trans on Computers 1988 and
ISCA 1985.

◼ Smith and Sohi, “The Microarchitecture of Superscalar


Processors,” Proceedings of the IEEE, 1995

◼ Hwu and Patt, “Checkpoint Repair for Out-of-order


Execution Machines,” ISCA 1987.

◼ Backup Slides

180

You might also like