U.
Wisconsin CS/ECE 752
Advanced Computer Architecture I
Prof. David A. Wood
Unit 5: Dynamic Scheduling I
Slides developed by Amir Roth of University of Pennsylvania
with sources that included University of Wisconsin slides by
Mark Hill, Guri Sohi, Jim Smith, and David Wood.
Slides enhanced by Milo Martin, Mark Hill, and David Wood
with sources that included Profs. Asanovic, Falsafi, Hoe,
Lipasti, Shen, Smith, Sohi, Vijaykumar, and Wood
CS/ECE 752 (Wood): Dynamic Scheduling I
This Unit: Dynamic Scheduling I
Application
Dynamic scheduling
OS
Compiler
CPU
Firmware
Out-of-order execution
Scoreboard
Dynamic scheduling with WAW/WAR
I/O
Memory
Tomasulos algorithm
Add register renaming to fix WAW/WAR
Digital Circuits
Gates & Transistors
Next unit
Adding speculation and precise state
Dynamic load scheduling
CS/ECE 752 (Wood): Dynamic Scheduling I
The Problem With In-Order Pipelines
addf f0,f1,f2
mulf f2,f3,f2
subf f0,f1,f4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
F D E+ E+ E+ W
F D d* d* E* E* E* E* E* W
F p* p* D E+ E+ E+ W
Whats happening in cycle 4?
mulf stalls due to RAW hazard
OK, this is a fundamental problem
subf stalls due to pipeline hazard
Why? subf cant proceed into D because addf is there
That is the only reason, and it isnt a fundamental one
Why cant subf go into D in cycle 4 and E+ in cycle 6?
CS/ECE 752 (Wood): Dynamic Scheduling I
Dynamic Scheduling: The Big Picture
add
sub
mul
div
I$
B
P
p2,p3,p4
p2,p4,p5
p2,p5,p6
p4,4,p7
regfile
insn buffer
D
D$
Ready Table
P2 P3 P4 P5 P6 P7
Yes Yes
Yes Yes Yes
Yes
Yes Yes Yes Yes
Yes
Yes Yes Yes Yes Yes Yes
add p2,p3,p4
sub p2,p4,p5 and div p4,4,p7
mul p2,p5,p6
Time
Instructions fetch/decoded/renamed into Instruction Buffer
Also called instruction window or instruction scheduler
Instructions (conceptually) check ready bits every cycle
Execute when ready
CS/ECE 752 (Wood): Dynamic Scheduling I
Register Renaming
To eliminate WAW and WAR hazards
Example
Names: r1,r2,r3
Locations: p1,p2,p3,p4,p5,p6,p7
Original mapping: r1p1, r2p2, r3p3, p4p7 are free
MapTable
FreeList
Raw insns
Renamed insns
r1
p1
p4
p4
p4
p4,p5,p6,p7
p5,p6,p7
p6,p7
p7
add
sub
mul
div
add
sub
mul
div
r2
p2
p2
p2
p2
r3
p3
p3
p5
p6
r2,r3,r1
r2,r1,r3
r2,r3,r3
r1,4,r1
p2,p3,p4
p2,p4,p5
p2,p5,p6
p4,4,p7
Renaming
+Removes WAW and WAR dependences
+Leaves RAW intact!
CS/ECE 752 (Wood): Dynamic Scheduling I
Dynamic Scheduling - OoO Execution
Dynamic scheduling
Totally in the hardware
Also called out-of-order execution (OoO)
Fetch many instructions into instruction window
Use branch prediction to speculate past (multiple) branches
Flush pipeline on branch misprediction
Rename to avoid false dependencies (WAW and WAR)
Execute instructions as soon as possible
Register dependencies are known
Handling memory dependencies more tricky (much more later)
Commit instructions in order
Any strange happens before commit, just flush the pipeline
Current machines: 64-100+ instruction scheduling window
CS/ECE 752 (Wood): Dynamic Scheduling I
Static Instruction Scheduling
Issue: time at which insns execute
Schedule: order in which insns execute
Related to issue, but the distinction is important
Scheduling: re-arranging insns to enable rapid issue
Static: by compiler
Requires knowledge of pipeline and program dependences
Pipeline scheduling: the basics
Requires large scheduling scope full of independent insns
Loop unrolling, software pipelining: increase scope for loops
Trace scheduling: increase scope for non-loops
Anything software can do hardware can do better
CS/ECE 752 (Wood): Dynamic Scheduling I
Motivation Dynamic Scheduling
Dynamic scheduling (out-of-order execution)
Execute insns in non-sequential (non-VonNeumann) order
+Reduce RAW stalls
+Increase pipeline and functional unit (FU) utilization
Original motivation was to increase FP unit utilization
+Expose more opportunities for parallel issue (ILP)
Not in-order can be in parallel
but make it appear like sequential execution
Important
But difficult
Next unit
CS/ECE 752 (Wood): Dynamic Scheduling I
Before We Continue
If we can do this in software
why build complex (slow-clock, high-power) hardware?
+ Performance portability
Dont want to recompile for new machines
+ More information available
Memory addresses, branch directions, cache misses
+ More registers available (??)
Compiler may not have enough to fix WAR/WAW hazards
+ Easier to speculate and recover from mis-speculation
Flush instead of recover code
But compiler has a larger scope
Compiler does as much as it can (not much)
Hardware does the rest
CS/ECE 752 (Wood): Dynamic Scheduling I
Going Forward: Whats Next
Well build this up in steps over the next few weeks
Scoreboarding - first OoO, no register renaming
Tomasulos algorithm - adds register renaming
Handling precise state and speculation
P6-style execution (Intel Pentium Pro)
R10k-style execution (MIPS R10k)
Handling memory dependencies
Conservative and speculative
Lets get started!
CS/ECE 752 (Wood): Dynamic Scheduling I
10
Dynamic Scheduling as Loop Unrolling
Three steps of loop unrolling
Step I: combine iterations
Increase scheduling scope for more flexibility
Step II: pipeline schedule
Reduce impact of RAW hazards
Step III: rename registers
Remove WAR/WAW violations that result from scheduling
CS/ECE 752 (Wood): Dynamic Scheduling I
11
Loop Example: SAX (SAXPY PY)
SAX (Single-precision A X)
Only because there wont be room in the diagrams for SAXPY
for (i=0;i<N;i++)
Z[i]=A*X[i];
0:
1:
2:
3:
4:
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
blt r1,r2,0
// loop
// A in f0
// i in r1
// N*4 in r2
Consider two iterations, ignore branch
ldf, mulf, stf, addi, ldf, mulf, stf
CS/ECE 752 (Wood): Dynamic Scheduling I
12
New Pipeline Terminology
regfile
I$
B
P
D$
In-order pipeline
Often written as F,D,X,W (multi-cycle X includes M)
Example pipeline: 1-cycle int (including mem), 3-cycle pipelined FP
CS/ECE 752 (Wood): Dynamic Scheduling I
13
New Pipeline Diagram
Insn
ldf X(r1),f1
c1 c2 c3
mulf f0,f1,f2 c3 c4+ c7
stf f2,Z(r1)
c7 c8 c9
addi r1,4,r1
c8 c9 c10
ldf X(r1),f1 c10 c11 c12
mulf f0,f1,f2 c12 c13+ c16
stf f2,Z(r1) c16 c17 c18
Alternative pipeline diagram
Down: insns
Across: pipeline stages
In boxes: cycles
Basically: stages cycles
Convenient for out-of-order
CS/ECE 752 (Wood): Dynamic Scheduling I
14
The Problem With In-Order Pipelines
regfile
I$
B
P
D$
In-order pipeline
Structural hazard: 1 insn register (latch) per stage
1 insn per stage per cycle (unless pipeline is replicated)
Younger insn cant pass older insn without clobbering it
Out-of-order pipeline
Implement passing functionality by removing structural hazard
CS/ECE 752 (Wood): Dynamic Scheduling I
15
Instruction Buffer
insn buffer
regfile
I$
B
P
D$
D1
D2
Trick: insn buffer (many names for this buffer)
Basically: a bunch of latches for holding insns
Implements iteration fusing here is your scheduling scope
Split D into two pieces
Accumulate decoded insns in buffer in-order
Buffer sends insns down rest of pipeline out-of-order
CS/ECE 752 (Wood): Dynamic Scheduling I
16
Dispatch and Issue
insn buffer
regfile
I$
B
P
D$
Dispatch (D): first part of decode
Allocate slot in insn buffer
New kind of structural hazard (insn buffer is full)
In order: stall back-propagates to younger insns
Issue (S): second part of decode
Send insns from insn buffer to execution units
+ Out-of-order: wait doesnt back-propagate to younger insns
CS/ECE 752 (Wood): Dynamic Scheduling I
17
Dispatch and Issue with Floating-Point
insn buffer
regfile
I$
B
P
D$
S
E*
E*
E
+
E
+
E*
E/
F-regfile
CS/ECE 752 (Wood): Dynamic Scheduling I
18
Dynamic Scheduling Algorithms
Three parts to loop unrolling
Scheduling scope: insn buffer
Pipeline scheduling and register renaming: scheduling algorithm
Look at two register scheduling algorithms
Register scheduler: scheduler based on register dependences
Scoreboard
No register renaming limited scheduling flexibility
Tomasulo
Register renaming more flexibility, better performance
Big simplification in this unit: memory scheduling
Pretend register algorithm magically knows memory dependences
A little more realism next unit
CS/ECE 752 (Wood): Dynamic Scheduling I
19
Scheduling Algorithm I: Scoreboard
Scoreboard
Centralized control scheme: insn status explicitly tracked
Insn buffer: Functional Unit Status Table (FUST)
First implementation: CDC 6600 [1964]
16 separate non-pipelined functional units (7 int, 4 FP, 5 mem)
No register bypassing
Our example: Simple Scoreboard
5 FU: 1 ALU, 1 load, 1 store, 2 FP (3-cycle, non-pipelined)
CS/ECE 752 (Wood): Dynamic Scheduling I
20
Scoreboard Data Structures
FU Status Table
FU, busy, op, R, R1, R2: destination/source register names
T: destination register tag (FU producing the value)
T1,T2: source register tags (FU producing the values)
Register Status Table
T: tag (FU that will write this register)
Tags interpreted as ready-bits
Tag == 0 Value is ready in register file
Tag != 0 Value is not ready, will be supplied by T
Insn status table
S,X bits for all active insns
CS/ECE 752 (Wood): Dynamic Scheduling I
21
Simple Scoreboard Data Structures
S X Insn
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
== CAMs
==
T
FU
Insn fields and status bits
Tags
Values
CS/ECE 752 (Wood): Dynamic Scheduling I
22
Scoreboard Pipeline
New pipeline structure: F, D, S, X, W
F (fetch)
Same as it ever was
D (dispatch)
Structural or WAW hazard ? stall : allocate scoreboard entry
S (issue)
RAW hazard ? wait : read registers, go to execute
X (execute)
Execute operation, notify scoreboard when done
W (writeback)
WAR hazard ? wait : write register, free scoreboard entry
W and RAW-dependent S in same cycle
W and structural-dependent D in same cycle
CS/ECE 752 (Wood): Dynamic Scheduling I
23
Scoreboard Dispatch (D)
S X Insn
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
==
==
T
FU
Stall for WAW or structural (Scoreboard, FU) hazards
Allocate scoreboard entry
Copy Reg Status for input registers
Set Reg Status for output register
CS/ECE 752 (Wood): Dynamic Scheduling I
24
Scoreboard Issue (S)
S X Insn
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
==
==
T
FU
Wait for RAW register hazards
Read registers
CS/ECE 752 (Wood): Dynamic Scheduling I
25
Issue Policy and Issue Logic
Issue
If multiple insns ready, which one to choose? Issue policy
Oldest first? Safe
Longest latency first? May yield better performance
Select logic: implements issue policy
W1 priority encoder
W: window size (number of scoreboard entries)
CS/ECE 752 (Wood): Dynamic Scheduling I
26
Scoreboard Execute (X)
S X Insn
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
==
==
T
FU
Execute insn
CS/ECE 752 (Wood): Dynamic Scheduling I
27
Scoreboard Writeback (W)
S X Insn
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
==
==
T
FU
Wait for WAR hazard
Write value into regfile, clear Reg Status entry
Compare tag to waiting insns input tags, match ? clear input tag
Free scoreboard entry
CS/ECE 752 (Wood): Dynamic Scheduling I
28
Scoreboard Data Structures
Insn Status
Insn
f0
f1
f2
r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
Reg Status
Reg T
R1
R2
T1
T2
no
no
no
no
no
CS/ECE 752 (Wood): Dynamic Scheduling I
29
Scoreboard: Cycle 1
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
no
yes ldf
no
no
no
Reg Status
Reg T
f0
f1
f2
r1
c1
R1
R2
T1
T2
f1
r1
CS/ECE 752 (Wood): Dynamic Scheduling I
LD
allocate
30
Scoreboard: Cycle 2
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
c1
c2
c2
R1
R2
T1
T2
r1
f0
f1
LD
no
yes ldf f1
no
yes mulf f2
no
Reg Status
Reg T
f0
f1
f2
r1
CS/ECE 752 (Wood): Dynamic Scheduling I
LD
FP1
allocate
31
Scoreboard: Cycle 3
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c2
c3
Functional unit status
FU busy op
R
R1
ALU
LD
ST
FP1
FP2
no
yes ldf f1
yes stf yes mulf f2
no
f2
f0
Reg Status
Reg T
f0
f1
f2
r1
R2
T1
T2
r1
r1
f1
FP1
-
LD
CS/ECE 752 (Wood): Dynamic Scheduling I
LD
FP1
allocate
32
Scoreboard: Cycle 4
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
Reg Status
Reg T
c1
c2
c3
c4
c2
c4
c3
c4
R1
R2
T1
T2
r1
f2
f0
r1
f1
FP1
-
LD
yes addi r1
no
yes stf yes mulf f2
no
CS/ECE 752 (Wood): Dynamic Scheduling I
f0
f1
f2
r1
LD
FP1
ALU
f1 written clear
allocate
free
f0 (LD) is ready issue mulf
33
Scoreboard: Cycle 5
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
yes
yes
yes
yes
no
addi
ldf
stf
mulf
Reg Status
Reg T
c1
c2
c3
c4
c5
c2
c4
c3
c5
c4
R1
R2
T1
T2
r1
f1
f2
r1
f2
f0
r1
r1
f1
FP1
-
ALU
-
c5
CS/ECE 752 (Wood): Dynamic Scheduling I
f0
f1
f2
r1
LD
FP1
ALU
allocate
34
Scoreboard: Cycle 6
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
yes
yes
yes
yes
no
addi
ldf
stf
mulf
c1
c2
c3
c4
c5
c2 c3 c4
c4 c5+
R1
R2
T1
T2
r1
f1
f2
r1
f2
f0
r1
r1
f1
FP1
-
ALU
-
c5
Reg Status
Reg T
f0
f1
f2
r1
c6
LD
FP1
ALU
D stall: WAW hazard w/ mulf (f2)
How to tell? RegStatus[f2] non-empty
CS/ECE 752 (Wood): Dynamic Scheduling I
35
Scoreboard: Cycle 7
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
yes
yes
yes
yes
no
addi
ldf
stf
mulf
Reg Status
Reg T
c1
c2
c3
c4
c5
c2 c3 c4
c4 c5+
c5
c6
R1
R2
T1
T2
r1
f1
f2
r1
f2
f0
r1
r1
f1
FP1
-
ALU
-
CS/ECE 752 (Wood): Dynamic Scheduling I
f0
f1 LD
f2 FP1
r1 ALU
W wait: WAR hazard w/ stf (r1)
How to tell? Untagged r1 in FuStatus
Requires CAM
36
Scoreboard: Cycle 8
Insn Status
Insn
c1
c2
c3
c4
c5
c8
c2 c3 c4
c4 c5+ c8
c8
c5 c6
R1
R2
T1
T2
addi r1
ldf f1
stf -
r1
f2
r1
r1
FP1
ALU
-
mulf f2
f0
f1
LD
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
yes
yes
yes
no
yes
Reg Status
Reg T
CS/ECE 752 (Wood): Dynamic Scheduling I
f0
f1 LD
f2 FP1 FP2
r1 ALU
W wait
first mulf done (FP1)
f1 (FP1) is ready issue stf
free
allocate
37
Scoreboard: Cycle 9
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
c1
c2
c3
c4
c5
c8
c2 c3 c4
c4 c5+ c8
c8 c9
c5 c6 c9
c9
Reg Status
Reg T
f0
f1
f2
r1
LD
FP2
ALU
r1 written clear
D stall: structural hazard FuStatus[ST]
no
yes ldf f1
yes stf no
yes mulf f2
R1
R2
T1
T2
f2
r1
r1
ALU
-
f0
f1
LD
CS/ECE 752 (Wood): Dynamic Scheduling I
free
r1 (ALU) is ready issue ldf
38
Scoreboard: Cycle 10
Insn Status
Insn
ldf X(r1),f1
c1
mulf f0,f1,f2 c2
stf f2,Z(r1)
c3
addi r1,4,r1
c4
ldf X(r1),f1
c5
mulf f0,f1,f2 c8
stf f2,Z(r1) c10
FU Status
FU busy op
ALU
LD
ST
FP1
FP2
no
yes ldf f1
yes stf no
yes mulf f2
c2 c3 c4
c4 c5+ c8
c8 c9 c10
c5 c6 c9
c9 c10
Reg Status
Reg T
f0
f1
f2
r1
LD
FP2
W & structural-dependent D in same cycle
R1
R2
T1
T2
f2
r1
r1
FP2
f0
f1
LD
CS/ECE 752 (Wood): Dynamic Scheduling I
free, then allocate
39
In-Order vs. Scoreboard
Insn
In-Order
D
X
Scoreboard
W
D
S
X
ldf X(r1),f1
c1 c2 c3 c1 c2 c3
mulf f0,f1,f2 c3 c4+ c7 c2 c4 c5+
stf f2,Z(r1)
c7 c8 c9 c3 c8 c9
addi r1,4,r1
c8 c9 c10 c4 c5 c6
ldf X(r1),f1 c10 c11 c12 c5 c9 c10
mulf f0,f1,f2 c12 c13+ c16 c8 c11 c12+
stf f2,Z(r1) c16 c17 c18 c10 c15 c16
W
c4
c8
c10
c9
c11
c15
c17
Big speedup?
Only 1 cycle advantage for scoreboard
Why? addi WAR hazard
Scoreboard issued addi earlier (c8 c5)
But WAR hazard delayed W until c9
Delayed issue of second iteration
CS/ECE 752 (Wood): Dynamic Scheduling I
40
In-Order vs. Scoreboard II: Cache Miss
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
In-Order
D
X
c1
c7
c11
c12
c14
c16
c20
c2+
c8+
c12
c13
c15
c17+
c21
Scoreboard
W
D
S
X
c7
c11
c13
c14
c16
c20
c22
c1
c2
c3
c4
c5
c6
c7
c2
c8
c12
c5
c13
c15
c19
c3+
c9+
c13
c6
c14
c16+
c20
W
c8
c12
c14
c13
c15
c19
c21
Assume
5 cycle cache miss on first ldf
Ignore FUST structural hazards
Little relative advantage
addi WAR hazard (c7 c13) stalls second iteration
CS/ECE 752 (Wood): Dynamic Scheduling I
41
Scoreboard Redux
The good
+ Cheap hardware
InsnStatus + FuStatus + RegStatus ~ 1 FP unit in area
+ Pretty good performance
1.7X for FORTRAN (scientific array) programs
The less good
No bypassing
Is this a fundamental problem?
Limited scheduling scope
Structural/WAW hazards delay dispatch
Slow issue of truly-dependent (RAW) insns
WAR hazards delay writeback
Fix with hardware register renaming
CS/ECE 752 (Wood): Dynamic Scheduling I
42
Register Renaming
Register renaming (in hardware)
Change register names to eliminate WAR/WAW hazards
An elegant idea (like caching & pipelining)
Key: think of registers (r1,f0) as names, not storage locations
+ Can have more locations than names
+ Can have multiple active versions of same name
How does it work?
Map-table: maps names to most recent locations
SRAM indexed by name
On a write: allocate new location, note in map-table
On a read: find location of most recent write via map-table lookup
Small detail: must de-allocate locations at some point
CS/ECE 752 (Wood): Dynamic Scheduling I
43
Register Renaming Example
Parameters
Names: r1,r2,r3
Locations: p1,p2,p3,p4,p5,p6,p7
Original mapping: r1p1, r2p2, r3p3, p4p7 are free
MapTable
FreeList
Raw insns
Renamed insns
r1
p1
p4
p4
p4
p4,p5,p6,p7
p5,p6,p7
p6,p7
p7
add
sub
mul
div
add
sub
mul
div
r2
p2
p2
p2
p2
r3
p3
p3
p5
p6
r2,r3,r1
r2,r1,r3
r2,r3,r3
r1,4,r1
p2,p3,p4
p2,p4,p5
p2,p5,p6
p4,4,p7
Renaming
+Removes WAW and WAR dependences
+Leaves RAW intact!
CS/ECE 752 (Wood): Dynamic Scheduling I
44
Scheduling Algorithm II: Tomasulo
Tomasulos algorithm
Reservation stations (RS): instruction buffer
Common data bus (CDB): broadcasts results to RS
Register renaming: removes WAR/WAW hazards
First implementation: IBM 360/91 [1967]
Dynamic scheduling for FP units only
Bypassing
Our example: Simple Tomasulo
Dynamic scheduling for everything, including load/store
No bypassing (for comparison with Scoreboard)
5 RS: 1 ALU, 1 load, 1 store, 2 FP (3-cycle, pipelined)
CS/ECE 752 (Wood): Dynamic Scheduling I
45
Tomasulo Data Structures
Reservation Stations (RS#)
FU, busy, op, R: destination register name
T: destination register tag (RS# of this RS)
T1,T2: source register tags (RS# of RS that will produce value)
V1,V2: source register values
Thats new
Map Table
T: tag (RS#) that will write this register
Common Data Bus (CDB)
Broadcasts <RS#, value> of completed insns
Tags interpreted as ready-bits++
T==0 Value is ready somewhere
T!=0 Value is not ready, wait until CDB broadcasts T
CS/ECE 752 (Wood): Dynamic Scheduling I
46
Simple Tomasulo Data Structures
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
Insn fields and status bits
Tags
Values
CS/ECE 752 (Wood): Dynamic Scheduling I
47
Simple Tomasulo Pipeline
New pipeline structure: F, D, S, X, W
D (dispatch)
Structural hazard ? stall : allocate RS entry
S (issue)
RAW hazard ? wait (monitor CDB) : go to execute
W (writeback)
Wait for CDB
Write register, free RS entry
W and RAW-dependent S in same cycle
W and structural-dependent D in same cycle
CS/ECE 752 (Wood): Dynamic Scheduling I
48
Tomasulo Dispatch (D)
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
Stall for structural (RS) hazards
Allocate RS entry
Input register ready ? read value into RS : read tag into RS
Set register status (i.e., rename) for output register
CS/ECE 752 (Wood): Dynamic Scheduling I
49
Tomasulo Issue (S)
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
Wait for RAW hazards
Read register values from RS
CS/ECE 752 (Wood): Dynamic Scheduling I
50
Tomasulo Execute (X)
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CS/ECE 752 (Wood): Dynamic Scheduling I
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
51
Tomasulo Writeback (W)
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
Wait for structural (CDB) hazards
Output Reg Status tag still matches? clear, write result to register
CDB broadcast to RS: tag match ? clear tag, copy value
Free RS entry
CS/ECE 752 (Wood): Dynamic Scheduling I
52
Difference Between Scoreboard
S X Insn
Fetched R1 R2 R
insns
Regfile
value
Reg Status T
op
FU Status
T1
==
==
==
==
T2
==
==
==
==
T
CS/ECE 752 (Wood): Dynamic Scheduling I
FU
53
And Tomasulo
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
T
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
FU
What in Tomasulo implements register renaming?
Value copies in RS (V1, V2)
Insn stores correct input values in its own RS entry
+ Future insns can overwrite master copy in regfile, doesnt matter
CS/ECE 752 (Wood): Dynamic Scheduling I
54
Value/Copy-Based Register Renaming
Tomasulo-style register renaming
Called value-based or copy-based
Names: architectural registers
Storage locations: register file and reservation stations
Values can and do exist in both
Register file holds master (i.e., most recent) values
+RS copies eliminate WAR hazards
Storage locations referred to internally by RS# tags
Register table translates names to tags
Tag == 0 value is in register file
Tag != 0 value is not ready and is being computed by RS#
CDB broadcasts values with tags attached
So insns know what value they are looking at
CS/ECE 752 (Wood): Dynamic Scheduling I
55
Value-Based Renaming Example
ldf X(r1),f1 (allocated RS#2)
MT[r1] == 0 RS[2].V2 = RF[r1]
MT[f1] = RS#2
mulf f0,f1,f2 (allocate RS#4)
MT[f0] == 0 RS[4].V1 = RF[f0]
MT[f1] == RS#2 RS[4].T2 = RS#2
MT[f2] = RS#4
addf f7,f8,f0
Can write RF[f0] before mulf executes, why?
ldf X(r1),f1
Can write RF[f1] before mulf executes, why?
Can write RF[f1] before first ldf, why?
Map Table
Reg T
f0
f1
f2
r1
RS#2
RS#4
Reservation Stations
T FU busy op
R
T1
T2
2
4
[r1]
RS#2 [f0]
LD
FP1
CS/ECE 752 (Wood): Dynamic Scheduling I
yes ldf f1
yes mulf f2
V1
V2
56
Tomasulo Data Structures
Insn Status
Insn
1
2
3
4
5
ALU
LD
ST
FP1
FP2
CDB
T
V
f0
f1
f2
r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
Reservation Stations
T FU busy op
R
Map Table
Reg T
T1
T2
V1
V2
no
no
no
no
no
CS/ECE 752 (Wood): Dynamic Scheduling I
57
Tomasulo: Cycle 1
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
ALU
LD
ST
FP1
FP2
f0
f1
f2
r1
c1
Reservation Stations
T FU busy op
R
1
2
3
4
5
no
yes ldf
no
no
no
Map Table
Reg T
f1
RS#2
T1
T2
V1
V2
[r1]
CS/ECE 752 (Wood): Dynamic Scheduling I
CDB
T
V
allocate
58
Tomasulo: Cycle 2
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c2
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
no
yes ldf f1
no
yes mulf f2
no
Map Table
Reg T
f0
f1
f2
r1
RS#2
RS#4
T1
T2
V1
V2
[r1]
RS#2 [f0] -
CS/ECE 752 (Wood): Dynamic Scheduling I
CDB
T
V
allocate
59
Tomasulo: Cycle 3
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c2
c3
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
no
yes ldf f1
yes stf yes mulf f2
no
f0
f1
f2
r1
T1
Map Table
Reg T
T2
V1
RS#2
RS#4
V2
[r1]
RS#4 [r1]
RS#2 [f0] -
CS/ECE 752 (Wood): Dynamic Scheduling I
CDB
T
V
allocate
60
Tomasulo: Cycle 4
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c2
c4
c3
c4
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
yes addi r1
no
yes stf yes mulf f2
no
Map Table
Reg T
CDB
T
V
f0
f1
f2
r1
RS#2 [f1]
RS#2
RS#4
RS#1
T1
T2
V1
[r1] -
ldf finished (W)
clear f1 RegStatus
CDB broadcast
V2
allocate
free
RS#4 [r1]
RS#2 [f0] CDB.V RS#2 ready
grab CDB value
CS/ECE 752 (Wood): Dynamic Scheduling I
61
Tomasulo: Cycle 5
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c5
c2
c4
c3
c5
c4
c5
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
yes
yes
yes
yes
no
addi
ldf
stf
mulf
r1
f1
f2
Map Table
Reg T
f0
f1
f2
r1
RS#2
RS#4
RS#1
T1
T2
V1
V2
RS#4
-
RS#1
-
[r1]
[f0]
[r1]
[f1]
CS/ECE 752 (Wood): Dynamic Scheduling I
CDB
T
V
allocate
62
Tomasulo: Cycle 6
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c5
c6
c2 c3 c4
c4 c5+
c5
Map Table
Reg T
f0
f1
f2
r1
c6
RS#2
RS#4RS#5
RS#1
no D stall on WAW: scoreboard would
overwrite f2 RegStatus
anyone who needs old f2 tag has it
Reservation Stations
T FU busy op
R
T1
T2
V1
V2
1
2
3
4
5
RS#4
-
RS#1
RS#2
[r1]
[f0]
[f0]
[r1]
[f1]
-
ALU
LD
ST
FP1
FP2
yes
yes
yes
yes
yes
addi
ldf
stf
mulf
mulf
CDB
T
V
r1
f1
f2
f2
CS/ECE 752 (Wood): Dynamic Scheduling I
allocate
63
Tomasulo: Cycle 7
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c5
c6
c2 c3 c4
c4 c5+
c5
c7
c6
Map Table
Reg T
c7
CDB
T
V
f0
RS#1 [r1]
f1 RS#2
f2 RS#5
r1 RS#1
no W wait on WAR: scoreboard would
anyone who needs old r1 has RS copy
D stall on store RS: structural
addi finished (W)
clear r1 RegStatus
CDB broadcast
Reservation Stations
T FU busy op
R
T1
T2
V1
V2
1
2
3
4
5
RS#4
-
RS#1
RS#2
[f0]
[f0]
CDB.V RS#1 ready
[r1] grab CDB value
[f1]
-
ALU
LD
ST
FP1
FP2
no
yes
yes
yes
yes
ldf
stf
mulf
mulf
f1
f2
f2
CS/ECE 752 (Wood): Dynamic Scheduling I
64
Tomasulo: Cycle 8
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c5
c6
c2 c3 c4
c4 c5+ c8
c8
c5 c6 c7
c7 c8
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
no
yes ldf f1
yes stf no
yes mulf f2
T1
T2
Map Table
Reg T
CDB
T
V
f0
f1
f2
r1
RS#4 [f2]
RS#2
RS#5
mulf finished (W)
dont clear f2 RegStatus
already overwritten by 2nd mulf (RS#5)
CDB broadcast
V1
V2
RS#4 -
CS/ECE 752 (Wood): Dynamic Scheduling I
[r1]
CDB.V [r1] RS#4 ready
grab CDB value
RS#2 [f0] 65
Tomasulo: Cycle 9
Insn Status
Insn
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
addi r1,4,r1
ldf X(r1),f1
mulf f0,f1,f2
stf f2,Z(r1)
c1
c2
c3
c4
c5
c6
c2 c3 c4
c4 c5+ c8
c8 c9
c5 c6 c7
c7 c8 c9
c9
Reservation Stations
T FU busy op
R
1
2
3
4
5
ALU
LD
ST
FP1
FP2
no
no
yes stf no
yes mulf f2
Map Table
Reg T
CDB
T
V
f0
f1
f2
r1
RS#2 [f1]
RS#2
RS#5
2nd mulf finished (W)
clear f1 RegStatus
CDB broadcast
T1
T2
V1
[f2] [r1]
RS#2 [f0] CDB.V RS#2 ready
grab CDB value
CS/ECE 752 (Wood): Dynamic Scheduling I
V2
66
Tomasulo: Cycle 10
Insn Status
Insn
ldf X(r1),f1
c1
mulf f0,f1,f2 c2
stf f2,Z(r1)
c3
addi r1,4,r1
c4
ldf X(r1),f1
c5
mulf f0,f1,f2 c6
stf f2,Z(r1) c10
ALU
LD
ST
FP1
FP2
CDB
T
V
f0
c2 c3 c4
f1
c4 c5+ c8
f2 RS#5
c8 c9 c10
r1
c5 c6 c7
c7 c8 c9 stf finished (W)
c9 c10
no output register no CDB broadcast
Reservation Stations
T FU busy op
R
1
2
3
4
5
Map Table
Reg T
no
no
yes stf no
yes mulf f2
T1
V1
V2
RS#5 -
[r1]
[f0] [f1]
CS/ECE 752 (Wood): Dynamic Scheduling I
T2
free allocate
67
Scoreboard vs. Tomasulo
Insn
Scoreboard
D
S
X
ldf X(r1),f1
c1 c2 c3
mulf f0,f1,f2 c2 c4 c5+
stf f2,Z(r1)
c3 c8 c9
addi r1,4,r1
c4 c5 c6
ldf X(r1),f1
c5 c9 c10
mulf f0,f1,f2 c8 c11 c12+
stf f2,Z(r1) c10 c15 c16
Hazard
Insn buffer
FU
RAW
WAR
WAW
Scoreboard
CS/ECE 752 (Wood): Dynamic Scheduling I
stall in D
wait in S
wait in S
wait in W
stall in D
Tomasulo
W
D
S
c4 c1 c2 c3 c4
c8 c2 c4 c5+ c8
c10 c3 c8 c9 c10
c9 c4 c5 c6 c7
c11 c5 c7 c8 c9
c15 c6 c9 c10+ c13
c17 c10 c13 c14 c15
Tomasulo
stall in D
wait in S
wait in S
none
none
68
Scoreboard vs. Tomasulo II: Cache Miss
Insn
Scoreboard
D
S
X
ldf X(r1),f1
c1 c2
mulf f0,f1,f2 c2 c8
stf f2,Z(r1)
c3 c12
addi r1,4,r1
c4 c5
ldf X(r1),f1
c8 c13
mulf f0,f1,f2 c12 c15
stf f2,Z(r1) c13 c19
c3+
c9+
c13
c6
c14
c16+
c20
Tomasulo
W
D
S
c8
c12
c14
c13
c15
c19
c21
c1
c2
c3
c4
c5
c6
c7
c2
c8
c12
c5
c7
c9
c13
c3+
c9+
c13
c6
c8
c10+
c14
c8
c12
c14
c7
c9
c13
c15
Assume
5 cycle cache miss on first ldf
Ignore FUST and RS structural hazards
+ Advantage Tomasulo
No addi WAR hazard (c7) means iterations run in parallel
CS/ECE 752 (Wood): Dynamic Scheduling I
69
Can We Add Superscalar?
Dynamic scheduling and multiple issue are orthogonal
E.g., Pentium4: dynamically scheduled 5-way superscalar
Two dimensions
N: superscalar width (number of parallel operations)
W: window size (number of reservation stations)
What do we need for an N-by-W Tomasulo?
RS: N tag/value w-ports (D), N value r-ports (S), 2N tag CAMs (W)
Select logic: WN priority encoder (S)
MT: 2N r-ports (D), N w-ports (D)
RF: 2N r-ports (D), N w-ports (W)
CDB: N (W)
Which are the expensive pieces?
CS/ECE 752 (Wood): Dynamic Scheduling I
70
Superscalar Select Logic
Superscalar select logic: WN priority encoder
Somewhat complicated (N2 logW)
Can simplify using different RS designs
Split design
Divide RS into N banks: 1 per FU?
Implement N separate W/N1 encoders
+ Simpler: N * logW/N
Less scheduling flexibility
FIFO design [Palacharla+]
Can issue only head of each RS bank
+ Simpler: no select logic at all
Less scheduling flexibility (but surprisingly not that bad)
CS/ECE 752 (Wood): Dynamic Scheduling I
71
Can We Add Bypassing?
op
Reservation Stations
T1
==
==
==
==
T2
==
==
==
==
CDB.T
Fetched
insns
Regfile
value
V1
V2
CDB.V
Map Table
T
FU
Yes, but its more complicated than you might think
In fact: requires a completely new pipeline
CS/ECE 752 (Wood): Dynamic Scheduling I
72
Why Out-of-Order Bypassing Is Hard
Insn
No Bypassing
D
S
X
Bypassing
W
D
S
ldf X(r1),f1
c1 c2 c3 c4 c1 c2 c3 c4
mulf f0,f1,f2 c2 c4 c5+ c8 c2 c3 c4+ c7
stf f2,Z(r1)
c3 c8 c9 c10 c3 c6 c7 c8
addi r1,4,r1
c4 c5 c6 c7 c4 c5 c6 c7
ldf X(r1),f1
c5 c7 c8 c9 c5 c7 c7 c9
mulf f0,f1,f2 c6 c9 c10+ c13 c6 c9 c8+ c13
stf f2,Z(r1) c10 c13 c14 c15 c10 c13 c11 c15
Bypassing: ldf X in c3 mulf X in c4 mulf S in c3
But how can mulf S in c3 if ldf W in c4? Must change pipeline
Modern scheduler
Split CDB tag and value, move tag broadcast to S
ldf tag broadcast now in cycle 2 mulf S in cycle 3
How do multi-cycle operations work? How do cache misses work?
CS/ECE 752 (Wood): Dynamic Scheduling I
73
Dynamic Scheduling Summary
Dynamic scheduling: out-of-order execution
Higher pipeline/FU utilization, improved performance
Easier and more effective in hardware than software
+More storage locations than architectural registers
+Dynamic handling of cache misses
Instruction buffer: multiple F/D latches
Implements large scheduling scope + passing functionality
Split decode into in-order dispatch and out-of-order issue
Stall vs. wait
Dynamic scheduling algorithms
Scoreboard: no register renaming, limited out-of-order
Tomasulo: copy-based register renaming, full out-of-order
CS/ECE 752 (Wood): Dynamic Scheduling I
74