Parallelism I: Inside the Core
Remember!
projects due next Wednesday 141LFinal on next Tuesday (3-6 pm) at midnight. 141
The nal
Comprehensive as the Midterm. Same general format the slides, and the Review the homeworks, quizzes.
Key Points
is wide issue mean? Whatdoes does it affect performance? How does it affect pipeline design? How is the basic idea behind out-of-order What execution? What is the difference between a true and false dependence? How do OOO processors remove false dependences? What is Simultaneous Multithreading?
4
Parallelism
= IC * * CT ET is moreCPI less xed or IC have shrunk cycle time as far as we can We have achieved a CPI of 1. We we get faster? Can
Parallelism
= IC * * CT ET is moreCPI less xed or IC have shrunk cycle time as far as we can We have achieved a CPI of 1. We we get faster? Can We can reduce our CPI to less than 1. The processor must do multiple operations at once. This is called Instruction Level Parallelism (ILP)
5
Approach 1: Widen the pipeline
Fetch PC and PC+4 Fetch Decode Deco 2 de inst Fetch 4 values Deco de
EX
EX
Mem Two Memory ops Mem
Write Write back back 2 values Write back
1 Process two instructions at once instead of PC 1 odd PC instruction and 1 even Often keeps the instruction fetch logic simpler. This 2-wide, in-order, superscalar processor Potential problems?
6
Single issue refresher
cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 add $s1,$s2,$s3 F D E M W
sub $s2,$s4,$s5
ld $s3, 0($s2)
add $t1, $s3, $s3
Single issue refresher
cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 add $s1,$s2,$s3 F D E M W
sub $s2,$s4,$s5
Forwarding
E M W D D E M W
ld $s3, 0($s2)
add $t1, $s3, $s3
Single issue refresher
cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 add $s1,$s2,$s3 F D E M W
sub $s2,$s4,$s5
Forwarding
E M W
ld $s3, 0($s2)
Forwarding
add $t1, $s3, $s3 F D D E M W
Dual issue: Ideal Case
0 add $s1,$s2,$s3 sub $s2,$s4,$s5 ld $s3, 0($s2) add $t1, $s3, $s3 ... ... ... ... ... ... F F 1 D D F F 2 E E D D F F 3 M M E E D D F F 4 W W M M E E D D F F W W M M E E D D W W M M E E W W M M W W 5 6 7 8
CPI == 0.5!
8
Dual issue: Structural Hazards Structural hazards everything We might not replicate
Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place?
Decode Deco 2 de inst Fetch 4 values Deco de
EX
Fetch PC and PC+4 Fetch
Mem
*
EX
<<
Write Write back back 2 values Write back
If an upper instruction needs the lower pipeline, squash the lower instruction
9
Dual issue: Structural Hazards Structural hazards everything We might not replicate
Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place?
Decode Deco 2 de inst Fetch 4 values Deco de
EX
Fetch PC and PC+4 Fetch
Mem
*
EX
<<
Write Write back back 2 values Write back
If an upper instruction needs the lower pipeline, squash the lower instruction
9
Dual issue: dealing with hazards
0 add sub Mul Shift Shift Ld Shift Ld F F 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10
10
Dual issue: dealing with hazards
0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10
PC = 0
add sub Mul Shift Shift Ld Shift Ld
F F
10
Dual issue: dealing with hazards
0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10
PC = 0 PC = 8
add sub Mul Shift Shift Ld Shift Ld
F F
10
Dual issue: dealing with hazards
0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M 5 6 7 8 9 10
PC = 0 PC = 8 PC = 12
add sub Mul Shift Shift Ld Shift Ld
F F
Shift moves to lower pipe load is squashed
x W
10
Dual issue: dealing with hazards
0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M 5 6 7 8 9 10
PC = 0 PC = 8 PC = 12 PC = 12
add sub Mul Shift Shift Ld Shift Ld
F F
Shift moves to lower pipe load is squashed
x W
Load uses lower pipe Shift becomes a noop
10
Dual issue: Data Hazards
The lower instruction may need a value produced by the upper instruction Forwarding cannot help us -- we must stall.
Fetch PC and PC+4 Fetch Decode Deco 2 de inst Fetch 4 values Deco de
EX
Mem
*
EX
<<
Write Write back back 2 values Write back
11
Dual issue: dealing with hazards
is essential! Forwardingstall. Both pipes
0 add $s1, $s3,#4 sub $s4, $s1, #4 add ... sub ... and ... or ... F F 1 D D F F 2 E D F F 3 M E D D F F 4 W M E E D D W M M E E W W M M W W 5 6 7 8 9 10
12
Dual issue: Control Hazards
The upper instruction might be branch. The lower instruction might be on the wrong path Solution 1: Require branches to execute in the lower pipeline -See structural hazards. What about consecutive branches? -- Exercise for the reader What about branches to odd addresses? -- Squash the upper pipe
Fetch PC and PC+4 Fetch
Decode Deco 2 de inst Fetch 4 values Deco de
EX
Mem
*
EX
<<
branch unit.
Write Write back back 2 values Write back
13
Beyond Dual Issue
are possible. Wider pipelinesseparate oating point pipeline. There is often a leads to hardware Wide issuegets harder, too. complexity Compiling processors use of two options if they In practice, ILP want more
Change the ISA and build a smart compiler: VLIW Keep the same ISA and build a smart processors: Outof-order
14
Going Out of Order: Data dependence refresher.
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t5,$t1,$t2
1
4: add $t3,$t1,$t2
15
Going Out of Order: Data dependence refresher.
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t5,$t1,$t2
1
1
2
4: add $t3,$t1,$t2
There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 15
Going Out of Order: Data dependence refresher.
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t5,$t1,$t2
1
1
2
4: add $t3,$t1,$t2
We can parallelize instructions that do not have a read-afterwrite dependence (RAW)
There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 15
Data dependences
is In general, if therewe no dependence between two instructions, can execute them in either
order or simultaneously. But beware:
1: add $t1,$s2,$s3
Is there a dependence here?
1 2
Can we reorder the instructions?
2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3
2: sub $t1,$s3,$s4
Is the result the same?
16
Data dependences
is In general, if therewe no dependence between two instructions, can execute them in either
order or simultaneously. But beware:
1: add $t1,$s2,$s3
Is there a dependence here?
1 2
Can we reorder the instructions?
2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3
2: sub $t1,$s3,$s4
No! The nal value of $t1 is different
16
Is the result the same?
False Dependence #1
Also called Write-after-Write dependences (WAW) occur when two instructions write to
the same value The dependence is false because no data ows between the instructions -- They just produce an output with the same name.
17
Beware again!
Is there a dependence here?
1: add $t1,$s2,$s3
1 2
Can we reorder the instructions?
2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3
2: sub $s2,$s3,$s4
Is the result the same?
18
Beware again!
Is there a dependence here?
1: add $t1,$s2,$s3
1 2
Can we reorder the instructions?
2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3
2: sub $s2,$s3,$s4
Is the result the same?
No! The value in $s2 that 1 needs will be destroyed
18
False Dependence #2
(WAR) This isita Write-after-Read no data dependence ows between Again, is false because the instructions
19
Out-of-Order Execution
instructions Any sequence of hazards that has set of RAW, WAW, and WAR constrain its
execution. Can we design a processor that extracts as much parallelism as possible, while still respecting these dependences?
20
The Central OOO Idea
1. Fetch a bunch of instructions 2. Build the dependence graph 3. Find all instructions with no unmet dependences 4. Execute them. 5. Repeat
21
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
WAR WAW RAW
4: add $t5,$t1,$t2
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 2
4: add $t5,$t1,$t2
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
4: add $t5,$t1,$t2
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2
8: add $t3,$t5,$t1
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2
8: add $t3,$t5,$t1
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2
8: add $t3,$t5,$t1
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2
8: add $t3,$t5,$t1
22
Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 3 2
4: add $t5,$t1,$t2 5: or $t4,$s1,$s3 6: mul $t2,$t3,$s5 7: sl $t3,$t4,$t2
8: add $t3,$t5,$t1
8 Instructions in 5 cycles
22
Simplied OOO Pipeline
A new schedule stage manages the Instruction Window The window holds the set of instruction the processor examines The fetch and decode ll the window Execute stage drains it Typically, OOO pipelines are also wide but it is not necessary. Impacts More forwarding, More stalls, longer branch resolution Fundamentally more work per instruction.
Deco de Sche dule
EX
Fetch
Mem
Write back
23
The Instruction Window
is the set The Instruction Windowexamines of instruction the processor the window, The larger can nd, but...the more parallelism the processor Keeping the window lled is a challenge
The fetch and decode ll the window Execute stage drains it
24
The Issue Window
Decoded Instruction data alu_out_dst_0 vrs vrt alu_out_dst_1 alu_out_value_0 alu_out_value_1
opcode etc
vrs
vrt
rs_value
valid
rt_value
valid
= = = = rt_value rs_value opcode Ready
25
The Issue Window
ALU0 Arbitration
insts
ALU1
Schedule
execute
26
Keeping the Window Filled
instruction lled is key! Keeping thewindows arewindow32 instructions about Instruction by their complexity, which is (size is limited
Branches are every 4-5 instructions. 6-8 This means that the processor predict the consecutive branches correctly to keep
considerable)
window full. On a mispredict, you ush the pipeline, which includes the emptying the window.
27
How Much Parallelism is There?
Not much, in the presence of WAW and WAR dependences. arise because we must registers, and Theseare a limited number wereusefreely reuse. there can How can we get rid of them?
28
Removing False Dependences
If WAW and WAR dependences arise because we have too few registers
cant! But! Wewhy didThe Architecture only gives us 32 (why or we only use 5 bits?) Solution:a set of internal physical register that is as large Dene
Lets add more!
This is called Register Renaming
as the number of instructions that can be in ight -128 in a recent intel chip. Every instruction in the pipeline gets a registers Maintaining a register mapping table that determines which physical register currently holds the value for the required architectural registers.
29
Alpha 21264: Renaming
Register map table
1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3
r1 0: p1 1: 2: 3:
r2 p2
r3 p3
1 2 3 4 5 RAW WAW WAR
4: 5:
Alpha 21264: Renaming
Register map table
1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3
r1 0: p1 1:
r2 p2
r3 p3
1 2 3 4 5 RAW
p1 currently holds the value of architectural 2: registers r1 3:
4: 5:
WAW WAR
Alpha 21264: Renaming
1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p2, p3
r1 0: p1 1: p1 2: 3:
r2 p2 p2
r3 p3 p4
1 2 3 4 5 RAW WAW WAR
4: 5:
Alpha 21264: Renaming
1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p2, p3 p5, p1, p4
r1 0: p1 1: p1 2: p1 3:
r2 p2 p2 p5
r3 p3 p4 p4
1 2 3 4 5 RAW WAW WAR
4: 5:
Alpha 21264: Renaming
1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p2, p3 p5, p1, p4 p6, p4, p1
r1 0: p1 1: p1 2: p1 3: p6
r2 p2 p2 p5 p5
r3 p3 p4 p4 p4
1 2 3 4 5 RAW WAW WAR
4: 5:
Alpha 21264: Renaming
1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p5, p6, p7, p2, p1, p4, p4, p3 p4 p1 p6
r1 0: p1 1: p1 2: p1 3: p6
r2 p2 p2 p5 p5 p7
r3 p3 p4 p4 p4 p4
1 2 3 4 5 RAW WAW WAR
4: p6 5:
Alpha 21264: Renaming
1: 2: 3: 4: 5: Add Sub Mult Add Add r3, r2, r1, r2, r2, r2, r1, r3, r3, r1, r3 r3 r1 r1 r3 p4, p5, p6, p7, p8, p2, p1, p4, p4, p6, p3 p4 p1 p6 p4
r1 0: p1 1: p1 2: p1 3: p6
r2 p2 p2 p5 p5 p7 p8
r3 p3 p4 p4 p4 p4 p4
1 2 3 4 5 RAW WAW WAR
4: p6 5: p6
Alpha 21264: Renaming
1: 2: 3: 4: 5: Add r3, r2, r3 Sub r2, r1, r3 Mult r1, r3, r1 Add r2, r3, r1 Add r2, r1, r3 p4, p5, p6, p7, p8, p2, p1, p4, p4, p6, p3 p4 p1 p6 p4
r1 0: p1 1: p1 2: p1 3: p6 4: p6 5: p6
r2 p2 p2 p5 p5 p7 p8
r3 p3 p4 p4 p4 p4 p4
1 2 3 4 5 RAW WAW 4 2
1 3 5 WAR
New OOO Pipeline
The register le is larger (to hold the physical registers) The pipeline is longer more forwarding The payoff had better be signicant (and it is)
Fetch Deco de Rena me Sche dule
EX
Longer branch delay
Mem
Write back
38
Modern OOO Processors
The fastest machines in the world are OOO superscalars AMD Barcelona 6-wide issue
106 instructions inight at once. Intel Nehalem 12 ALUs issue to 5-way instructions in ight > 128
OOO provides the most benet for memory operations.
Non-dependent instructions can keep executing during cache misses. This is so-called memory-level parallelism. It is enormously important. CPU performance is (almost) all about memory performance nowadays (remember the memory wall graphs!)
39
The Problem with OOO
the fastest OOO machines EvenIPC, even though they are 4-5only get about 1-2 wide. Problems ILP within applications. -- 1-2 per thread, Insufcient
usually Poor branch prediction performance parallelism. Single threads also have little memory Observation many ALUs and instruction queue slots many Onempty cycles, sit
40
Simultaneous Multithreading
AKA HyperThreading in Intel machines Run multiple threads at the same time Just throw all the instructions into the pipeline Keep some separate data for each
Renaming table TLB entries PCs
But the rest of the hardware is shared. It is surprisingly simple (but still quite complicated)
Fetch T1 Fetch T2 Fetch T3 Fetch T4
Deco de
Rena me
Sche dule
EX
Mem
Write back
Deco de
Rena me
Sche dule
EX
Mem
Write back
41
SMT Advantages
multiple threads at once Exploit the ILP of or branch prediction (fewer Less dependence required per thread) correct predictions Less idle hardware (increased power efciency) Much higher IPC -- up to 4 (in simulation) threads can Disadvantages: other down. ght over resources and slow each Invented, in part, Historical footnote:he was at UW by our own Dean Tullsen when
42