0% found this document useful (0 votes)

55 views61 pages

Parallelism I: Inside The Core

An out-of-order processor can achieve a CPI of less than 1 by executing multiple instructions simultaneously through instruction level parallelism (ILP). It works by fetching a window of instructions, determining which have no dependencies, and executing those instructions out of order while respecting true data dependencies. This allows more instructions to execute in parallel than a traditional in-order pipeline.

Uploaded by

Manoj Kumar G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views61 pages

Parallelism I: Inside The Core

Uploaded by

Manoj Kumar G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Parallelism I: Inside the Core

Remember!

projects due next Wednesday 141LFinal on next Tuesday (3-6 pm) at midnight. 141

The nal

Comprehensive as the Midterm. Same general format the slides, and the Review the homeworks, quizzes.

Key Points
is wide issue mean? Whatdoes does it affect performance? How does it affect pipeline design? How is the basic idea behind out-of-order What execution? What is the difference between a true and false dependence? How do OOO processors remove false dependences? What is Simultaneous Multithreading?
4

Parallelism
= IC * * CT ET is moreCPI less xed or IC have shrunk cycle time as far as we can We have achieved a CPI of 1. We we get faster? Can

Parallelism
= IC * * CT ET is moreCPI less xed or IC have shrunk cycle time as far as we can We have achieved a CPI of 1. We we get faster? Can We can reduce our CPI to less than 1. The processor must do multiple operations at once. This is called Instruction Level Parallelism (ILP)
5

Approach 1: Widen the pipeline

Fetch PC and PC+4 Fetch Decode Deco 2 de inst Fetch 4 values Deco de
EX

Mem Two Memory ops Mem

Write Write back back 2 values Write back

1 Process two instructions at once instead of PC 1 odd PC instruction and 1 even Often keeps the instruction fetch logic simpler. This 2-wide, in-order, superscalar processor Potential problems?
6

Single issue refresher

cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 add $s1,$s2,$s3 F D E M W

sub $s2,$s4,$s5

ld $s3, 0($s2)

add $t1, $s3, $s3

Single issue refresher

cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 add $s1,$s2,$s3 F D E M W

sub $s2,$s4,$s5

Forwarding
E M W D D E M W

ld $s3, 0($s2)

add $t1, $s3, $s3

Single issue refresher

cycle 0 cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 add $s1,$s2,$s3 F D E M W

sub $s2,$s4,$s5

Forwarding
E M W

ld $s3, 0($s2)

Forwarding
add $t1, $s3, $s3 F D D E M W

Dual issue: Ideal Case

0 add $s1,$s2,$s3 sub $s2,$s4,$s5 ld $s3, 0($s2) add $t1, $s3, $s3 ... ... ... ... ... ... F F 1 D D F F 2 E E D D F F 3 M M E E D D F F 4 W W M M E E D D F F W W M M E E D D W W M M E E W W M M W W 5 6 7 8

CPI == 0.5!
8

Dual issue: Structural Hazards Structural hazards everything We might not replicate

Perhaps only one multiplier, one shifter, and one load/ store unit What if the instruction is in the wrong place?
Decode Deco 2 de inst Fetch 4 values Deco de
EX

Fetch PC and PC+4 Fetch

Mem

*
EX

Write Write back back 2 values Write back

If an upper instruction needs the lower pipeline, squash the lower instruction
9

Fetch PC and PC+4 Fetch

Mem

*
EX

Write Write back back 2 values Write back

If an upper instruction needs the lower pipeline, squash the lower instruction
9

Dual issue: dealing with hazards

0 add sub Mul Shift Shift Ld Shift Ld F F 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10

Dual issue: dealing with hazards

0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10

PC = 0

add sub Mul Shift Shift Ld Shift Ld

F F

Dual issue: dealing with hazards

0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M x W 5 6 7 8 9 10

PC = 0 PC = 8

add sub Mul Shift Shift Ld Shift Ld

F F

Dual issue: dealing with hazards

0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M 5 6 7 8 9 10

PC = 0 PC = 8 PC = 12

add sub Mul Shift Shift Ld Shift Ld

F F

Shift moves to lower pipe load is squashed

x W

Dual issue: dealing with hazards

0 1 D D F F 2 E E D D F F 3 M M E E D x x F 4 W W M M E x x D W W M x x E W x x M 5 6 7 8 9 10

PC = 0 PC = 8 PC = 12 PC = 12

add sub Mul Shift Shift Ld Shift Ld

F F

Shift moves to lower pipe load is squashed

x W

Load uses lower pipe Shift becomes a noop

Dual issue: Data Hazards

The lower instruction may need a value produced by the upper instruction Forwarding cannot help us -- we must stall.
Fetch PC and PC+4 Fetch Decode Deco 2 de inst Fetch 4 values Deco de
EX

Mem

*
EX

Write Write back back 2 values Write back

Dual issue: dealing with hazards

is essential! Forwardingstall. Both pipes
0 add $s1, $s3,#4 sub $s4, $s1, #4 add ... sub ... and ... or ... F F 1 D D F F 2 E D F F 3 M E D D F F 4 W M E E D D W M M E E W W M M W W 5 6 7 8 9 10

Dual issue: Control Hazards

The upper instruction might be branch. The lower instruction might be on the wrong path Solution 1: Require branches to execute in the lower pipeline -See structural hazards. What about consecutive branches? -- Exercise for the reader What about branches to odd addresses? -- Squash the upper pipe

Fetch PC and PC+4 Fetch

Decode Deco 2 de inst Fetch 4 values Deco de

Mem

*
EX

<<
branch unit.

Write Write back back 2 values Write back

Beyond Dual Issue

are possible. Wider pipelinesseparate oating point pipeline. There is often a leads to hardware Wide issuegets harder, too. complexity Compiling processors use of two options if they In practice, ILP want more

Change the ISA and build a smart compiler: VLIW Keep the same ISA and build a smart processors: Outof-order

Going Out of Order: Data dependence refresher.

1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t5,$t1,$t2
1

4: add $t3,$t1,$t2

Going Out of Order: Data dependence refresher.

1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t5,$t1,$t2
1

1
2

4: add $t3,$t1,$t2

There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 15

Going Out of Order: Data dependence refresher.

1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t5,$t1,$t2
1

1
2

4: add $t3,$t1,$t2

We can parallelize instructions that do not have a read-afterwrite dependence (RAW)

There is parallelism!! We can execute 1 & 2 at once and 3 & 4 at once 15

Data dependences
is In general, if therewe no dependence between two instructions, can execute them in either

order or simultaneously. But beware:

1: add $t1,$s2,$s3

Is there a dependence here?

1 2

Can we reorder the instructions?

2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3

2: sub $t1,$s3,$s4

Is the result the same?

Data dependences
is In general, if therewe no dependence between two instructions, can execute them in either

order or simultaneously. But beware:

1: add $t1,$s2,$s3

Is there a dependence here?

1 2

Can we reorder the instructions?

2: sub $t1,$s3,$s4 1: add $t1,$s2,$s3

2: sub $t1,$s3,$s4

No! The nal value of $t1 is different

Is the result the same?

False Dependence #1
Also called Write-after-Write dependences (WAW) occur when two instructions write to

the same value The dependence is false because no data ows between the instructions -- They just produce an output with the same name.

Beware again!
Is there a dependence here?
1: add $t1,$s2,$s3
1 2

Can we reorder the instructions?

2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3

2: sub $s2,$s3,$s4

Is the result the same?

Beware again!
Is there a dependence here?
1: add $t1,$s2,$s3
1 2

Can we reorder the instructions?

2: sub $s2,$s3,$s4 1: add $t1,$s2,$s3

2: sub $s2,$s3,$s4

Is the result the same?

No! The value in $s2 that 1 needs will be destroyed

False Dependence #2

(WAR) This isita Write-after-Read no data dependence ows between Again, is false because the instructions

Out-of-Order Execution
instructions Any sequence of hazards that has set of RAW, WAW, and WAR constrain its

execution. Can we design a processor that extracts as much parallelism as possible, while still respecting these dependences?

The Central OOO Idea

1. Fetch a bunch of instructions 2. Build the dependence graph 3. Find all instructions with no unmet dependences 4. Execute them. 5. Repeat

Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
WAR WAW RAW

4: add $t5,$t1,$t2

Example
1: add $t1,$s2,$s3 2: sub $t2,$s3,$s4 3: or $t3,$t1,$t2
1 WAR WAW RAW 3 2

4: add $t5,$t1,$t2