Data Hazards
Hazards: Key Points
Hazards cause imperfect pipelining
They prevent us from achieving CPI = 1
They are generally causes by counter flow data dependences in
the pipeline
Three kinds
Structural -- contention for hardware resources
Data -- a data value is not available when/where it is needed.
Control -- the next instruction to execute is not known.
ways to deal with hazards
TwoRemoval
hardware and/or complexity to work around the
hazard so--itadd
does not exist
Bypassing/forwarding
Speculation
Stall -- Sacrifice performance to prevent the hazard from
occurring
Stalling causes bubbles
Data Dependences
data dependence occurs whenever one
Ainstruction
needs a value produced by another.
Register values (for now)
Also memory accesses (more on this later)
add $s0, $t0, $t1
sub $t2, $s0, $t3
sw
$t1, 0($t2)
ld
$t3, 0($t2)
ld
$t4, 16($s4)
add $t3, $s0, $t4
and $t3, $t2, $t4
Dependences in the pipeline
our simple pipeline, these instructions cause a
Inhazard
Cycles
add $s0, $t0, $t1
sub $t2, $s0, $t3
Fetch
Deco
de
Fetch
EX
Mem
Deco
de
EX
Write
back
Mem
Write
back
How can we fix it?
Ideas?
Solution 1: Make the compiler deal with it.
hazards to the big A architecture
Expose
A result is available N instructions after the instruction
that generates it.
In the meantime, the register file has the old value.
delay slots
is N?
What
it change?
Can
What can the compiler do?
Fetch
Deco
de
EX
Mem
Write
back
Compiling for delay slots
compiler must fill the delay slots with other
The
instructions
What if it cant? No-ops
add $s0, $t0, $t1
Rearrange
instructions add $s0, $t0, $t1
sub $t2, $s0, $t3
and $t7, $t5, $t4
add $t3, $s0, $t4
sub $t2, $s0, $t3
and $t7, $t5, $t4
add $t3, $s0, $t4
Solution 2: Stall
you need a value that is not ready, stall
When
Suspend the execution of the executing instruction
and those that follow.
This introduces a pipeline bubble. A bubble is a lack of
work to do. It moves through the pipeline like an
instruction.
Cycles
add $s0, $t0, $t1
sub $t2, $s0, $t3
Fetch
Deco
de
Fetch
EX
Mem
Stall
Write
back
Deco
de
EX
Mem
Write
back
Stalling the pipeline
all pipeline stages before the stage where
Freeze
the hazard occurred.
Disable the PC update
Disable the pipeline registers
Insert nop control bits at stalled stage (decode in our
example)
How is this solution still potentially better than relying
on the compiler?
essentially equivalent to always inserting a
This
nop when a hazard exists
The compiler can still act like there are delay slots to avoid stalls.
Implementation details are not exposed in the ISA
9
The Impact of Stalling On Performance
= I * CPI * CT
ET
and CT are constant
IWhat
is the impact of stalling on CPI?
What do we need to know to figure it out?
10
The Impact of Stalling On Performance
= I * CPI * CT
ET
and CT are constant
IWhat
is the impact of stalling on CPI?
of instructions that stall: 30%
Fraction
CPI = 1
Baseline
Stall CPI = 1 + 2 = 3
New CPI = 0.3*3 + 0.7*1 = 1.6
11
Solution 3: Bypassing/Forwarding
values are computed in Ex and Mem but
Data
publicized in write back
The data exists! We should use it.
Results "published"
to registers
results known
inputs are needed
Fetch
Deco
de
EX
Mem
Write
back
12
Bypassing or Forwarding
Take the values, where ever they are
Cycles
add $s0, $t0, $t1
sub $t2, $s0, $t3
Fetch
Deco
de
Fetch
EX
Mem
Deco
de
EX
Write
back
Mem
Write
back
13
Forwarding Paths
Cycles
add $s0, $t0, $t1
sub $t2, $s0, $t3
sub $t2, $s0, $t3
sub $t2, $s0, $t3
Fetch
Deco
de
Fetch
EX
Mem
Deco
de
EX
Mem
Deco
de
EX
Mem
Deco
de
EX
Fetch
Fetch
Write
back
Write
back
Write
back
Mem
Write
back
14
Forwarding in Hardware
Add
Add
4
Shi<
le<
2
File
Write
Addr
Write
Data
16
Sign
Extend
Read
Data
2
32
ALU
Address
Write
Data
Read
Data
Mem/WB
Read
Addr
2
Data
Memory
Read
Data
1
Exec/Mem
Register
Dec/Exec
Read
Address
Read
Addr
1
IFetch/Dec
PC
Instruc(on
Memory
Add
Forwarding for Loads
Load values come from the Mem stage
Cycles
ld
$s0, (0)$t0
sub $t2, $s0, $t3
Fetch
Deco
de
Fetch
EX
Mem
Deco
de
EX
Write
back
Mem
Time travel presents significant
implementation challenges
16
What can we do?
to the compiler
Punt
Easy enough.
Will work.
Same dangers apply as before.
If the compiler cant fix it, the hardware will stall
stall.
Always
when possible, stall otherwise
Forward
Here the compiler still has leverage
17
Hardware Cost of Forwarding
our pipeline, adding forwarding required
Inrelatively
little hardware.
deeper pipelines it gets much more
For
expensive
ALU * pipeline stages you need to forward over
Roughly:
modern processor have multiple ALUs (4-5)
Some
And deeper pipelines (4-5 stages of to forward across)
paths need to be supported.
NotIf a allpathforwarding
does not exist, the processor will need to stall.
18
Key Points: Control Hazards
occur when we dont know what the
Control
next instruction is
caused by branches
Mostly
for dealing with them
Strategies
Stall
Guess!
Leads to speculation
Flushing the pipeline
Strategies for making better guesses
Understand the difference between stall and flush
19
Control Hazards
add $s1, $s3, $s2
Computing the new PC
sub $s6, $s5, $s2
beq $s6, $s7, somewhere
and $s2, $s3, $s1
Fetch
Deco
de
EX
Mem
Write
back
20
Computing the PC
instruction
Non-branch
PC = PC + 4
When is PC ready?
Fetch
Deco
de
EX
Mem
Write
back
21
Computing the PC
instructions
Branch
bne $s1, $s2, offset
if ($s1 != $s2) { PC = PC + offset} else {PC = PC + 4;}
When is the value ready?
Fetch
Deco
de
EX
Mem
Write
back
22
Option 2: Simple Prediction
a processor tell the future?
Can
non-taken branches, the new PC is ready
For
immediately.
just assume the branch is not taken
Lets
called branch prediction or control
Also
speculation
What if we are wrong?
23
Predict Not-taken
Cycles
Not-taken
bne $t2, $s0, somewhere
Taken
bne $t2, $s4, else
Fetch
Deco
de
Fetch
add $s0, $t0, $t1
...
else:
sub $t2, $s0, $t3
EX
Mem
Deco
de
EX
Fetch
Deco
de
Write
back
Mem
EX
Write
back
Mem
Write
back
Squash
Fetch
Deco
de
start the add, and then, when we discover
We
the branch outcome, we squash it.
We flush the pipeline.
24
Simple static Prediction
means before run time
static
prediction schemes are possible
Many
taken
Predict
Loops are commons
Pros?
not-taken
Predict
Pros?
Not all branches are for loops.
Backward Taken/Forward not taken
Best of both worlds.
25
Implementing Backward taken/forward not
taken
in control
Changes
inputs to the control unit
New
The sign of the offset
The result of the branch
outputs from control
New
flush signal.
The
Inserts noop bits in datapath and control
26
The Importance of Pipeline depth
are two important parameters of the
There
pipeline that determine the impact of branches
on performance
Branch decode time -- how many cycles does it take to
identify a branch (in our case, this is less than 1)
Branch resolution time -- cycles until the real branch
outcome is known (in our case, this is 2 cycles)
27
Pentium 4 pipeline
1.Branches take 19 cycles to resolve
2.Identifying a branch takes 4 cycles.
3.Stalling is not an option.
4.Not quite as bad now, but BP is still very important.
Dynamic Branch Prediction
pipes demand higher accuracy than static
Long
schemes can deliver.
of making the the guess once, make it
Instead
every time we see the branch.
Predict future behavior based on past behavior
29