Data Hazards
Time (in clock cycles)
IM
ALU
Reg
Reg
IM
AND R6, R1, R7
OR R8, R1, R9
CC 4
DM
Reg
IM
CC 5
CC 6
Reg
DM
Reg
Reg
DM
ALU
or
Danger!Danger!Danger!
SUB R4, R1, R5
IM
CC 3
ALU
Pipeline Hazards
CC 2
ALU
Program execution order (in instructions)
ADD R1, R2, R3
CC 1
XOR R10, R1, R11
IM
CSE 240A
Dean Tullsen
Data Hazard
lw R8, 10000(R3)
ID/EX
IF/ID
4
ADD
add R6, R2, R1
M
u
x
addi R3, R1, #35
MEM/WB
Instruction
memory
IR
IR 11..15
MEM/WB.IR
Registers
M
u
x
M
u
x
16
CSE 240A
Sign
extend
Data hazards are caused by data dependences
Data dependences, and thus data hazards, come in
3 flavors (not all of which apply to this pipeline).
RAW (read-after-write)
WAW (write-after-write)
WAR (write-after-read)
Branch
taken
IR 6..10
PC
Dean Tullsen
Data Dependence
EX/MEM
Zero?
Reg
CSE 240A
ALU
Data
memory
M
u
x
32
Dean Tullsen
CSE 240A
Dean Tullsen
RAW Hazard
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
RAW hazard is extremely common
CSE 240A
Dean Tullsen
CSE 240A
Dealing with Data Hazards
through Forwarding
WAR Hazard
Dean Tullsen
later instruction tries to write an operand before earlier instruction
reads it
The dependence
add R1, R2, R3
sub R2, R5, R4
The hazard?
add R1, R2, R3
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
sub R2, R5, R4
WAR hazard is uncommon/impossible in a reasonable (in-order)
pipeline
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
CC 1
CC 2
IM
Reg
IM
CC 3
Reg
IM
CC 4
DM
Reg
IM
CC 5
CC 6
Reg
DM
Reg
Reg
DM
ALU
The hazard
add R1, R2, R3
sub R5, R1, R4
ALU
later instruction tries to write an operand before earlier
instruction writes it
The dependence
add R1, R2, R3
sub R1, R2, R4
The hazard
ID
EX
MEM MEM2 MEM3
lw R1, R2, R3 IF
IF
ID
EX
MEM
WB
sub R1, R2, R4
WAW hazard possible in a reasonable pipeline, but not in
the very simple pipeline were assuming.
ALU
later instruction tries to read an operand before earlier instruction
writes it
The dependence
add R1, R2, R3
sub R5, R1, R4
ALU
WAW Hazard
XOR R10, R1, R11
IM
CSE 240A
Dean Tullsen
CSE 240A
Reg
Dean Tullsen
WB
Dealing with Data Hazards through
Forwarding
AND R6, R4, R7
ID/EX
IF/ID
Zero?
MEM/WB
Branch
taken
IR 6..10
Instruction
memory
M
u
x
IR 11..15
IR
Registers
MEM/WB.IR
ALU
M
u
x
16
Sign
extend
Data
memory
M
u
x
(Im letting ADD stand in for all ALU operations)
32
CSE 240A
Dean Tullsen
CSE 240A
Dean Tullsen
More Forwarding
CC 1
CC 2
CC 3
CC 4
Forwarding and Stalling
CC 5
IM
Reg
ALU
LW
ADD R1, R2, R3
DM
R1, 0(R2)
Reg
ALU
IM
DM
Reg
ALU
IM
Reg
CC 4
CC 5
CC 6
DM
Reg
IM
Reg
Bubble
IM
Bubble
Reg
Bubble
IM
DM
DM
OR R8, R1, R9
CSE 240A
IM
CC 3
Reg
AND R6, R1, R7
SW 12(R1), R4
CC 2
Reg
SUB R4, R1, R5
LW R4, 0(R1)
CC 1
CC 6
ALU
PC
ADD -> ADD
ADD -> LW
ADD -> SW (2 operands)
LW -> ADD
LW -> LW
LW -> SW (2 operands)
ALU
ADD
EX/MEM
M
u
x
ADD R1, R2, R3
ALU
SUB R4, R1, R5
Forwarding Options
Dean Tullsen
CSE 240A
Reg
Dean Tullsen
Example
Avoiding Pipeline Stalls
lw R1, 1000(R2)
lw R3, 2000(R2)
add R4, R1, R3
lw R1, 3000(R2)
add R6, R4, R1
sw R6, 1000(R2)
ADD R1, R2, R3
SW R1, 1000(R2)
LW R7, 2000(R2)
ADD R5, R7, R1
LW R8, 2004(R2)
SW R7, 2008(R8)
ADD R8, R8, R2
this is a compiler technique called instruction scheduling.
LW R9, 1012(R8)
SW R9, 1016(R8)
CSE 240A
How big a problem are these pipeline
stalls?
Detecting ALU Input Hazards
13% of the loads in FP programs
25% of the loads in integer programs
ID/EX
opcode rd rs1 rs2
Dean Tullsen
CSE 240A
Dean Tullsen
CSE 240A
EX/MEM
MEM/WB
opcode rd
Dean Tullsen
opcode rd
CSE 240A
ALU
Dean Tullsen
Inserting Bubbles
Adding Datapaths
Set all control values in the EX/MEM register to safe
values (equivalent to a nop)
Keep same values in the ID/EX register and IF/ID register
Keep PC from incrementing
ID/EX
EX/MEM
MEM/WB
Zero?
M
u
x
ALU
Data
memory
M
u
x
CSE 240A
Dean Tullsen
CSE 240A
Dean Tullsen
Control Hazards
Branch Hazards
Instructions are not only dependent on instructions that
produce their operands, but also on all previous control
flow (branch, jump) instructions that lead to that
instruction.
add
add
bne
sub
and
beq
Branch
IF
I2
I3
I4
correct instruction
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
branch taken
and
sub
CSE 240A
Dean Tullsen
CSE 240A
Dean Tullsen
Branch Stall Impact
New Datapath
ID/EX
If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
Two part solution:
ADD
IF/ID
Determine branch taken or not sooner, AND
Compute taken branch address earlier
ADD
IR
IR
Instruction
memory
Dean Tullsen
correct instruction
I4
I5
CSE 240A
MEM/WB.IR
Registers
ALU
Sign
extend
Data
M
u
x
memory
32
CSE 240A
Branch Hazards
I2
6..10
11..15
16
CSE 240A
IF
IR
M
u
x
Move Zero test to ID/RF stage
Adder to calculate new PC in ID/RF stage
1 clock cycle penalty for branch versus 3
Branch
M
u
x
PC
(limited MIPS) branch tests if register = 0 or 0
MIPS Solution:
MEM/WB
EX/MEM
Zero?
Dean Tullsen
What We Know About Branches
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
more conditional branches than unconditional
more forward than backward
67% of branches taken
backward branches taken 80%
WB
Dean Tullsen
CSE 240A
Dean Tullsen
Four Branch Hazard Alternatives
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
#4: Delayed Branch
Define branch to take place AFTER a following instruction
Execute successor instructions in sequence
Squash instructions in pipeline if branch actually taken
Advantage of late pipeline state update
33% MIPS branches not taken on average
PC+4 already calculated, so use it to get next instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target if taken
#3: Predict Branch Taken
67% MIPS branches taken on average
But havent calculated branch target address in this MIPS architecture
1 slot delay allows proper decision and branch target address in 5 stage
pipeline
MIPS uses this
MIPS still incurs 1 cycle branch penalty
Other machines: branch target known before outcome
CSE 240A
Dean Tullsen
CSE 240A
Delayed Branch
Dean Tullsen
Key Points
Where to get instructions to fill branch delay slot?
Branch delay of length n
Before branch instruction
From the target address: only valuable when branch taken
From fall through: only valuable when branch not taken
Cancelling branches allow more slots to be filled
Compiler effectiveness for single branch delay slot:
Hard to keep the pipeline completely full
Data Hazards require dependent instructions to wait for the
producer instruction
Most of the problem handled with forwarding (bypassing)
Sometimes stall still required (especially in modern processors)
Control hazards require control-dependent (post-branch)
instructions to wait for the branch to be resolved
Fills about 60% of branch delay slots
About 80% of instructions executed in branch delay slots useful in
computation
About 50% (60% x 80%) of slots usefully filled
CSE 240A
Dean Tullsen
CSE 240A
Dean Tullsen