1
CS-506 Advanced Computer Systems
Architecture
Lecture#10-11 ILP
Gul Munir Ujjan
Assistant Professor
CISE Department, NEDUET
Karachi
2
Data Hazards
One way to solve this problem is by using a simple
hardware technique called forwarding (also called
bypassing).
Forwarding can be generalized to include passing a
result directly to the functional unit that requires it.
In the previous example the result is not really needed
by the DSUB instruction until after the DADD
instruction actually produces it.
If the result can be moved from the pipeline register
where the DADD stores it to where the DSUB needs it,
then the need for a stall can be avoided.
3
Data Hazards
4
Data Hazards
Where to find the ALU result?
The ALU result generated in the EX stage is passed
through the pipeline registers to the MEM and WB
stages, before it is finally written to the register file.
Since the pipeline registers already contain the ALU
result, we could just forward that value to subsequent
instructions, to prevent data hazards.
5
Forwarding Unit
Forwarding unit selects the correct ALU inputs for the
EX stage.
If there is no hazard, the ALU’s operands will come from the
register file, just like before.
If there is a hazard, the operands will come from either the
EX/MEM or MEM/WB pipeline registers instead.
The ALU sources will be selected by two new multiplexers,
with control signals named ForwardA and ForwardB.
6
Forwarding Unit
7
Detecting Hazards
8
Data hazard
The second way is by using no-op instructions
9
Data hazard
Solution: Delay load approach: Insert a no-operation
instruction to avoid the data conflict
ADD R1 R2+R3
No-Op
No-Op
SUB R4 R1-R5
AND R6 R1 AND R7
OR R8 R1 OR R9
XOR R10 R1 XOR R11
10
Data hazard
FO: fetch data value WO: store the executed value
11
Data Hazard Classification
Three types of data hazards
RAW : Read After Write
WAW : Write After Write
Write After Read
RAR : Read After Read
►Is this a hazard?
Before discussing data hazard classification, let us get
a brief look on dependences.
12
Dependences
Determining how one instruction depends on another is critical
to determining how much parallelism exists in a program and
how that parallelism can be exploited.
If two instructions are dependent, they are not parallel and
must be executed in order, although they may often be partially
overlapped.
There are three different types of dependences:
data dependences (also called true data dependences)
name dependences
control dependences
13
Data Dependence
An instruction y is data dependent on instruction x if
either of the following holds:
1. Instruction x produces a result that may be used by
instruction y.
2. Instruction y is data dependent on instruction z, and
instruction z is data dependent on instruction x.
14
Name Dependence
A name dependence occurs when two instructions use the same
register or memory location, called a name, but there is no flow
of data between the instructions associated with that name.
There are two types of name dependences between an
instruction x that precedes instruction y in program order:
1. An antidependence between instruction x and instruction y occurs
when instruction y writes a register or memory location that
instruction x reads. The original ordering must be preserved to ensure
that x reads the correct value.
2. An output dependence occurs when instruction x and instruction y
write the same register or memory location. The ordering between the
instructions must be preserved to ensure that the value finally written
corresponds to instruction y.
15
Name Dependence
1. Antidependence
Register Renaming
S.D F4, 0(R1)
Because a name dependence is
DADDIU R1, R1, #-8
not a true dependence,
instructions involved in a name
2. Output Dependence
dependence can execute
DSUBU R4, R5, R6 simultaneously or be reordered, if
DADDU R4, R7, R8 the register number used in the
instructions is changed so the
instructions do not conflict.
16
Control Dependence
Control dependency is a situation in which a program
instruction executes if the previous instruction
evaluates in a way that allows its execution.
It is a situation where the execution of one program
statement is directly influenced by the outcome of a
previous statement, specifically a conditional
statement or a loop condition
Read After Write (RAW) 17
True data dependance
Example:
ADD R1 R2+R3
SUB R4 R1-R5
A read after write (RAW) data hazard refers to a situation
where an instruction refers to a result that has not yet
been calculated or retrieved.
This can occur because an instruction is executed after a
previous instruction, and the previous instruction has not
been completely processed through the pipeline.
Solution: Program order must be preserved by
inserting stalls to ensure that second instruction
receives the value from first.
Write After Read (WAR) 18
Antidependence
Example:
R4 R1 + R5
R5 R1 + R2
y tries to write a destination before it is read by x, so x
incorrectly gets the new value.
This hazard arises from an antidependence (or name
dependence).
WAR hazards cannot occur in most static issue pipelines.
The dependence is on the name R5, but not on actual dataflow
Using different output register for y instruction would
solve.
Write After Write (WAW) 19
Output dependence
Example:
R2 R4 + R7
R2 R1 + R3
y tries to write an operand before it is written by x. The writes
end up being performed in the wrong order, leaving the value
written by x rather than the value written by y in the
destination.
This hazard corresponds to an output dependence. WAW
hazards are present only in pipelines that write in more than
one pipe stage or allow an instruction to proceed even when a
previous instruction is stalled.
Solution: We must delay the WB (Write Back) of
second instruction until the execution of first
instruction is successful.
20
Handling Data Hazards
There are three techniques commonly used to handle
data hazards in pipelining:
Forwarding (add special circuitry to the pipeline)
Insertion of the stalls
Re-ordering of the code
21
Control/ Branch Hazards
Branch hazards can cause a greater
performance loss for pipelines
When a branch instruction is executed, it may
or may not change the PC
If a branch changes the PC to its target
address, it is a taken branch Otherwise,
it is untaken.
22
Control/ Branch Hazards
There are FOUR schemes to handle branch
hazards
Freeze scheme
Predict-untaken scheme
Predict-taken scheme
Delayed branch
Note: We will compare them for both
whether branch is taken o untaken.
23
Control/ Branch Hazards
The simplest scheme to handle branches is to
freeze or flush the pipeline, holding or deleting
any instructions after the branch until the branch
destination is known.
Predicted Not taken: In terms of hardware, it’s
easier to assume that the branch will be Not
Taken. Thus, we just need to increment the PC
and continue execution, as for normal
instructions.
If we’re correct, then there is no problem
of instruction flushing and the pipeline
keeps going without any performance
penalty.
24
Control/ Branch Hazards
Predicted Taken: An alternative scheme is to treat every
branch as taken. As soon as the branch is decoded and the
target address is computed, we assume the branch to be
taken and begin fetching and executing at the target.
Delayed Branch: The compiler rearranges the code so that
a branch instruction’s effect is delayed by whatever
number of cycles required to discharge the dependencies.
Instructions are placed after branch instruction
that will execute regardless of branch is taken
or not.
Such instructions are said to be in the branch
delay slot
25
Control/ Branch Hazards
5-Stage Pipelining
26
Control/ Branch Hazards
Branch Untaken (Freeze approach)
The simplest method of dealing with branches is to redo the fetch following
a branch
27
Control/ Branch Hazards
Branch Taken (Freeze approach)
The simplest method of dealing with branches is to redo the fetch following
a branch
28
Control/ Branch Hazards
Branch Taken (Freeze approach)
The simplest scheme to handle branches is to
freeze the pipeline holding or deleting any
instructions after the branch until the branch
destination is known.
The attractiveness of this solution lies
primarily in its simplicity both for hardware
and software.
29
Control/ Branch Hazards
Branch Untaken (Predicted-untaken)
A higher performance, and only slightly more
complex, scheme is to treat every branch as not
taken.
It is implemented by continuing to fetch
instructions as if the branch were normal
instruction.
The pipeline looks the same if the branch is not
taken
If the branch is taken, we need to redo the fetch
instruction
30
Control/ Branch Hazards
Branch Untaken (Predicted-untaken)
31
Control/ Branch Hazards
Branch Taken (Predicted-untaken)
32
Control/ Branch Hazards
Branch untaken (Predicted-Taken)
33
Control/ Branch Hazards
Branch Taken (Predicted-Taken)
An alternative scheme is to treat every branch
as taken
As soon as the branch is decoded and the target
address is computed; we assume the branch to
be taken and begin fetching and executing the
target
34
Control/ Branch Hazards
Branch Taken (Predicted-Taken)
35
Control/ Branch Hazards
Delayed Branch
A fourth scheme in use in some processors is called
delayed branch
It is done in compiler time. It modifies the code.
The general format is:
branch instruction
delay slot
branch target if taken
36
Branch Prediction
The idea is to guess outcome of branch's
condition test (i.e. whether or not the branch will
be taken).
In computer architecture, a branch predictor is
a digital circuit that tries to guess which way a
branch (e.g., an if–then–else structure) will go
before this is known definitively
37
Branch Prediction
The purpose of the branch predictor is to improve the flow in
the instruction pipeline. Branch predictors play a critical role in
achieving high effective performance in many modern pipelined
microprocessor architectures.
It is not known for certain whether a conditional jump will be
taken or not taken until the condition has been calculated and
the conditional jump has passed the execution stage in the
instruction pipeline.
38
Branch Prediction
Branch Prediction
Without branch prediction, the processor would have
to wait until the conditional jump instruction has
passed the execute stage before the next instruction
can enter the fetch stage in the pipeline.
The branch predictor attempts to avoid this waste of
time by trying to guess whether the conditional jump
is most likely to be taken or not taken.
39
Branch Prediction
40
Branch Prediction
1. Static Branch Prediction
When branch prediction is not based on branch
behavior i.e. the predictor always predicts the branch
in the same direction i.e.
Taken (MIPS, Stanford) or
Not-taken (Motorola MC88000),
Such prediction is called static branch prediction.
Static because the prediction is already known before
the program is executed.
41
Branch Prediction
1. Static Branch Prediction
Some of the static prediction schemes include:
Predict all branches to be taken. This makes use of the
observation that a majority of branches is taken. This
primitive mechanism yields 60% to 70% accuracy.
Predict backward branches to be taken and forward
branches not to be taken.
backward branches (branches which decrease the PC)
forward branches (branches which increase the PC)
Profiling can also be used to predict the outcome of a
branch.
42
Branch Prediction
2. Dynamic Branch Prediction
The goal of dynamic branch prediction is to make use of
run-time behavior of branch to more accurately predict
what direction (taken or not-taken) a given branch will
follow.
There are several dynamic branch predictor in use or being
researched nowadays.
The simplest dynamic branch-prediction scheme is a
branch-prediction buffer or Branch History Table (BHT)
A branch-prediction buffer is a small memory indexed
by the lower portion of the address (PC) of the branch
instruction.
The memory contains a single bit that says whether the
branch was recently taken or not.
43
Branch Prediction
2. Dynamic Branch Prediction
Branch History Table or Branch Prediction Buffer
With such a buffer, we don’t know, if the prediction is
correct—it may have been put there by another branch that
has the same low-order address bits.
The prediction is just a hint, if the hint turns out to be wrong,
the prediction bit is inverted and stored back.
Not Taken Taken Taken
0 1
(Predict Not-Taken) (Predict Taken)
Not Taken
44
Branch Prediction
2. Dynamic Branch Prediction
This simple 1-bit prediction scheme has a
performance shortcoming:
Even if a branch is almost always taken, we will
likely predict incorrectly twice, rather than once.
once upon loop exit and then again on loop entry
45
Branch Prediction
2. Dynamic Branch Prediction
To remedy this weakness, 2-bit prediction
schemes are often used. In a 2-bit scheme,
a prediction must miss twice before it is
changed.
46
Branch Prediction
2. Dynamic Branch Prediction
The 2-bit scheme is actually a specialization of a more
general scheme that has an n-bit saturating counter
for each entry in the prediction buffer.
Saturating counter: Each time the branch is taken the counter is
incremented; if the maximum value is already reached, nothing is done.
Each time the branch is not taken the counter is decremented; if the
minimum value = 0 is reached it stays the same.
The first half of states i.e. from 0 to 2n – 1 – 1 of
saturating counter are treated as predict not-taken
whereas next half of states i.e. from 2n – 1 to 2n – 1 are
treated as predict taken.
Studies of n-bit predictors have shown that the 2-bit
predictors do almost as well.
47
Correlating Branches
Recent branches are correlated; that is, behaviour of recently-executed
branches affects prediction of current branch
if (d == 0)
d = 1;
if (d == 1)
…
OR
if (aa==2) -- branch b1
aa = 0;
if (bb==2) --- branch b2
bb = 0;
if(aa!=bb) { … --- branch b3
branch b3 – Clearly depends on the results of b1 and b2
If the first two branches are not taken then the third will be taken