0% found this document useful (0 votes)

203 views106 pages

Lec 04 Pipeline D Processor

The document discusses pipelined processor design. It begins with an example of pipelining laundry tasks into three stages of wash, dry, and fold to speed up completing multiple loads. Key aspects of pipelining a processor are covered, including introducing pipeline registers between stages to overlap instruction execution. Pipeline hazards that can occur from data or control dependencies between instructions are explained. Solutions like forwarding, stalling, and branch prediction help mitigate hazards to improve pipelined processor performance. Diagrams are used to illustrate the pipelined execution of instructions over multiple clock cycles.

Uploaded by

MaheshKota

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

203 views106 pages

Lec 04 Pipeline D Processor

Uploaded by

MaheshKota

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

Pipelined Processor Design

M S Bhat
Dept. of E&C,

NITK Suratkal

Presentation Outline
Pipelining versus Serial Execution
Pipelined Datapath and Control

Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

8/16/2015

VL722

Pipelining Example
Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes
3. Fold and put clothes into drawers
Each stage takes 30 minutes to complete
Four loads of clothes to wash, dry, and fold
8/16/2015

VL722

D
3

Sequential Laundry
6 PM
Time 30

7
30

8
30

9
30

10
30

11
30

12 AM
30

A
B
C
D

Sequential laundry takes 6 hours for 4 loads

Intuitively, we can use pipelining to speed up laundry
8/16/2015

VL722

Pipelined Laundry: Start Load ASAP

6 PM
30

7
30
30

8
30
30
30

30
30
30

9 PM
Time
30
30

Pipelined laundry takes

3 hours for 4 loads

Speedup factor is 2 for

4 loads

Time to wash, dry, and

fold one load is still the
same (90 minutes)

8/16/2015

VL722

Serial Execution versus Pipelining

Consider a task that can be divided into k subtasks
The k subtasks are executed on k different stages
Each subtask requires one time unit
The total execution time of the task is k time units

Pipelining is to overlap the execution

The k stages work in parallel on k different tasks
Tasks enter/leave pipeline at the rate of one task per time unit
1 2

k
1 2

1 2

k
1 2

1 2

Without Pipelining
One completion every k time units
8/16/2015

VL722

1 2

With Pipelining
One completion every 1 time unit
6

Synchronous Pipeline
Uses clocked registers between stages
Upon arrival of a clock edge
All registers hold the results of previous stages simultaneously

The pipeline stages are combinational logic circuits

It is desirable to have balanced stages
Approximately equal delay in all stages

Input

Clock period is determined by the maximum stage delay

Output

Clock
8/16/2015

VL722

Pipeline Performance
Let ti = time delay in stage Si
Clock cycle t = max(ti) is the maximum stage delay

Clock frequency f = 1/t = 1/max(ti)

A pipeline can process n tasks in k + n 1 cycles
k cycles are needed to complete the first task
n 1 cycles are needed to complete the remaining n 1 tasks

Ideal speedup of a k-stage pipeline over serial execution

Serial execution in cycles

Sk =

Pipelined execution in cycles

8/16/2015

VL722

k+n1

Sk k for large n

MIPS Processor Pipeline

Five stages, one cycle per stage
1. IF: Instruction Fetch from instruction memory
2. ID: Instruction Decode, register read, and J/Br address
3. EX: Execute operation or calculate load/store address

4. MEM: Memory access for load and store

5. WB: Write Back result to register

8/16/2015

VL722

Performance Example
Assume the following operation times for components:
Instruction and data memories: 200 ps

ALU and adders: 180 ps

Decode and Register file access (read or write): 150 ps
Ignore the delays in PC, mux, extender, and wires

Which of the following would be faster and by how much?

Single-cycle implementation for all instructions
Multicycle implementation optimized for every class of instructions

Assume the following instruction mix:

40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps
8/16/2015

VL722

Single-Cycle vs Multicycle Implementation

Break instruction execution into five steps
Instruction fetch

Instruction decode and register read

Execution, memory address calculation, or branch completion
Memory access or ALU instruction completion
Load instruction completion

One step = One clock cycle (clock cycle is reduced)

First 2 steps are the same for all instructions

8/16/2015

Instruction

# cycles

ALU & Store

Branch

Load

Jump

VL722

Instruction

# cycles

Solution
Instruction
Class

Instruction
Memory

ALU
Operation

Data
Memory

Total

ALU

200

150

180

150

680 ps

Load

200

150

180

200

150

880 ps

Store

200

150

180

200

Branch

200

150

180

530 ps

Jump

200

150

decode and update PC

350 ps

730 ps

For fixed single-cycle implementation:

Clock cycle = 880 ps determined by longest delay (load instruction)

For multi-cycle implementation:

Clock cycle = max (200, 150, 180) = 200 ps (maximum delay at any step)
Average CPI = 0.44 + 0.25 + 0.14+ 0.23 + 0.12 = 3.8

Speedup = 880 ps / (3.8 200 ps) = 880 / 760 = 1.16

8/16/2015

VL722

Single-Cycle vs Pipelined Performance

Consider a 5-stage instruction execution in which
Instruction fetch = Data memory access = 200 ps
ALU operation = 180 ps
Register read = register write = 150 ps

What is the clock cycle of the single-cycle processor?

What is the clock cycle of the pipelined processor?
What is the speedup factor of pipelined execution?
Solution

Single-Cycle Clock = 200+150+180+200+150 = 880 ps

Reg

ALU

MEM

Reg
IF

880ps

Reg

ALU
880 ps

8/16/2015

VL722

MEM

Reg
13

Single-Cycle versus Pipelined contd

Pipelined clock cycle = max(200, 180, 150) = 200 ps
IF

Reg

200

IF
200

ALU

Reg
IF
200

MEM

Reg

ALU

MEM

Reg

ALU

MEM

200

Reg
200

CPI for pipelined execution =

Reg
200

One instruction is completed in each cycle (ignoring pipeline fill)

Speedup of pipelined execution =

880 ps / 200 ps = 4.4

Instruction count and CPI are equal in both cases

Speedup factor is less than 5 (number of pipeline stage)

Because the pipeline stages are not balanced

Throughput = 1/Max(delay) = 1/200 ps = 5 x 109 instructions/sec

8/16/2015

VL722

Pipeline Performance Summary

8/16/2015

VL722

Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control

Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

8/16/2015

VL722

Single-Cycle Datapath
Shown below is the single-cycle datapath
How to pipeline this single-cycle datapath?
Answer: Introduce pipeline register at end of each stage
IF = Instruction Fetch

ID = Decode &
Register Read

Jump or Branch Target Address

EX = Execute

MEM = Memory
Access
J

Next
PC

Beq
Bne

Instruction
Memory
Instruction

0
1

ALU result

Imm26

PCSrc

Rs 5
32

Rt 5

Address
Rd

Imm16

zero

BusA

Registers
RB

BusB

0
1

WB =
Write
Back

BusW

A
L
U

Data
Memory
Address
Data_out
Data_in

RegDst

clk

Reg
Write

ExtOp ALUSrc ALUCtrl

8/16/2015

VL722

Mem Mem
Read Write

Mem
toReg

Pipelined Datapath
Pipeline registers are shown in green, including the PC
Same clock edge updates all pipeline registers, register
file, and data memory (for store instruction)

Address

RA
RB

0
1

ALU result
Imm16
32

BusA

E
BusB
BusW

zero

A
L
U

Data
Memory

ALUout

Imm

NPC
Rt 5

Next
PC

Data_out

32
32

Address

WB Data

Rs 5

Instruction

Imm26

Instruction
Memory

Instruction

MEM = Memory
Access

WB = Write Back

EX = Execute

ID = Decode &
Register Read
NPC2

IF = Instruction Fetch

Data_in

clk
8/16/2015

VL722

Problem with Register Destination

Is there a problem with the register destination address?
Instruction in the ID stage different from the one in the WB stage

Address

0
1

Imm

Next
PC
ALU result
Imm16

E
BusB
BusW

zero

A
L
U

Data
Memory

ALUout

Rt 5

BusA

MEM =
Memory Access

32
32

Address
Data_out

Rs 5

Instruction

Imm26

Instruction
Memory

Instruction

NPC

NPC2

EX = Execute

WB Data

ID = Decode &
Register Read

IF = Instruction Fetch

WB = Write Back

Instruction in the WB stage is not writing to its destination register

but to the destination of a different instruction in the ID stage

Data_in

clk
8/16/2015

VL722

Pipelining the Destination Register

Destination Register number should be pipelined
Destination register number is passed from ID to WB stage

The WB stage writes back data knowing the destination register

0
1

BusB
BusW

A
L
U

Address
32

Data_out

32
32

Data
Memory

ALUout

Imm
A

zero

WB Data

ALU result
Imm16

Data_in

Rd4

Next
PC

Address

Rt 5

BusA

MEM

Rd3

Rs 5

Rd2

Instruction

Imm26

Instruction
Memory

Instruction

NPC

NPC2

clk
8/16/2015

VL722

Graphically Representing Pipelines

Multiple instruction execution over multiple clock cycles
Instructions are listed in execution order from top to bottom
Clock cycles move from left to right

Program Execution Order

Figure shows the use of resources at each stage and each cycle
Time (in cycles)

CC1

CC2

CC3

CC4

CC5

lw R14, 8(R21)

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

add R17,R18,R19
ori R20,R11, 7
sub R13, R18, R11
sw R18, 10(R19)

8/16/2015

VL722

CC6

CC7

CC8

Instruction-Time Diagram
Instruction-Time Diagram shows:
Which instruction occupying what stage at each clock cycle

Instruction flow is pipelined over the 5 stages

Instruction Order

Up to five instructions can be in the

pipeline during the same cycle.
Instruction Level Parallelism (ILP)
lw

R15, 8(R19)

R14, 8(R21)

ori R12, R19, 7

EX MEM WB

sub R13, R18, R11

R18, 10(R19)
CC1

8/16/2015

ALU instructions skip

the MEM stage.
Store instructions
skip the WB stage

EX MEM

CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

VL722

Time
22

Control Signals
ID

Imm16

BusB

Address
32
32

Data
Memory
Data_out

BusW

0
1

A
L
U

WB Data

zero

ALUout

Imm

ALU result

Data_in

Rd4

Bne

Address

Beq

Rd3

Rt 5

BusA

Next
PC

Instruction

Rs 5

MEM

Rd2

Instruction
Memory

Imm26

PCSrc

Instruction

NPC

NPC2

clk
Reg
Dst

Reg
Write

Ext
Op

ALU
Src

ALU
Ctrl

Mem Mem
Read Write

Mem
toReg

Same control signals used in the single-cycle datapath

8/16/2015

VL722

32
32

0
1

Data_out

BusW

Address

WB Data

Data
Memory

ALUout

Imm
A

BusB

A
L
U

Data_in

Rd4

zero

Rd
Op

Address

Bne

Rd3

Rt 5

BusA

Beq
ALU result

Imm16

Instruction

Rs 5

Next
PC

Rd2

Instruction
Memory

Imm26

PCSrc

Instruction

NPC

NPC2

Pipelined Control

8/16/2015

Main
& ALU
Control

Ext
Op

VL722

ALU
Src

J
ALU Beq
Ctrl Bne

Mem Mem
Read Write

Mem
toReg

Reg
Write

MEM

Reg
Dst

Pass control
signals along
pipeline just
like the data

func

clk

Pipelined Control Cont'd

ID stage generates all the control signals
Pipeline the control signals as the instruction moves
Extend the pipeline registers to include the control signals

Each stage uses some of the control signals

Instruction Decode and Register Read
Control signals are generated
RegDst is used in this stage

Execution Stage => ExtOp, ALUSrc, and ALUCtrl

Next PC uses J, Beq, Bne, and zero signals for branch control

Memory Stage

=> MemRead, MemWrite, and MemtoReg

Write Back Stage => RegWrite is used in this stage

8/16/2015

VL722

Control Signals Summary

Decode
Stage

Execute Stage

Memory Stage

Write

Control Signals

Back

Op
RegDst ALUSrc ExtOp
R-Type

1=Rd

0=Reg

addi

0=Rt

slti

Beq Bne

ALUCtrl

MemRd MemWr MemReg RegWrite

func

1=Imm 1=sign

ADD

0=Rt

1=Imm 1=sign

SLT

andi

0=Rt

1=Imm 0=zero

AND

ori

0=Rt

1=Imm 0=zero

0=Rt

1=Imm 1=sign

ADD

1=Imm 1=sign

ADD

beq

0=Reg

SUB

bne

0=Reg

SUB

8/16/2015

VL722

Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control

Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

8/16/2015

VL722

Pipeline Hazards
Hazards: situations that would cause incorrect execution
If next instruction were launched during its designated clock cycle

1. Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle

2. Data hazards
An instruction may compute a result needed by next instruction
Hardware can detect dependencies between instructions

3. Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control

Hazards complicate pipeline control and limit performance

8/16/2015

VL722

Structural Hazards
Problem
Attempt to use the same hardware resource by two different
instructions during the same cycle
Structural Hazard
Two instructions are
attempting to write
the register file
during same cycle

Example
Writing back ALU result in stage 4

Instructions

Conflict with writing load data in stage 5

R14, 8(R21)

ori R12, R19, 7

sub R13, R18,R19
sw

EX MEM WB

EX MEM

R18, 10(R19)
CC1

8/16/2015

CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

VL722

Time
29

Resolving Structural Hazards

Serious Hazard:
Hazard cannot be ignored

Solution 1: Delay Access to Resource

Must have mechanism to delay instruction access to resource
Delay all write backs to the register file to stage 5
ALU instructions bypass stage 4 (memory) without doing anything

Solution 2: Add more hardware resources (more costly)

Add more hardware to eliminate the structural hazard
Redesign the register file to have two write ports
First write port can be used to write back ALU results in stage 4

Second write port can be used to write back load data in stage 5
8/16/2015

VL722

Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control

Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

8/16/2015

VL722

Data Hazards
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access

Read After Write RAW Hazard

Given two instructions I and J, where I comes before J

Instruction J should read an operand after it is written by I

Called a data dependence in compiler terminology
I: add R17, R18, R19

# R17 is written

J: sub R20, R17, R19

# R17 is read

Hazard occurs when J reads the operand before I writes it

8/16/2015

VL722

Example of a RAW Data Hazard

Program Execution Order

Time (cycles)
value of R18
sub R18, R9, R11
add R20, R18, R13

CC1

CC2

CC3

CC4

CC5

CC6

CC7

CC8

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

or R22, R11, R18

and R23, R12, R18

sw R24, 10(R18)

Result of sub is needed by add, or, and, & sw instructions

Instructions add & or will read old value of R18 from reg file

During CC5, R18 is written at end of cycle

8/16/2015

VL722

Instruction Order

Solution 1: Stalling the Pipeline

Time (in cycles)
value of R18
sub R18, R9, R11
add R20, R18, R13
or R22, R11, R18

CC1

CC2

CC3

CC4

CC5

CC6

CC7

CC8

CC9

Reg

ALU

Reg

ALU

Reg

stall

stall
IF

Reg

ALU

Three stall cycles during CC3 thru CC5 (wasting 3 cycles)

Stall cycles delay execution of add & fetching of or instruction

The add instruction cannot read R18 until beginning of CC6

The add instruction remains in the Instruction register until CC6
The PC register is not modified until beginning of CC6
8/16/2015

VL722

Solution 2: Forwarding ALU Result

The ALU result is forwarded (fed back) to the ALU input
No bubbles are inserted into the pipeline and no cycles are wasted

ALU result is forwarded from ALU, MEM, and WB stages

Program Execution Order

Time (cycles)
value of R18
sub R18, R9, R11
add R20, R18, R13

CC1

CC2

CC3

CC4

CC5

CC6

CC7

CC8

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

or R22, R11, R18

and R23, R22, R18

sw R24, 10(R18)

8/16/2015

VL722

Implementing Forwarding
Two multiplexers added at the inputs of A & B registers
Data from ALU stage, MEM stage, and WB stage is fed back

Two signals: ForwardA and ForwardB control forwarding

Result

Data
Memory

0
32

Data_out

0
32
1

WData

Address

Data_in
Rd4

0
1

BusW

A
L
U

BusB
0
1
2
3

32 ALU result

Rd3

0
1
2
3

BusA

Imm16

Rd2

Instruction

Imm26

Im26

ForwardA

clk
ForwardB

8/16/2015

VL722

Forwarding Control Signals

Signal

Explanation

ForwardA = 0 First ALU operand comes from register file = Value of (Rs)

ForwardA = 1 Forward result of previous instruction to A (from ALU stage)

ForwardA = 2 Forward result of 2nd previous instruction to A (from MEM stage)
ForwardA = 3 Forward result of 3rd previous instruction to A (from WB stage)
ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)
ForwardB = 1 Forward result of previous instruction to B (from ALU stage)
ForwardB = 2 Forward result of 2nd previous instruction to B (from MEM stage)
ForwardB = 3 Forward result of 3rd previous instruction to B (from WB stage)
8/16/2015

VL722

Forwarding Example
Instruction sequence:
lw
R12, 4(R8)
ori R15, R9, 2
sub R11, R12, R15

When sub instruction is in decode stage

ori will be in the ALU stage
lw will be in the MEM stage

ForwardA = 2 from MEM stage

sub R11,R12,R15

ForwardB = 1 from ALU stage

ori R15, R9,2

lw R12,4(R8)

Result

Data
Memory

0
32

Data_out

0
32
1

WData

Address

Data_in
Rd4

0
1

BusW

0
1
2
3

BusB

Rd3

A
L
U

0
1
2
3

BusA

32 ALU result

Rd2

Instruction

ext
Register File

Imm16

Imm

Imm26

clk
8/16/2015

VL722

RAW Hazard Detection

Current instruction being decoded is in Decode stage
Previous instruction is in the Execute stage
Second previous instruction is in the Memory stage
Third previous instruction in the Write Back stage
If ((Rs != 0) and (Rs == Rd2) and (EX.RegWrite))

ForwardA 1

Else if

((Rs != 0) and (Rs == Rd3) and (MEM.RegWrite)) ForwardA 2

Else if

((Rs != 0) and (Rs == Rd4) and (WB.RegWrite))

Else

ForwardA 3

ForwardA 0

If ((Rt != 0) and (Rt == Rd2) and (EX.RegWrite))

ForwardB 1

Else if

((Rt != 0) and (Rt == Rd3) and (MEM.RegWrite)) ForwardB 2

Else if

((Rt != 0) and (Rt == Rd4) and (WB.RegWrite))

Else

8/16/2015

ForwardB 3

ForwardB 0

VL722

Hazard Detect and Forward Logic

Data_out

0
32
1

WData

Data
Memory
Data_in

ALUCtrl
Rd4

0
1

Result

Im26
BusW

Address

BusB
0
1
2
3

A
L
U

Rd3

0
1
2
3

BusA

32 ALU result

Rd2

Instruction

Imm26

clk
RegDst

ForwardB

ForwardA

Hazard Detect
and Forward
func

8/16/2015

VL722

RegWrite

Main
& ALU
Control

RegWrite

MEM

RegWrite

Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control

Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Pipeline Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

8/16/2015

VL722

Load Delay
Unfortunately, not all data hazards can be forwarded
Load has a delay that cannot be eliminated by forwarding

In the example shown below

The LW instruction does not read data until end of CC4
Cannot forward data to ADD at end of CC3 - NOT possible

Program Order

Time (cycles)
lw

R18, 20(R17)

add R20, R18, R13

or R14, R11, R18

CC1

CC2

CC3

CC4

CC5

Reg

ALU

Reg

ALU

Reg

ALU

Reg

ALU

and R15, R18, R12

8/16/2015

VL722

CC6

CC7

CC8

However, load can

forward data to
2nd next and later
instructions

Reg
42

Detecting RAW Hazard after Load

Detecting a RAW hazard after a Load instruction:
The load instruction will be in the EX stage
Instruction that depends on the load data is in the decode stage

Condition for stalling the pipeline

if ((EX.MemRead == 1) // Detect Load in EX stage
and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard

Insert a bubble into the EX stage after a load instruction

Bubble is a no-op that wastes one clock cycle
Delays the dependent instruction after load by once cycle
Because of RAW hazard
8/16/2015

VL722

Stall the Pipeline for one Cycle

ADD instruction depends on LW stall at CC3
Allow Load instruction in ALU stage to proceed
Freeze PC and Instruction registers (NO instruction is fetched)
Introduce a bubble into the ALU stage (bubble is a NO-OP)

Load can forward data to next instruction after delaying it

Program Order

Time (cycles)
lw

R18, 20(R17)

add R20, R18, R13

CC1

CC2

CC3

CC4

CC5

Reg

ALU

Reg

stall

bubble

Reg

ALU

Reg

ALU

or R14, R11, R18

8/16/2015

VL722

CC6

CC7

CC8

Reg
44

Showing Stall Cycles

Stall cycles can be shown on instruction-time diagram
Hazard is detected in the Decode stage
Stall indicates that instruction is delayed
Instruction fetching is also delayed after a stall
Example:
Data forwarding is shown using blue arrows
lw

R17, (R13)

R18, 8(R17)

add R2, R18, R11

ID
IF

EX MEM WB
Stall

ID
IF

sub R1, R18, R2

EX MEM WB
Stall

EX MEM WB

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 Time
8/16/2015

VL722

Hazard Detect, Forward, and Stall

Data
Memory

0
1

Data_out

32
1

Data_in

Rd4

WData

Address

0
1
2
3

BusW

Result

Im26
A

BusB

A
L
U

Rd3

0
1
2
3

BusA

32 ALU result

Rd2

Instruction

Imm26

clk
Disable PC

RegDst

ForwardB
func

ForwardA

Hazard Detect
Forward, & Stall
MemRead
Stall

8/16/2015

Bubble
=0

RegWrite

0
1

VL722

Control Signals

MEM

Main & ALU

Control

RegWrite

Code Scheduling to Avoid Stalls

Compilers reorder code in a way to avoid load stalls
Consider the translation of the following statements:
A = B + C; D = E F; // A thru F are in Memory

Fast code: No Stalls

Slow code:
lw
lw
add
sw
lw
lw
sub
sw
8/16/2015

R8,
R9,
R10,
R10,
R11,
R12,
R13,
R13,

4(R16)
8(R16)
R8, R9
0(R16)
16(R16)
20(R16)
R11, R12
12(R0)

# &B = 4(R16)
# &C = 8(R16)
# stall cycle
# &A = 0(R16)
# &E = 16(R16)
# &F = 20(R16)
# stall cycle
# &D = 12(R0)
VL722

lw
lw
lw
lw
add
sw
sub
sw

R8,
R9,
R11,
R12,
R10,
R10,
R13,
R13,

4(R16)
8(R16)
16(R16)
20(R16)
R8, R9
0(R16)
R11, R12
12(R0)
47

Name Dependence: Write After Read

Instruction J writes its result after it is read by I
Called anti-dependence by compiler writers
I: sub R12, R9, R11 # R9 is read
J: add R9, R10, R11 # R9 is written
Results from reuse of the name R9
NOT a data hazard in the 5-stage pipeline because:
Reads are always in stage 2
Writes are always in stage 5, and
Instructions are processed in order

Anti-dependence can be eliminated by renaming

Use a different destination register for add (eg, R13)
8/16/2015

VL722

Name Dependence: Write After Write

Same destination register is written by two instructions
Called output-dependence in compiler terminology
I: sub R9, R12, R11 # R9 is written

J: add R9, R10, R11 # R9 is written again

Not a data hazard in the 5-stage pipeline because:

All writes are ordered and always take place in stage 5

However, can be a hazard in more complex pipelines

If instructions are allowed to complete out of order, and
Instruction J completes and writes R9 before instruction I

Output dependence can be eliminated by renaming R9

Read After Read is NOT a name dependence
8/16/2015

VL722

Hazards due to Loads and Stores

Consider the following statements:
sw
lw

R10, 0(R1)
R11, 6(R5)

Is there any Data Hazard possible in the above code sequence?

If 0(R1) == 6(R5), then it means the same memory location!!
Data dependency through Data Memory
But no Data Hazard since the memory is not accessed by the two
instructions simultaneously writing and reading happens in two
consecutive clock cycles
But, in an out-of-order execution processor, this can lead to a
Hazard!

8/16/2015

VL722

Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control

Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

8/16/2015

VL722

Control Hazards
Jump and Branch can cause great performance loss
Jump instruction needs only the jump target address

Branch instruction needs two things:

Branch Result

Taken or Not Taken

Branch Target Address

PC + 4

If Branch is NOT taken

PC + 4 immediate

If Branch is Taken

Jump and Branch targets are computed in the ID stage

At which point a new instruction is already being fetched

Jump Instruction: 1-cycle delay

Branch: 2-cycle delay for branch result (taken or not taken)
8/16/2015

VL722

2-Cycle Branch Delay

Control logic detects a Branch instruction in the 2nd Stage
ALU computes the Branch outcome in the 3rd Stage
Next1 and Next2 instructions will be fetched anyway
Convert Next1 and Next2 into bubbles if branch is taken

Beq R9,R10,L1

cc1

cc2

cc3

Reg

ALU

Next1
Next2

cc4

cc5

cc6

Reg

Bubble

Reg

ALU

L1: target instruction

Branch
Target
Addr

8/16/2015

VL722

cc7

NPC2

Bne

2
3

BusB
0
1

BusW

A
L
U

0
1

zero

ALUout

Imm16

Beq

Rd3

Im26

NPC

Address

BusA

Rt 5

0
1

Next
PC

Rd2

Instruction
0

Rs 5

Instruction
Memory

Imm26

PCSrc

Instruction

Jump or Branch Target

Implementing Jump and Branch

Branch target & outcome

are computed in ALU stage

8/16/2015

J, Beq, Bne

Main & ALU

Control

Control Signals
Bubble = 0

VL722

0
1

MEM

Branch Delay = 2 cycles

Reg
Dst

func

clk

Predict Branch NOT Taken

Branches can be predicted to be NOT taken
If branch outcome is NOT taken then
Next1 and Next2 instructions can be executed
Do not convert Next1 & Next2 into bubbles
No wasted cycles

Else, convert Next1 and Next2 into bubbles 2 wasted cycles

Beq R9,R10,L1
Next1
Next2

8/16/2015

cc1

cc2

cc3

Reg

ALU NOT Taken

cc4

cc5

cc6

Reg

ALU

Reg

ALU

VL722

cc7

Reg

Reducing the Delay of Branches

Branch delay can be reduced from 2 cycles to just 1 cycle
Branches can be determined earlier in the Decode stage
A comparator is used in the decode stage to determine branch
decision, whether the branch is taken or not
Because of forwarding the delay in the second stage will be
increased and this will also increase the clock cycle

Only one instruction that follows the branch is fetched

If the branch is taken then only one instruction is flushed

We should insert a bubble after jump or taken branch

This will convert the next instruction into a NOP
8/16/2015

VL722

Reducing Branch Delay to 1 Cycle

2
3

BusB
0
1

BusW

0
1

A
L
U

Rd3

BusA

Address
Op

Rt 5

0
1

Rd2

Instruction
0

Rs 5

Instruction
Memory

Instruction

Imm16
PCSrc

Data forwarded
then compared

ALUout

J
Beq
Bne

Longer Cycle

Im16

Next
PC

Reset

Jump or Branch Target

Zero

8/16/2015

Reg
Dst
J, Beq, Bne

Main & ALU

Control

Control Signals
Bubble = 0

VL722

ALUCtrl
0
1

MEM

Reset signal converts

next instruction after
jump or taken branch
into a bubble

func

clk

Branch Behavior in Programs

Based on SPEC benchmarks on DLX
Branches occur with a frequency of 14% to 16% in integer
programs and 3% to 12% in floating point programs.

About 75% of the branches are forward branches

60% of forward branches are taken
85% of backward branches are taken

67% of all branches are taken

Why are branches (especially backward branches) more

likely to be taken than not taken?

8/16/2015

VL722

Four Branch Hazard Alternatives

#1: Stall until branch direction is clear
#2: Predict Branch Not Taken

Execute successor instructions in sequence

Flush instructions in pipeline if branch actually taken
Advantage of late pipeline state update
33% DLX branches not taken on average
PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken

67% DLX branches taken on average
#4: Define branch to take place AFTER n following instruction
(Delayed branch)

8/16/2015

VL722

Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control

Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

8/16/2015

VL722

Branch Hazard Alternatives

Predict Branch Not Taken (previously discussed)
Successor instruction is already fetched
Do NOT Flush instruction after branch if branch is NOT taken
Flush only instructions appearing after Jump or taken branch

Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)

Dynamic Branch Prediction

Loop branches are taken most of time
Must reduce branch delay to 0, but how?
How to predict branch behavior at runtime?
8/16/2015

VL722

Delayed Branch
Define branch to take place after the next instruction
Instruction in branch delay slot is always executed
Compiler (tries to) move a useful instruction into delay slot.
From before the Branch: Always helpful when possible
For a 1-cycle branch delay, we have one delay slot
branch instruction
branch delay slot
branch target

label:
. . .

(next instruction)
(if branch taken)

add R10,R11,R12
beq R17,R16,label
Delay Slot

Compiler fills the branch delay slot

By selecting an independent instruction

From before the branch

. . .

If no independent instruction is found

Compiler fills delay slot with a NO-OP
8/16/2015

label:

VL722

beq R17,R16,label
add R10,R11,R12
62

Delayed Branch
From the Target: Helps when branch is taken. May duplicate instructions

ADD R2, R1, R3

BEQZ R2, L1
DELAY SLOT
L1: SUB R4, R5, R6
L2:

L1:
L2:

ADD
R2, R1, R3
BEQZ R2, L2
SUB R4, R5, R6
SUB R4, R5, R6

Instructions between BEQ and SUB (in fall through) must not use R4.
Why is instruction at L1 duplicated? What if R5 or R6 changed?
(Because, L1 can be reached by another path !!)

8/16/2015

VL722

Delayed Branch
From Fall Through: Helps when branch is not taken
(default case)
ADD R2, R1, R3

ADD

BEQZ R2, L1

DELAY SLOT

SUB R4, R5, R6

R2, R1, R3

L1:

Instructions at target (L1 and after) must not use R4 till set (written)
again.

8/16/2015

VL722

Filling delay slots

Compiler effectiveness for single branch delay slot:
Fills about 60% of branch delay slots
About 80% of instructions executed in branch delay slots
are useful in computation
About 50% (i.e., 60% x 80%) of slots usefully filled

Canceling branches or nullifying branches

Include a prediction of the branch is taken or not taken
If the prediction is correct, the instruction in the delay slot is
executed
If the prediction is incorrect, the instruction in the delay slot
is quashed.
Allow more slots to be filled from the target address or fall
through
8/16/2015

VL722

Drawback of Delayed Branching

New meaning for branch instruction
Branching takes place after next instruction (Not immediately!)

Impacts software and compiler

Compiler is responsible to fill the branch delay slot
For a 1-cycle branch delay One branch delay slot

However, modern processors are deeply pipelined

Branch penalty is multiple cycles in deeper pipelines
Multiple delay slots are difficult to fill with useful instructions

MIPS used delayed branching in earlier pipelines

However, delayed branching is not useful in recent processors
8/16/2015

VL722

Compiler Static Prediction of

Taken/Untaken Branches
Two strategies examined
Backward branch predict taken, forward branch not taken
Profile-based prediction: record branch behavior, predict
branch based on prior run
10000
1000
100

Profile-based

8/16/2015

tomcatv

swm256

ora

mdljsp2

hydro2d

gcc

espresso

compress

doduc

alvinn

Instructions per mispredicted branch

100000

Direction-based

VL722

Zero-Delayed Branching
How to achieve zero delay for a jump or a taken branch?
Jump or branch target address is computed in the ID stage
Next instruction has already been fetched in the IF stage

Solution
Introduce a Branch Target Buffer (BTB) in the IF stage
Store the target address of recent branch and jump instructions

Use the lower bits of the PC to index the BTB

Each BTB entry stores Branch/Jump address & Target Address
Check the PC to see if the instruction being fetched is a branch

Update the PC using the target address stored in the BTB

8/16/2015

VL722

Branch Target Buffer

The branch target buffer is implemented as a small cache
Stores the target address of recent branches and jumps

We must also have prediction bits

To predict whether branches are taken or not taken
The prediction bits are dynamically determined by the hardware
Branch Target & Prediction Buffer
Addresses of
Recent Branches

Inc
mux

Target
Predict
Addresses
Bits

low-order bits
used as index

=
predict_taken

8/16/2015

VL722

Dynamic Branch Prediction

Prediction of branches at runtime using prediction bits
Prediction bits are associated with each entry in the BTB
Prediction bits reflect the recent history of a branch instruction

Typically few prediction bits (1 or 2) are used per entry

We dont know if the prediction is correct or not
If correct prediction
Continue normal execution no wasted cycles

If incorrect prediction (misprediction)

Flush the instructions that were incorrectly fetched wasted cycles
Update prediction bits and target address for future use

8/16/2015

VL722

Dynamic Branch Prediction Contd

Use PC to address Instruction
Memory and Branch Target Buffer
PC = target address

Increment PC
No

Jump
or taken
branch?

Found
BTB entry with predict
taken?

Yes

Normal
Execution

Jump
or taken
branch?

Yes

Correct Prediction
No stall cycles

Mispredicted Jump/branch
Enter jump/branch address, target
address, and set prediction in BTB entry.
Flush fetched instructions
Restart PC at target address

8/16/2015

Yes

VL722

Mispredicted branch
Branch not taken
Update prediction bits
Flush fetched instructions
Restart PC after branch
71

1-bit Prediction Scheme

Prediction is just a hint that is assumed to be correct
If incorrect then fetched instructions are flushed

1-bit prediction scheme is simplest to implement

1 bit per branch instruction (associated with BTB entry)
Record last outcome of a branch instruction (Taken/Not taken)

Use last outcome to predict future behavior of a branch

Not
Taken

Taken

Taken
Predict
Not Taken

Predict
Taken

Not Taken

8/16/2015

VL722

1-Bit Predictor: Shortcoming

Inner loop branch mispredicted twice!
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner
loop next time around
outer:

inner:

bne , , inner

bne , , outer

8/16/2015

VL722

2-bit Prediction Scheme

1-bit prediction scheme has a performance shortcoming
2-bit prediction scheme works better and is often used
4 states: strong and weak predict taken / predict not taken

Implemented as a saturating counter

Counter is incremented to max=3 when branch outcome is taken
Counter is decremented to min=0 when branch is not taken
Not Taken

Strong
Predict
Not Taken

Taken
Taken

Not Taken
8/16/2015

Weak
Predict
Not Taken

Taken

Not Taken
VL722

Weak
Predict
Taken

Taken

Strong
Predict
Taken

Not Taken
74

2-bit Prediction Scheme

Alternative state machine

8/16/2015

VL722

1-Bit Branch History Table

Example

while branch : always TAKEN

main()
{
int i, j;
while (1)
for (i=1; i<=4; i++) {
j = i++;
}
}

11111111111111111.

for branch : TAKEN x 3, NOT TAKEN x 1

1110111011101110.

branch outcome

1 ...

last outcome

0 ...

prediction

N T

N ...

new last outcome

1 ...

X O O X

X ...

correctness

O O O X

Assume initial last outcome = 1

X O O X

O : correct, X : mispredict

Prediction accuracy of for branch: 50%

2-Bit Counter Scheme

Prediction is made based on the last two branch outcomes
Each of BHT entries consists of 2 bits, usually a 2-bit counter, which
is associated with the state of the automaton(many different
automata are possible)

Branch outcome
.
.
.

Automaton

Prediction

BHT

2-Bit BHT
2-bit scheme where change prediction only if get misprediction twice
MSB of the state symbol represents the prediction;
1: TAKEN, 0: NOT TAKEN
T
Predict
TAKEN

Predict
TAKEN

11
T

T
Predict
NOT
TAKEN

NT
01

00
T

Predict
NOT
TAKEN
NT

branch outcome

1 1 1 0 1 1 1 0 1 1 1 0 ...

counter value

11 11 11 11 10 11 11 11 10 11 11 11 . . .

prediction

T T T T T T T T T T T T ...

new counter value 11 11 11 10 11 11 11 10 11 11 11 10 . . .

correctness

O O O X O O O X O O O X ...

assume initial counter value : 11 O : correct, X : mispredict

Prediction accuracy of for branch: 75 %

Case for Correlating Predictors

Basic two-bit predictor schemes
use recent behavior of a branch
to predict its future behavior

L1:

Improve the prediction accuracy

look also at recent behavior
of other branches

if (aa == 2) aa = 0;
if (bb == 2) bb = 0;
if (aa != bb) { }

L2:

subi R3, R1, #2

bnez R3, L1
; b1
add R1, R0, R0
subi R3, R1, #2
bnez R3, L2
; b2
add R2, R0, R0
sub R3, R1, R2
beqz R3, L3
; b3

b3 is correlated with b1 and b2;

If b1 and b2 are both untaken,
then b3 will be taken.
=>
Use correlating predictors or
two-level predictors.
79

Branch Correlation
Code Snippet
if (aa==2)
// b1
aa = 0;
if (bb==2)
// b2
bb = 0;
if (aa!=bb) { // b3
.
}

0 (NT)
0
b3
Path: A:0-0
aa=0
bb=0

1 (T)

1
b3

B:0-1 C:1-0 D:1-1

aa=0 aa2 aa2
bb2 bb=0 bb2

Branch direction
Not independent
Correlated to the path taken

Example: Decision of b3 can be surely known beforehand if

the path to b3 is 1-1
Track path using a 2-bit register

Correlating Branches
Example:
if (d==0)

b1 :

d=1;
if (d==1)

Initial value
of d
0
1
2

BNEZ
ADDI
SUBI
BNEZ

R1,b1 ;
(b1)(d!=0)
R1,R1,#1; since d==0, make d=1
R3,R1,#1
R3,b2;
(b2)(d!=1)

.....
b2 :
d==0?
Y
N
N

b1
NT
T
T

Value of d
before b2
1
1
2

d==1?
Y
Y
N

b2
NT
NT
T

If b1 is NT, then b2 is NT
1-bit
self history
predictor
Sequence of
2,0,2,0,2,0,...

d=?
2
0
2
0

b1
prediction
NT
T
NT
T

b1
action
T
NT
T
NT

New b1
prediction
T
NT
T
NT

b2
prediction
NT
T
NT
T

All branches are mispredicted

b2
action
T
NT
T
NT

New b2
prediction
T
NT
T
NT

Correlating Branches
Example:
if (d==0)
d=1;
if (d==1)

Self
Prediction
bits(XX)
NT/NT
NT/T
T/NT
T/T
d=? b1 prediction
2
NT/NT
0
T/NT
2
T/NT
0
T/NT

b1 action
T
NT
T
NT

b1 :

BNEZ
ADDI
SUBI
BNEZ

R1,L1 ;
branch b1 (d!=0)
R1,R1,#1; since d==0, make d=1
R3,R1,#1
R3,L2;
branch b2(d!=1)

.....
b2 :
Gloabal
Prediction, if last
branch action was NT
NT
NT
T
T

Prediction, if last
branch action was T
NT
T
NT
T

new b1 prediction b2 prediction

T/NT
NT/NT
T/NT
NT/T
T/NT
NT/T
T/NT
NT/T

Initial self prediction bits NT/NT and

Initial last branch was NT.

b2 action
T
NT
T
NT

new b2 prediction
NT/T
NT/T
NT/T
NT/T

Prediction used is shown in Red

Misprediction only in the first prediction

Local/Global Predictors
Instead of maintaining a counter for each branch to
capture the common case,
Maintain a counter for each branch and surrounding pattern
If the surrounding pattern belongs to the branch being predicted,
the predictor is referred to as a local predictor
If the surrounding pattern includes neighboring branches, the
predictor is referred to as a global predictor

8/16/2015

Correlated Branch Prediction

Idea: record m most recently executed branches as
taken or not taken, and use that pattern to select the
proper n-bit branch history table
In general, (m,n) predictor means record last m branches
to select between 2m history tables, each with n-bit
counters
Thus, old 2-bit BHT is a (0,2) predictor

Global Branch History: m-bit shift register keeping T/NT

status of last m branches.
Each entry in table has m n-bit predictors.
8/16/2015

Correlating Branches
(2,2) predictor

Behavior of recent
branches selects
between four
predictions of next
branch, updating just
that prediction

Branch address
4
2-bits per branch predictor

Prediction

2-bit global branch history

8/16/2015

Correlated Branch Predictor

2-bit shift register
Subsequent (global branch history)

branch
direction
select
Branch PC

Branch PC

X X
hash

X X

Prediction

.
.
.
.
2-bit
counter

2-bit Sat. Counter Scheme

w
hash

.
.
.
.
2-bit
counter

Prediction

.
.
.
.
2-bit
counter

(2,2) Correlation Scheme

(M,N) correlation scheme

M: shift register size (# bits)
N: N-bit counter
86

Accuracy of Different Schemes

20%

4096 Entries 2-bit BHT

Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT

16%
14%
12%

11%

10%
8%
6%

Unlimited entries: 2-bits/entry

8/16/2015

1,024 entries (2,2)

eqntott

expresso

gcc

fpppp

matrix300

spice

doducd

tomcatv

4,096 entries: 2-bits per entry

nasa7

Frequency of Mispredictions

18%

Two-Level Branch Predictor

Pattern History Table (PHT)
00..00
00..01
00..10

Branch History Register (BHR)

(Shift left when update)

1 1 .....

Rc-k

2N entries

Rc-1

1 0
N

11..10
11..11

Branch History Pattern

Rc: Actual Branch Outcome

Generalized correlated branch predictor

Prediction
PHT update

Current State

FSM
Update
Logic

1st level keeps branch history in Branch History Register (BHR)

2nd level segregates pattern history in Pattern History Table (PHT)
88

Branch History Register

An N-bit Shift Register = 2N patterns in PHT
Shift-in branch outcomes
1 taken
0 not taken

First-in First-Out

BHR can be
Global
Per-set
Local (Per-address)

Pattern History Table

2N entries addressed by N-bit BHR
Each entry keeps a counter (2-bit or more) for prediction
Counter update: the same as 2-bit counter
Can be initialized in alternate patterns (01, 10, 01, 10, ..)

Alias (or interference) problem

Two-Level Branch Prediction

Set

PC = 0x4001000C

PHT

00000000
00000001
00000010

00110110
00110110 10
00110111

BHR

0110
11111101
11111110
11111111

MSB = 1
Predict Taken
91

Predictor Update (Actually, Not Taken)

PHT

PC = 0x4001000C

00000000
00000001
00000010

00111100
00110110
00110110
00110111

BHR

00111100

decremented

0110
1100

01
10

11111101
11111110
11111111

Update Predictor after branch is resolved

Tournament Predictors
A local predictor might work well for some branches or
programs, while a global predictor might work well for others
Provide one of each and maintain another predictor to
identify which predictor is best for each branch
Local
Predictor

M
U
X

Global
Predictor

Branch PC

Tournament
Predictor

Table of 2-bit
saturating counters
93

8/16/2015

Tournament Predictors
Multilevel branch predictor
Selector for the Global and Local predictors of correlating branch
prediction

Use n-bit saturating counter to choose between predictors

Usual choice between global and local predictors

8/16/2015

Tournament Predictors
Advantage of tournament predictor is the ability to select the right
predictor for a particular branch
A typical tournament predictor selects global predictor 40% of the time
for SPEC integer benchmarks
AMD Opteron and Phenom use tournament style

Accuracy v. Size (SPEC89)

Conditional branch misprediction rate

10%
9%
8%

Local - 2 bit counters

7%
6%
5%

Correlating - (2,2) scheme

4%
3%

Tournament

2%
1%
0%
0

Total predictor size (Kbits)

104 112 120 128

Tournament Predictors (Intel Core i7)

Based on predictors used in Core 2 Duo chip
Combines three different predictors
Two-bit
Global history
Loop exit predictor
Uses a counter to predict the exact number of taken
branches (number of loop iterations) for a branch
that is detected as a loop branch
Tournament: Tracks accuracy of each predictor
Main problem of speculation:
A mispredicted branch may lead to another branch
being mispredicted !

Branch Prediction is More Important Today

Conditional branches still comprise about 20% of instructions
Correct predictions are more important today - why?

pipelines deeper
branch not resolved until more cycles from fetching - therefore the
misprediction penalty greater
cycle times smaller - more emphasis on throughput (performance)
more functionality between fetch & execute

multiple instruction issue (superscalars & VLIW)

branch occurs almost every cycle
flushing & refetching more instructions

object-oriented programming
more indirect branches - which are harder to predict

dual of Amdahls Law

other forms of pipeline stalling are being addressed - so the portion
of CPI due to branch delays is relatively larger

All this means that the potential stalling due to branches is greater

Branch Prediction is More Important Today

On the other hand,
Chips are denser so we can consider sophisticated
HW solutions
Hardware cost is small compared to the performance
gain

Directions in Branch Prediction

1: Improve the prediction
correlated (2-level) predictor (Pentium III - 512 entries, 2-bit,
Pentium Pro - 4 history bits)
hybrid local/global predictor (Alpha 21264)
confidence predictors
2: Determine the target earlier
branch target buffer (Pentium Pro, IA-64 Itanium)
next address in I-cache (Alpha 21264, UltraSPARC)
return address stack (Alpha 21264, IA-64 Itanium, MIPS R10000,
Pentium Pro, UltraSPARC-3)
3: Reduce misprediction penalty
fetch both instruction streams (IBM mainframes, SuperSPARC)
4: Eliminate the branch
predicated execution (IA-64 Itanium, Alpha 21264)

100

Pipelining Complications
Exceptions: Events other than branches or jumps that change the
normal flow of instruction execution. Some types of exceptions:
I/O Device request
Invoking an OS service from user program

Tracing Instruction execution

Breakpoint (programmer requested interrupt)
Integer arithmetic overflow
FP arithmetic anomaly

Page fault (page not in main memory)

Misaligned memory accesses
Memory protection violation
Use of undefined instruction
Hardware malfunction
Power failure
8/16/2015

VL722

101

Pipelining Complications
Exceptions: Events other than branches or jumps
that change the normal flow of instruction execution.
5 instructions executing in 5 stage pipeline
How to stop the pipeline?
Who caused the interrupt?
How to restart the pipeline?
Stage

Problems causing the interrupts

Page fault on instruction fetch; misaligned

memory access; memory-protection violation
Undefined or illegal opcode
Arithmetic interrupt
Page fault on data fetch; misaligned memory
access; memory-protection violation

ID
EX
MEM
8/16/2015

VL722

102

Pipelining Complications
Simultaneous exceptions in more than one pipeline stage,
e.g.,
LOAD with data page fault in MEM stage
ADD with instruction page fault in IF stage

Solution #1
Interrupt status vector per instruction
Defer check until last stage, kill state update if exception

Solution #2
Interrupt ASAP
Restart everything that is incomplete

8/16/2015

VL722

103

Pipelining Complications
Our DLX pipeline only writes results at the end of the
instructions execution. Not all processors do this.

Address modes: Auto-increment causes register change

during instruction execution
Interrupts Need to restore register state
Adds WAR and WAW hazards since writes happen not only in
last stage

Memory-Memory Move Instructions

Must be able to handle multiple page faults
VAX and x86 store values temporarily in registers

Condition Codes
Need to detect the last instruction to change condition codes

8/16/2015

VL722

104

Fallacies and Pitfalls

Pipelining is easy!
The basic idea is easy

The devil is in the details

Detecting data hazards and stalling pipeline

Poor ISA design can make pipelining harder

Complex instruction sets (Intel IA-32)
Significant overhead to make pipelining work
IA-32 micro-op approach

Complex addressing modes

VL722

105

Pipeline Hazards Summary

Three types of pipeline hazards
Structural hazards: conflicts using a resource during same cycle
Data hazards: due to data dependencies between instructions

Control hazards: due to branch and jump instructions

Hazards limit the performance and complicate the design

Structural hazards: eliminated by careful design or more hardware
Data hazards are eliminated by data forwarding
However, load delay cannot be completely eliminated
Delayed branching can be a solution for control hazards
BTB with branch prediction can reduce branch delay to zero
Branch misprediction should flush the wrongly fetched instructions
8/16/2015

VL722

106

Chapter 4.5 - 4.8 Piplined Processor and Hazards
No ratings yet
Chapter 4.5 - 4.8 Piplined Processor and Hazards
68 pages
CA07 2022S3 New
No ratings yet
CA07 2022S3 New
29 pages
Pipelined Processor Design: Computer Architecture and Assembly Language
No ratings yet
Pipelined Processor Design: Computer Architecture and Assembly Language
22 pages
Bản Sao Của Lecture 9 - Pipelined Processor Design
No ratings yet
Bản Sao Của Lecture 9 - Pipelined Processor Design
11 pages
CSE332 / EEE336 Computer Organization & Architecture Pipelining I
No ratings yet
CSE332 / EEE336 Computer Organization & Architecture Pipelining I
21 pages
3-Pipelining 241110 203716
No ratings yet
3-Pipelining 241110 203716
59 pages
Helping Slides Pipelining Hazards Solutions
No ratings yet
Helping Slides Pipelining Hazards Solutions
55 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
Lec11 Pipeline 1 Notes
No ratings yet
Lec11 Pipeline 1 Notes
26 pages
07 Pipeline Notes
No ratings yet
07 Pipeline Notes
145 pages
L117-19 MIPS Pipeline Implementation
No ratings yet
L117-19 MIPS Pipeline Implementation
37 pages
07 MIPS Pipelining CH4
No ratings yet
07 MIPS Pipelining CH4
73 pages
05 Pipelining
No ratings yet
05 Pipelining
34 pages
Lecture 13 Pipelining
No ratings yet
Lecture 13 Pipelining
12 pages
Lec12 Pipeline
No ratings yet
Lec12 Pipeline
23 pages
Lecture-4-08 01 2025
No ratings yet
Lecture-4-08 01 2025
35 pages
Week 11
No ratings yet
Week 11
33 pages
Pipeline Processor Design
No ratings yet
Pipeline Processor Design
89 pages
13 PipelinedProcessorDesign
No ratings yet
13 PipelinedProcessorDesign
53 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
77 pages
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
No ratings yet
Computer Architecture: Pipelining: Dr. Ashok Kumar Turuk
136 pages
Pipelined Processor Design: Computer Architecture & Assembly Language Prof. Muhamed Mudawar
No ratings yet
Pipelined Processor Design: Computer Architecture & Assembly Language Prof. Muhamed Mudawar
66 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
49 pages
677adcc290db7CA Lab11 Fall2024
No ratings yet
677adcc290db7CA Lab11 Fall2024
8 pages
L11 Pipelined Datapath and
100% (1)
L11 Pipelined Datapath and
31 pages
Chapter 6
No ratings yet
Chapter 6
43 pages
Module 3-Part 2
No ratings yet
Module 3-Part 2
50 pages
Pipelining Basic and Intermediate Concepts
No ratings yet
Pipelining Basic and Intermediate Concepts
75 pages
اسمبلي ٩
No ratings yet
اسمبلي ٩
3 pages
Risc in Pipe Ine
No ratings yet
Risc in Pipe Ine
39 pages
Pipelined MIPS Processor Design
No ratings yet
Pipelined MIPS Processor Design
51 pages
Pipeline Processing
No ratings yet
Pipeline Processing
28 pages
Lec03-Pipelining 2021
No ratings yet
Lec03-Pipelining 2021
20 pages
Basic Pipelining: CS2100 - Computer Organization
No ratings yet
Basic Pipelining: CS2100 - Computer Organization
83 pages
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
No ratings yet
Topic 10: Pipelining: Cos / Ele 375 Computer Architecture and Organization
64 pages
Understanding Processor Pipelining
No ratings yet
Understanding Processor Pipelining
28 pages
CODch 6 Slides
No ratings yet
CODch 6 Slides
77 pages
Design of 32bit MIPS Processor
No ratings yet
Design of 32bit MIPS Processor
23 pages
Pipe 1 New
No ratings yet
Pipe 1 New
64 pages
Module 2
No ratings yet
Module 2
64 pages
Pipelining: 5-Stage Pipeline: Mahdi Nazm Bojnordi
No ratings yet
Pipelining: 5-Stage Pipeline: Mahdi Nazm Bojnordi
35 pages
1 Processor Pipeline
No ratings yet
1 Processor Pipeline
73 pages
L14 MipsPipeline Ovw
No ratings yet
L14 MipsPipeline Ovw
17 pages
Advanced Pipelining Techniques
No ratings yet
Advanced Pipelining Techniques
44 pages
CO Pipelining PDF Notes
No ratings yet
CO Pipelining PDF Notes
10 pages
CS104: Computer Organization: 30 March, 2020
No ratings yet
CS104: Computer Organization: 30 March, 2020
31 pages
L24 Pipeline
No ratings yet
L24 Pipeline
40 pages
Module1 UPD
No ratings yet
Module1 UPD
72 pages
Pipelining Unit 3
No ratings yet
Pipelining Unit 3
19 pages
Lecture Notes Pipelining Stages 7B
No ratings yet
Lecture Notes Pipelining Stages 7B
7 pages
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec03-Pipelining - (Cuuduongthancong - Com)
No ratings yet
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec03-Pipelining - (Cuuduongthancong - Com)
35 pages
Week 11 Reduced
No ratings yet
Week 11 Reduced
29 pages
Module 4-Pipelining
No ratings yet
Module 4-Pipelining
39 pages
CS530 Fall2015 Lecture9
No ratings yet
CS530 Fall2015 Lecture9
5 pages
L04 Pipelining
No ratings yet
L04 Pipelining
48 pages
Lec 11
No ratings yet
Lec 11
30 pages
CH7-Parallel and Pipelined Processing
No ratings yet
CH7-Parallel and Pipelined Processing
23 pages
Drawing PD2-18012AJ (EN)
No ratings yet
Drawing PD2-18012AJ (EN)
1 page
Chap002 - CH2 Solution of Power Electronics by Daniel W.Hart Chap002 - CH2 Solution of Power Electronics by Daniel W.Hart
No ratings yet
Chap002 - CH2 Solution of Power Electronics by Daniel W.Hart Chap002 - CH2 Solution of Power Electronics by Daniel W.Hart
26 pages
Capacitores Simetricos
No ratings yet
Capacitores Simetricos
15 pages
Prepaid Energy Meters with Smart Cards
No ratings yet
Prepaid Energy Meters with Smart Cards
21 pages
RET670 Test Report Differential
50% (2)
RET670 Test Report Differential
3 pages
Guide To The Installation of Photovoltaic Systems
100% (21)
Guide To The Installation of Photovoltaic Systems
124 pages
Capacitors: Basics and History
No ratings yet
Capacitors: Basics and History
21 pages
MH Cet Engineering Sample Paper
100% (1)
MH Cet Engineering Sample Paper
3 pages
DSC260 Digital System Controller: Owners Manual
No ratings yet
DSC260 Digital System Controller: Owners Manual
40 pages
ES Notes1 (R19) IV ECE 1-2 UNITS
No ratings yet
ES Notes1 (R19) IV ECE 1-2 UNITS
66 pages
XTEInterfaceServlet PDF
No ratings yet
XTEInterfaceServlet PDF
2 pages
C4.4 Diagramas Electricos Generador Marino Caterpillar
No ratings yet
C4.4 Diagramas Electricos Generador Marino Caterpillar
2 pages
Spesifikasi Juknis
No ratings yet
Spesifikasi Juknis
96 pages
Data Center HVAC Design Guide
No ratings yet
Data Center HVAC Design Guide
13 pages
Esat Review Matz PDF
No ratings yet
Esat Review Matz PDF
13 pages
Transformer Protection Guide
100% (1)
Transformer Protection Guide
28 pages
TimeCube User Manual
No ratings yet
TimeCube User Manual
24 pages
AAU5339w Description Draft A (20191129)
60% (5)
AAU5339w Description Draft A (20191129)
20 pages
Tubitak Ume Product Catalog 2
No ratings yet
Tubitak Ume Product Catalog 2
55 pages
85.intelligent Fire Detector With Automatic Water Sprinkler System To Avoid Fire Accidents
No ratings yet
85.intelligent Fire Detector With Automatic Water Sprinkler System To Avoid Fire Accidents
3 pages
SIMULINK Model of A Quarter-Vehicle With An Anti-Lock Braking System
No ratings yet
SIMULINK Model of A Quarter-Vehicle With An Anti-Lock Braking System
133 pages
RS485 Board Schematic
No ratings yet
RS485 Board Schematic
1 page
Semester V: Control Systems Course
No ratings yet
Semester V: Control Systems Course
16 pages
FBU
No ratings yet
FBU
986 pages
Confidential: Service Manual
100% (1)
Confidential: Service Manual
108 pages
Unit 4 Power Quality Management in Smart Grid Emc - Electromagnetic Compatibility
No ratings yet
Unit 4 Power Quality Management in Smart Grid Emc - Electromagnetic Compatibility
18 pages
Hong Kong's LTE Evolution
No ratings yet
Hong Kong's LTE Evolution
21 pages
Sip5 Apn 030
No ratings yet
Sip5 Apn 030
48 pages
How To Test The Fuel Injectors1
No ratings yet
How To Test The Fuel Injectors1
7 pages
Tinyswitch Plus: Energy Efficient, Low Power Off-Line Switcher
No ratings yet
Tinyswitch Plus: Energy Efficient, Low Power Off-Line Switcher
20 pages