Pipe Lining
Pipe Lining
1
The slow way
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
2
Laundry Pipelining
▪ Start each load as soon as possible
— Overlap loads
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
4
Instruction execution review
▪ Executing a MIPS instruction can take up to five steps.
Step Name Description
Instruction Fetch IF Read an instruction from memory.
Instruction Decode ID Read source registers and generate control signals.
Execute EX Compute an R-type result or a branch outcome.
Memory MEM Read or write the data memory.
Writeback WB Store a result in the destination register.
5
Single-cycle datapath diagram
0
M
PC Add u
x
4
Add 1
Shift
1ns left 2
PCSrc
RegWrite 2ns
Read Instruction
address [31-0]
I [25 - 21]
Read Read
2ns MemWrite MemToReg
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend
6
Single-cycle review
▪ All five execution steps occur in one clock cycle.
▪ This means the cycle time must be long enough to accommodate all the
steps of the most complex instruction—a “lw” in our instruction set.
— If the register file has a 1ns latency and the memories and ALU have a
2ns latency, “lw” will require 8ns.
— Thus all instructions will take 8ns to execute.
▪ Each hardware element can only be used once per clock cycle.
— A “lw” or “sw” must access memory twice (in the IF and MEM stages),
so there are separate instruction and data memories.
— There are multiple adders, since each instruction increments the PC
(IF) and performs another computation (EX). On top of that, branches
also need to compute a target address.
7
Example: Instruction Fetch (IF)
▪ Let’s quickly review how lw is executed in the single-cycle datapath.
▪ We’ll ignore PC incrementing and branching for now.
▪ In the Instruction Fetch (IF) step, we read the instruction memory.
RegWrite
8
Instruction Decode (ID)
▪ The Instruction Decode (ID) step reads the source registers from the
register file.
RegWrite
9
Execute (EX)
▪ The third step, Execute (EX), computes the effective memory address
from the source register and the instruction’s constant field.
RegWrite
10
Memory (MEM)
▪ The Memory (MEM) step involves reading the data memory, from the
address computed by the ALU.
RegWrite
11
Writeback (WB)
▪ Finally, in the Writeback (WB) step, the memory value is stored into the
destination register.
RegWrite
MemWrite MemToReg
Read Instruction I [25 - 21]
Read Read
address [31-0]
register 1 data 1 Read Read
ALU 1
I [20 - 16] address data
Instruction Read Zero M
register 2 Read 0 Write u
memory 0 Result
data 2 M address x
M Write
u Write Data 0
u register
Registers memory
I [15 - 11] x ALUOp data
Write 1
1 data
MemRead
ALUSrc
RegDst
I [15 - 0] Sign
extend
12
A bunch of lazy functional units
▪ Notice that each execution step uses a different functional unit.
▪ In other words, the main units are idle for most of the 8ns cycle!
— The instruction RAM is used for just 2ns at the start of the cycle.
— Registers are read once in ID (1ns), and written once in WB (1ns).
— The ALU is used for 2ns near the middle of the cycle.
— Reading the data memory only takes 2ns as well.
▪ That’s a lot of hardware sitting around doing nothing.
13
Putting those slackers to work
▪ We shouldn’t have to wait for the entire instruction to complete before
we can re-use the functional units.
▪ For example, the instruction memory is free in the Instruction Decode
step as shown below, so...
14
Decoding and fetching together
▪ Why don’t we go ahead and fetch the next instruction while we’re
decoding the first one?
15
Executing, decoding and fetching
▪ Similarly, once the first instruction enters its Execute stage, we can go
ahead and decode the second instruction.
▪ But now the instruction memory is free again, so we can fetch the third
instruction!
16
Making Pipelining Work
▪ We’ll make our pipeline 5 stages long, to handle load instructions as they
were handled in the multi-cycle implementation
— Stages are: IF, ID, EX, MEM, and WB
▪ We want to support executing 5 instructions simultaneously: one in each
stage.
17
Break datapath into 5 stages
▪ Each stage has its own functional units.
▪ Each stage can execute in 2ns
— Just like the multi-cycle implementation
IF ID EXE MEM WB
RegWrite
6 PM 7 8 9
Time
30 40 40 40 40 20
19
A pipeline diagram
Clock cycle
1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WB
sub $v0, $a0, $a1 IF ID EX MEM WB
and $t1, $t2, $t3 IF ID EX MEM WB
or $s0, $s1, $s2 IF ID EX MEM WB
add $sp, $sp, -4 IF ID EX MEM WB
21
Pipelining Performance
Speedup=2400/1400~2
22
Pipeline Datapath: Resource Requirements
Clock cycle
1 2 3 4 5 6 7 8 9
lw $t0, 4($sp) IF ID EX MEM WB
lw $t1, 8($sp) IF ID EX MEM WB
lw $t2, 12($sp) IF ID EX MEM WB
lw $t3, 16($sp) IF ID EX MEM WB
lw $t4, 20($sp) IF ID EX MEM WB
23
Pipelining other instruction types
Clock cycle
1 2 3 4 5 6 7 8 9
add $sp, $sp, -4 IF ID EX WB
sub $v0, $a0, $a1 IF ID EX WB
lw $t0, 4($sp) IF ID EX MEM WB
or $s0, $s1, $s2 IF ID EX WB
lw $t1, 8($sp) IF ID EX MEM WB
24
Important Observation
Clock cycle
1 2 3 4 5 6 7 8 9
add $sp, $sp, -4 IF ID EX WB
sub $v0, $a0, $a1 IF ID EX WB
lw $t0, 4($sp) IF ID EX MEM WB
or $s0, $s1, $s2 IF ID EX WB
lw $t1, 8($sp) IF ID EX MEM WB
25
A solution: Insert NOP stages
▪ Enforce uniformity
— Make all instructions take 5 cycles.
— Make them have the same stages, in the same order
• Some stages will do nothing for some instructions
R-type IF ID EX NOP WB
Clock cycle
1 2 3 4 5 6 7 8 9
add $sp, $sp, -4 IF ID EX NOP WB
sub $v0, $a0, $a1 IF ID EX NOP WB
lw $t0, 4($sp) IF ID EX MEM WB
or $s0, $s1, $s2 IF ID EX NOP WB
lw $t1, 8($sp) IF ID EX MEM WB
26
Summary
27
Pipelined datapath
1
PCSrc
Read Read
register 1 data 1 ALU MemWrite
Read Instruction Zero
address [31-0] Read Read
0 Result Address
register 2 data 2
Write Data
Instruction register 1 MemToReg
memory
memory Registers ALUOp
Write
data ALUSrc Write Read
data data 1
28
What about control signals?
▪ The control signals are generated in the same way as in the single-cycle
processor—after an instruction is fetched, the processor decodes it and
produces the appropriate control values.
▪ But just like before, some of the control signals will not be needed until
some later stage and clock cycle.
▪ These signals must be propagated through the pipeline until they reach
the appropriate stage. We can just pass them in the pipeline registers,
along with the other data.
▪ Control signals can be categorized by the pipeline stage that uses them.
29
Pipelined datapath and control
1
0
ID/EX
WB EX/MEM
PCSrc
Control M WB MEM/WB
IF/ID EX M WB
4
Add
P
C Add
Shift
RegWrite left 2
Read Read
register 1 data 1 ALU MemWrite
Read Instruction Zero
address [31-0] Read Read
0 Result Address
register 2 data 2
Write Data
Instruction register 1 MemToReg
memory
memory Registers ALUOp
Write
data ALUSrc Write Read
data data 1
30
An example execution sequence
▪ Here’s a sample sequence of instructions to execute.
1000: lw $8, 4($29)
addresses in 1004: sub $2, $4, $5
decimal 1008: and $9, $10, $11
1012: or $16, $17, $18
1016: add $13, $14, $0
▪ We’ll make some assumptions, just so we can show actual data values.
— Each register contains its number plus 100. For instance, register $8
contains 108, register $29 contains 129, and so forth.
— Every data memory location contains 99.
▪ Our pipeline diagrams will follow some conventions.
— An X indicates values that aren’t important, like the constant field of
an R-type instruction.
— Question marks ??? indicate values we don’t know, usually resulting
from instructions coming before and after the ones in our example.
31
Cycle 1 (filling)
IF: lw $8, 4($29) ID: ??? EX: ??? MEM: ??? WB: ???
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4
Add
P
1004
C Add
Shift
RegWrite (?) left 2
???
32
Cycle 2
IF: sub $2, $4, $5 ID: lw $8, 4($29) EX: ??? MEM: ??? WB: ???
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4
Add
P
C 1008
Add
Shift
RegWrite (?) left 2
29 ???
1004 Read Read
register 1 data 1 ALU MemWrite (?)
Read Instruction Zero
X X ???
address [31-0] Read Read ???
0 Result Address
register 2 data 2
Write ??? MemToReg
??? Data
Instruction register 1 (?)
memory
memory ??? Registers ALUOp (???)
Write ???
data ALUSrc (?) ??? Write Read
data data 1
4 Sign ???
RegDst (?) ???
extend MemRead (?)
0
8 ???
0 ??? ??? ???
X ???
1
???
33
Cycle 3
IF: and $9, $10, $11 ID: sub $2, $4, $5 EX: lw $8, 4($29) MEM: ??? WB: ???
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4
Add
P
1012
C Add
Shift
RegWrite (?) left 2
4 104 129
1008 Read Read
register 1 data 1 ALU MemWrite (?)
Read Instruction Zero
5 X
address [31-0] Read Read 105 ???
0 Result Address
register 2 data 2
??? Write 133 MemToReg
Data
Instruction register 1 (?)
memory
memory ??? Registers ALUOp (add)
Write
??? Write Read ???
data ALUSrc (1)
X Sign 4
RegDst (0)
extend MemRead (?) ???
0
X 8
0 8 ??? ???
2 X
1
???
34
Cycle 4
IF: or $16, $17, $18 ID: and $9, $10, $11 EX: sub $2, $4, $5 MEM: lw $8, 4($29) WB: ???
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4
Add
P
1016
C Add
Shift
RegWrite (?) left 2
10 110 104
1012 Read Read
register 1 data 1 ALU MemWrite (0)
Read Instruction Zero
11 105
address [31-0] Read Read 111 133
0 Result Address
register 2 data 2
–1
??? Write Data MemToReg
Instruction register 1 (?)
memory
memory ??? Registers ALUOp (sub)
Write
data ALUSrc (0) X Write Read 99 ???
data data 1
X Sign X
RegDst (1)
extend MemRead (1) ???
0
X X 0 2 8 ???
9 2
1
???
20
Cycle 5 (full)
IF: add $13, $14, $0 ID: or $16, $17, $18 EX: and $9, $10, $11 MEM: sub $2, $4, $5 WB:
lw $8, 4($29)
1
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4
Add
P
C 1020
Add
Shift
RegWrite (1) left 2
17 117 110
1016 Read Read
register 1 data 1 ALU MemWrite (0)
Read Instruction Zero
18 111
address [31-0] Read Read 118 -1
0 Result Address
register 2 data 2
8 Write 110 Data MemToReg
Instruction register 1 (1)
memory
memory 99 Registers ALUOp (and)
Write
data ALUSrc (0) 105 Write Read X 99
data data 1
X Sign X
RegDst (1)
extend MemRead (0) 133
0
X X 0 9 2 8
16 9
1
99
36
Cycle 6 (emptying)
IF: ??? ID: add $13, $14, $0 EX: or $16, $17, $18 MEM: and $9, $10, $11 WB: sub
$2, $4, $5
1
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4
Add
P
???
C Add
Shift
RegWrite (1) left 2
14 114
117
1020 Read Read
register 1 data 1 ALU MemWrite (0)
Zero
Read Instruction 0 118
Read Read 0 110
address [31-0] 0
register 2 data 2 Result Address
2 119 MemToReg
Write
1 Data
Instruction register (0)
-1 memory
memory Registers ALUOp (or)
Write X
data ALUSrc (0) 111 Write Read
1
data data
X Sign X
RegDst (1)
extend MemRead (0)
0
X X 16 9
0
13 16
1
37
Cycle 7
IF: ??? ID: ??? EX: add $13, $14, $0 MEM: or $16, $17, $18 WB: and
$9, $10, $11
1
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4
Add
P
C ???
Add
Shift
RegWrite (1) left 2
??? Sign X
RegDst (1)
extend MemRead (0) 110
0
??? X 0 13 16 9
??? 13
1
110
38
Cycle 8
IF: ??? ID: ??? EX: ??? MEM: add $13, $14, $0 WB: or $16,
$17, $18
1
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4
Add
P
C ???
Add
Shift
RegWrite (1) left 2
119
39
Cycle 9
IF: ??? ID: ??? EX: ??? MEM: ??? WB: add
$13, $14, $0
1
0 ID/EX
WB EX/MEM
PCSrc Control M WB MEM/WB
IF/ID EX M WB
4
Add
P
C ???
Add
Shift
RegWrite (1) left 2
114
40
Performance Revisited
▪ Assuming the following functional unit latencies:
ALU
mem Read
Mem Write
41
The pipelining paradox
42
Pipeline Hazards
• Data hazards
– Dependency: Instruction depends on the result of a previous instruction
still in the pipeline
Add $s0, $t0, $t1