HRY-312 Computer Organization Lecture 6 Introduction to Pipelining
Recall: Performance Evaluation What is the average CPI?
state diagram gives CPI for each instruction type workload gives frequency of each type Type Arith/Logic Load Store branch CPIi for type 4 5 4 3 Frequency 40% 30% 10% 20% CPIi x freqIi 1.6 1.5 0.4 0.6
Average CPI:4.1
Can we get CPI < 4.1? Seems to be lots of idle hardware
Why not overlap instructions???
PCWr PCWrCond Zero IorD MemWr IRWr
32 32 32 32 0 RAdr Rs 32 Rt 5 5 0
PCSrc RegDst RegWr ALUSelA
1
Mux x
32
PC Mux M
1 32 32
0 32
Zero ALU Out A ALU A
Instru uction Reg
Mux
Ra Rb Rw busW busB busA b A A
Ideal Memory
WrAdr 32 Din Dout
Mem Da Reg ata
Rt 0 Rd 1
Reg File B
32
0 1 2 3 32
1 Mux 0
Mux
32
32
<< 2
ALU Control
Imm 16
Extend
32
ALUOp ALUSelB
ExtOp
MemtoReg
The Big Picture: Where are We Now? The Five Classic Components of a Computer
Processor Input Control Memory Datapath
Output
Next Topics:
Pipelining by Analogy Pi li hazards Pipeline h d
Pipelining is Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes 20 minutes Folder A B C D
Sequential Laundry 6 PM 7 8 9
Time
10
11
Midnight
30 40 20 30 40 20 30 40 20 30 40 20
T a s k O r d e r
A B C D Sequential laundry takes 6 hours for 4 loads pipelining, If they learned pipelining how long would laundry take?
Pipelined Laundry: Start work ASAP 6 PM 7 8 9
Time
10
11
Midnight
30 40
T a s k O r d e r
40
40
40 20
A B C D Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Lessons 6 PM 7 8 9
Time
Pipelining doesnt help latency of single task, it helps throughput of entire workload kl d Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences
30 40
T a s k O r d e r
40
40
40 20
A B C D
The Five Stages of Load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Load Ifetch L d If t h
Reg/Dec R /D
Exec E
Mem M
Wr W
Ifetch: Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec: Register Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file
Note: These 5 stages were there all along Fetch h
IR <= MEM[PC]
PC <= PC + 4
0000
D Decode e
ALUout <= PC +SX
0001 R-type
ALUout <= A fun B
M Memory y Writ te-back k Execu ute
ORi
ALUout <= A op ZX
LW
ALUout <= A + SX
SW
ALUout <= A + SX
BEQ
If A = B then PC <= ALUout
0100
0110
1000
M <= MEM[ALUout]
1011
MEM[ALUout] <= B
0010
1001
R[rd] <= ALUout R[rt] <= ALUout R[rt] <= M
1100
0101
0111
1010
Pipelining Improve performance by increasing throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
Basic Idea
What do we need to add to split the datapath into stages?
Graphically Representing Pipelines
Can help with answering questions like:
how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths
Conventional Pipelined Execution Representation
Time IFetch Dcd IF t h D d Exec E Mem M Exec WB Mem Exec WB Mem Exec WB Mem Exec WB Mem Exec WB Mem WB
IFetch Dcd
IFetch Dcd
IFetch Dcd
IFetch Dcd Program Flow
IFetch Dcd
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1 Clk Single Cycle Implementation: Load Store Waste Cycle 2
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Ifetch Reg g Exec Mem
Wr
Store Ifetch
Reg g
Exec
Mem
R-type Ifetch
Pipeline Implementation: Load Ifetch Reg Exec Reg Mem Exec Reg Wr Mem Exec Wr Mem Wr
Store Ifetch
R-type Ifetch
Why Pipeline? Suppose we execute 100 instructions Si l C l M hi Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns
M lti Multicycle Machine l M hi
10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
Id l pipelined machine Ideal i li d hi
10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
Why pipeline (cont.)?
Time (clock cycles)
ALU
I n s t r. O r d e r
Inst 0 Inst 1 Inst 2 Inst I t3 Inst I t4
Im
Reg
Dm ALU
Reg
Im
Reg
Dm ALU
Reg
Im
Reg
Dm ALU
Reg
Im
Reg
Dm ALU A
Reg
Im
Reg
Dm
Reg
Can pipelining get us into trouble? Yes: Pipeline Hazards
structural hazards: attempt to use the same resource two different ways at the same time - E E.g., combined washer/dryer would be a structural hazard bi d h /d ld b t t lh d or folder busy doing something else (watching TV) control hazards: attempt to make a decision before condition is evaluated - E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in - b branch instructions hi t ti data hazards: attempt to use item before it is ready - E.g., one sock of pair in dryer and one in washer; cant fold g, p y ; until get sock from washer through dryer - instruction depends on result of prior instruction still in the pipeline
Can always resolve hazards by waiting
pipeline control must detect the hazard take action (or delay action) to resolve hazards
Single Memory is a Structural Hazard
Time (clock cycles)
ALU
I n s t r. O r d e r
Load Instr 1 Instr 2 Instr 3 Instr 4
Mem
Reg
Mem ALU
Reg
Mem
Reg
Mem ALU A
Reg
Mem
Reg
Mem AL LU
Reg
Mem
Reg
Mem ALU U
Reg
Mem
Reg
Mem
Reg
Detection is easy in this case! (right half highlight means read, left half write)
Structural Hazards limit performance Example: if 1.3 memory accesses per instruction and only one memory access per cycle then
average CPI 1.3 otherwise resource is more than 100% utilized
Control Hazard Solution #1: Stall
I n s t r. O r d e r Time (clock cycles)
AL LU
Add Beq Load
Mem
Reg
Mem ALU U
Reg
Mem
Reg
Mem
Reg
ALU U
Lost potential
Mem
Reg
Mem
Reg
Stall: wait until decision is clear Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow i t ti ) l Move decision to end of decode
save 1 cyc e pe b a c sa e cycle per branch
Control Hazard Solution #2: Predict
I n s t r. O r d e r Time (clock cycles) Ti ( l k l )
ALU
Add Beq Load
Mem
Reg
Mem ALU
Reg
Mem
Reg
Mem
Reg
ALU
Mem
Reg
Mem
Reg
Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right - 50% of time)
Need to Squash and restart following instruction if wrong Produce CPI on branch of (1 *.5 + 2 * .5) = 1.5 Total CPI might then be: 1.5 * .2 + 1 * .8 = 1.1 (20% branch)
More dynamic scheme: history of 1 branch (- 90%)
Control Hazard Solution #3: Delayed Branch
I n s t r. O r d e r Time (clock cycles)
AL LU
Add Beq Misc Load
Mem M
Reg R
Mem M ALU U
Reg R
Mem
Reg g
Mem ALU U
Reg
Mem
Reg Mem
Mem ALU
Reg Mem Reg
Reg
Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in slot (- 50% of time) As launch more instruction per clock cycle less useful cycle,
Data Hazard on r1
add r1,r2,r3 , , sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11
Data Hazard on r1: Dependencies backwards in time are hazards
Time (clock cycles) I F Im ID/R F Reg
Im
I n s t r. O r d e r
add r1,r2,r3 , , sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
E X
ME M Dm
ALU
W B Reg
Dm ALU Reg
Reg
Im
ALU
Reg
Dm ALU A
Reg
Im
Reg
Dm AL LU
Reg
xor r10,r1,r11
Im
Reg R
Dm
Reg
Data Hazard Solution: Forward result from one stage to another
Time (clock cycles) I F Im ID/R F Reg
Im
I n s t r. O r d e r
add r1,r2,r3 , , sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
E X
ME M Dm
ALU
W B Reg
Dm ALU Reg
Reg
Im
ALU
Reg
Dm ALU A
Reg
Im
Reg
Dm AL LU
Reg
xor r10,r1,r11
Im
Reg R
Dm
Reg
or OK if define read/write properly
Forwarding (or Bypassing): What about Loads? Dependencies backwards in time are hazards
Time (clock cycles) I F Im ID/R F Reg
Im
lw r1,0(r2) , ( )
E X
ME M Dm
ALU
W B Reg
Dm Reg
sub r4,r1,r3
Reg
Cant solve with forwarding: Must delay/stall instruction dependent on loads
ALU
Forwarding (or Bypassing): What about Loads Dependencies backwards in time are hazards
Time (clock cycles) I F Im ID/R F Reg E X ME M Dm
Reg
lw r1,0(r2) , ( )
W B Reg
ALU Dm Reg
ALU
sub r4,r1,r3
Stall
Im
Cant solve with forwarding: Must delay/stall instruction dependent on loads
Designing a Pipelined Processor Go back and examine your datapath and control diagram associated resources with states conflict ensure that flows do not conflict, or figure out how to resolve assert control in appropriate stage
Summary: Pipelining Reduce CPI by overlapping many instructions
Average throughput of approximately 1 CPI with fast clock
Utilize capabilities of the Datapath
start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards
What makes it easy
all instructions are the same length just a few instruction formats memory operands appear only in loads and stores
What makes it hard?
structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction