View from 30,000 Feet
Note: we haven’t bothered
showing multiplexors
• What is the role of the Add units? Source: H&P textbook
• Explain the inputs to the data memory unit
• Explain the inputs to the ALU
• Explain the inputs to the register unit 4
Clocking Methodology
Source: H&P textbook
• Which of the above units need a clock?
• What is being saved (latched) on the rising edge of the clock?
Keep in mind that the latched value remains there for an entire cycle
5
Implementing R-type Instructions
• Instructions of the form add $t1, $t2, $t3
• Explain the role of each signal
Source: H&P textbook
6
Implementing Loads/Stores
• Instructions of the form lw $t1, 8($t2) and sw $t1, 8($t2)
Where does this input come from?
7
Source: H&P textbook
Implementing J-type Instructions
• Instructions of the form beq $t1, $t2, offset
Source: H&P textbook 8
View from 10,000 Feet
9
Source: H&P textbook
View from 5,000 Feet
10
Source: H&P textbook
Latches and Clocks in a Single-Cycle Design
Instr Reg Data
PC ALU Addr
Mem File Memory
• The entire instruction executes in a single cycle
• Green blocks are latches
• At the rising edge, a new PC is recorded
• At the rising edge, the result of the previous cycle is recorded
• At the falling edge, the address of LW/SW is recorded so
we can access the data memory in the 2nd half of the cycle 11
Multi-Stage Circuit
Instead of executing the entire instruction in a single
cycle (a single stage), let’s break up the execution into
multiple stages, each separated by a latch
Instr Reg Data
PC L2 L3 ALU L4 L5
Mem File Memory
Reg
File
12
The Assembly Line
Unpipelined Start and finish a job before moving to the next
Jobs
Time
A B C
A B C Break the job into smaller stages
A B C
A B C
Pipelined
13
Performance Improvements?
• Does it take longer to finish each individual job?
• Does it take shorter to finish a series of jobs?
• What assumptions were made while answering these
questions?
• Is a 10-stage pipeline better than a 5-stage pipeline?
4
A 5-Stage Pipeline
register write in the first half of the clock cycle (dotted part)
register read in the second half of the clock cycle (solid part)
so that in CC5, I4 can use the register released by I1 (otherwise directly I5
will be able to use it)
5
Source: H&P textbook
A 5-Stage Pipeline
Use the PC to access the I-cache and increment PC by 4
all instructions go through all stages
for eg, add instruction does not require DM, but still it will
take 5 clock cycles (it will wait for that particular clock cycle)
6
DM - data memory
A 5-Stage Pipeline improved throughput, thorughput becomes 5 times
Read registers, compare registers, compute branch target; for now, assume
branches take 2 cyc (there is enough work that branches can easily take more)
branches dont work well with pipelines
7
A 5-Stage Pipeline
ALU computation, effective address computation for load/store
8
A 5-Stage Pipeline
Memory access to/from data cache, stores finish in 4 cycles
9
A 5-Stage Pipeline
Write result of ALU computation or load into register file
because of the solid and dotted lines, we are able to use
the writing of I1 for the reading of I4. otherwise we would have to
wait for I5 for reading what I1 wrote.
10
Pipeline Summary
note: no skipping of stages. so that there is no
overtaking (faster instruction overtaking the
slower one)
RR ALU DM RW
still 5 cycles taken (even if DM is empty). IM is not shown, only the latter 4 cycles are shown
ADD R1, R2, R3 Rd R1,R2 R1+R2 -- Wr R3
BEQ R1, R2, 100 Rd R1, R2 -- -- --
Compare, Set PC
LD 8[R3] R6 Rd R3 R3+8 Get data Wr R6
here R3 is the address
ST 8[R3] R6 Rd R3,R6 R3+8 Wr data --
11
Performance Improvements?
• Does it take longer to finish each individual job?
yes, possibbly due to additional latch delays
• Does it take shorter to finish a series of jobs?
• What assumptions were made while answering these
questions?
– No dependences between instructions
– Easy to partition circuits into uniform pipeline stages
– No latch overhead
• Is a 10-stage pipeline better than a 5-stage pipeline?
12
Quantitative Effects
• As a result of pipelining:
Time in ns per instruction goes up
Each instruction takes more cycles to execute
But… average CPI remains roughly the same
Clock speed goes up becomes 5 times for 5 stage pipeline
Total execution time goes down, resulting in lower
average time per instruction
Under ideal conditions, speedup
= ratio of elapsed times between successive instruction
completions
= number of pipeline stages = increase in clock speed
13
Conflicts/Problems
• I-cache and D-cache are accessed in the same cycle – it
helps to implement them separately since, IM and DM might happen in the same clock
cycle, we must build separate hardware for these ops
• Registers are read and written in the same cycle – easy to
deal with if register read/write time equals cycle time/2
• Branch target changes only at the end of the second stage
-- what do you do in the meantime?
14
Hazards
• Structural hazards: different instructions in different stages
(or the same stage) conflicting for the same resource
for eg reading and writing in the same register in the same clock cycle solution: half cycles mein break kardo
eg(2) im and dm operations in the same clock cycle solution: keep separate IM and DM
• Data hazards: an instruction cannot continue because it
needs a value that has not yet been generated by an
earlier instruction dependencies, an instruction might need the output of some instruction that has still
not completed its 5 stages
• Control hazard: fetch cannot continue because it does
not know the outcome of an earlier branch – special case
of a data hazard – separate category because they are
treated in different ways
15
Structural Hazards
• Example: a unified instruction and data cache
stage 4 (MEM) and stage 1 (IF) can never coincide
• The later instruction and all its successors are delayed
until a cycle is found when the resource is free these
are pipeline bubbles
• Structural hazards are easy to eliminate – increase the
number of resources (for example, implement a separate
instruction and data cache, add more register ports)
5
Data Hazards
• An instruction produces a value in a given pipeline stage
• A subsequent instruction consumes that value in a pipeline
stage
• The consumer may have to be delayed so that the time
of consumption is later than the time of production
6
Example 1 – No Bypassing i1 and i2 have data hazard
• Show the instruction occupying each stage in each cycle (no bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R7+R8R9
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8
IF IF IF IF IF IF IF IF
D/R D/R D/R D/R D/R D/R D/R D/R
ALU ALU ALU ALU ALU ALU ALU ALU
DM DM DM DM DM DM DM DM
RW RW RW RW RW RW RW RW 7
Example 1 – No Bypassing
• Show the instruction occupying each stage in each cycle (no bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R7+R8R9
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8
IF IF IF IF IF IF IF IF
L2
I1 I2 I3 I3 I3
waiting for I2 to proceed
I4 I5
L3
D/R D/R D/R D/R D/R D/R D/R D/R
I1 I2 I2 I2
concluded finally
I3 I4
in the second half
of the clock cycle
L4 ALU ALU ALU ALU ALU ALU ALU ALU
I1 I2 I3
L5
DM DM DM DM DM DM DM DM
I1 this is a bubble
I2 I3
RW RW RW RW RW RW RW RW 8
I1 I2
Example 2 – Bypassing
• Show the instruction occupying each stage in each cycle (with bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R3+R8R9.
Identify the input latch for each input operand.
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8
IF IF IF IF IF IF IF IF
D/R D/R D/R D/R D/R D/R D/R D/R
ALU ALU ALU ALU ALU ALU ALU ALU
DM DM DM DM DM DM DM DM
RW RW RW RW RW RW RW RW 9
Example 2 – Bypassing Li Lj means that Li has been overwritten by Lj
L5 L3 because by the end of cyc4, L4 has been updated by the ALU op
• Show the instruction occupying each stage in each cycle (with bypassing)
if I1 is R1+R2R3 and I2 is R3+R4R5 and I3 is R3+R8R9.
Identify the input latch for each input operand.
observe that the result has been stored in L3 for I1 in cyc3 itself, and it is directly usable now for I2 (dont have to wait for all 5 cycles)
CYC-1 CYC-2 CYC-3 CYC-4 CYC-5 CYC-6 CYC-7 CYC-8
IF IF IF IF IF IF IF IF
I1 I2 I3 I4 I5
L2
D/R D/R D/R D/R D/R D/R D/R D/R
I1 I2 I3 I4
L3 L3 L3 L4 L3 L5 L3
ALU ALU ALU ALU ALU ALU ALU ALU
I1 I2 I3
L4
DM DM DM DM DM DM DM DM
I1 I2 I3
L5
RW RW RW RW RW RW RW RW
I1 I2 I3
Problem 1
IF D/R ALUL3 DM RW
i1 i1 i1 i1 i1
IF D/R ALU DM RW
L4
i2 i2 i2 i2
i2
IF D/R ALU DM RW
add $1, $2, $3
IF D/R ALU DM RW
lw $4, 8($1)
11
Problem 2
L2 L3 L4 L5
IF D/R ALU DM RW
i1 i1
i1 i1 i1
IF D/R ALU DM RW
i2 i2
lw $1, 8($2) IF D/R ALU DM RW
i2 i2
DM is in cyc4, so there is one cycle delay i2 i2
(still faster than non-bypass)
lw $4, 8($1) IF D/R ALU DM RW
12
Problem 3 1) read from L5
2) writing will happen in the first half and hence DM can access the written part in the second half
L5
IF D/R ALU DM RW
i1 i1 i1 i1
i1
IF D/R ALU DM RW
i2 i2 i2
i2 i2
IF D/R ALU DM RW
lw $1, 8($2)
IF D/R ALU DM RW
sw $1, 8($3)
13
Problem 4
A 7 or 9 stage pipeline, RR and RW take an entire stage
IF IF Dec Dec RR ALU RW
ALU DM DM RW
lw $1, 8($2)
add $4, $1, $3 9
Problem 4
A 7 or 9 stage pipeline, RR and RW take an entire stage
instruction
fetch decode
IF IF Dec Dec RR ALU RW
ALU DM DM RW
lw $1, 8($2)
add $4, $1, $3 10
Problem 4
Without bypassing: 4 stalls
IF:IF:DE:DE:RR:AL:DM:DM:RW
IF: IF :DE:DE:DE:DE: DE :DE:RR:AL:RW
With bypassing: 2 stalls
IF:IF:DE:DE:RR:AL:DM:DM:RW
IF: IF :DE:DE:DE:DE: RR :AL:RW
lw $1, 8($2)
IF IF Dec Dec RR ALU RW
add $4, $1, $3
ALU DM DM RW
11