Slide 6
Slide 6
Basic concepts
1
Overview
Pipelining is widely used in modern processors.
Pipelining improves system performance in terms of Throughput.
(The latency of each instruction still remains unchanged)
Throughput = No. of Instructions completed per unit time!
Pipeline Improves Throughput, because multiple instructions can
now run concurrently/ simultaneously!!! In Ideal case (with no
Hazard): Throughput approaches 1 Instruction/cycle
Pipelined organization requires sophisticated compilation
techniques.
2
Making the Execution of Programs Faster
• Use Faster circuit technology to build the processor and the main memory.
(Faster Clock Rate means Shorter Clock Cycle!)
• Arrange the hardware so that more than one operation can be performed in
Parallel / at the same time. (This is the Pipelining!)
• In the latter way, the Throughput (number of operations performed per
second) is increased, even though the Latency (Elapsed time needed to
perform any one operation) is not changed.
3
Traditional Pipeline Concept
• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes A B C D
4
Traditional Pipeline Concept
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
• Sequential laundry takes 6
A hours for 4 loads
• If they learned pipelining, how
long would laundry take?
B
5
Traditional Pipeline Concept
6 PM 7 8 9 10 11 Midnight
Time
T Pipelined laundry takes
a 30 40 40 40 40 20 3.5 hours for 4 loads
• When Ann Puts her dress from
s
A the Washer to the Dryer,
k Someone else (say, Brian) can
start using the Washer.
O B • After Ann leaves the Dryer and
starts Folding, Brian can Move to
r the Dryer (from washer, after
d C waiting for 10 minutes) and
e Cathy can start using the
Washer!
r D
• So, Pipelining = Concurrent
Operations! 6
Traditional Pipeline Concept
• Pipelining doesn’t help latency of
6 PM 7 8 9 single task, it helps throughput of
entire workload
Time
• Pipeline rate limited by slowest
T pipeline stage Dryer=40 minutes
a 30 40 40 40 40 20
• Multiple tasks simultaneously
s operating using different resource
A
k • Potential speedup = Number of
pipeline stages
O B • Unbalanced lengths of pipe stages
r reduces speedup (40 min.)
d C • Time to “fill” pipeline and time to
e “drain” it reduces speedup
r • Stall for Dependences
D
7
Basic concepts (contd..)
•Processor executes a program by fetching and executing instructions one after the
other.
•This is known as sequential execution.
•If Fi refers to the fetch step, and Ei refers to the execution step of instruction Ii,
then sequential execution looks like:
Time
1 2 3
F E F E F E
1 1 2 2 3 3
What if the execution of one instruction is overlapped with the fetching of the
next one?
8
Basic concepts (contd..)
•Computer has two separate hardware units, one for fetching instructions and one
for executing instructions.
•Instruction is fetched by instruction fetch unit and deposited in an intermediate
buffer B1.
•Buffer enables the instruction execution unit to execute the instruction while the
fetch unit is fetching the next instruction.
•Results of the execution are deposited in the destination location specified by the
instruction.
Interstage buffer
B1
Instruction Execution
fetch unit
unit
9
Basic concepts (contd..)
•Computer is controlled by a clock whose period is such that the fetch and execute
steps of any instruction can be completed in one clock cycle.
•First clock cycle:
- Fetch unit fetches an instruction I1 (F1) and stores it in B1.
•Second clock cycle:
- Fetch unit fetches an instruction I2 (F2) , and execution unit executes instruction I1 (E1).
•Third clock cycle:
- Fetch unit fetches an instruction I3 (F3), and execution unit executes instruction I2 (E2).
•Fourth clock cycle:
- Execution unit executes instruction I3 (E3).
Time
Clock cycle 1 2 3 4
Instruction
I1 F1 E1
I2 F2 E2
I3 F3 E3
10
Basic concepts (contd..)
• In each clock cycle, the fetch unit fetches the next instruction, while
the execution unit executes the current instruction stored in the
interstage buffer.
• Fetch and the execute units can be kept busy all the time.
• If this pattern of fetch and execute can be sustained for a long time,
the completion rate of instruction execution will be twice that
achievable by the sequential operation.
• Fetch and execute units constitute a two-stage pipeline.
• Each stage performs one step in processing of an instruction.
• Interstage storage buffer holds the information that needs to be passed from
the fetch stage to execute stage.
• New information gets loaded into the buffer every clock cycle.
11
Basic concepts (contd..)
•Suppose the processing of an instruction is divided into four steps:
F Fetch: Read the instruction from the memory.
D Decode: Decode the instruction and fetch the source operands.
E Execute: Perform the operation specified by the instruction.
W Write: Store the result in the destination location.
•There are four distinct hardware units, for each one of the steps.
•Information is passed from one unit to the next through an interstage buffer.
•Three interstage buffers connecting four units.
•As an instruction progresses through the pipeline, the information needed by the
downstream units must be passed along.
Interstageuffers
b
D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3
12
Basic concepts (contd..) Time
Clock cycle 1 2 3 4 5 6 7
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
Clock cycle 1: F1
Clock cycle 2: D1, F2
Clock cycle 3: E1, D2, F3
Clock cycle 4: W1, E2, D3, F4
Clock cycle 5: W2, E3, D4
Clock cycle 6: W3, E3, D4
Clock cycle 7: W4
13
Basic concepts (contd..) Time
Clock cycle 1 2 3 4 5 6 7
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I1 F1 E1
(a) Sequential execution
I2 F2 E2
Interstage buffer
B1
I3 F3 E3
Instruction Execution
fetch
unit
unit (c) Pipelined execution
Interstageuff
b ers
D : Decode
F : Fetch instruction E: Execute W : Write
instruction and f etch operation results
operands
B1 B2 B3
Instruction = Fetch +
Decode + Execution + Write
“Fetch” stages fetches the
next instruction from the
cache or main memory
“Decode” stage prepares
and fetches the source
operands
“Execution” => The Actual
Execution, May involve ALU
for Add, Sub, Mul, Div, And,
OR, Shift etc ...
“Write” stage saves/stores Textbook
the result, such as writing to page: 457
a Register or Memory
location
Here: Throughput => (approx.) 1 instruction/cycle (starting
from the 4th cycle). And, Latency: 4 cycles for each instruction 16
Role of cache memory
17
Pipeline Performance
• The potential increase in performance from pipelining is
proportional to the number of pipeline stages, say n.
• n stages => n instructions can run simultaneously, and the clock pulse can be
(ideally) as small as just 1/n-th of the previous pulse length!!!
• Ideally, the Throughput is Increased n-times!!!
• However, this increase would be achieved only if all pipeline stages require the
same time to complete, and there is no interruption (hazard) throughout
program execution.
• Unfortunately, this is not true.
18
Pipeline performance
19
Pipeline Performance
21
Data Hazards
22
Data Hazards
Time
Clock cy cle 1 2 3 4 5 6 7 8 9
Instruction
I 1 (Mul) F1 D1 E1 W1
I 2 (Add) F2 D2 D2A E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
Instruction
I 1 (Mul) F1 D1 E1 W1
I 2 (Add) F2 D2 D2A E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
•Cycles 5 and 6, the Write stage is idle, because it has no data to work with.
•Information in buffer B2 must be retained till the execution of the instruction I2 is
complete.
•Stage 2, and by extension stage 1 cannot accept new instructions because the
information in B1 cannot be overwritten.
•Steps D6 and F5 must be postponed.
•A data hazard is a condition in which either the source or the destination operand is
not available at the time expected in the pipeline.
24
Data hazards
•Data hazard is a situation in which the pipeline is stalled because the data to be
operated on are delayed.
•Consider two instructions:
I1: A = 3 + A
I2: B = 4 x A
•If A = 5, and I1 and I2 are executed sequentially, B=32.
•In a pipelined processor, the execution of I2 can begin before the execution of I1.
•The value of A used in the execution of I2 will be the original value of 5 leading to
an incorrect result.
•Thus, instructions I1 and I2 depend on each other, because the data used by I2
depends on the results generated by I1.
•Results obtained using sequential execution of instructions should be the same as
the results obtained from pipelined execution.
•When two instructions depend on each other, they must be performed in the correct
order.
25
Data hazards
Clock cycle 1
(contd..)
2 3 4 5 6 7 8 9
Instruction
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
•Mul instruction places the results of the multiply operation in register R4 at the end
of clock cycle 4.
•Register R4 is used as a source operand in the Add instruction. Hence the Decode
Unit decoding the Add instruction cannot proceed until the Write step of the first
instruction is complete.
•Data dependency arises because the destination of one instruction is used as a source
in the next instruction.
26
Data hazard
•Execution of the instruction occurs in the E stage of the
pipeline.
•Execution of most arithmetic and logic operations would take
only one clock cycle.
•However, some operations such as division would take more
time to complete.
27
Control or instruction hazard
•Pipeline may be stalled because an instruction is not available at the expected time.
•For example, while fetching an instruction a cache miss may occur, and hence the
instruction may have to be fetched from the main memory.
•Fetching the instruction from the main memory takes much longer than fetching the
instruction from the cache.
•Thus, the fetch cycle of the instruction cannot be completed in one cycle.
•For example, the fetching of instruction I2 results in a cache miss.
•Thus, F2 takes 4 clock cycles instead of 1.
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
28
Time
Clock cy cle 1 2 3 4 5 6 7 8 9
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
Time
Clock cy cle 1 2 3 4 5 6 7 8 9
Stage
F: Fetch F1 F2 F2 F2 F2 F3
(b) Function perf ormed by each processor stage in successiv e clock cy cles
Example
of How
Cache
Miss
causes
Instruction
Hazard
and the
Pipeline
Stalls
Idle periods –
stalls (bubbles)
29
Control or instruction hazard (contd..)
•Fetch operation for instruction I2 results in a cache miss, and the instruction fetch
unit must fetch this instruction from the main memory.
•Suppose fetching instruction I2 from the main memory takes 4 clock cycles.
•Instruction I2 will be available in buffer B1 at the end of clock cycle 5.
•The pipeline resumes its normal operation at this point.
•Decode unit is idle in cycles 3 through 5.
•Execute unit is idle in cycles 4 through 6.
•Write unit is idle in cycles 5 through 7.
•Such idle periods are called as stalls or bubbles.
•Once created in one of the pipeline stages, a bubble moves downstream unit it
reaches the last unit.
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Stage
F: Fetch F1 F2 F2 F2 F2 F3
30
Structural hazard
31
Pipeline Performance: Structural hazard
Clock cy cle
Instruction
I1
I 2 (Load)
I3
F1
1 2
D1
F2
3
E1
D2
F3
4
W1
E2
D3
5
M2
E3
6
W2
W3
7
Time
I4 F4 D4 E4
I5 F5 D5
I2 F2 D2 E2 M2 W2 (Load X(R1),R2
I3 F3 D3 E3 W3
I4 F4 D4 E4
I5 F5 D5
•Memory address X+R1 is computed in step E2 in cycle 4, memory access takes place
in cycle 5, operand read from the memory is written into register R2 in cycle 6.
•Execution of instruction I2 takes two clock cycles 4 and 5.
•In cycle 6, both instructions I2 and I3 require access to register file.
•Pipeline is stalled because the register file cannot handle two operations at once.
33
Pipelining and performance
34
Handling Data Hazard in Hardwire: Operand
Forwarding
• Instead of from the register file, the second instruction can get data
directly from the output of ALU immediately just after the “E” (i.e.,
execution) stage of the previous instruction is completed.
• A special arrangement needs to be made to “forward” the output of
ALU to the input of ALU.
35
Operand forwarding
•Data hazard occurs because the destination of one instruction is used as the source
in the next instruction.
•Hence, instruction I2 has to wait for the data to be written in the register file by the
Write stage at the end of step W1.
•However, these data are available at the output of the ALU once the Execute stage
completes step E1.
•Delay can be reduced or even eliminated if the result of instruction I1 can be
forwarded directly for use in step E2.
•This is called “operand forwarding”.
36
Operand forwarding (contd..)
Source 1
Source 2
•Similar to the three-bus organization.
•Registers SRC1, SRC2 and RSLT have
been added.
SRC1 SRC2
•SRC1, SRC2 and RSLT are interstage
buffers for pipelined operation.
Register
•SRC1 and SRC2 are part of buffer B2.
file •RSLT is part of buffer B3.
•Data forwarding mechanism is shown by
ALU the two blue lines.
•Two multiplexers connected at the inputs
RSLT
to the ALU allow the data on the destination
Destination
bus to be selected instead of the contents of
SRC1 and SRC2 register.
37
Operand forwarding (contd..)
I1: Mul R2, R3, R4
Source 1 I2: Add R5, R4, R6
Source 2
Clock cycle 3:
- Instruction I2 is decoded, and a data
SRC1 SRC2
dependency is detected.
- Operand not involved in the dependency,
Register
register R5 is loaded in register SRC1.
file Clock cycle 4:
- Product produced by I1 is available in
ALU register RSLT.
- The forwarding connection allows the
RSLT
result to be used in step E2.
Destination
Instruction I2 proceeds without interruption.
38
Handling data dependency in software
39
Superscalar operation
40
Superscalar operation (contd..)
41
Superscalar operation (contd..)
Instruction fetch unit is capable of reading two
instructions at a time and storing them in the
F : Instruction
fetch unit instruction queue.
Instruction queue
Integer
If there is one integer and one unit
floating point instruction, and no
hazards, then both instructions are
dispatched in the same clock cycle.
42
Superscalar operation (contd..)
43
Superscalar operation (contd..)
Clock cycle 1 2 3 4 5 6 7
I 1 (Fadd) F1 D1 E1A E1B E1C W1
I 2 (Add) F2 D2 E2 W2
I 3 (Fsub) F3 D3 E3 E3 E3 W3
I 4 (Sub) F4 D4 E4 W4
I 2 (Add) F2 D2 E2 W2
I 3 (Fsub) F3 D3 E3 E3 E3 W3
I 4 (Sub) F4 D4 E4 W4
Clock cycle 3:
- I1 and I2 begin execution, I2 completes execution. I3 is dispatched to floating
point unit and I4 is dispatched to integer unit.
Clock cycle 4:
- I1 continues execution, I3 begins execution, I2 completes Write stage,
I4 completes execution.
Clock cycle 5:
- I1 completes execution, I3 continues execution, and I4 completes Write.
Order of completion is I2, I4, I1, I3
45