UNIT-IV
2. Pipelining
Contents
• Basic concepts of pipelining
• Throughput and speed
• Pipeline Hazards
Pipelining
• Pipelining is
– a technique of decomposing a sequential process into SUB-
OPERATIONS,
– with each sub-process being executed in a special DEDICATED
SEGMENT that operates concurrently with all other segments
• A pipeline is
– a collection of processing segments through which binary
information flows
– Each segment performs partial processing dictated by the way the
task is partitioned
– Result obtained in each segment is transferred to the next segment
in the pipeline
– Final result is obtained after the data have passed through all
segments
• Several computations can be in progress in distinct segments
at the SAME TIME
• Each segment consists of an input register followed by a combinational
circuit
– register holds the data
– combinational circuit performs the SUB-OPERATION
– output of the combinational circuit in a given segment is applied to the next
segment’s input register
– A CLOCK is applied to all registers after enough time has elapsed to perform all
segment activity
• Registers provide isolation between each segment
– So, operate on distinct data simultaneously
– overlapping of computation is made
• Demonstration pipeline organization
– Suppose that we want to perform the
combined multiply and add operations
with a stream of numbers
Ai * Bi + Ci for i = 1, 2, 3, . . . , 7
– Each sub-operation is to be
implemented in a segment within a
pipeline
– Each segment has one or two registers
and a combinational circuit (shown in
Fig.)
• R1 through R5 are registers that receive
new data with every clock pulse
• multiplier and adder are combinational
circuits
– SUB-OPERATIONS performed in EACH
SEGMENT of the pipeline are as
follows:
Figure: Example of pipeline processing
• Effect of Each Clock is shown in Table
– It takes THREE clock pulses to fill up the pipe and retrieve the first output
from R5
– From there on, each clock produces a new output and moves the data one
step down the pipeline
TABLE: Content of Registers in Pipeline Example
Pipe line
General Considerations
• A task define a task as the total operation performed going through all the
segments in the pipeline
• BEHAVIOUR OF A PIPELINE can be illustrated with a SPACE-TIME DIAGRAM.
– It shows the segment utilization as a function of time
• SPACE-TIME DIAGRAM of a FOUR-SEGMENT PIPELINE is demonstrated in
below Fig.
– horizontal axis displays the time in clock cycles
– vertical axis gives the segment number (ex:- fetch, decode ,read & execute)
Figure: Space-time diagram for pipeline
• 6-TASKS (T1 through T6) executed in 4-SEGMENTS
(A1+B1 ; A2+B2; ……. A6+B6)
• In first clock cycle, task 1 is handled by segment 1
• In second clock cycle,
– segment 2 is busy with T1,
– segment 1 is busy with task T2
• Continuing in this manner,
– first task T1 is completed after the fourth clock cycle
– from then on, the pipe completes a task EVERY CLOCK CYCLE
– That is, once the pipeline is full, it takes ONLY ONE CLOCK PERIOD to obtain an output
Pipe line Pipe line
is full
(F)
(D)
(R)
(E)
Figure: Space-time diagram for pipeline
• Consider,
– a ‘k’-segment pipeline with a clock cycle time ‘tp’, is used to
execute ‘n’ tasks
– first task T1 requires a time = ‘ktp’ to complete its operation
– remaining n-1 tasks emerge from the pipe at the rate of one task
per clock cycle and completed after a time = (n-1)tp
– Therefore, to complete ‘n’ tasks using a k-segment pipeline requires
k + (n - 1) clock cycles
For example,
– Above Fig. shows four segments and six tasks
– The time required to complete all the operations is
4 + (6 - 1) = 9 clock cycles
Numerical example
• Let,
– TIME it takes to process a sub-operation in each segment, tp = 20 ns
– No. of SEGMENTS in pipeline k = 4
• No. of tasks to be executed in sequence n = 100
– Time taken to complete 100 tasks in PIPELINE system
(k + n - 1)tp = (4 + 99) x 20 = 2060 ns
• Assuming that
– TIME it takes to process a task is the SAME in the PIPELINE and NON-PIPELINE
circuits then
tn = ktp = 4 x 20 = 80 ns
– So, Time taken to complete 100 tasks in NON-PIPELINE system
nktp = 100 x 80 = 8000 ns
• Then, speedup ratio = 8000/2060 = 3.88
• As the number of tasks increases, the speedup will approach 4 (= No. of
SEGMENTS in pipeline)
• If tn = 60 ns, the speedup becomes tn/tp = 60/20 = 3
• There are two areas of computer design where the
pipeline organization is applicable
• Arithmetic Pipeline
– divides an arithmetic operation into sub-operations for
execution in the pipeline segments
• Instruction Pipeline
– operates on a stream of instructions by overlapping the
fetch, decode, and execute phases of the instruction
cycle
Arithmetic Pipeline
• Used in very high speed computers
• Used to implement
– floating-point operations
– multiplication of fixed-point numbers
– similar computations encountered in scientific problems
• Example : Pipeline Unit for floating-point addition and subtraction
– inputs to the floating-point adder pipeline are two normalized floating-
point binary numbers
X = A X 2a
Y = B X 2b
– A and B are two fractions that represent the mantissas
– a and b are the exponents
• Floating-point addition and subtraction can be performed in four
segments, as shown in Fig.
Fig: Pipeline for floating-point Exponents Mantissas
addition and subtraction a b A B
R R
• Registers (R) are Compare
placed between the Difference
Segment 1: exponents
segments to store by subtraction
intermediate results
R
• Sub-operations that
are performed in the Choose exponent
Segment 2: Align mantissa
FOUR segments are:
1. Compare the
exponents R
2. Align the mantissas
3. Add or subtract the
mantissas Segment 3: Add or subtract
mantissas
4. Normalize the result
R R
Segment 4: Adjust Normalize
exponent result
R R
• Numerical Example clarify the sub-operations performed in each segment
• Consider the two normalized floating-point numbers:
X = 0.9504 X 103
Y = 0.8200 X 102
• In FIRST SEGMENT:
– two exponents are subtracted to obtain 3 - 2 = 1
– The larger exponent 3 is chosen as the exponent of the result.
• In SECOND SEGMENT:
– shifts the mantissa of Y to the right to obtain
X = 0.9504 X 103
Y = 0. 0820 X 103
– This aligns the two mantissas under the same exponent.
• In THIRD SEGMENT:
– Addition of the two mantissas in produces the sum Z = 1 . 0324 X 103
• In FOURTH SEGMENT:
– Sum is adjusted by normalizing the result (fraction with a non-zero first digit)
– This is done by shifting the mantissa once to the right and incrementing the exponent by
one to obtain the Normalized Sum
Z = 0.10324 X 104
• Comparator, shifter, adder-subtractor, incrementer, and decrementer
in the floating-point pipeline are implemented with combinational
circuits
• Suppose that the
– time delays of the four segments are t1 = 60ns, t2 = 70ns,
t3 = 100ns, t4 = 80 ns
– interface registers have a delay of tr = 10 ns
• Clock Cycle is chosen to be
tp = t3 + tr = 110 ns (highest delay to be considered, so- t3 )
• An equivalent NON-PIPELINE floating-point adder-subtractor will have a
delay time
tn = t1 + t2 + t3 + t4 + tr = 320 ns
• In this case the Pipelined Adder has a SPEEDUP over the Non-pipelined
Adder is
tn /tp =320/110 = 2.9
Instruction Pipeline
• An instruction pipeline reads consecutive
instructions from memory while previous
instructions are being executed in other segments
• In the most general case, the computer needs to
process each instruction with the following
sequence of steps:
1. Fetch the instruction from memory
2. Decode the instruction
3. Calculate the effective address
4. Fetch the operands from memory
5. Execute the instruction
6. Store the result in the proper place
• Difficulties that will prevent the instruction pipeline
from operating at its maximum rate
– Different segments may take different times to operate
on the incoming information
– Some segments are skipped for certain operations
• For example: a register mode instruction DOES NOT need an
effective address calculation.
– Two or more segments may require memory access at
the same time, causing one segment to WAIT until
another is finished with the memory
• Memory access conflicts are sometimes resolved by using TWO
memory buses for accessing instructions and data in separate
memory modules
Example: Four-Segment Instruction Pipeline
• Assume, decoding of the instruction can be combined
with the calculation of the effective address into one
segment
• Instruction execution and storing of the result can be
combined into one segment
(as most of the instructions the result is saved into a
processor register)
• This reduces the instruction pipeline into FOUR segments
• Figure shows how the instruction cycle in the CPU can be processed with a
Four-segment Pipeline
•While FIRST instruction is Fetch instruction
Segment 1 from memory
being executed in SEGMENT 4,
Decode instruction
•the SECOND instruction Segment 2 And calculate
in sequence is busy Effective address
fetching an operand from
yes
memory in SEGMENT 3 Branch?
no
•the effective address Fetch operand
may be calculated in a Segment 3 From memory
separate arithmetic circuit
for the THIRD
Segment 4 Execute instruction
INSTRUCTION,
Interrupt yes
•whenever the memory is handling Interrupt?
available, the FOURTH
no
and ALL subsequent Update PC
instructions can be
Figure: Four Segment CPU Pipeline
fetched and placed in an Empty pipe
instruction FIFO
• Figure shows the operation of the instruction pipeline
– Time in the horizontal axis is divided into steps of equal duration
– Four segments are represented in the diagram with an abbreviated symbol
1. FI is the segment that fetches an instruction
2. DA is the segment that decodes the instruction and calculates the
effective address
3. FO is the segment that fetches the operand
4. EX is the segment that executes the instruction
Time in
Figure: Timing of instruction pipeline
• Assume, processor has SEPARATE Instruction and DATA Memories
– so that operation in FI and FO can proceed at the Same time
• In the ABSENCE of a branch instruction, each segment operates on
different instructions
– in step 4, instruction 1 is being executed in segment EX;
– the operand for instruction 2 is being fetched in segment FO;
– instruction 3 is being decoded in segment DA;
– and instruction 4 is being fetched from memory in segment FI
Figure: Timing of instruction pipeline
• Assume, instruction 3 is a BRANCH INSTRUCTION
– this instruction is decoded in segment DA in step 4,
– the transfer from FI to DA of the other instructions is HALTED until the branch
instruction is executed in step 6
– If the branch is TAKEN, a NEW Instruction is fetched in step 7
– If the branch is NOT Taken, the instruction fetched PREVIOUSLY in step 4 can be used
– The pipeline then continues until a new branch instruction is encountered
• Another DELAY may occur in the pipeline
– if the EX segment needs to store the result of the operation in the data memory
while FO segment needs to fetch an operand
– So, segment FO must WAIT until segment EX has finished its operation
Figure: Timing of instruction pipeline
• In general, there are THREE major difficulties
(Hazards) that cause the instruction pipeline to
DEVIATE from its normal operation
1. Resource Conflicts caused by access to memory by
two segments at the same time
- Resolved by using separate instruction and data
memories
2. Data Dependency Conflicts arise when an instruction
depends on the result of a previous instruction, but
this result is not yet available
3. Branch Difficulties arise from branch and other
instructions that change the value of PC
Data Dependency
• A data dependency occurs when an instruction needs data that are not
yet available
• Degradation of performance in an instruction pipeline is due to possible
COLLISION of Data or Address
For example,
– an instruction in the FO segment may need to fetch an operand that is being
generated at the same time by the previous instruction in segment EX
– Therefore, the second instruction must WAIT for data to become available
by the first instruction.
• Similarly, an ADDRESS dependency may occur when an operand address
CANNOT be calculated because the information needed by the
addressing mode is NOT available.
For example,
– an instruction with register indirect mode can NOT proceed to fetch the
operand if the previous instruction is loading the address into the register
– Therefore, the operand access to memory must be DELAYED until the
required address is available
• Pipelined computers deal with such CONFLICTS
between DATA DEPENDENCIES in a variety of ways:
– hardware interlocks
– operand forwarding
– delayed load
• Insert HARDWARE interlocks
– An INTERLOCK is a CIRCUIT that detects instructions
whose source operands are destinations of instructions
farther up in the pipeline
– INSTRUCTION whose source is not available to be
DELAYED by enough clock cycles to resolve the conflict
• Operand forwarding
– uses SPECIAL HARDWARE to detect a conflict and then
avoid it by routing the data through special
(additional) paths between pipeline segments
For example,
– instead of transferring an ALU result into a destination
register, the HARDWARE checks the destination
operand, and if it is needed as a source in the next
instruction, it passes the result DIRECTLY into the ALU
Input, BYPASSING the register file.
• Delayed load
– COMPILER is designed to detect a data conflict and
REORDER the instructions as necessary to delay the
loading of the conflicting data by inserting delayed
load NO-OPERATION instructions.
Handling of Branch Instructions
• A branch instruction can be conditional or unconditional
• An unconditional branch always alters the sequential program
flow by loading the program counter with target address
• In a conditional branch, the control selects the target
instruction if the condition is satisfied
• Various HARDWARE techniques to minimize the performance
degradation caused by instruction branching:
– Pre-fetch target instruction
– branch target buffer
– loop buffer
– branch prediction
– delayed branch
• Pre-fetch target instruction
– One way of handling a conditional branch is to prefetch the target
instruction and instruction following the branch
– Both are saved until the branch is executed
– If the branch condition is successful, the pipeline continues from
the branch target instruction and to continue fetching instructions
from both places until the branch decision is made.
• Branch Target Buffer (BTB)
– BTB is an associative memory included in the fetch segment of the
pipeline
– Each entry in the BTB consists of the address of a previously
executed branch instruction and the target instruction for that
branch
– When the pipeline decodes a branch instruction
• If instruction is in the BTB, the instruction is available directly and
prefetch continues from the new path
• If the instruction is NOT in the BTB, the pipeline shifts to a new instruction
stream and stores the target instruction in the BTB
– Advantage: branch instructions that have occurred previously are
readily available in the pipeline without interruption.
• Loop buffer
– A variation of the BTB is the loop buffer.
– This is a small very high speed register file maintained by the instruction fetch
segment of the pipeline
– When a program loop is detected in the program, it is stored in the loop buffer
– and executed directly without having to access memory
• Branch prediction
– A pipeline with branch prediction uses some additional logic to guess the outcome
of a conditional branch instruction before it is executed
– The pipeline then begins prefetching the instruction stream from the predicted path
– A correct prediction eliminates the wasted time caused by branch penalties
• Delayed Branch
– This procedure employed in most RISC Processors
– COMPILER detects the Branch instructions and REARRANGES the machine language
CODE SEQUENCE by INSERTING useful INSTRUCTIONS that keep the pipeline
operating without interruptions.
Example:
– Insertion of a NO-OPERATION instruction after a Branch instruction.
– This causes the computer to fetch the target instruction during the execution of the
no-operation instruction, allowing a continuous flow of the pipeline
RISC Pipeline
Example: Three-Segment Instruction Pipeline
• Instruction cycle is divided into three sub-operations and implemented in
three segments:
I: Instruction fetch
A: ALU operation
E: Execute instruction
• I segment
– fetches the instruction from program memory
• A segment
– instruction is decoded and an ALU operation is performed
– It performs an operation for a data manipulation instruction
– it evaluates the effective address for a load or store instruction, (OR) it calculates the
branch address for a program control instruction
• E segment
– directs the output of the ALU to one of three destinations
– It transfers
• result of the ALU operation into a destination register (OR)
• effective address to a data memory storing (OR)
• branch address to the program counter
Delayed Load
• Consider operation of the following four
instructions:
1. LOAD: R1 <--M [address 1]
2. LOAD: R2 <--M [address 2]
3. ADD: R3 <--R 1 + R2
4. STORE: M[address 3] <--R3
• If the three-segment pipeline proceeds without interruptions,
– there will be a data conflict in instruction 3 because the operand in R2 is not yet available
in the A segment
• ‘E’ segment in clock cycle 4 is in a process of placing the memory data into R2.
• ‘A’ segment in clock cycle 4 is using the data from R2,
– but the value in R2 will NOT be the correct value since it has NOT yet been transferred from
memory
• COMPILER inserts a NO-OP (no-operation) instruction
– This instruction is fetched from memory but has NO OPERATION,
– thus wasting a clock cycle.
Figure: Three-segment
pipeline timing with
DATA CONFLICT
• Figure (b) shows the same program with a NO-OP instruction inserted after the load to R2
instruction
– The data is loaded into R2 in clock cycle 4
– The add instruction uses the value of R2 in step 5
• NO-OP instruction is used to advance one clock cycle in order to compensate for the DATA CONFLICT
in the pipeline
– NOP is performed in segment A during clock cycle 4 OR segment E during clock cycle 5
• Concept of delaying the use of the data loaded from memory is referred to as DELAYED LOAD
• Advantage:
– DATA DEPENDENCY is taken care of by the COMPILER rather than HARDWARE.
Figure: Three-segment
pipeline timing with
DELAYED LOAD
Delayed Branch
• A branch instruction delays the pipeline
operation until the instruction at the branch
address is fetched
– Most RISC processors rely on the COMPILER to
REDEFINE the branches
– This method is referred to as DELAYED BRANCH
• Delayed branches
– analyze the instructions before and after the branch
– and REARRANGE the PROGRAM SEQUENCE by
inserting useful instructions in the delay steps
• The program for this EXAMPLE consists of FIVE INSTRUCTIONS:
– Load from memory to R1
– Increment R2
– Add R3 to R4
– Subtract R5 from R6
– Branch to address X
• In Fig(a) the COMPILER inserts two NO-OP instructions after the branch
• Branch address X is TRANSFERRED to PC in clock cycle 7
• Fetching of the instruction at X is delayed by two clock cycles by the no-op instructions
• After the program counter PC has been UPDATED, the instruction in X starts the fetch
phase at clock cycle 8
Figure: Example of
delayed branch
using NO-OP
• The program in Fig.(b) is REARRANGED by placing the add and subtract
instructions after the branch instruction
• PC is updated to the value of X in clock cycle 5,
• add and subtract instructions are fetched from memory and executed in the
proper sequence
– if the load instruction is at address 101 and X is equal to 350,
– the branch instruction is fetched from address 103
– add instruction is fetched from address 104 and executed in clock cycle 6
– subtract instruction is fetched from address 105 and executed in clock cycle 7
– Since the value of X is transferred to PC with clock cycle 5 in the E segment,
– the instruction fetched from address 350 (instruction at the branch address) at
clock cycle 6
101
102
103
104
105
350 Fig: Using NOP
(PC)