0% found this document useful (0 votes)
24 views46 pages

Slide 6

The document discusses the concept of pipelining in processors to improve performance. Pipelining involves dividing instruction processing into discrete stages and allowing multiple instructions to progress through the pipeline concurrently by ensuring each stage operates on a different instruction. This overlap of instruction processing across pipeline stages improves throughput by enabling the processor to complete more instructions per clock cycle compared to sequential processing.

Uploaded by

imkawsarcr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views46 pages

Slide 6

The document discusses the concept of pipelining in processors to improve performance. Pipelining involves dividing instruction processing into discrete stages and allowing multiple instructions to progress through the pipeline concurrently by ensuring each stage operates on a different instruction. This overlap of instruction processing across pipeline stages improves throughput by enabling the processor to complete more instructions per clock cycle compared to sequential processing.

Uploaded by

imkawsarcr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Chapter 8: Pipelining

Basic concepts

• Speed of execution of programs can be improved in two ways:


• Faster circuit technology to build the processor and the memory.
• Arrange the hardware so that a number of operations can be performed
simultaneously. The number of operations performed per second is
increased although the elapsed time needed to perform any one operation is
not changed.
• Pipelining is an effective way of organizing concurrent activity in a
computer system to improve the speed of execution of programs.

1
Overview
 Pipelining is widely used in modern processors.
 Pipelining improves system performance in terms of Throughput.
(The latency of each instruction still remains unchanged)
 Throughput = No. of Instructions completed per unit time!
Pipeline Improves Throughput, because multiple instructions can
now run concurrently/ simultaneously!!! In Ideal case (with no
Hazard): Throughput approaches 1 Instruction/cycle
 Pipelined organization requires sophisticated compilation
techniques.

2
Making the Execution of Programs Faster
• Use Faster circuit technology to build the processor and the main memory.
(Faster Clock Rate means Shorter Clock Cycle!)
• Arrange the hardware so that more than one operation can be performed in
Parallel / at the same time. (This is the Pipelining!)
• In the latter way, the Throughput (number of operations performed per
second) is increased, even though the Latency (Elapsed time needed to
perform any one operation) is not changed.

3
Traditional Pipeline Concept

• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes A B C D

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

4
Traditional Pipeline Concept

6 PM 7 8 9 10 11 Midnight

Time

30 40 20 30 40 20 30 40 20 30 40 20
• Sequential laundry takes 6
A hours for 4 loads
• If they learned pipelining, how
long would laundry take?
B

5
Traditional Pipeline Concept

6 PM 7 8 9 10 11 Midnight

Time
T Pipelined laundry takes
a 30 40 40 40 40 20 3.5 hours for 4 loads
• When Ann Puts her dress from
s
A the Washer to the Dryer,
k Someone else (say, Brian) can
start using the Washer.
O B • After Ann leaves the Dryer and
starts Folding, Brian can Move to
r the Dryer (from washer, after
d C waiting for 10 minutes) and
e Cathy can start using the
Washer!
r D
• So, Pipelining = Concurrent
Operations! 6
Traditional Pipeline Concept
• Pipelining doesn’t help latency of
6 PM 7 8 9 single task, it helps throughput of
entire workload
Time
• Pipeline rate limited by slowest
T pipeline stage Dryer=40 minutes
a 30 40 40 40 40 20
• Multiple tasks simultaneously
s operating using different resource
A
k • Potential speedup = Number of
pipeline stages
O B • Unbalanced lengths of pipe stages
r reduces speedup (40 min.)
d C • Time to “fill” pipeline and time to
e “drain” it reduces speedup
r • Stall for Dependences
D

7
Basic concepts (contd..)
•Processor executes a program by fetching and executing instructions one after the
other.
•This is known as sequential execution.
•If Fi refers to the fetch step, and Ei refers to the execution step of instruction Ii,
then sequential execution looks like:
Time
1 2 3

F E F E F E
1 1 2 2 3 3

What if the execution of one instruction is overlapped with the fetching of the
next one?

8
Basic concepts (contd..)
•Computer has two separate hardware units, one for fetching instructions and one
for executing instructions.
•Instruction is fetched by instruction fetch unit and deposited in an intermediate
buffer B1.
•Buffer enables the instruction execution unit to execute the instruction while the
fetch unit is fetching the next instruction.
•Results of the execution are deposited in the destination location specified by the
instruction.

Interstage buffer
B1

Instruction Execution
fetch unit
unit

9
Basic concepts (contd..)
•Computer is controlled by a clock whose period is such that the fetch and execute
steps of any instruction can be completed in one clock cycle.
•First clock cycle:
- Fetch unit fetches an instruction I1 (F1) and stores it in B1.
•Second clock cycle:
- Fetch unit fetches an instruction I2 (F2) , and execution unit executes instruction I1 (E1).
•Third clock cycle:
- Fetch unit fetches an instruction I3 (F3), and execution unit executes instruction I2 (E2).
•Fourth clock cycle:
- Execution unit executes instruction I3 (E3).

Time
Clock cycle 1 2 3 4

Instruction

I1 F1 E1

I2 F2 E2

I3 F3 E3

10
Basic concepts (contd..)

• In each clock cycle, the fetch unit fetches the next instruction, while
the execution unit executes the current instruction stored in the
interstage buffer.
• Fetch and the execute units can be kept busy all the time.
• If this pattern of fetch and execute can be sustained for a long time,
the completion rate of instruction execution will be twice that
achievable by the sequential operation.
• Fetch and execute units constitute a two-stage pipeline.
• Each stage performs one step in processing of an instruction.
• Interstage storage buffer holds the information that needs to be passed from
the fetch stage to execute stage.
• New information gets loaded into the buffer every clock cycle.

11
Basic concepts (contd..)
•Suppose the processing of an instruction is divided into four steps:
F Fetch: Read the instruction from the memory.
D Decode: Decode the instruction and fetch the source operands.
E Execute: Perform the operation specified by the instruction.
W Write: Store the result in the destination location.
•There are four distinct hardware units, for each one of the steps.
•Information is passed from one unit to the next through an interstage buffer.
•Three interstage buffers connecting four units.
•As an instruction progresses through the pipeline, the information needed by the
downstream units must be passed along.

Interstageuffers
b

D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3

12
Basic concepts (contd..) Time
Clock cycle 1 2 3 4 5 6 7

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

Clock cycle 1: F1
Clock cycle 2: D1, F2
Clock cycle 3: E1, D2, F3
Clock cycle 4: W1, E2, D3, F4
Clock cycle 5: W2, E3, D4
Clock cycle 6: W3, E3, D4
Clock cycle 7: W4
13
Basic concepts (contd..) Time
Clock cycle 1 2 3 4 5 6 7

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

During clock cycle #4:


•Buffer B1 holds instruction I3, which is being decoded by the instruction-decoding
unit. Instruction I3 was fetched in cycle 3.
•Buffer B2 holds the source and destination operands for instruction I2. It also holds
the information needed for the Write step (W2) of instruction I2. This information
will be passed to the stage W in the following clock cycle.
•Buffer B1 holds the results produced by the execution unit and the destination
information for instruction I1.
14
Pipelining: How Implemented in a Computer? Also:
How Pipelining Improves Throughput, but not Latency
Throughput: No. of instruction completed per unit
time
Suppose: Instruction = Fetch + Latency: no. of clock cycles elapsed between the
Execution starting and ending of an instruction
I1 I2 I3 Time
Time
Clock cycle 1 2 3 4
F E F E F E
1 1 2 2 3 3 Instruction

I1 F1 E1
(a) Sequential execution
I2 F2 E2
Interstage buffer
B1
I3 F3 E3

Instruction Execution
fetch
unit
unit (c) Pipelined execution

(b) Hardware organization Fig. 8.1. Basic idea of instruction pipelining

Note: Sequential Execution (fig. a) => Throughput = 0.5 instruction/cycle;


Pipelined Execution (fig. c) => (approx.) 1 instruction/cycle (starting from 2nd cycle)
15
In both Cases: Latency of each instruction is 2 cycles
(a) Instruction execution div ided into f our steps

Interstageuff
b ers

D : Decode
F : Fetch instruction E: Execute W : Write
instruction and f etch operation results
operands
B1 B2 B3

(b) Hardware organization

Use the Idea of Pipelining in a Computer Figure 8.2. A 4-stage pipeline.

Instruction = Fetch +
Decode + Execution + Write
“Fetch” stages fetches the
next instruction from the
cache or main memory
“Decode” stage prepares
and fetches the source
operands
“Execution” => The Actual
Execution, May involve ALU
for Add, Sub, Mul, Div, And,
OR, Shift etc ...
“Write” stage saves/stores Textbook
the result, such as writing to page: 457
a Register or Memory
location
Here: Throughput => (approx.) 1 instruction/cycle (starting
from the 4th cycle). And, Latency: 4 cycles for each instruction 16
Role of cache memory

• Each stage in the pipeline is expected to complete its operation in


one clock cycle:
• Clock period should be sufficient to complete the longest task.
• Units which complete the tasks early remain idle for the remaining clock
period.
• Tasks being performed in different stages should require about the same
amount of time for pipelining to be effective.
• If instructions are to be fetched from the main memory, the
instruction fetch stage would take as much as ten times greater than
the other stage operations inside the processor.
• However, if instructions are to be fetched from the cache memory
which is on the processor chip, the time required to fetch the
instruction would be more or less similar to the time required for
other basic operations.

17
Pipeline Performance
• The potential increase in performance from pipelining is
proportional to the number of pipeline stages, say n.
• n stages => n instructions can run simultaneously, and the clock pulse can be
(ideally) as small as just 1/n-th of the previous pulse length!!!
• Ideally, the Throughput is Increased n-times!!!
• However, this increase would be achieved only if all pipeline stages require the
same time to complete, and there is no interruption (hazard) throughout
program execution.
• Unfortunately, this is not true.

18
Pipeline performance

• Potential increase in performance achieved by using pipelining is


proportional to the number of pipeline stages.
• For example, if the number of pipeline stages is 4, then the rate of instruction
processing is 4 times that of sequential execution of instructions.
• Pipelining does not cause a single instruction to be executed faster, it is the
throughput that increases.
• This rate can be achieved only if the pipelined operation can be
sustained without interruption through program instruction.
• If a pipelined operation cannot be sustained without interruption,
the pipeline is said to “stall”.
• A condition that causes the pipeline to stall is called a “hazard”.

19
Pipeline Performance

• Again, pipelining does not result in individual


instructions being executed faster; rather, it is the
throughput that increases.
• Throughput is measured by the rate at which
instruction execution is completed.
• Pipeline stall causes degradation in pipeline
performance.
• We need to identify all hazards that may cause the
pipeline to stall and to find ways to minimize their
impact.
20
Pipeline Performance
• Any condition that causes a pipeline to stall is called a
Hazard
• Data hazard – any condition in which either the source or
the destination operands of an instruction are not
available at the time expected in the pipeline. So some
operation has to be delayed, and the pipeline stalls.
• Instruction (control) hazard – a delay in the availability of
an instruction causes the pipeline to stall.
• Structural hazard – the situation when two instructions
require the use of a given hardware resource at the same
time.

21
Data Hazards

• We must ensure that the results obtained when instructions are


executed in a pipelined processor are identical to those obtained
when the same instructions are executed sequentially.
• Hazard occurs
A ← 3 + R1
B←4×A
• No hazard
A←5×C
B ← 20 + C
• When two operations depend on each other, they must be
executed sequentially in the correct order.
• Another example:
Mul R2, R3, R4  R4 = R2 + R3
Add R5, R4, R6  R6 = R4 + R5

22
Data Hazards
Time
Clock cy cle 1 2 3 4 5 6 7 8 9

Instruction

I 1 (Mul) F1 D1 E1 W1

I 2 (Add) F2 D2 D2A E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

Figure 8.6. Pipeline stalled by data dependenc


y between 2Dand W1.

Mul R2, R3, R4


Add R5, R4, R6

Figure 8.6. Pipeline stalled by data dependency between D2 and W1


23
Time
Clock cy cle 1 2 3 4 5 6 7 8 9

Instruction

I 1 (Mul) F1 D1 E1 W1

I 2 (Add) F2 D2 D2A E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

Figure 8.6. Pipeline stalled by data dependenc


y between 2Dand W1.

Data hazard (contd..)

•Cycles 5 and 6, the Write stage is idle, because it has no data to work with.
•Information in buffer B2 must be retained till the execution of the instruction I2 is
complete.
•Stage 2, and by extension stage 1 cannot accept new instructions because the
information in B1 cannot be overwritten.
•Steps D6 and F5 must be postponed.
•A data hazard is a condition in which either the source or the destination operand is
not available at the time expected in the pipeline.
24
Data hazards
•Data hazard is a situation in which the pipeline is stalled because the data to be
operated on are delayed.
•Consider two instructions:
I1: A = 3 + A
I2: B = 4 x A
•If A = 5, and I1 and I2 are executed sequentially, B=32.
•In a pipelined processor, the execution of I2 can begin before the execution of I1.
•The value of A used in the execution of I2 will be the original value of 5 leading to
an incorrect result.
•Thus, instructions I1 and I2 depend on each other, because the data used by I2
depends on the results generated by I1.
•Results obtained using sequential execution of instructions should be the same as
the results obtained from pipelined execution.
•When two instructions depend on each other, they must be performed in the correct
order.

25
Data hazards
Clock cycle 1
(contd..)
2 3 4 5 6 7 8 9
Instruction

I1 F1 D1 E1 W1 Mul R2, R3, R4

I2 F2 D2 D2A E2 W2 Add R5,R4,R6

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

•Mul instruction places the results of the multiply operation in register R4 at the end
of clock cycle 4.
•Register R4 is used as a source operand in the Add instruction. Hence the Decode
Unit decoding the Add instruction cannot proceed until the Write step of the first
instruction is complete.
•Data dependency arises because the destination of one instruction is used as a source
in the next instruction.
26
Data hazard
•Execution of the instruction occurs in the E stage of the
pipeline.
•Execution of most arithmetic and logic operations would take
only one clock cycle.
•However, some operations such as division would take more
time to complete.

27
Control or instruction hazard
•Pipeline may be stalled because an instruction is not available at the expected time.
•For example, while fetching an instruction a cache miss may occur, and hence the
instruction may have to be fetched from the main memory.
•Fetching the instruction from the main memory takes much longer than fetching the
instruction from the cache.
•Thus, the fetch cycle of the instruction cannot be completed in one cycle.
•For example, the fetching of instruction I2 results in a cache miss.
•Thus, F2 takes 4 clock cycles instead of 1.

Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction
I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

28
Time
Clock cy cle 1 2 3 4 5 6 7 8 9

Instruction
I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

Pipeline Performance: Instruction hazard


(a) Instruction execution steps in successiv e clock cy cles

Time
Clock cy cle 1 2 3 4 5 6 7 8 9

Stage
F: Fetch F1 F2 F2 F2 F2 F3

D: Decode D1 idle idle idle D2 D3

E: Execute E1 idle idle idle E2 E3

W: Write W1 idle idle idle W2 W3

(b) Function perf ormed by each processor stage in successiv e clock cy cles

Figure 8.4. Pipeline stall caused by a cache miss in F2.

Example
of How
Cache
Miss
causes
Instruction
Hazard
and the
Pipeline
Stalls
Idle periods –
stalls (bubbles)

29
Control or instruction hazard (contd..)
•Fetch operation for instruction I2 results in a cache miss, and the instruction fetch
unit must fetch this instruction from the main memory.
•Suppose fetching instruction I2 from the main memory takes 4 clock cycles.
•Instruction I2 will be available in buffer B1 at the end of clock cycle 5.
•The pipeline resumes its normal operation at this point.
•Decode unit is idle in cycles 3 through 5.
•Execute unit is idle in cycles 4 through 6.
•Write unit is idle in cycles 5 through 7.
•Such idle periods are called as stalls or bubbles.
•Once created in one of the pipeline stages, a bubble moves downstream unit it
reaches the last unit.

Time
Clock cycle 1 2 3 4 5 6 7 8 9

Stage
F: Fetch F1 F2 F2 F2 F2 F3

D: Decode D1 idle idle idle D2 D3

E: Execute E1 idle idle idle E2 E3

W: Write W1 idle idle idle W2 W3

30
Structural hazard

• Two instructions require the use of a hardware resource at the same


time.
• Most common case is in access to the memory:
• One instruction needs to access the memory as part of the Execute or Write
stage.
• Other instruction is being fetched.
• If instructions and data reside in the same cache unit, only one instruction
can proceed and the other is delayed.
• Many processors have separate data and instruction caches to avoid
this delay.
• In general, structural hazards can be avoided by providing sufficient
resources on the processor chip.

31
Pipeline Performance: Structural hazard
Clock cy cle

Instruction
I1

I 2 (Load)

I3
F1
1 2

D1

F2
3

E1

D2

F3
4

W1

E2

D3
5

M2

E3
6

W2

W3
7
Time

I4 F4 D4 E4

I5 F5 D5

Figure 8.5. Effect of a Load instruction on pipeline timing.

I2: Load X(R1), R2 I3: MOV R3, R2


The
Execution
Stage E2
Computes
Structural Hazard, because
X+R1 the “Write” unit Can’t
Then An Perform Two different
Writes in One clock Pulse
Extra
Memory
Access
Stage, M2 is
Required to
Fetch the
source
Operand
from the
Memory
Address
[X+R1] 32
Structural hazard (contd..)
Clock cycle 1 2 3 4 5 6 7
Instruction
I1 F1 D1 E1 W1

I2 F2 D2 E2 M2 W2 (Load X(R1),R2

I3 F3 D3 E3 W3

I4 F4 D4 E4

I5 F5 D5

•Memory address X+R1 is computed in step E2 in cycle 4, memory access takes place
in cycle 5, operand read from the memory is written into register R2 in cycle 6.
•Execution of instruction I2 takes two clock cycles 4 and 5.
•In cycle 6, both instructions I2 and I3 require access to register file.
•Pipeline is stalled because the register file cannot handle two operations at once.

33
Pipelining and performance

• When a hazard occurs, one of the stages in the pipeline cannot


complete its operation in one clock cycle.
• The pipeline stalls causing a degradation in performance.
• Performance level of one instruction completion in each clock cycle is
the upper limit for the throughput that can be achieved in a pipelined
processor.

34
Handling Data Hazard in Hardwire: Operand
Forwarding
• Instead of from the register file, the second instruction can get data
directly from the output of ALU immediately just after the “E” (i.e.,
execution) stage of the previous instruction is completed.
• A special arrangement needs to be made to “forward” the output of
ALU to the input of ALU.

35
Operand forwarding
•Data hazard occurs because the destination of one instruction is used as the source
in the next instruction.
•Hence, instruction I2 has to wait for the data to be written in the register file by the
Write stage at the end of step W1.
•However, these data are available at the output of the ALU once the Execute stage
completes step E1.
•Delay can be reduced or even eliminated if the result of instruction I1 can be
forwarded directly for use in step E2.
•This is called “operand forwarding”.

36
Operand forwarding (contd..)
Source 1
Source 2
•Similar to the three-bus organization.
•Registers SRC1, SRC2 and RSLT have
been added.
SRC1 SRC2
•SRC1, SRC2 and RSLT are interstage
buffers for pipelined operation.
Register
•SRC1 and SRC2 are part of buffer B2.
file •RSLT is part of buffer B3.
•Data forwarding mechanism is shown by
ALU the two blue lines.
•Two multiplexers connected at the inputs
RSLT
to the ALU allow the data on the destination
Destination
bus to be selected instead of the contents of
SRC1 and SRC2 register.

37
Operand forwarding (contd..)
I1: Mul R2, R3, R4
Source 1 I2: Add R5, R4, R6
Source 2
Clock cycle 3:
- Instruction I2 is decoded, and a data
SRC1 SRC2
dependency is detected.
- Operand not involved in the dependency,
Register
register R5 is loaded in register SRC1.
file Clock cycle 4:
- Product produced by I1 is available in
ALU register RSLT.
- The forwarding connection allows the
RSLT
result to be used in step E2.
Destination
Instruction I2 proceeds without interruption.

38
Handling data dependency in software

• Data dependency may be detected by the hardware while decoding


the instruction:
• Control hardware may delay by an appropriate number of clock cycles
reading of a register till its contents become available. The pipeline stalls for
that many number of clock cycles.
• Detecting data dependencies and handling them can also be
accomplished in software.
• Compiler can introduce the necessary delay by introducing an appropriate
number of NOP instructions. For example, if a two-cycle delay is needed
between two instructions then two NOP instructions can be introduced
between the two instructions.

I1: Mul R2, R3, R4


NOP
NOP
I2: Add R5, R4, R6

39
Superscalar operation

• Pipelining enables multiple instructions to be executed concurrently


by dividing the execution of an instruction into several stages:
• Instructions enter the pipeline in strict program order.
• If the pipeline does not stall, one instruction enters the pipeline and one
instruction completes execution in one clock cycle.
• Maximum throughput of a pipelined processor is one instruction per clock
cycle.
• An alternative approach is to equip the processor with multiple
processing units to handle several instructions in parallel in each
stage.

40
Superscalar operation (contd..)

• If a processor has multiple processing units then several instructions


can start execution in the same clock cycle.
• Processor is said to use “multiple issue”.
• These processors are capable of achieving instruction execution
throughput of more than one instruction per cycle.
• These processors are known as “superscalar processors”.

41
Superscalar operation (contd..)
Instruction fetch unit is capable of reading two
instructions at a time and storing them in the
F : Instruction
fetch unit instruction queue.
Instruction queue

Dispatch unit fetches Processor has two execution units:


and retrieves up to Integer and Floating Point
two instructions at
a time from the
Floating-
front of the queue. point
unit
Dispatch
unit W : Write
results

Integer
If there is one integer and one unit
floating point instruction, and no
hazards, then both instructions are
dispatched in the same clock cycle.
42
Superscalar operation (contd..)

• Various hazards cause a even greater deterioration in performance in


case of a superscalar processor.
• Compiler can avoid many hazards by careful ordering of instructions:
• For example, the compiler should try to interleave floating-point and integer
instructions.
• Dispatch unit can then dispatch two instructions in most clock cycles, and
keep both integer and floating point units busy most of the time.
• If the compiler can order instructions in such a way that the available
hardware units can be kept busy most of the time, high performance
can be achieved.

43
Superscalar operation (contd..)
Clock cycle 1 2 3 4 5 6 7
I 1 (Fadd) F1 D1 E1A E1B E1C W1

I 2 (Add) F2 D2 E2 W2

I 3 (Fsub) F3 D3 E3 E3 E3 W3

I 4 (Sub) F4 D4 E4 W4

•Instructions in the floating-point unit take three cycles to execute.


•Floating-point unit is organized as a three-stage pipeline.
•Instructions in the integer unit take one cycle to execute.
•Integer unit is organized as a single-stage pipeline.
•Clock cycle 1:
- Instructions I1 (floating point) and I2 (integer) are fetched.
•Clock cycle 2:
- Instructions I1 and I2 are decoded and dispatched, I3 is fetched.
44
Superscalar operation (contd..)
Clock cycle 1 2 3 4 5 6 7
I 1 (Fadd) F1 D1 E1A E1B E1C W1

I 2 (Add) F2 D2 E2 W2

I 3 (Fsub) F3 D3 E3 E3 E3 W3

I 4 (Sub) F4 D4 E4 W4
Clock cycle 3:
- I1 and I2 begin execution, I2 completes execution. I3 is dispatched to floating
point unit and I4 is dispatched to integer unit.
Clock cycle 4:
- I1 continues execution, I3 begins execution, I2 completes Write stage,
I4 completes execution.
Clock cycle 5:
- I1 completes execution, I3 continues execution, and I4 completes Write.
Order of completion is I2, I4, I1, I3
45

You might also like