Instruction Level Parallelism
and Superscalar Processors
Chapter 14
William Stallings
Computer Organization and
Architecture
7th Edition
What is Superscalar?
• Common instructions (arithmetic,
load/store, conditional branch) can be
initiated simultaneously and executed
independently
• Applicable to both RISC & CISC
Why Superscalar?
• Most operations are on scalar
quantities (see RISC notes)
• Improve these operations by
executing them concurrently in
multiple pipelines
• Requires multiple functional units
• Requires re-arrangement of
instructions
General Superscalar
Organization
Superpipelined
• Many pipeline stages need less than half a
clock cycle
• Double internal clock speed gets two tasks
per external clock cycle
• Superscalar allows parallel fetch and
execute
Limitations
• Instruction level parallelism: the degree to
which the instructions can be executed
parallel (in theory)
• To achieve it:
– Compiler based optimisation
– Hardware techniques
• Limited by
– Data dependency
– Procedural dependency
– Resource conflicts
True Data (Write-Read)
Dependency
• ADD r1, r2 (r1 := r1+r2;)
• MOVE r3, r1 (r3 := r1;)
• Can fetch and decode second
instruction in parallel with first
• Can NOT execute second instruction
until first is finished
Procedural Dependency
• Cannot execute instructions
after a (conditional) branch in
parallel with instructions
before a branch
• Also, if instruction length is not
fixed, instructions have to be
decoded to find out how many
fetches are needed (cf. RISC)
• This prevents simultaneous
fetches
Resource Conflict
• Two or more instructions requiring access
to the same resource at the same time
– e.g. functional units, registers, bus
• Similar to true data dependency, but it is
possible to duplicate resources
Effect of
Dependencies
Design Issues
• Instruction level parallelism
– Some instructions in a sequence are
independent
– Execution can be overlapped or re-ordered
– Governed by data and procedural
dependency
• Machine Parallelism
– Ability to take advantage of instruction level
parallelism
– Governed by number of parallel pipelines
(Re-)ordering instructions
• Order in which instructions are
fetched
• Order in which instructions are
executed – instruction issue
• Order in which instructions
change registers and memory -
commitment or retiring
In-Order Issue
In-Order Completion
• Issue instructions in the
order they occur
• Not very efficient – not used
in practice
• May fetch >1 instruction
• Instructions must stall if
necessary
An Example
• I1 requires two cycles to execute
• I3 and I4 compete for the same
functional unit
• I5 depends on the value
produced by I4
• I5 and I6 compete for the same
functional unit
In-Order Issue In-Order
Completion (Diagram)
In-Order Issue Out-of-Order
Completion (Diagram)
In-Order Issue
Out-of-Order Completion
• Output (write-write) dependency
– R3:= R2 + R5; (I1)
– R4:= R3 + 1; (I2)
– R3:= R5 + 1; (I3)
– R6:= R3 + 1; (I4)
– I2 depends on result of I1 - data
dependency
– If I3 completes before I1, the input for I4
will be wrong - output dependency: I1&I3-
I6
Out-of-Order Issue
Out-of-Order Completion
• Decouple decode pipeline from execution
pipeline
• Can continue to fetch and decode until this
pipeline is full
• When a functional unit becomes available
an instruction can be executed
• Since instructions have been decoded,
processor can look ahead – instruction
window
Out-of-Order Issue Out-of-Order
Completion (Diagram)
Antidependency
• Read-write dependency: I2-I3
– R3:=R3 + R5; (I1)
– R4:=R3 + 1; (I2)
– R3:=R5 + 1; (I3)
– R7:=R3 + R4; (I4)
– I3 should not execute before I2 starts as I2
needs a value in R3 and I3 changes R3
Register Renaming
• Output and antidependencies
occur because register
contents may not reflect the
correct program flow
• May result in a pipeline stall
• The usual reason is storage
conflict
• Registers can be allocated
dynamically
Register Renaming example
• R3b:=R3a + R5a (I1)
• R4b:=R3b + 1 (I2)
• R3c:=R5a + 1 (I3)
• R7b:=R3c + R4b (I4)
• Without label (a,b,c) refers to logical
register
• With label is hardware register allocated
• Removes antidependency I2-I3 and output
dependency I1&I3-I4
• Needs extra registers
Machine Parallelism
• Duplication of Resources
• Out of order issue
• Renaming
• Not worth duplicating functions without
register renaming
• Need instruction window large enough
(more than 8)
Speedups Without Procedural
Dependencies (with out-of-order issue)
Branch Prediction
• Intel 80486 fetches both next sequential
instruction after branch and branch target
instruction
• Gives two cycle delay if branch taken (two
decode cycles)
RISC - Delayed Branch
• Calculate result of branch before unusable
instructions pre-fetched
• Always execute single instruction
immediately following branch
• Keeps pipeline full while fetching new
instruction stream
• Not as good for superscalar
– Multiple instructions need to execute in delay
slot
• Revert to branch prediction
Superscalar Execution
Pentium 4
• 80486 - CISC
• Pentium – some superscalar
components
– Two separate integer execution units
• Pentium Pro – Full blown
superscalar
• Subsequent models refine &
enhance superscalar design
Pentium 4 Operation
• Fetch instructions form memory in order of static
program
• Translate instruction into one or more fixed
length RISC instructions (micro-operations)
• Execute micro-ops on superscalar pipeline
– micro-ops may be executed out of order
• Commit results of micro-ops to register set in
original program flow order
• Outer CISC shell with inner RISC core
• Inner RISC core pipeline at least 20 stages
– Some micro-ops require multiple execution stages
– cf. five stage pipeline on Pentium
Pentium 4 Pipeline
Stages 1-9
• 1-2 (BTB&I-LTB, F/t): Fetch instructions,
static branch prediction, split into 4 micro-
ops
• 3-4 (TC): Dynamic branch prediction,
sequencing micro-ops
• 5: Feed into out-of-order execution logic
• 6 (R/a): Allocating resources (registers)
• 7-8 (R/a): Renaming registers and
removing false dependencies
• 9 (mopQ): Re-ordering micro-ops
Stages 10-20
• 10-14 (Sch): Scheduling (FIFO) and
dispatching micro-ops towards available
execution unit
• 15-16 (RF): Storing pending operations
• 17 (ALU, Fop): Execution of micro-ops
• 18 (ALU, Fop): Compute flags
• 19 (ALU): Branch check – feedback to
stages 3-4
• 20: Retiring instructions
Pentium 4 Block Diagram