Prepared by: Karlene M.
Black COA - Sem 1
A communication pathway connecting two or
more devices
A link in which information is usually
broadcast to all devices (not just intended
recipient)
consists of multiple communication lines
transmitting digital information in parallel
A component in which width is important in
determining performance
often grouped
◦ A number of channels in one bus
e.g. 32 bit data bus is 32 separate single bit
channels
systems can contain multiple buses
Major advantages of bus organization are
versatility and low cost.
Major disadvantage is that insufficient
bandwidth may create a communication bottle
neck, limiting the maximum I/O throughput.
Performance factors
◦ Physical size (latency & bandwidth)
◦ Number and type of connected devices (taps)
Parallel lines on circuit boards (often yellow)
Ribbon cables
Strip connectors on mother boards
e.g. PCI
Sets of wires
1. System (processor-memory) buses
Address bus
Data bus
Control bus
2. I/O buses
3. Backplane buses
contains the source or destination address of
the data on the data bus
e.g. CPU needs to read an instruction
(data) from a given location in memory
Bus width determines maximum memory
capacity of system
e.g. 8080 has 16 bit address bus giving 64k address
space
Used for moving data between modules
◦ Remember that there is no difference between
“data” and “instruction” at this level
Width is a key determinant of performance
◦ 8, 16, 32, 64 bit
set of lines used to control use of the data and
address lines by the attached devices
includes
◦ Memory read/write signal
◦ Interrupt acknowledgement
◦ Interrupt request
◦ Clock signals
◦ I/O Read or Write
◦ Transfer ACK
◦ Bus request
◦ Bus grant
◦ Reset
Type
• dedicated or multiplexed
Arbitration
• centralised or distributed
Timing
• synchronous or asynchronous
Width
Transfer type
• serial or parallel
Dedicated
◦ Separate data & address lines
Multiplexed
◦ Shared lines
◦ Address valid or data valid control line
◦ Advantage - fewer lines
◦ Disadvantages
More complex control
Degradation of performance
Ensuring only one device uses the bus at a
time – avoiding collisions
Choosing a master among multiple requests
◦ Try to implement priority and fairness (no device
“starves”)
Uses a master-slave mechanism
Two main schemes:
• centralized and distributed
Components
Bus master: component that can initiate a bus
request
◦ Bus typically has several masters
◦ Processor, but I/O devices can also be masters
Components
Daisy-chain: devices connect to bus in priority
order
◦ High-priority devices intercept/deny requests by low-
priority ones
◦ Simple, but slow and can’t ensure fairness
New trend: Point-to-point busses
•Pro: No arbitration, no “master”, fast,
simple, source synchronous
•Con: need lots of wires or requires high
per-wire bandwidth
Centralized
◦ single hardware device controls bus access
- (bus controller or bus arbiter)
◦ May be part of CPU or separate
Distributed
◦ any module (except passive devices like
memory) can become the bus master e.g.
CPU and DMA controller
◦ Access control logic is on all modules
◦ Modules work together to control bus
Co-ordination of events on bus
Synchronous – events are controled by a clock
Asynchronous – timing is handled by well-
defined specifications, i.e., a response is
delivered within a specified time after a
request
Events determined by clock signals
Control Bus includes clock line
A single 1-0 cycle is a bus cycle
All devices can read clock line
Usually sync on leading/rising edge
Usually a single cycle for an event
Analogy – Orchestra conductor with baton
Usually stricter in terms of its timing
requirements
Devices must have certain tolerances to
provide responses to signal stimuli
More flexible allowing slower devices to
communicate on same bus with faster
devices.
Performance of faster devices, however, is
limited to speed of bus
Wider the bus the better the data transfer rate
or the wider the addressable memory space
Serial “width” is determined by
length/duration of frame
Lots of devices on one bus leads to:
◦ Propagation delays
Long data paths mean that co-ordination of bus use
can adversely affect performance
If aggregate data transfer approaches bus capacity
Most systems use multiple buses to overcome
these problems
Historically, parallel has been used for high
speed peripherals (e.g., SCSI, parallel port
zip drives rather than serial port). High
speed serial, however, has begun to replace
this need
Serial communication also used to be
restricted to point-to-point
communications. Now there's an increasing
prevalence of multipoint
working space (temporary storage) available
to CPU (a MUST have)
Number and function vary between processor
designs
One of the major design decisions
Top level of memory hierarchy
hardware devices for holding values
mainly used to:
◦ control execution of a program
◦ hold an instruction that is currently
being executed
31
fast memory, almost always connected to
circuitry that allows various arithmetic,
logical, control, and other manipulations, as
well as possibly setting internal flags
found in all computers but not all registers
can be manipulated by the programmer
(examples, IR, MAR, MBR)
32
include those that may be controlled by the
programmer - called operational registers
and are often referred to as the register set
of the machine.
33
EAX
AX ESI
34
Make them general purpose
◦ Increase flexibility and programmer options
◦ Increase instruction size & complexity
Make them specialized
◦ Smaller (faster) instructions
◦ Less flexibility
General Purpose
Data
Address
Condition Codes
May be true general purpose
May be restricted
May be used for data or addressing
◦ Data
Accumulator
◦ Addressing
Segment
How Many?
Between 8 - 64
Fewer = more memory references
More does not reduce memory references
and takes up processor real estate
How big?
Large enough to hold full address
Large enough to hold full word
Often possible to combine two data registers
Example: C programming
◦ double int a;
◦ long int a;
register-to-register instructions
memory-to-register instructions
memory-to-memory instructions
load-and-store instructions
40
Ax – Used to accumulate the results of additions,
subtractions and so forth.
Bx – Base register. Used to point to the starting
address (called the base) of a structure in memory.
Cx – Count register. Frequently specifies the
number of times some operation is to repeat.
Dx – Data register. Holds data, perhaps passed to a
subroutine for processing.
41
Sp (stack pointer) – points to the top of the
processor’s stack
Bp (base pointer) – usually addresses variables
stored inside the stack
Si (Source index) and di (destination index) – aka
string registers
IP (Instruction Pointers) – s pecifies the next
machine code instruction to be executed, relative to
the segment located by CS. IP is rarely referred to
directly.
42
Cs (code segment) – addresses the start of the
programs machine code in memory
Ds (data segment) – addresses the start of the
programs variables
Ss (stack segment) – locates start of program’s stack
space
Es (extra segment) – locates an additional data
segment if needed, although in many programs, es
and ds address the same memory, facilitating some
operations tied to these registers.
43
Also called status bits
Boolean variables within the processor status
(condition code) register
Describe the status of the CPU and the
program currently being executed
Values are set (1) or cleared (0) after the
execution of an instruction
44
O or of – Overflow flag
D or df – Direction flag
I or if – interrupt enable flag
T or tf – trap flag
S or sf – sign flag
Z or zf – zero flag
A or af – auxiliary flag
P or pf – parity flag
C or cf – carry
45
Sets of individual bits
◦ e.g. result of last operation was zero
Can be read (implicitly) by programs
◦ e.g. Jump if zero
Cannot (usually) be set by programs
Program Counter
Instruction Decoding Register
Memory Address Register
Memory Buffer Register
May have registers pointing to:
◦ Process control blocks
◦ Interrupt Vectors
N.B. CPU design and operating system design
are closely linked
Two main classes – RISC and CISC
RISC = Reduced Instruction Set Computer.
- type of microprocessor architecture that
utilizes a small, highly-optimized set of
instructions,
CISC = Complex Instruction Set Computer
- type of microprocessor architecture that
utilizes a more specialized set of instructions
51
RISC
• Emphasis on software • Relative large general
purpose register sets
• Single-clock,
reduced instruction only
• Instructions that operate
• Register to register: upon operands stored in
"LOAD" and "STORE" registers rather than in
are independent instructions main memory
• Low cycles per second,
• A hardwired control unit
large code sizes
• Relatively few addressing modes • Spends more transistors
on memory registers
• Fixed-length decodable
instruction formats
52
• Emphasis on hardware • Memory-to-memory:
"LOAD" and "STORE"
• Includes multi-clock
incorporated in
complex instructions
instructions
• A large number of instructions
• Small code sizes,
• A large number of addressing high cycles per second
modes
• Transistors used for storing
• A range of variable length complex instructions
instruction formats with
variable execution times
• A microprogrammed control
unit
53
A.k.a writing microcode
method used to implement machine instructions
in a CPU relatively easily, often using less
hardware than with other methods.
a set of very detailed and rudimentary lowest-
level routines which controls and sequences the
actions needed to execute particular
instructions, sometimes also to decode them.
a machine instruction implemented by a series of
microinstructions - loosely comparable to how an
interpreter implements a high-level language statement
using a series of machine instructions
Microprograms are carefully designed and optimized
for the fastest possible execution, since a slow
microprogram would yield a slow machine instruction
which would in turn cause all programs using that
instruction to be slow.
Microcode
the element of a microprogram
normally written by the CPU engineer during
the design phase
generally not meant to be visible or changeable
by a normal programmer, not even an
assembly programmer
can be dramatically changed with a new
microarchitecture generation
Microcode
often used to let one microarchitecture emulate
another, usually more powerful, architecture.
used as a synonym for firmware, whether or not
it actually implements the microprogramming of
a processor e.g. firmware used in a hard drive
Microcode
Microinstruction providing the bits which control
the functional elements that internally compose a
CPU
transforms a complex electronic design challenge
(the control of a CPU) into a less-complex
programming challenge
can be characterized as horizontal or vertical.
each microinstruction directly controls CPU
elements
typically contained in a fairly wide control store,
it is not uncommon for each word to be 56 bits
or more.
on each tick of a sequencer clock, a microcode
word is read, decoded, and used to control the
functional elements which make up the CPU.
A microprogram word comprises fairly tightly
defined groups of bits.
each microinstruction requires subsequent
decoding before it can control CPU elements,
since the bit fields may pass through
intermediate combinatory logic which in turn
generates the actual control signals for internal
CPU elements (ALU, registers, etc.).
bit fields themselves directly produce the
control signals.
requires small instruction lengths and storage
requires more time to decode, resulting in a
slower CPU clock.
may be just the assembly language of a simple
conventional computer that is emulating a more
complex computer.
As transistors became cheaper, horizontal
microcode came to dominate the design of CPUs
using microcode, with vertical microcode no
longer being used.
Each type of CPU has its own machine
language and assembly language, so an
assembly language program written for
one type of CPU won't run on another.
62
Machine languages
consist entirely of numbers
almost impossible for humans to read and
write.
Assembly languages
have the same structure and set of commands
as machine languages
enable a programmer to use names instead of
numbers
63
Three basic characteristics differentiate
microprocessors:
1. Clock speed
2. Bandwidth/Bus speed
3. Instruction set
64
Clock speed
also called clock rate
the speed at which a microprocessor executes
instructions.
regulated by an internal clock
synchronizes all the various computer
components.
65
Clock speed
faster the clock => more instructions that may be
executed per second
stated in Hertz (Hz)
i.e: 1 MHz = 1 million cycles per second,
Does not imply better performance
major factor in determining the power of a
computer
66
Bus speed
measured in MHz
often hampers the performance of the computer
by having a slower speed than the processor
Ideally should be the same as the CPU clock
speed
67
Instruction set:
The complete collection of instructions that
the microprocessor can execute:
◦ Machine Code
◦ Binary
Usually represented by assembly codes
Determines a computer family
68
Instruction set:
Dictates how programs a (software written) for a
microprocessor
example: the SIMP computer understands
10 instructions, and any program written for it
uses those ten instructions in various ways to
accomplish some surprisingly complicated tasks.
69
Instruction set:
sometimes a larger instruction set will equal
better performance.
For example, one difference between Pentium 4
and Pentium 5 is that Pentium 5 has a larger
instruction set.
70
plays an important role in co-design of the
embedded computer system
links the software and hardware part of the
system.
design process supports software interface
implementation and hardware interface
synthesis.
enables software running on the
microcontroller to control external devices
consists of the sequential logic that
physically connects the devices to the
microcontroller and the software drivers
that allow code to access the device
functions.
model that renders lower-level details of
computer systems temporarily invisible in
order to facilitate the design of sophisticated
systems
the notion that we can concentrate on one
“level” of the big picture at a time, with
confidence that we can then connect
effectively with the levels above and below.
framing the levels of abstraction
appropriately is one of the most important
skills in any undertaking.
on the other hand, abstraction does not
mean being clueless about the neighboring
levels.
interface between the hardware and the
lowest-level software - one of the most
important abstractions – referred to as the
Instruction Set Architecture (ISA)
Omits unnecessary details
High Level Language
Compiler
Assembly Language
Assembler
Binary Machine Language
What kind of interface should the hardware
present to the software?
• Types of instructions?
• Instruction representation?
• How do we get from instruction 1 to 2 (or to 7
instead)?
• Software’s view of storage? Where do variables live?
• Does the hardware help to support function calls? If
so, how?
78
includes anything programmers need to
know to make a binary machine language
program work correctly, including
instructions, I/O devices, and so on.
COA - Sm 1
Note: The operating system will encapsulate the
details of doing I/O, allocating memory, and other
low-level system functions, so that application
programmers do not need to worry about such
details. The combination of the basic instruction
set and the operating system interface provided for
application programmers is called the application
binary interface (ABI).
allows computer designers to talk about
functions independently from the hardware
that performs them. Computer designers
distinguish architecture from an
implementation of an architecture along the
same lines: an implementation is hardware
that obeys the architecture abstraction.
Operation code (Op code) - What to do
Source Operand reference – Object(s) to use
Result Operand reference – Where to store result
Next Instruction Reference - When you have done
that, do this...
a unique bit pattern for machine code
a symbolic representation for human
readability
◦ e.g. ADD, SUB, LOAD
ADD A,B
Operate Instructions
◦ process data (addition, logical operations, etc.)
Data Movement Instructions …
◦ move data between memory locations and registers.
Control Instructions …
◦ change the sequence of execution of instructions in
the stored program.
The default is sequential execution: the PC is incremented
by 1 at the start of every Fetch, in preparation for the next
one.
Control instructions set the PC to a new value during the
Execute phase, so the next instruction comes from a
different place in the program.
This allows us to build control structures such as loops
and branches.
Length
1. Fixed length
• 32 or 64 bits (depends on architecture – MIPS
152/16 is 16 bit)
+ Simple implementation: compute next PC
using only PC
– Code density: 32 or 64 bits for a NOP?
87
Length
2. Variable length
– Complex implementation
+ Code density
3. Compromise of fixed and variable two lengths
Encoding
- transforming information from one format to
another
88
Length
• 32-bits
• MIPS16: 16-bit variants of common instructions for
density
Encoding
• 4 formats, simple encoding
89
R-Type Instruction
opcode (6) rs (5) rt (5) rd (5) sa (5) function (6)
(register to register)
I-Type Instruction
format function
(register with immediate operand) opcode (6) ft (5) fs (5) fd (5)
(5) (6)
J-Type Instruction
(jump to target) opcode (6) target (26)
90
opcode (31-26)
Opcode is short for "operation code". The
opcode is a binary encoding for the
instruction.
rd (25-21)
This is the destination register. The
destination register is the register where the
result of the operation is stored.
91
rs (20-16)
This is the first source register. The source
register is the register that holds one of the
arguments of the operation.
rt (15-11)
This is the second source register.
92
sa (B10-6)
The amount of bits to shift. Used in shift
instructions.
function (B5-0)
An additional 6 bits used to specify the
operation, in addition to the opcode
93
3 addresses
◦ Operand 1, Operand 2, Result
◦ a = b + c;
◦ May be a forth - next instruction (usually
implicit)
◦ Not common
◦ Needs very long words to hold everything
2 addresses
◦ One address doubles as operand and
result
◦a=a+b
◦ Reduces length of instruction
◦ Requires some extra work
Temporary storage to hold some results
1 address
◦ Implicit second address
◦ Usually a register (accumulator)
◦ Common on early machines
0 (zero) addresses
◦ All addresses implicit
◦ Uses a stack
◦ e.g. push a
◦ push b
◦ add
◦ pop c
◦c=a+b
More addresses
◦ More complex (powerful?) instructions
◦ More registers
Inter-register operations are quicker
◦ Fewer instructions per program
Fewer addresses
◦ Less complex (powerful?) instructions
◦ More instructions per program
◦ Faster fetch/execution of instructions
Operation repertoire
◦ How many ops?
◦ What can they do?
◦ How complex are they?
Data types
Instruction formats
◦ Length of op code field
◦ Number of addresses
Registers
◦ Number of CPU registers available
◦ Which operations can be performed on which
registers?
Addressing modes ( will be discussed later…)
RISC v CISC
Addresses
Numbers
◦ Integer/floating point
Characters
◦ ASCII etc.
Logical Data
◦ Bits or flags
(Aside: Is there any difference between numbers and characters?
Ask a C programmer!)
Operation type encoded in instruction opcode
Operations act on operands (stored in registers)
◦ Data Transfer
◦ Arithmetic
◦ Logical
◦ Conversion
◦ I/O
◦ System Control
◦ Transfer of Control
Data Transfer
Specify
◦ Source
◦ Destination
◦ Amount of data
May be different instructions for different
movements
◦ e.g. IBM 370
Or one instruction and different addresses
◦ e.g. VAX
Arithmetic
Add, Subtract, Multiply, Divide, Mod/rem
◦ signed/unsigned)
Packed integer: padd, pmul, pand, por…
(saturating/wraparound
Floating point
◦ add, sub, mul, div, sqrt
May include
◦ Increment (a++)
◦ Decrement (a--)
◦ Negate (-a)
Shift and
Rotate
Operations
Logical
Bitwise operations
AND, OR, NOT, XOR, SLL, SRL. SRA
Conversion
E.g. Binary to Decimal
Input/Output
May be specific instructions
May be done using data movement
instructions (memory mapped)
May be done by a separate controller (DMA)
Systems Control
Privileged instructions
CPU needs to be in specific state
◦ Ring 0 on 80386+
◦ Kernel mode
For operating systems use
Transfer of Control
Branch
◦ e.g. branch to x if result is zero
Skip
◦ e.g. increment and skip if zero
◦ ISZ Register1
◦ Branch xxxx
◦ ADD A
Subroutine call
◦ c.f. interrupt call
What order do we read numbers that occupy
more than one byte?
Endianness
◦ byte ordering used to represent types of data
◦ the transmission order over a network or other
medium
Big endian
◦ most significant byte (MSB) value is stored at the
memory location with the lowest address
Little endian
◦ The least significant byte (LSB) value is stored at the
lowest address
refers to the ways in which instructions
reference their operands.
Programmers use statement labels to refer
to instructions and the names of variables
or expressions to refer to the variables
themselves.
112
Direct – moves a byte or word between a
memory location specified in the instruction
and a register
MOV AL, [1234h]
Register indirect (base relative or indexed) –
transfers a byte or word of data between a
register and the memory location addressed by
an index (DI or SI) or base register (BP or BX)
MOV AX, [BX]
113
Register – transfers a byte or word from the
source register to the destination register
MOV BX, CX
Immediate – transfers an immediate byte or
word of data into the destination register or
memory location
MOV AX, 3456h
114
Register –Base plus index (base relative
indexed) – transfers a byte or word of data
between a register and the memory location
addressed by a base register (BP or BX) plus
index (DI or SI) register
MOV DX, [BX + DI]
115
Register relative – transfers a byte or word of
data between a register and the memory location
addressed by an index (DI or SI) or base register
(BP or BX) plus a constant displacement
MOV AX, [BX + 1000h]
116
Register relative Base relative plus index –
transfers a byte or word of data between a
register and the memory location addressed by
a base register (BP or BX) plus an index
register (DI or SI) plus a displacement
MOV DX, [BP + SI + 1000h]
117
For the CPU to function effectively, it must:
◦ Fetch instructions
◦ Interpret instructions
◦ Fetch data
◦ Process data
◦ Write data
The Control Unit orchestrates the complete
execution of each instruction:
◦ At its heart is a Finite State Machine (FSM) that sets
up the state of the logic circuits according to each
instruction.
◦ This process is governed by the system clock - the
FSM goes through one transition (“machine cycle”)
for each tick of the clock.
Six phases of the complete Instruction Cycle
◦ Fetch: load IR with instruction from memory
◦ Decode: determine action to take (set up inputs for
ALU, RAM, etc.)
◦ Evaluate address: compute memory address of
operands, if any
◦ Fetch operands: read operands from memory or
registers
◦ Execute: carry out instruction
◦ Store results: write result to destination (register or
memory)
Fetch
◦ The first step is to read an instruction from memory.
◦ This actually takes several smaller steps, or “micro-
instructions”:
MAR (PC) ; use the value in PC to access
memory
PC (PC) + 1 ; increment the value of PC
MDR Mem[MAR] ; read memory location to MDR
IR (MDR) ; copy (MDR) to IRDecode
◦ Steps 1, 2 & 4 take a single machine cycle each, but step 3
(memory access) can take many machine cycles .
Decode
◦ The opcode is input to a decoder, which sets
up the ensuing sequence of events according
the instruction.
Evaluate Address
◦ Computes the address of the operand (if
any), or of the memory location to be
accessed: e.g. the location from which to
obtain a value in a LOAD instruction.
This is known as the Effective Address (EA).
Fetch Operands
◦ Obtains the source operand(s), if required for
execution.
◦ Operands can come from Registers or RAM, or
be embedded in the instruction itself.
The Effective Address (EA) determined in the
previous step is used to obtain an operand
from memory.
Execute
◦ the instruction is executed.
e.g. if the opcode was ADD, the two source
operands are added by the ALU.
If the opcode was a control instruction, a
value is written to the PC
NB: Data Movement instructions don’t have an execute phase
Store Result
◦ If there is a result from the operation it is
written to memory (using the EA), or to a
register.
Note: Some instructions don't need all 6 phases
If only using registers, skip Evaluate Address
If only moving data, skip Execute
Start over …
◦ The control unit just keeps repeating this whole
process: so it now Fetches a new instruction from the
address currently stored in the PC.
Recall that the PC was incremented in the first step
(FETCH), so the instruction retrieved will be the next in the
program as stored in memory - unless the instruction just
executed changed the contents of the PC.
May require memory access to fetch operands
Indirect addressing requires more memory
accesses
Can be thought of as additional instruction
subcycle
A “good” design goal of any system is to have all
of its components performing useful work all of
the time – high efficiency
Following the instruction cycle in a sequential
fashion does not permit this level of efficiency
Compare the instruction cycle to an automobile
assembly line… do you see a problem?
Solution: Perform all tasks concurrently, but on
different (sequential) instructions
◦ i.e Overlap the stages on instruction cycle
◦ The result is temporal parallelism instruction
pipeline
An ideal pipeline divides a task into k
independent sequential subtasks
◦ Each subtask requires 1 time unit to complete
◦ The task itself then requires k time units to complete
For n iterations of the task, the execution times
will be:
◦ With no pipelining: n x k time units
◦ With pipelining: k + (n -1) time units
Speedup of a k-stage pipeline is thus:
S = n k / [k +(n -1)] ==> k (for large n)
does not decrease the time for individual
instruction execution
Increases instruction throughput
◦ The throughput of the instruction pipeline is
determined by how often an instruction exits the
pipeline.
introduces hazards as it changes the relative timing
of instructions by overlapping their execution
situations in which a correct program ceases to
work correctly due to implementing the
processor with a pipeline
three fundamental types of hazard:
◦ data hazards
◦ branch hazards
◦ structural hazards
Note: Only branch hazards and RAW data hazards are possible in MIPS pipeline
Data hazards
◦ when reads and writes of data occur in a different
order in the pipeline than in the program code,
resulting in data dependencies
◦ an instruction uses the result of a previous
instruction (RAW)
ADD R1, R2, R3 or SW R1, 3(R2)
ADD R4, R1, R5 LW R3, 3(R2)
◦ Write After Read (WAR), Write After Write (WAW), and
Read After Write (RAW) hazards
Data hazards
WAR
◦ the reverse of a RAW: in the code a write occurs after
a read, but the pipeline causes write to happen first
WAW
◦ when two writes occur out of order - when there is no
read in between
RAW
◦ when, in the code as written, one instruction reads a
location after an earlier instruction writes new data to
it, but in the pipeline the write occurs after the read
(so the instruction doing the read gets stale data).
Example (Data hazard)
add $1, $2, $3 _ _ _ _ _
add $4, $5, $6 _ _ _ _ _
add $7, $8, $9 _ _ _ _ _
add $10, $11, $12 _ _ _ _ _
add $13, $14, $1 _ _ _ _ _ (data arrives early; OK)
add $15, $16, $7 _ _ _ _ _ (data arrives on time;
OK)
add $17, $18, $13 _ _ _ _ _ (uh, oh)
add $19, $20, $17 _ _ _ _ _ (uh, oh again)
Branch/Control hazards
occurs when a decision needs to be made,
but the information needed to make the
decision is not available yet
JMP LOOP
…
LOOP: ADD R1, R2, R3
Structural hazards
occur when a single piece of hardware is used
in more than one stage of the pipeline, so it's
possible for two instructions to need it at the
same time
Four possible techniques
1. Stall
◦ Can resolve any type of hazard
◦ Detect the hazard
◦ Freeze the pipeline up to the dependent stage until
the hazard is resolved - simply make the later
instruction wait until the hazard resolves itself
◦ Undesirable because it slows down the machine, but
may be necessary.
2. Bypass/Forward
◦ detects condition - If the data is available
somewhere, but is just not where we want it, create
extra data paths to “forward” the data to where it is
needed
◦ no need to stall - eliminates stalls for single-cycle
operations - reduces longest stall to N-1 cycles for N-
cycle operations
◦ best solution, since it doesn't slow the machine down
and doesn't change the semantics of the instruction
set.
3. Document/punt –
define instruction sequences that are
forbidden, or change the semantics of
instructions, to account for the hazard -
worst solution, both because it results in
obscure conditions on permissible
instruction sequences, and (more
importantly) because it ties the instruction
set to a particular pipeline implementation.
4. Add hardware
◦ most appropriate to structural hazards; if a
piece of hardware has to be used twice in
an instruction, see if there is a way to
duplicate the hardware.
1 2 3 4 5 6 7 8 9
ADD R1, R2, R3 IF ID EX MEM WB
IF
SUB R4, R5, R1 IDSUB EX MEM WB
IF
AND R6, R1, R7 IDAND EX MEM WB
IF
OR R8, R1, R9 IDOR EX MEM WB
IF
XOR R10,R1,R11 IDXOR EX MEM WB
Analysis of hazards
All the instructions after the ADD use the result of the
ADD instruction (in R1).
The ADD instruction writes the value of R1 in the WB
stage and the SUB instruction reads the value during
ID stage resulting in a data hazard and unless
precautions are taken to prevent it, the SUB instruction
will read the wrong value and try to use it.
For the AND instruction, the write of R1 does not
complete until the end of cycle 5. Thus, the AND
instruction that reads the registers during cycle 4 will
receive the wrong result.
Analysis of hazards
The OR instruction can be made to operate without
incurring a hazard by a simple implementation
technique. The technique is to perform register file
reads in the second half of the cycle, and writes in the
first half. Because both WB for ADD and ID for OR are
performed in one cycle 5, the write to register file by
ADD will perform in the first half of the cycle, and the
read of registers by OR will perform in the second half
of the cycle.
Analysis of hazards
The XOR instruction operates properly, because its
register read occur in cycle 6 after the register write
by ADD.
Multiple Streams
Prefetch Branch Target
Loop buffer
Branch prediction
Delayed branching
Multiple Streams
Have two pipelines
Prefetch each branch into a separate pipeline
Use appropriate pipeline
◦ Leads to bus & register contention
◦ Multiple branches lead to further pipelines
being needed
Prefetch Branch Target
Target of branch is prefetched in addition to
instructions following branch
Keep target until branch is executed
Used by IBM 360/91
Loop Buffer
Very fast memory
Maintained by fetch stage of pipeline
Check buffer before fetching from memory
Very good for small loops or jumps
c.f. cache
Used by CRAY-1
Loop Buffer Diagram
Branch Prediction
Predict never taken
◦ Assume that jump will not happen
◦ Always fetch next instruction
◦ 68020 & VAX 11/780
◦ VAX will not prefetch after branch if a page fault would
result (O/S v CPU design)
Predict always taken
◦ Assume that jump will happen
◦ Always fetch target instruction
Branch Prediction
Predict by Opcode
◦ Some instructions are more likely to result in a jump
◦ Can get up to 75% success
Taken/Not taken switch
◦ Based on previous history
◦ Good for loops
Delayed Branch
◦ Do not take jump until you have to
◦ Rearrange instructions
Branch Prediction
Predict by Opcode
◦ Some instructions are more likely to result in a
jump
◦ Can get up to 75% success
Taken/Not taken switch
◦ Based on previous history
◦ Good for loops
Performance of a machine is determined by:
Instruction count(determine by compiler & ISA)
Clock cycle time
Clock cycles per instruction
Processor design (datapath and control) will
determine:
Clock cycle time
Clock cycles per instruction
CPU Time = execution time
= instruction count x CPI x Clock cycle time
= instruction count x CPI
Clock rate
NB: clock cycle time = 1 . CPI = cycles per
clock rate instruction
Performance ~ 1/Execution Time
CPU PerformanceA = Execution TimeB
CPU PerformanceB Execution TimeA
Example
System A takes 12 seconds to execute a
program. System B takes 20 seconds to
execute the same program. Which system is
faster and by how much?
Example (soln)
NB: Slower time = faster machine
CPU PerformanceA = Execution TimeB
CPU PerformanceB Execution TimeA
(Normally read: faster as compared to slower)
=> CPU PerformanceA = 20 seconds
CPU PerformanceB 12 seconds
System A is 20/12 = 1.67 times or 67% faster than
System B for this application.
67% performance improvement
Computing CPI
The CPI is the average number of cycles per
instruction.
If for each instruction type, we know its frequency
and number of cycles need to execute it, we can
compute the overall CPI as follows:
CPI = Σ (CPI x F)
For example
Op F CPI (CPI x F) % Time
ALU 50% 1 0.5 23
Load 20% 5 1.0 45
Store 10% 3 0.3 14
Branch 20% 2 0.4 18
Total 100% 2.2 100
(frequency (f) must always add up to 100)
Estimating Performance Improvements
Assume a processor currently requires 10 seconds to
execute a program and performance improves by 50% per
year.
By what factor does processor performance improve in 5
years?
Factor of improvement (foi)= (1 + %)years
Factor of improvement = (1 + 0.5)5 = 7.59
How long will it take a processor to execute the program
after 5 years?
Exec_Timenew = Exec_Timeold =10/7.59 = 1.32 seconds
foi
Example:
You have a system that contains a special
processor for doing floating-point operations.
You have determined that 50% of your
computations can use the floating-point
processor. The speedup of the floating pointing-
point processor is 15. What is the overall
speedup?
Execution timenew
= Execution timeold x (1 – Fractionenhanced) + fractionenhanced
Speedupenhanced
The overall speedup is the ratio of the execution times:
Speedupoverall = executionold
executionnew
= 1 .
(1 – Fractionenhanced) + Fractionenhanced
Speedupenhanced
Overall speedup achieved by using the
floating-point processor.
Overall speedup = 1 .
(1-0.5) + 0.5/15
= 1 .
0.5 + 0.033
= 1.876
Example: (Impact of optimizing compiler)
Assume the following program makeup:
Operation Freq. Clock
Cycle
ALU 43% 1
Load 21% 2
Store 12% 2
Branch 24% 2
Assume a 20 ns clock, optimizing compiler
eliminates 50% of all ALU operations.
Solution (NOT Optimized):
MIPS = instruction count = clock rate
Exec Time x 106 CPI x 106
Avg. CPI = (0.43 x 1) + (0.21 x 2) + (0.12 x 2) + (0.24x 2)
= 1.57
MIPS = 50 MHz
1.57 x 106
= 5 x 106
1.57 x 106
NB: MIPS has no units
= 31.8
Solution (Optimized):
Ave CPI = .43/2x1 + .21x2 + .12x2 + .24x2
1-.43/2
= 1.73
MIPS = 50 MHz
1.73 x 106
= 5 x 106
1.73 x 106
= 28.9
An implementation of a CPU
Must be considered with the processor design –
sequencing and execution of instructions
◦ the individual components that are necessary:
◦ the use of a clock
◦ registers and memory.
Datapath
The interconnection of functional units that
make up the processor, such as ALUs,
decoders, and multiplexers that perform data
processing operations
A component of most processors
Works in conjunction with the control unit
Two implementations
Single cycle implementation
◦ uses a single clock cycle for every instruction.
(Start execution with one clock edge, and complete on the next edge)
◦ Advantage: One clock cycle per instruction
◦ Disadvantage: long cycle time
Multicycle implementation
◦ instructions use multiple clock cycles.
◦ Advantage: Shorter execution time
◦ Disadvantage: More complex control
PCSrc
1
Add M
u
x
4 ALU 0
Add result
RegWrite Shift
left 2
Instruction [25– 21] Read
Read register 1 Read MemWrite
PC data 1
address Instruction [20– 16] Read MemtoReg
ALUSrc
Instruction register 2 Zero
1 Read ALU ALU
[31– 0] Write data 2 1 Read
M result Address 1
u register M data
Instruction Instruction [15– 11] x u M
memory Write x u
0 data Registers 0
x
Write Data 0
RegDst data memory
Instruction [15– 0] 16 Sign 32
extend ALU MemRead
control
Instruction [5– 0]
ALUOp
Cycle time = Σ(stages)
Execution Time = IC * CPI * Cycle Time
Processor design (datapath and control) will
determine:
◦ Clock cycle time
◦ Clock cycles per instruction
All combinational logic must stabilize within
one clock cycle.
All state elements will be written exactly
once at the end of the clock.
CPI = 1
All operations must occur in parallel
Used to improve performance in real computer
systems
Divides instruction execution into multiple clock
cycles.
Resources can be reused on each cycle, saving
resources
◦ ALU used to compute address and to increment PC
◦ Memory used for instruction and data
Control signals not determined solely by
instruction
◦ e.g., what should the ALU do for a “subtract”
instruction?
Shorter instructions can use fewer clock cycles
Total execution time is reduced.
Cycle time = time for longest stage
Execution time = cycle time * IC * CPI
Advantages
◦ The work required by the typical instructions can be
divided over approximately equal, smaller elementary
operations.
◦ The clock cycle can be set to the longest elementary
operation.
Multicycle Datapath
The performance of a pipelined processor may
be considered:
Cycle time
= longest stage + pipeline overhead
Execution time
= cycle time * (no. of stages + IC – 1)
the processor is not the only component in the
system that determines overall system
performance
Speed is NOT performance e.g. a system running
with a Pentium 150 is not necessarily "50% faster"
than one with a Pentium 100.
speeding up the processor only improves system
performance for those aspects of system use that
depend on the processor.
most processors are already much faster than the
devices that support them, so they spend a great deal
of time waiting around for data that they can use
One of the most important factors that influences
overall system performance is memory bus speed e.g.
a slower processor running on a 66 MHz system bus
can provide comparable performance to a faster one
on a 60 MHz bus (Pentium 133 vs Pentium 150 in
some benchmarks)
very high clock speeds sometimes results in minimal
improvement in overall system performance is
minimal even if the processor's performance
increases a great deal (e.g. Pentium 200 which by
most benchmarks provides less than 10% system
performance improvement over the Pentium 166,
despite having 20% higher benchmarks when looking
just at the processor - sort of a law of diminishing
returns in processor speed.