CPT214
CPT214
MINNA
CPT 214
COMPUTER ARCHITECTURE
HAND OUT
A AS
fo foM
S
In In
M
S S
A A
M M
fo nfo
In I
InfoMAS
This Book has been digitally prepared by the Information
Management and Administrator’s System (InfoMAS)
http://futmininfo.hexat.com
This material, its content and layout belong to the original owner,
the lecturer @ SICT FUTMINNA. The InfoMAS claim no ownership
of any part of this material apart from the cover page put together
by our team
This book is FREE for use by anyone anywhere at no cost and with
no restrictions Whatsoever. You can download this book at
Http://futmininfo.hexat.com
Http://futmininfo.blogspot.com
Http://infomas.hexat.com
1
5.3 What are the types of instructions
2
In addition three more operators have been defined for Boolean algebra:
XOR (Exlusive OR), NOR (Not OR) and NAND (Not AND)
However, for designing and analysing a logical circuit, it is convenient to use AND, NOT and
OR operators because AND and OR obey many laws as of multiplication and addition in the
ordinary algebra of real numbers.
We will discuss XOR, NOR and NAND operators in the next subsection.
Point 3:
The basic logical identities used in Boolean algebra are:
3
We will not give any proofs of these identities. The use of some of the identities is shown in the
example given after Boolean function definition. DeMorgan’s law is very useful in simplifying
logical expressions.
1.2 Boolean Function:
A Boolean function is defined as an algebraic expression formed with the binary variables, the
logic operation symbols, parenthesis, and equal to sign. For example,
F = A.B+C is a Boolean function.
The value of Boolean function F can be either 0 or 1.
A Boolean function can be broken into logic diagram and vice versa (we will discuss this in the
next section), therefore if we code the logic operations in Boolean algebraic form and simplify
this expression we will design the simplified form of the logic circuits.
The AND gate is an electronic circuit that gives a high output (1) only if all its inputs are high.
A dot (.) is used to show the AND operation i.e. A.B. Bear in mind that this dot is sometimes
omitted i.e. AB
4
The OR gate is an electronic circuit that gives a high output (1) if one or more of its inputs are
high. A plus (+) is used to show the OR operation.
The NOT gate is an electronic circuit that produces an inverted version of the input at its
output. It is also known as an inverter. If the input variable is A, the inverted output is known
as NOT A. This is also shown as A', or A with a bar over the top, as shown at the outputs. The
diagrams below show two ways in which the NAND logic gate can be configured to produce a
NOT gate. It can also be done using NOR logic gates in the same way.
5
This is a NOT-AND gate which is equal to an AND gate followed by a NOT gate. The outputs
of all NAND gates are high if any of the inputs are low. The symbol is an AND gate with a
small circle on the output. The small circle represents an inversion.
6
This is a NOT-OR gate which is equal to an OR gate followed by a NOT gate. The outputs of
all NOR gates are low if any of the inputs are high. The symbol is an OR gate with a small
circle on the output. The small circle represents an inversion.
The 'Exclusive-OR' gate is a circuit which will give a high output if either, but not both, of its
two inputs are high. An encircled plus sign ( ) is used to show the EOR operation.
The 'Exclusive-NOR' gate circuit does the opposite of the EOR gate. It will give a low output
if either, but not both of its two inputs are high. The symbol is an EXOR gate with a small
circle on the output. The small circle represents an inversion. The NAND and NOR gates are
called universal functions since with either one the AND and OR functions and NOT can be
generated.
7
Note:
A function in sum of products form can be implemented using NAND gates by replacing all
AND and OR gates by NAND gates. A function in product of sums form can be implemented
using NOR gates by replacing all AND and OR gates by NOR gates.
other gate functions. Also note that a truth table with 'n' inputs has 2 n
rows. You can com pare the outputs of different gates.
8
The truth table of NAND and NOR can be made from NOT (A AND B) and NOT (A OR B)
respectively. Exclusive OR (XOR) is a special gate whose output is one only if the inputs are
not equal. The inverse of exclusive OR can be a comparator which will produce a one output if
two inputs are equal.
The digital circuit use only one or two types of gates for simplicity in fabrication purposes.
Therefore, one must think in terms of functionally complete sets of gates. What does a
functionally complete set imply? A set of gates by which any Boolean function can be
implemented is called a functionally complete set. The functionally compete sets are: (AND,
NOT), (NOR), (NAND), (OR< NOT) etc.
9
The basic design issue related to combinational circuits is: the minimization of the number of
gates. The normal circuit constraints for combinational circuit design are:
· The depth of the circuit should no t exceed a specific level.
· Number of input lines to a gate (fan in) and how many gates its output can be fed (fan out) are
constrained by the circuit power constraints.
13
You should be familiar with all the terms given in the above diagram, except the term, “disk
cache”. The disk cache may be a portion of RAM, sometimes called the soft disk cache that is
used to speed up the access time on a disk. In some newer technologies such a memory can be a
part disk drive itself; such a memory is sometimes called the hard disk cache or buffer. These
hard disk caches are more effective, but they are expensive, and therefore smaller. Almost all
modern disk drives include a small amount of internal cache.
16
Some of the main attributes that are considered for the storage devices are physical size, energy
consumption and reliability. The physical size also depends on the storage density of the
memories. This is also coupled with the question of portability of memory. The energy
consumption is another key factor that determines the power consumption of the computer and
cooling system requirements. The higher the power consumption, the costlier the equipment
required for the internal cooling of the computer. Reliability is measured as mean time to
failure. The storage devices, which require mechanical motion e.g., hard disks, are more prone
to failure rather than the semiconductor memories, which are totally electronic in nature. Very
high-speed semiconductor memories are also prone to failure as technology is moving towards
its limits.
Cost: Another key factor which is of prime concern in a memory system, is cost. It is normally
expressed per bit. The cost of a memory system includes not only the device but also the cost of
access circuitry and peripherals essential for the operation of the memory. Table 3 also shows
per bit cost of these memories. Please note that as the access time for memories is increasing,
the cost is decreasing.
These are small fast memories placed between the processor and the main memory (see Figure
51). Caches are faster than the main memory. Then why do we need the main memory at all?
17
Well once again it is the cost which determines this. The caches although are fast yet are very
expensive memories and are used in only small sizes. For example, caches of sizes 128K, 256K
etc. are normally used in typical Pentium based systems, whereas they can have 4 to 128 MB
RAMs or even more. Thus, small cache memories are intended to provide fast speed of
memory retrieval without sacrificing the size of memory (because of main memory size). If we
have such a small size of fast memory how can it be advantageous in increasing the overall
speed of memory references? The answer lies in the “principle of locality”, which says that if a
particular memory location is accessed at a time then it is highly likely that its nearby locations
will be accessed in the near future.
Cache contains a copy of certain portions of main memory. The memory read or writes
operation is first checked with the cache and if the desired location data is available in cache
then it is used by the CPU directly. Otherwise, a block of words is read from the main memory
to cache and the word is used by the CPU from cache. Since cache has limited space, for this
incoming block a portion called a slot needs to be vacated in cache. The contents of this
vacating block are written back to the main memory at the position it belongs to. The reason for
bringing a block of words to cache is once again locality of reference. We expect that the next
few addresses will be close to this address and therefore, the block of words is transferred from
the main memory to cache. Thus, for the word, which is not in cache, access time is slightly
more than the access time for main memory without cache. But because of locality of
references, the next few words may be in the cache, thus, enhancing the overall speed of
memory references. For example, if memory read cycle takes 100 ns and a cache read cycle
takes 20 ns, then for four continuous references (first one brings the main memory contents to
cache and next three from cache).
Thus, the closer the reference interval, the better the performance of cache and that is why a
structured code is considered a good programming practice, since it provides maximum
possible locality. The performance of cache is closely related to the nature of the programs
18
being executed; therefore, it is very difficult to say what should be the optimum size of cache,
but in general a cache size between 128k to 256k is considered to be optimum for most of the
cases. Another important aspect in cache is the size of the block (refer Figure 51) that is
normally equivalent to 4 to 8 addressable units (which may be words, bytes etc). Figure 51
gives the cache memory organization.
A cache consists of a number of slots. As mentioned earlier, the cache size is smaller than that
of the main memory; therefore, there is no possibility of one mapping of the contents of the
main memory to cache.
So how will the computer identify which block is residing in cache? This is achieved by storing
the block address to the tag number. In the above figure four consecutive words are put in a
single block. Thus if we have 2n words in the main memory then the total number of blocks
which exist in the main memory are =2n/4 = 2n-2 (size of one block is 4 byte). Therefore, we
need at least (n-2) bits as the size of the tag field. The important feature in cache memory is
how the main memory block can be mapped in cache. There are three ways of doing so: direct,
associative and set associative.
19
3.3 Direct Mapping: In this mapping each block of memory is mapped in a fixed slot of cache
only. For example, if a cache has four slots then the main memory blocks 0 or 4 or 8 or 12 or
16…can be found in slot 0, while 1 or 5 or 9 or 13 or 17…can be found in slot 1; 2 or 6 or 10 or
14 or 18…in slot 2: and 3 or 7 or 11 15 or 19…in slot 3. This can be mathematically defined as:
In this technique, it can be easily determined whether a block is in cache or not. (How?)
This technique is simple, but there is a disadvantage of this scheme. Suppose two words which
are referenced alternately repeatedly are falling in the same slot then the swapping of these two
blocks will take place in the cache, thus, resulting in reduced efficiency of the cache. A direct
mapping example is shown in Figure 52(a). Please note the mapping of main memory address
to cache tag and slot. All addresses are in hexadecimal notation.
3.4 Associative Mapping: In associative mapping any block of the memory can be mapped on
to any location of the cache. But here the main difficulty is to determine “Whether a block is in
cache or not.” This process of determination is normally carried out simultaneously. The main
disadvantage of this mapping is the complex circuitry required to examine all the cache slots in
parallel to determine the presence or absence of a block in cache. An associative mapping
example is shown in figure 52(b).
3.5 Set Associative Mapping: This is a compromise between the above mentioned two types
of mapping. Here the advantages of both direct and associative cache can be obtained. The
cache is divided in some sets, let’s say t. the scheme is that a direct mapping is used to map the
main memory blocks in one of the “t” sets and within this set any slot can be assigned to this
block. Thus, we have associative mapping inside each set of the sets. Please refer to Figure
52(c) for an example. Another important feature for cache is the replacement algorithm. For
direct mapping no algorithm is needed since only one slot can be occupied by a block in cache
but in associative and set associative mapping many slots may be used by a block. So which
slot should be vacated for this new block? One of the common strategies used is to vacate the
least recently used slot for this new block. The reason is just the probability of accessing a
block, which was used quite long ago, is less in comparison to those of blocks which are used
20
afterwards (or recently). This scheme is also derived from the principle of locality. Other
schemes, which may not be as effective as the least frequently used, can be:
22
4. It should have a provision for data buffering
Data buffering is quite useful for the purpose of smoothing out the gaps in speed of CPU and
I/O devices. The data buffers are registers, which hold the I/O information temporarily. The I/O
is performed in short bursts in which data is stored in buffer a area while the device can take its
own time to accept it. In I/O device to CPU transfer, data is first transferred to the buffer and
then passed on to the CPU from these buffer registers. Thus, the I/O operation does not tie up
the bus for the slower I/O devices.
5. Error detection mechanisms should be in-built
The error detection mechanisms may involve checking the mechanical as well as data
communication errors. These errors should be reported to the CPU. Examples of the kind of
mechanical errors that can occur in devices are paper jam in printer, mechanical failure,
electrical failure, etc. data communication errors may be checked by using parity bit.
23
The data is stored or buffered in data registers. The status registers are used to indicate the
current state of a device, e.g., the device is busy, the system BUS is busy, the device is out of
order etc. If an I/O module takes more processing burden then it is called an I/O channel or a
processor. The primitive I/O module which requires detailed control by the processor is called
the I/O controller or device controller. Theses I/O controllers are normally used in micro-
computers while I/O processors are mainly used for mainframes because the I/O work for the
microcomputer is normally limited to single user’s job, therefore, we do not expect a very huge
amount of I/O to justify the investment in I/O processors, which are expensive. Once we have
defined a general structure for an I/O module, the next question is how actually are the I/O
operations performed? The next section tries to answer this basic query in a general way.
24
In programmed I/O, the I/O operations are completely controlled by the CPU. The CPU
executes programs that initiate, direct and terminate an I/O operation. It requires a little special
I/O hardware, but is quite time consuming for the CPU, since the CPU has to wait for slower
I/O operations.
Another technique suggested to reduce the waiting by the CPU is interrupt-driven I/O. The
CPU issues the I/O command to I/O module and starts doing other work, which may be the
execution of a separate program. When the I/O operation is complete, the I/O module interrupts
the CPU by informing it that I/O has finished. The CPU then, may proceed with execution of
this program. In both programmed I/O and interrupt-driven I/O, the CPU is responsible for
extracting data from the memory for output and storing data in the memory for input. Such a
requirement does not exist in DMA where the memory can be accessed directly by the I/O
module. Thus, the I/O module can store or extract data in/from the memory. We will discuss
programmed I/O and interrupt-driven I/O in this section and DMA in the next section.
A situation in which the I/O is solely looked after by a separate dedicated processor is referred
to as I/O channel or I/O processor. The basic advantage of these devices is that they free the
CPU of the burden of input/output. Thus, during this time, the CPU can do other work,
therefore, effectively increasing the CPU utilization. We will discuss I/O channels and I/O
processors in the next section.
25
· Transfer of data from CPU registers to memory.
In addition, in a programmed I/O method the responsibility of the CPU is to constantly check
the status of the I/O device to check whether it has become free (in case output is desired) or it
has finished inputting the current series of data (in case input is going on). Thus, programmed
I/O is a very time consuming method where the CPU wastes lot of time for checking and
verifying the status of an I/O device. Let us now try to focus how this input-output is
performed. Figure 56(a) gives the block diagram of transferring a block of data word by word
using programmed I/O technique.
26
27
28
I/O Instructions: To carry out input/output CPU issues I/O related instructions. These
instructions consist of two components:
· The address of the input/output device specifying the I/O device and I/O module; and
· An in put/output command.
There are four types of I/O Commands, which can be classified as:
CONTROL, TEST, READ and WRITE.
CONTROL commands are specific devices and are used to control the specific instructions to
the device; e.g., a magnetic tape requires rewinding or moving forward by a block. TEST
command checks the status such as, if a device is ready or not or is in error condition.
The READ command is used for input of data from input device and the WRITE command is
used for output of data to input device.
The other part of I/O instruction is the address of the I/O device. In systems with programmed
I/O the I/O module, the main memory and the CPU normally share the system bus. Thus, each
I/O module should interpret the address lines to determine if the command is for itself. Or in
other words: How does CPU specify which device to access? There are two methods of doing
so. These are called memory mapped I/O and I/O-mapped I/O.
29
If we use the single address space for memory locations and I/O devices, i.e., the CPU treats the
status and data registers of the I/O module as memory locations, and then memory and I/O
devices can be accessed using the same instructions. This is referred to as memory mapped I/O.
For a memory mapped I/O, only a single READ and a single WRITE line are needed for
memory or I/O module read or write operations. These lines are activated by the CPU for either
memory access or I/O device access.
32
33
5.0 INSTRUCTION SET CHARACTERISTICS
Till now, we have discussed instruction in an abstract way. Now let us discuss in detail various
characteristics of instructions. However we will first of all find out the significance of the
instructions set. One thing which should be kept in mind is that the instruction set is a boundary
which is looked upon in the same fashion by a computer designer and the programmer. From
the computer designer’s point of view, the instruction set provides the functional requirements
of the CPU. In fact, for implementing the CPU design, one of the main tasks is to implement
the instruction set for that CPU. However, from the user’s point of view machine instructions or
assembly instructions are needed for low level programming. In addition, a user should also be
aware of registers, the data types supported by the machine and the functioning of the ALU.
We will discuss the registers in Unit 2 and the ALU in Unit 3 of this module.
Explanation on the machine instruction set gives extensive details about the CPU of a machine.
In fact, the operations which a CPU can perform can be determined by the machine
instructions. Now, let us answer some introductory questions about the instruction set.
34
An important aspect, of the references to operands and results is: where are those operands
located? In the memory or in the CPU registers or in the I/O device. If the operands are located
in the registers then an instruction can be executed faster than that of the operands located in
the memory. The main reason here is that memory access time is higher in comparison to the
register access time.
In most instruction sets, many instruction formats are used. An instruction is first read into an
instruction register (IR), and then the CPU decodes the instruction and extracts the required
operands on the basis of references made through the instruction fields, and processes it.
Since the binary representation of the instruction is difficult to comprehend and is seldom used
for representations, we will be using symbolic representations to these instructions in this unit
along with the comments wherever desired.
35
5.3 What are the types of instructions?
The instructions can be categorized under the following:
· Data Processing Instructions: These instructions are used for arithmetical and logic
operations in a machine. Examples of data processing instructions are: Arithmetic, Boolean,
shift, character and string processing instructions, stack and register, manipulation instructions,
vector instructions, etc.
· Data Storage/Retrieval Instructions: Since the data processing operations are normally
performed on the data stored in CPU registers, we need instructions to bring data to and from
memory to registers. These are called data storage/retrieval instructions.
Examples of data storage and retrieval instructions are load and store instructions.
· Data Movement Instructions: These are basically input/output instructions. They are
required to bring in programs and data from various devices to memory or to communicate the
results to the input/output devices. Some of these instructions can be: start input/output, halt
input/output, TEST input/output etc.
· Control Instructions: These instructions are used for testing the status of computation
through Processor Status Word (PSW). Another of such instruction is the branch instruction
used for transfer of control. We will discuss in more details about these instructions.
· Miscellaneous Instructions: These instructions do not fit in any of the above categories.
Some of these instructions are: interrupt or supervisory call, swapping, return from interrupt,
halt instruction or some privileged instruction of operating systems.
What are the factors which play important roles for selection/designing of instruction sets
for a machine?
Instruction set design is the most complex yet interesting and very much analyzed aspect of
computer design. The instruction set plays an important role in the design of the CPU as it
defines many functions of it. Since instruction sets are the means by which a programmer can
control the CPU, therefore, users’ views must be considered while designing the instruction set.
Some of the important design issues relating to instruction design are:
· How many and what operations should be provided?
· What are the operand data types to be provided?
36
· What should be the instruction format? This includes issues like: instruction length, number of
address, length of various elements of instructions, etc.
· What is the number of registers which can be referenced by an instruction and how are they
used?
· What are the modes of specifying an operand address? We will try to analyze these issues in
some detail in this and subsequent sections. However, we have kept the level of discussion very
general. A special example of an instruction set is given at the end of this unit.
37
All the machines provide instructions for performing arithmetical operations on fixed point and
floating point numbers. Many machines provide arithmetical instructions which perform
operations on packed decimal digits.
Characters: Another very common data type is the character or string of characters. The most
widely used character representation is ASCII (American National Standard Code of
Information Interchange). It has 7 bits for coding data pattern which implies 128 different
characters. Some of these characters are control characters which may be used in data
communication. The eighth bit of ASCII may be used as a parity bit. One special mention about
ASCII which facilitates the conversion of a 7 bit ASCII and a 4 bit packed decimal number is
that the last four digits of ASCII number are binary equivalents of digits 0-9. with packed
decimal in a similar way as that of ASCII. The digits 0 through 9 in this can be represented as
1111 0000 through 1111 1001.
Logical Data: In general a data word or any other addressable unit such as byte, half word etc.
are treated as a single unit of data. But can we consider an n-bit data unit consisting of n items
of 1 bit each? If we treat each bit of an n-bit data as an item then it can be considered to be
logical data. Each of these n items can have a value 0 or 1.
What are the advantages of such a bit oriented view of data? The advantages of such a view
will be:
· We can store an array of Boolean or binary data items most efficiently.
· We will be in a position to manipulate the bits of any data item.
But where do we need to manipulate bits of a data item? The example of such a case is shifting
of significance bits in a floating point operation or for converting ASCII to packed decimals
where we need only the 4 rights most bits of ASCII’s byte. Please note that for extracting
decimal from ASCII, first the data is treated as logical data and then it can be used in
arithmetical operations as numeric data. Thus, the operation performed on a unit of data
determines the types of the unit of data at that instance. This statement may be true for high
level language, but holds good for machine level language.
The register architecture, that is a general classification which is based on register set of the
computer, is sometimes classified according to the number of addresses in instructions. These
classifications are:
Evaluation-Stack Architecture: These machines are Zero address machines and their
operands are taken from top of the stack implicitly. The ALU of such machines directly
39
references a stack which can be implemented in main memory or registers or both. These
machines contain instructions like PUSH and POP to load a value on stack and store a value in
the memory respectively. Please note that PUSH, POP are not zero address instructions but
contain one address. The main advantages of such architecture are:
· Very short instructions
· Since stacks are normally made within CPU in such an architecture machine, the operands are
close to the ALU, thus effecting fast execution of instructions
· Excellent support for subroutines.
While the main disadvantages of this architecture are:
· Not very general in nature, in comparison to other architectures
· Difficult to program for applications like text and string processing.
One example of such a machine is Burroughs B6700. However, these machines are not very
common today because of the general nature of machines desired. The stack machine uses
Polish notations for evaluation of arithmetical expressions. Polish notation was introduced by a
Polish logician, Jan Lukasiewicz. The main theme of this notation is that an arithmetical
expression AxB can be expressed as:
In stacks we use suffix or reverse polish notation for evaluating expression. The rule here is to
push all the variable values on the top of stack and do the operation with the top elements of
stack as an operand is encountered. The priority of operand is kept in mind for generation of
reverse polish notation. For example, an expression AxB+CxD/E will be represented in reverse
polish notation as:
ABxCDxE/+
Thus, the execution program for evaluation of F=AxB+CxD/E will be:
PUSH A /Transfer the value of A on to th e top of stack/
PUSH B /Transfer the value of B on to the top of stack/
MULT /Multiply. It will remove the value of A and B from the stack and multiply A x B/
40
PUSH C /Transfer C on the top of stack/
PUSH D /Transfer D on the top of stack/
MULT /Remove the values of C & D from the stack and multiply
A x D/
PUSH E /Transfer E on the top of stack/
DIV /C x D/E /
ADD /Add the top two values on the stack/
POP F /Transfer the results back to memory location F/
Figure 66: Sample program for evaluating AxB+CxD/E us ing zero address
instructions.
Please note that PUSH and POP are not zero-address instructions.
The execution of the above program using stack is represented diagrammatically in figure 67.
41
6.0 PIPELINING DESIGN TECHNIQUES
There exist two basic techniques to increase the instruction execution rate of a processor.
These are to increase the clock rate, thus decreasing the instruction execution time, or
alternatively to increase the number of instructions that can be executed simultaneously.
Pipelining and instruction-level parallelism are examples of the latter technique. Pipelining
owes its origin to car assembly lines. The idea is to have more than one instruction being
processed by the processor at the same time. Similar to the assembly line, the success of a
pipeline depends upon dividing the execution of an instruction among a number of subunits
(stages), each performing part of the required operations. A possible division is to consider
instruction fetch (F), instruction decode (D), operand fetch (F), instruction execution (E), and
store of results (S) as the subtasks needed for the execution of an instruction.
In this case, it is possible to have up to five instructions in the pipeline at the same time, thus
reducing instruction execution latency.
It is clear from the figure that the total time required to process three instructions (I1, I2, I3) is
only six time units if four-stage pipelining is used as compared to 12 time units if sequential
42
processing is used. A possible saving of up to 50% in the execution time of these three
instructions is obtained. In order to formulate some performance measures for the goodness of a
pipeline in processing a series of tasks, a space time chart (called the Gantt’s chart) is used. The
chart shows the succession of the subtasks in the pipe with respect to time. Figure 9.2 shows a
Gantt’s chart. In this chart, the vertical axis represents the subunits (four in this case) and the
horizontal axis represents time (measured in terms of the time unit required for each unit to
perform its task). In developing the Gantt’s chart, we assume that the time (T) taken by each
subunit to perform its task is the same; we call this the unit time.
As can be seen from the figure, 13 time units are needed to finish executing 10 instructions (I1
to I10). This is to be compared to 40 time units if sequential processing is used (ten instructions
each requiring four time units).
In the following analysis, we provide three performance measures for the goodness of a
pipeline. These are the Speed-up S(n), Throughput U(n), and Efficiency E(n). It should be
noted that in this analysis we assume that the unit time T = t units.
43
6.2. Instruction Pipeline
The simple analysis made in figure 9.1 ignores an important aspect that can affect the
performance of a pipeline, that is, pipeline stall. A pipeline operation is said to have been
stalled if one unit (stage) requires more time to perform its function, thus forcing other stages to
become idle. Consider, for example, the case of an instruction fetch that incurs a cache miss.
Assume also that a cache miss requires three extra time units. Figure 9.3 illustrates the effect of
having instruction I2 incurring a cache miss (assuming the execution of ten instructions I1 to
I10).
44
The figure shows that due to the extra time units needed for instruction I2 to be fetched, the
pipeline stalls, that is, fetching of instruction I3 and subsequent instructions are delayed. Such
situations create what is known as pipeline bubble (or pipeline hazards). The creation of a
pipeline bubble leads to wasted unit times, thus leading to an overall increase in the number of
time units needed to finish executing a given number of instructions. The number of time units
needed to execute the 10 instructions shown in Figure 9.3 is now 16 time units, compared to 13
time units if there were no cache misses.
Pipeline hazards can take place for a number of other reasons. Among these are instruction
dependency and data dependency. These are explained below. 9.2.1. Pipeline “Stall” Due to
Instruction Dependency Correct operation of a pipeline requires that operation performed by a
stage MUST NOT depend on the operation(s) performed by other stage(s). Instruction
dependency refers to the case whereby fetching of an instruction depends on the results of
executing a previous instruction. Instruction dependency manifests itself in the execution of a
conditional branch instruction. Consider, for example, the case of a “branch if negative”
instruction. In this case, the next instruction to fetch will not be known until the result of
executing that “branch if negative” instruction is known. In the following discussion, we will
assume that the instruction following a conditional branch instruction is not fetched until the
result of executing the branch instruction is known (stored). The following example shows the
effect of instruction dependency on a pipeline.
Example 1 Consider the execution of ten instructions I1–I10 on a pipeline consisting of four
pipeline stages: IF (instruction fetch), ID (instruction decode), IE (instruction execute), and IS
(instruction results store). Assume that the instruction I4 is a conditional branch instruction and
that when it is executed, the branch is not taken, that is, the branch condition(s) is(are) not
satisfied. Assume also that when the branch instruction is fetched, the pipeline stalls until the
result of executing the branch instruction is stored. Show the succession of instructions in the
pipeline; that is, show the Gantt’s chart. Figure 9.4 shows the required Gantt’s chart. The
bubble created due to the pipeline stall is clearly shown in the figure.
45
6.3 Pipeline “Stall” Due to Data Dependency
Data dependency in a pipeline occurs when a source operand of instruction Ii depends on the
results of executing a preceding instruction, Ij, i > j. It should be noted that although instruction
Ii can be fetched, its operand(s) may not be available until the results of instruction Ij are
stored. The following example shows the effect of data dependency on a pipeline.
In this piece of code, the first instruction, call it Ii, adds the contents of two registers R1 and R2
and stores the result in register R3. The second instruction, call it Ii+1, shifts the contents of R3
one bit position to the left and stores the result back into R3. The third instruction, call it Ii+2 ,
stores the result of subtracting the content of R6 from the content of R5 in register R4. In order
to show the effect of such data dependency, we will assume that the pipeline consists of five
stages, IF, ID, OF, IE, and IS. In this case, the OF stage represents the operand fetch stage. The
functions of the remaining four stages remain the same as explained before. Figure 9.5 shows
the Gantt’s chart for this piece of code. As shown in the figure, although instruction Ii+1 has
been successfully decoded during time unit k + 2, this instruction cannot proceed to the OF unit
during time unit k + 3. This is because the operand to be fetched by Ii+1 during time unit k+3
should be the content of register R3, which has been modified by execution of instruction Ii .
However, the modified value of R3 will not be available until the end of time unit k + 4. This
will require instruction Ii+1 to wait (at the output of the ID unit) until k + 5. Notice that
instruction Ii+2 will
46
have also to wait (at the output of the IF unit) until such time that instruction Iiþ1 proceeds to
the ID. The net result is that pipeline stall takes place due to the data dependency that exists
between instruction Ii and instruction Iiþ1. The data dependency presented in the above
example resulted because register R3 is the destination for both instructions Ii and Iiþ1. This is
called a write-after-write data dependency. Taking into consideration that any register can be
written into (or read from), then a total of four different possibilities exist, including the write-
after-write case. The other three cases are read-after-write, write-after-read, and read-after-read.
Among the four cases, the read-after-read case should not lead to pipeline stall. This is because
a register read operation does not change the content of the register. Among the remaining three
cases, the write-after-write (see the above example) and the read-after-write lead to pipeline
stall.
The following piece of code illustrates the read-after-write case:
In this case, the first instruction modifies the content of register R3 (through a write operation)
while the second instruction uses the modified contents of R3 (through a read operation) to load
a value into register R4. While these two instructions are proceeding within a pipeline, care
should be taken so that the value of register R3 read in the second instruction is the updated
value resulting from execution of the previous instruction. Figure 9.6 shows the Gantt’s chart
for this case assuming that the first instruction is called Ii and the second instruction is called
Ii+1. It is clear that the operand of the second instruction cannot be fetched during time unit
47
k+3 and that it has to be delayed until time unit k + 5. This is because the modified value of the
content of register R3 will not be available until time slot k + 5.
Fetching the operand of the second instruction during time slot k þ 3 will lead to incorrect
results.
Figure 9.7 illustrates the progression of these instructions in the pipeline taking into
consideration the data dependencies mentioned above. The assumption made in constructing
the Gantt’s chart in Figure 9.7 is that fetching an operand by an instruction that depends on the
48
results of a previous instruction execution is delayed until such operand is available, that is, the
result is stored. A total of 16 time units are required to execute the given seven instructions
taking into consideration the data dependencies among the different instructions.
The discussion on pipeline stall due to instruction and data dependencies should reveal three
main points about the problems associated with having such dependencies.
These are:
1. Both instruction and data dependencies lead to added delay in the pipeline.
2. Instruction dependency can lead to the fetching of the wrong instruction.
3. Data dependency can lead to the fetching of the wrong operand.
There exist a number of methods to deal with the problems resulting from instruction and data
dependencies. Some of these methods try to prevent the fetching of the wrong instruction or the
wrong operand while others try to reduce the delay incurred in the pipeline due to the existence
of instruction or data dependency. A number of these methods are introduced below.
49
6.4. Instruction-Level Parallelism
Contrary to pipeline techniques, instruction-level parallelism (ILP) is based on the idea of
multiple issue processors (MIP). An MIP has multiple pipelined datapaths for instruction
execution. Each of these pipelines can issue and execute one instruction per cycle. Figure 9.17
shows the case of a processor having three pipes. For comparison purposes, we also show in the
same figure the sequential and the single pipeline case. It is clear from the figure that while the
limit on the number of cycles per instruction in the case of a single pipeline is CPI = 1, the MIP
can achieve CPI < 1.
In order to make full use of ILP, an analysis should be made to identify the instruction and data
dependencies that exist in a given program. This analysis should lead to the appropriate
scheduling of the group of instructions that can be issued simultaneously while retaining the
program correctness. Static scheduling results in the use of very long instruction word (VLIW)
architectures, while dynamic scheduling results in the use of superscalar architectures.
In VLIW, an instruction represents a bundle of many operations to be issued simultan eously.
The compiler is responsible for checking all dependencies and making the appropriate
groupings/scheduling of operations. This is in contrast with superscalar architectures, which
rely entirely on the hardware for scheduling of instructions.
Superscalar Architectures A scalar machine is able to perform only one arithmetic operation at
once. A superscalar architecture (SPA) is able to fetch, decode, execute, and store results of
several instructions at the same time. It does so by transforming a static and sequential
instruction stream into a dynamic and parallel one, in order to execute a number of instructions
simultaneously. Upon completion, the SPA reinforces the original sequential instruction stream
such that instructions can be completed in the original order. In an SPA instruction, processing
consists of the fetch, decode, issue, and commit stages. During the fetch stage, multiple
instructions are fetched simultaneously.
50
Branch prediction and speculative execution are also performed during the fetch stage. This is
done in order to keep on fetching instructions beyond branch and jump instructions. Decoding
is done in two steps. Predecoding is performed between the main memory and the cache and is
responsible for identifying branch instructions.
Actual decoding is used to determine the following for each instruction: (1) the operation to be
performed; (2) the location of the operands; and (3) the location where the results are to be
stored. During the issue stage, those instructions among the dispatched ones that can start
execution are identified. During the commit stage, generated values/results are written into their
destination registers.
The most crucial step in processing instructions in SPAs is the dependency analysis. The
complexity of such analysis grows quadratically with the instruction word size. This puts a
limit on the degree of parallelism that can be achieved with SPAs such that a degree of
51
parallelism higher than four will be impractical. Beyond this limit, the dependence analysis and
scheduling must be done by the compiler. This is the basis for the VLIW approach.
REFERENCES
Baron, Robert J. and Higbie Lee. Computer Architecture . Addison-Wesley Publishing
Company.
Flautner, K. Kim, N.S. Martin, S. Blaauw D. and Mudge, T.N. Drowsy Caches: Simple
Techniques for Reducing Leakage Power. In International Symposium on Computer
Architecture. (ISCA), 148–157, 2002
Hayes, John P. (1998). Computer Architecture and Organisation (2nd ed). McGraw-Hill
International editions.
Hennessy J.L and Patterson, D.A. Computer Architecture: A Quantitative Approach, Morgan
Kaufmann, 2006. ISBN: 0-123-70490-1.
Mano, M. Morris, (1993).Computer System Architecture (3rd ed). Prentice Hall of India.
Miles J, and Vincent P. H. (1999). Principle of Computer Architecture – Class Test Edition,
August 1999. http://www.cs.rutgers.edu/~murdocca/
http://ece-www.colorado.edu/faculty/heuring.html
Stallings William. Computer Organisation and Architecture (3rd ed). Maxwell Macmillan
International Editions.
Tanenbaum, Andrew S. (1993). Structural Computer Organisation (3rd ed) Prentice Hall of
India.
http://websrv.cs.fsu.edu/~tyson/CDA5155/refs.html
http://www.cs.ucsd.edu/classes/wi99/cse141_B/lectures.html
http://www.cs.caltech.edu/courses/cs184/ winter2001/slides/day5_2up.pdf
http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/compSched.html
http://www.mmdb.ece.ucsb.edu/~ece154/lecture5.pdf
52