0% found this document useful (0 votes)
23 views134 pages

2 3 4 Merged

Module-2 covers advanced processor technology, focusing on processor design space, including CISC and RISC architectures, superscalar and VLIW machines, and instruction pipelines. It discusses the differences in instruction sets, control unit designs, and the advantages and disadvantages of each architecture type. The module emphasizes the importance of instruction-level parallelism and the role of compilers in optimizing performance across various processor types.

Uploaded by

Shantha P Adi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views134 pages

2 3 4 Merged

Module-2 covers advanced processor technology, focusing on processor design space, including CISC and RISC architectures, superscalar and VLIW machines, and instruction pipelines. It discusses the differences in instruction sets, control unit designs, and the advantages and disadvantages of each architecture type. The module emphasizes the importance of instruction-level parallelism and the role of compilers in optimizing performance across various processor types.

Uploaded by

Shantha P Adi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 134

ACA (15CS72) Notes Module-2

MODULE-2

Chapter 4 Processors and Memory Hierarchy


4.1 Advanced Processor Technology

4.1.1 Design Space of Processors

 Processors can be ―mapped‖ to a space that has clock rate and cycles per instruction (CPI) as
coordinates. Each processor type occupies a region of this space.
 Newer technologies are enabling higher clock rates.
 Manufacturers are also trying to lower the number of cycles per instruction.
 Thus the ―future processor space‖ is moving toward the lower right of the processor design space.

VTUPulse.com
CISC and RISC Processors

• Complex Instruction Set Computing (CISC) processors like the Intel 80486, the Motorola 68040,
the VAX/8600, and the IBM S/390 typically use microprogrammed control units, have lower clock
rates, and higher CPI figures than…

• Reduced Instruction Set Computing (RISC) processors like the Intel i860, SPARC, MIPS R3000,
and IBM RS/6000, which have hard-wired control units, higher clock rates, and lower CPI figures.

Superscalar Processors

• This subclass of the RISC processors allow multiple instructions to be issued simultaneously
during each cycle.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 1


ACA (15CS72) Notes Module-2

• The effective CPI of a superscalar processor should be less than that of a generic scalar RISC
processor.

• Clock rates of scalar RISC and superscalar RISC machines are similar.

VLIW Machines

• Very Long Instruction Word machines typically have many more functional units than superscalars
(and thus the need for longer – 256 to 1024 bits – instructions to provide control for them).

• These machines mostly use microprogrammed control units with relatively slow clock rates
because of the need to use ROM to hold the microcode.

Superpipelined Processors

• These processors typically use a multiphase clock (actually several clocks that are out of phase
with each other, each phase perhaps controlling the issue of another instruction) running at a
relatively high rate.

• The CPI in these machines tends to be relatively high (unless multiple instruction issue is used).

• Processors in vector supercomputers are mostly superpipelined and use multiple functional units

VTUPulse.com
for concurrent scalar and vector operations.

Instruction Pipelines

• Typical instruction includes four phases:


– fetch
– decode
– execute
– write-back
• These four phases are frequently performed in a pipeline, or ―assembly line‖ manner, as
illustrated on the figure 4.2.
• The pipeline, like an industrial assembly line, receives successive instructions from its input
end and executes them in a streamlined, overlapped fashion as they flow through.
• A pipeline cycle is intuitively defined as the time required for each phase to complete its
operation, assuming equal delay in all phases (pipeline stages).

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 2


ACA (15CS72) Notes Module-2

VTUPulse.com
Basic definitions associated with Pipeline operations:

• Instruction pipeline cycle – the time required for each phase to complete its operation
(assuming equal delay in all phases)

• Instruction issue latency – the time (in cycles) required between the issuing of two adjacent
instructions

• Instruction issue rate – the number of instructions issued per cycle (the degree of a
superscalar)

• Simple operation latency – the delay (after the previous instruction) associated with the
completion of a simple operation (e.g. integer add) as compared with that of a complex
operation (e.g. divide).
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 3
ACA (15CS72) Notes Module-2

• Resource conflicts – when two or more instructions demand use of the same functional unit(s)
at the same time.

Pipelined Processors

• A base scalar processor:

– issues one instruction per cycle

– has a one-cycle latency for a simple operation

– has a one-cycle latency between instruction issues

– can be fully utilized if instructions can enter the pipeline at a rate on one per cycle

• For a variety of reasons, instructions might not be able to be pipelines as aggressively as in a


base scalar processor. In these cases, we say the pipeline is underpipelined.

• CPI rating is 1 for an ideal pipeline. Underpipelined systems will have higher CPI ratings,
lower clock rates, or both.

VTUPulse.com

• Figure 4.3 shows the data path architecture and control unit of a typical, simple scalar processor
which does not employ an instruction pipeline. Main memory, I/O controllers, etc. are connected to
the external bus.
• The control unit generates control signals required for the fetch, decode, ALU operation, memory
access, and write result phases of instruction execution.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 4


ACA (15CS72) Notes Module-2

• The control unit itself may employ hardwired logic, or—as was more common in older CISC style
processors—microcoded logic.
• Modern RISC processors employ hardwired logic, and even modern CISC processors make use of
many of the techniques originally developed for high-performance RISC processors.

4.1.2 Instruction Set Architectures

• CISC
– Many different instructions
– Many different operand data types
– Many different operand addressing formats
– Relatively small number of general purpose registers
– Many instructions directly match high-level language constructions
• RISC
– Many fewer instructions than CISC (freeing chip space for more functional units!)
– Fixed instruction format (e.g. 32 bits) and simple operand addressing
– Relatively large number of registers
– Small CPI (close to 1) and high clock rates

VTUPulse.com
Architectural Distinctions

• CISC
– Unified cache for instructions and data (in most cases)
– Microprogrammed control units and ROM in earlier processors (hard-wired controls
units now in some CISC systems)
• RISC
– Separate instruction and data caches
– Hard-wired control units

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 5


ACA (15CS72) Notes Module-2

• CISC Advantages

VTUPulse.com
– Smaller program size (fewer instructions)
– Simpler control unit design
– Simpler compiler design

• RISC Advantages
– Has potential to be faster
– Many more registers
• RISC Problems
– More complicated register decoding system
– Hardwired control is less flexible than microcode

4.1.3 CISC Scalar Processors

• Early systems had only integer fixed point facilities.


• Modern machines have both fixed and floating point facilities, sometimes as parallel functional
units.
• Many CISC scalar machines are underpipelined.

Representative CISC Processors:

– VAX 8600
– Motorola MC68040
– Intel Pentium

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 6


ACA (15CS72) Notes Module-2

VAX 8600 processor

• The VAX 8600 was introduced by Digital Equipment Corporation in 1985.


• This machine implemented a typical CISC architecture with microprogrammed control.
• The instruction set contained about 300 instructions with 20 different addressing modes.



VTUPulse.com
The CPU in the VAX 8600 consisted of two functional units for concurrent execution of
integer and floating point instructions.
The unified cache was used for holding both instructions and data.
There were 16 GPRs in the instruction unit. Instruction pipelining was built with six stages in
the VAX 8600, as in most elsc machines.
• The instruction unit prefetched and decoded instructions, handled branching operations, and
supplied operands to the two functional units in a pipelined fashion.
• A Translation Lookaside Buffer (TLB) was used in the memory control unit for fast generation
of a physical address from a virtual address.
• Both integer and floating point units were pipelined.
• The performance of the processor pipelines relied heavily on the cache hit ratio and on minimal
branching damage to the pipeline flow.

4.1.4 RISC Scalar Processors

• Designed to issue one instruction per cycle

• RISC and CISC scalar processors should have same performance if clock rate and program
lengths are equal.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 7


ACA (15CS72) Notes Module-2

• RISC moves less frequent operations into software, thus dedicating hardware resources to the
most frequently used operations.

Representative RISC Processors:

– Sun SPARC

– Intel i860

– Motorola M88100

– AMD 29000

SPARCs (Scalable Processor Architecture) and Register Windows

• SPARC family chips produced by Cypress Semiconductors, Inc. Figure 4.7 shows the architecture
of the Cypress CY7C601 SPARC processor and of the CY7C602 FPU.
• The Sun SPARC instruction set contains 69 basic instructions
• The SPARC runs each procedure with a set of thirty-two 32-bit IU registers.
• Eight of these registers are global registers shared by all procedures, and the remaining 24 are

VTUPulse.com
window registers associated with only each procedure.
• The concept of using overlapped register windows is the most important feature introduced by the
Berkeley RISC architecture.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 8


ACA (15CS72) Notes Module-2

• Fig. 4.8 shows eight overlapping windows (formed with 64 local registers and 64 overlapped
registers) and eight globals with a total of 136 registers, as implemented in the Cypress 601.

VTUPulse.com
Each register window is divided into three eight-register sections, labeled Ins, Locals, and Outs.
• The local registers are only locally addressable by each procedure. The Ins and Outs are shared
among procedures.
• The calling procedure passes parameters to the called procedure via its Outs (r8 to r15) registers,
which are the Ins registers of the called procedure.
• The window of the currently running procedure is called the active window pointed to by a current
window pointer.
• A window invalid mask is used to indicate which window is invalid. The trap base register serves
as a pointer to a trap handler.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 9


ACA (15CS72) Notes Module-2

• VTUPulse.com
A special register is used to create a 64-bit product in multiple step instructions. Procedures can
also be called without changing the window.
• The overlapping windows can significantly save the time required for interprocedure
communications, resulting in much faster context switching among cooperative procedures.

4.2 Superscalar, Vector Processors

 A CISC or a RISC scalar processor can be improved with a superscalar or vector architecture.
 Scalar processors are those executing one instruction per cycle.
 Only one instruction is issued per cycle, and only one completion of instruction is expected from
the pipeline per cycle.
 In a superscalar processor, multiple instructions are issued per cycle and multiple results are
generated per cycle.
 A vector processor executes vector instructions on arrays of data; each vector instruction involves a
string of repeated operations, which are ideal for pipelining with one result per cycle.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 10


ACA (15CS72) Notes Module-2

4.2.1 Superscalar Processors

 Superscalar processors are designed to exploit more instruction-level parallelism in user programs.
 Only independent instructions can be executed in parallel without causing a wait state. The amount
of instruction level parallelism varies widely depending on the type of code being executed.
 It has been observed that the average value is around 2 for code without loop unrolling. Therefore,
for these codes there is not much benefit gained from building a machine that can issue more than
three instructions per cycle.
 The instruction-issue degree in a superscalar processor has thus been limited to 2 to 5 in practice.

Pipelining in Superscalar Processors

 The fundamental structure of a three-issue superscalar pipeline is illustrated in Fig. 4.11.


 Superscalar processors were originally developed as an alternative to vector processors, with a
view to exploit higher degree of instruction level parallelism.

VTUPulse.com
 A superscalar processor of degree m can issue m instructions per cycle.
 The base scalar processor, implemented either in RISC or CISC, has m = 1.
 In order to fully utilize a superscalar processor of degree m, m instructions must be executable in
parallel. This situation may not be true in all clock cycles.
 In that case, some of the pipelines may be stalling in a wait state.
 In a superscalar processor, the simple operation latency should require only one cycle, as in the
base scalar processor.
 Due to the desire for a higher degree of instruction-level parallelism in programs, the superscalar
processor depends more on an optimizing compiler to exploit parallelism.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 11


ACA (15CS72) Notes Module-2

Representative Superscalar Processors

VTUPulse.com

Typical Superscalar Architecture

• A typical superscalar will have

– multiple instruction pipelines

– an instruction cache that can provide multiple instructions per fetch

– multiple buses among the function units

• In theory, all functional units can be simultaneously active.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 12


ACA (15CS72) Notes Module-2

4.2.2 VLIW Architecture

• VLIW = Very Long Instruction Word


• Instructions usually hundreds of bits long.
• Each instruction word essentially carries multiple ―short instructions.‖
• Each of the ―short instructions‖ are effectively issued at the same time.
• (This is related to the long words frequently used in microcode.)
• Compilers for VLIW architectures should optimally try to predict branch outcomes to properly
group instructions.
Pipelining in VLIW Processors

• Decoding of instructions is easier in VLIW than in superscalars, because each ―region‖ of an


instruction word is usually limited as to the type of instruction it can contain.

• Code density in VLIW is less than in superscalars, because if a ―region‖ of a VLIW word isn’t
needed in a particular instruction, it must still exist (to be filled with a ―no op‖).

• Superscalars can be compatible with scalar processors; this is difficult with VLIW parallel and
non-parallel architectures.

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 13


ACA (15CS72) Notes Module-2

VLIW Opportunities

• ―Random‖ parallelism among scalar operations is exploited in VLIW, instead of regular


parallelism in a vector or SIMD machine.

• The efficiency of the machine is entirely dictated by the success, or ―goodness,‖ of the
compiler in planning the operations to be placed in the same instruction words.

• Different implementations of the same VLIW architecture may not be binary-compatible with
each other, resulting in different latencies.

VLIW Summary

• VLIW reduces the effort required to detect parallelism using hardware or software techniques.

• The main advantage of VLIW architecture is its simplicity in hardware structure and instruction
set.

• Unfortunately, VLIW does require careful analysis of code in order to ―compact‖ the most
appropriate ‖short‖ instructions into a VLIW word.

4.2.3 Vector Processors




VTUPulse.com
A vector processor is a coprocessor designed to perform vector computations.
A vector is a one-dimensional array of data items (each of the same data type).
Vector processors are often used in multipipelined supercomputers.

Architectural types include:


1. Register-to-Register (with shorter instructions and register files)

2. Memory-to-Memory (longer instructions with memory addresses)

1. Register-to-Register Vector Instructions

• Assume Vi is a vector register of length n, si is a scalar register, M(1:n) is a memory array of


length n, and ―ο‖ is a vector operation.

• Typical instructions include the following:

– V1 ο V2  V3 (element by element operation)

– s1 ο V1  V2 (scaling of each element)

– V1 ο V2  s1 (binary reduction - i.e. sum of products)

– M(1:n)  V1 (load a vector register from memory)

– V1  M(1:n) (store a vector register into memory)

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 14


ACA (15CS72) Notes Module-2

– ο V1  V2 (unary vector -- i.e. negation)

– ο V1  s1 (unary reduction -- i.e. sum of vector)

2. Memory-to-Memory Vector Instructions

• Typical memory-to-memory vector instructions (using the same notation as given in the
previous slide) include these:

– M1(1:n) ο M2(1:n)  M3(1: n) (binary vector)

– s1 ο M1(1:n)  M2(1:n) (scaling)

– ο M1(1:n)  M2(1:n) (unary vector)

– M1(1:n) ο M2(1:n)  M(k) (binary reduction)

Pipelines in Vector Processors

VTUPulse.com

• Vector processors can usually effectively use large pipelines in parallel, the number of such
parallel pipelines effectively limited by the number of functional units.

• As usual, the effectiveness of a pipelined system depends on the availability and use of an
effective compiler to generate code that makes good use of the pipeline facilities.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 15


ACA (15CS72) Notes Module-2

Symbolic Processors
• Symbolic processors are somewhat unique in that their architectures are tailored toward the
execution of programs in languages similar to LISP, Scheme, and Prolog.

• In effect, the hardware provides a facility for the manipulation of the relevant data objects with
―tailored‖ instructions.

• These processors (and programs of these types) may invalidate assumptions made about more
traditional scientific and business computations.

VTUPulse.com
4.3 Memory Hierarchical Technology

 Storage devices such as registers, caches, main memory, disk devices, and backup storage are often
organized as a hierarchy as depicted in Fig. 4.17.
 The memory technology and storage organization at each level is characterized by five parameters:

1. access time ti (round-trip time from CPU to ith level)


2. memory size si (number of bytes or words in level i)
3. cost per byte ci
4. transfer bandwidth bi (rate of transfer between levels)
5. unit of transfer xi (grain size for transfers between levels i and i+1)

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 16


ACA (15CS72) Notes Module-2

VTUPulse.com
Memory devices at a lower level are:

 faster to access,
 are smaller in capacity,
 are more expensive per byte,
 have a higher bandwidth, and
 have a smaller unit of transfer.

In general, ti-1 < ti, si-1 < si, ci-1 > ci, bi-1 > bi and xi-1 < xi for i = 1, 2, 3, and 4 in the
hierarchy where i = 0 corresponds to the CPU register level.
The cache is at level 1, main memory at level 2, the disks at level 3 and backup storage at level 4.

Registers and Caches


Registers
 The registers are parts of the processor;
 Register assignment is made by the compiler.
 Register transfer operations are directly controlled by the processor after instructions are decoded.
 Register transfer is conducted at processor speed, in one clock cycle.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 17


ACA (15CS72) Notes Module-2

Caches
 The cache is controlled by the MMU and is programmer-transparent.
 The cache can also be implemented at one or multiple levels, depending on the speed and
application requirements.
 Multi-level caches are built either on the processor chip or on the processor board.
 Multi-level cache systems have become essential to deal with memory access latency.

Main Memory (Primary Memory)


 It is usually much larger than the cache and often implemented by the most cost-effective RAM
chips, such as DDR SDRAMs, i.e. dual data rate synchronous dynamic RAMs.
 The main memory is managed by a MMU in cooperation with the operating system.

Disk Drives and Backup Storage


 The disk storage is considered the highest level of on-line memory.
 It holds the system programs such as the OS and compilers, and user programs and their data sets.
 Optical disks and magnetic tape units are off-line memory for use as archival and backup storage.

 VTUPulse.com
They hold copies of present and past user programs and processed results and files.
Disk drives are also available in the form of RAID arrays.

Peripheral Technology
 Peripheral devices include printers, plotters, terminals, monitors, graphics displays, optical
scanners, image digitizers, output microfilm devices etc.
 Some I/O devices are tied to special-purpose or multimedia applications.

4.3.2 Inclusion, Coherence, and Locality

Information stored in a memory hierarchy (M1, M2,…, Mn) satisfies 3 important properties:
1. Inclusion
2. Coherence
3. Locality
 We consider cache memory the innermost level M1, which directly communicates with the CPU
registers.
 The outermost level Mn contains all the information words stored. In fact, the collection of all
addressable words in Mn forms the virtual address space of a computer.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 18


ACA (15CS72) Notes Module-2

 Program and data locality is characterized below as the foundation for using a memory hierarchy
effectively.

VTUPulse.com

1. The Inclusion Property

• The inclusion property is stated as:


M1  M2  ...  Mn
• The implication of the inclusion property is that all items of information in the ―innermost‖
memory level (cache) also appear in the outer memory levels.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 19


ACA (15CS72) Notes Module-2

• The inverse, however, is not necessarily true. That is, the presence of a data item in level Mi+1 does
not imply its presence in level Mi. We call a reference to a missing item a ―miss.‖

2. The Coherence Property

The requirement that copies of data items at successive memory levels be consistent is called the
―coherence property.‖

Coherence Strategies

• Write-through

– As soon as a data item in Mi is modified, immediate update of the corresponding data


item(s) in Mi+1, Mi+2, … Mn is required.

– This is the most aggressive (and expensive) strategy.

• Write-back

– The update of the data item in Mi+1 corresponding to a modified item in Mi is not
updated unit it (or the block/page/etc. in Mi that contains it) is replaced or removed.

VTUPulse.com
– This is the most efficient approach, but cannot be used (without modification) when
multiple processors share Mi+1, …, Mn.

3. Locality of References

• Memory references are generated by the CPU for either instruction or data access.
• These accesses tend to be clustered in certain regions in time, space, and ordering.
There are three dimensions of the locality property:

– Temporal locality – if location M is referenced at time t, then it (location M) will be


referenced again at some time t+t.

– Spatial locality – if location M is referenced at time t, then another location Mm will
be referenced at time t+t.

– Sequential locality – if location M is referenced at time t, then locations M+1, M+2, …


will be referenced at time t+t, t+t’, etc.

• In each of these patterns, both m and t are ―small.‖


• H&P suggest that 90 percent of the execution time in most programs is spent executing only 10
percent of the code.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 20


ACA (15CS72) Notes Module-2

Working Sets

• The set of addresses (bytes, pages, etc.) referenced by a program during the interval from t to t+
t, where t is called the working set parameter, changes slowly.

• This set of addresses, called the working set, should be present in the higher levels of M if a
program is to execute efficiently (that is, without requiring numerous movements of data items
from lower levels of M). This is called the working set principle.

VTUPulse.com
4.3.3 Memory Capacity Planning
The performance of a memory hierarchy is determined by the Effective Access Time Teff to any level
in the hierarchy. It depends on the hit ratios and access frequencies.

Hit Ratios

• When a needed item (instruction or data) is found in the level of the memory hierarchy being
examined, it is called a hit. Otherwise (when it is not found), it is called a miss (and the item
must be obtained from a lower level in the hierarchy).

• The hit ratio, h, for Mi is the probability (between 0 and 1) that a needed data item is found
when sought in level memory Mi.

• The miss ratio is obviously just 1-hi.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 21


ACA (15CS72) Notes Module-2

• We assume h0 = 0 and hn = 1.

Access Frequencies

• The access frequency fi to level Mi is


fi = (1-h1)  (1-h2)  …  hi

• Note that f1 = h1, and ∑

Effective Access Times

• There are different penalties associated with misses at different levels in the memory hierarcy.

– A cache miss is typically 2 to 4 times as expensive as a cache hit (assuming success at


the next level).

– A page fault (miss) is 3 to 4 magnitudes as costly as a page hit.

• The effective access time of a memory hierarchy can be expressed as


n
Teff   fi  ti
i 1

• VTUPulse.com
 h1t1  (1  h1 )h2t2   (1  h1 )(1  h2 ) (1  hn1 )hntn
The effective access time is still dependent on program behavior and memory design choices.

Hierarchy Optimization

The total cost of a memory hierarchy is estimated as follows:

This implies that the cost is distributed over n levels. Since cl > c2 > c3 > … cn, we have to choose s1
< s2 < s3 < … sn.
The optimal design of a memory hierarchy should result in a Teff close to the t1 of M1 and a total
cost close to the cost of Mn.
The optimization process can be formulated as a linear programming problem, given a ceiling C0 on
the total cost— that is, a problem to minimize

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 22


ACA (15CS72) Notes Module-2

4.4 Virtual Memory Technology

• To facilitate the use of memory hierarchies, the memory addresses normally generated by
modern processors executing application programs are not physical addresses, but are rather
virtual addresses of data items and instructions.

• Physical addresses, of course, are used to reference the available locations in the real physical
memory of a system.

• Virtual addresses must be mapped to physical addresses before they can be used.

VTUPulse.com
Virtual to Physical Mapping

• The mapping from virtual to physical addresses can be formally defined as follows:

• The mapping returns a physical address if a memory hit occurs. If there is a memory miss, the
referenced item has not yet been brought into primary memory.

Mapping Efficiency

• The efficiency with which the virtual to physical mapping can be accomplished significantly
affects the performance of the system.

• Efficient implementations are more difficult in multiprocessor systems where additional


problems such as coherence, protection, and consistency must be addressed.

Virtual Memory Models

1. Private Virtual Memory

2. Shared Virtual Memory


Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 23
ACA (15CS72) Notes Module-2

1. Private Virtual Memory

– In this scheme, each processor has a separate virtual address space, but all processors share the
same physical address space.

VTUPulse.com
– Advantages:

• Small processor address space

• Protection on a per-page or per-process basis

• Private memory maps, which require no locking

– Disadvantages

• The synonym problem – different virtual addresses in different/same virtual spaces


point to the same physical page

• The same virtual address in different virtual spaces may point to different pages in
physical memory

2. Shared Virtual Memory

• All processors share a single shared virtual address space, with each processor being given a
portion of it.

• Some of the virtual addresses can be shared by multiple processors.

Advantages:

• All addresses are unique

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 24


ACA (15CS72) Notes Module-2

• Synonyms are not allowed

Disadvantages

• Processors must be capable of generating large virtual addresses (usually > 32 bits)

• Since the page table is shared, mutual exclusion must be used to guarantee atomic updates

• Segmentation must be used to confine each process to its own address space

• The address translation process is slower than with private (per processor) virtual memory

Memory Allocation

Both the virtual address space and the physical address space are divided into fixed-length pieces.

– In the virtual address space these pieces are called pages.

– In the physical address space they are called page frames.

• The purpose of memory allocation is to allocate pages of virtual memory using the page frames of
physical memory.

VTUPulse.com
4.4.2 TLB, Paging, and Segmentation

Both the virtual memory and physical memory are partitioned into fixed-length pages. The purpose of
memory allocation is to allocate pages of virtual memory to the page frames of the physical memory.

Address Translation Mechanisms

 The process demands the translation of virtual addresses into physical addresses. Various schemes
for virtual address translation are summarized in Fig. 4.21a.

 The translation demands the use of translation maps which can be implemented in various ways.
 Translation maps are stored in the cache, in associative memory, or in the main memory.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 25


ACA (15CS72) Notes Module-2

 To access these maps, a mapping function is applied to the virtual address. This function generates
a pointer to the desired translation map.
 This mapping can be implemented with a hashing or congruence function.
 Hashing is a simple computer technique for converting a long page number into a short one with
fewer bits.
 The hashing function should randomize the virtual page number and produce a unique hashed
number to be used as the pointer.

Translation Lookaside Buffer


 The TLB is a high-speed lookup table which stores the most recently or likely referenced page
entries.
 A page entry consists of essentially a (virtual page number, page frame number) pair. It is hoped
that pages belonging to the same working set will be directly translated using the TLB entries.
 The use of a TLB and PTs for address translation is shown in Fig 4.21b. Each virtual address is
divided into 3 fields:

VTUPulse.com
– The leftmost field holds the virtual page number,
– the middle field identifies the cache block number,
– the rightmost field is the word address within the block.

 Our purpose is to produce the physical address consisting of the page frame number, the block
number, and the word address.
 The first step of the translation is to use the virtual page number as a key to search through the TLB
for a match.
 The TLB can be implemented with a special associative memory (content addressable memory) or
use part of the cache memory.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 26
ACA (15CS72) Notes Module-2

 In case of a match (a hit) in the TLB, the page frame number is retrieved from the matched page
entry. The cache block and word address are copied directly.
 In case the match cannot be found (a miss) in the TLB, a hashed pointer is used to identify one of
the page tables where the desired page frame number can be retrieved.

Implementing Virtual Memory


There are 3 approaches to implement virtual memory:

1. Paging

2. Segmentation

3. A combination of the two called Paged Segmentation

1. Paging memory

• Memory is divided into fixed-size blocks called pages.


VTUPulse.com
Main memory contains some number of pages which is smaller than the number of pages in the
virtual memory.

For example, if the page size is 2K and the physical memory is 16M (8K pages) and the virtual
memory is 4G (2 M pages) then there is a factor of 254 to 1 mapping.

• A page map table is used for implementing a mapping, with one entry per virtual page.

2. Segmented memory

• In a segmented memory management system the blocks to be replaced in main memory are
potentially of unequal length and here the segments correspond to logical blocks of code or data.

For example, a subroutine or procedure.

• Segments, then, are ``atomic'' in the sense that either the whole segment should be in main
memory, or none of the segment should be there.

• The segments may be placed anywhere in main memory, but the instructions or data in one
segment should be contiguous,

3. Paged Segmentation

• It is a combination of paging and segmentation concepts

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 27


ACA (15CS72) Notes Module-2

• Within each segment, the addresses are divided into fixed size pages

• Each virtual address is divided into 3 fields

– Segment Number

– Page Number

– Offset

Inverted paging

• Besides direct mapping, address translation maps can also be implemented with inverted mapping
(Fig. 4.21c).
• An inverted page table is created for each page frame that has been allocated to users. Any virtual
page number can be paired with a given physical page number.
• Inverted page tables are accessed either by an associative search or by the use of a hashing
function.

VTUPulse.com
• In using an inverted PT, only virtual pages that are currently resident in physical memory are
included. This provides a significant reduction in the size of the page tables.
• The generation of a long virtual address from a short physical address is done with the help of
segment registers, as demonstrated in Fig. 4.21c.

• The leading 4 bits (denoted sreg) of a 32-bit address name a segment register.
• The register provides a segment id that replaces the 4-bit sreg to form a long virtual address.
• This effectively creates a single long virtual address space with segment boundaries at multiples of
256 Mbytes (228 bytes).
• The IBM RT/PC had a 12-bit segment id (4096 segments) and a 40-bit virtual address space.
• Either associative page tables or inverted page tables can be used to implement inverted mapping.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 28
ACA (15CS72) Notes Module-2

• The inverted page table can also be assisted with the use of a TLB. An inverted PT avoids the use
of a large page table or a sequence of page tables.
• Given a virtual address to be translated, the hardware searches the inverted PT for that address and,
if it is found, uses the table index of the matching entry as the address of the desired page frame.
• A hashing table is used to search through the inverted PT.
• The size of an inverted PT is governed by the size of the physical space, while that of traditional
PTs is determined by the size of the virtual space.
• Because of limited physical space, no multiple levels are needed for the inverted page table.

4.4.3 Page Replacement Policies

• Memory management policies include the allocation and deallocation of memory pages to active
processes and the replacement of memory pages.
• Demand paging memory systems. refers to the process in which a resident page in main memory is
replaced by a new page transferred from the disk.
• Since the number of available page frames is much smaller than the number of pages, the frames



VTUPulse.com
will eventually be fully occupied.
In order to accommodate a new page, one of the resident pages must be replaced.
The goal of a page replacement policy is to minimize the number of possible page faults so that the
effective memory-access time can be reduced.
• The effectiveness of a replacement algorithm depends on the program behavior and memory traffic
patterns encountered.
• A good policy should match the program locality property. The policy is also affected by page size
and by the number of available frames.

Page Traces: A page trace is a sequence of page frame numbers (PFNs) generated during the
execution of a given program.

The following page replacement policies are specified in a demand paging memory system for a page
fault at time t.
(1) Least recently used (LRU)—This policy replaces the page in R(t) which has the longest backward
distance:

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 29


ACA (15CS72) Notes Module-2

(2) Optimal (OPT) algorithm—This policy replaces the page in R(t) with the longest forward
distance:

(3) First-in-first-out (FIFO)—This policy replaces the page in R(t) which has been in memory for the
longest time.
(4) Least frequently used (LFU)—This policy replaces the page in R(t) which has been least
referenced in the past.
(5) Circular FIFO—This policy joins all the page frame entries into a circular FIFO queue using a
pointer to indicate the front of the queue.
• An allocation bit is associated with each page frame. This bit is set upon initial allocation of a page
to the frame.
• When a page fault occurs, the queue is circularly scanned from the pointer position.
• The pointer skips the allocated page frames and replaces the very first unallocated page frame.
• When all frames are allocated, the front of the queue is replaced, as in the FIFO policy.

VTUPulse.com
(6) Random replacement—This is a trivial algorithm which chooses any page for replacement
randomly.

Example:
Consider a paged virtual memory system with a two-level hierarchy: main memory M1 and disk
memory M2.
Assume a page size of four words. The number of page frames in M1 is 3, labeled a, b and c; and the
number of pages in M2 is 10, identified by 0, 1, 2,….9. The ith page in M2consists of word
addresses 4i to 4i + 3 for all i = 0, 1, 2, …, 9.
A certain program generates the following sequence of word addresses which are grouped (underlined)
together if they belong to the same page. The sequence of page numbers so formed is the page trace:

Page tracing experiments are described below for three page replacement policies: LRU, OPT, and
FIFO, respectively. The successive pages loaded in the page frames (PFs) form the trace entries.
Initially, all PFs are empty.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 30


ACA (15CS72) Notes Module-2

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 31


ACA (15CS72) Notes Module-III

Module-III

Chapter 5 Bus, Cache and Shared Memory


5.1 Bus Systems

• System bus of a computer operates on contention basis.


• Several active devices such as processors may request use of the bus at the same time.
• Only one of them can be granted access to bus at a time
• The Effective bandwidth available to each processor is inversely proportional to the number of
processors contending for the bus.
• For this reason, most bus-based commercial multiprocessors have been small in size.
• The simplicity and low cost of a bus system made it attractive in building small multiprocessors
ranging from 4 to 16 processors.

5.1.1 Backplane Bus Specification

 A backplane bus interconnects processors, data storage and peripheral devices in a tightly coupled
hardware.


VTUPulse.com
The system bus must be designed to allow communication between devices on the devices on the
bus without disturbing the internal activities of all the devices attached to the bus.
Timing protocols must be established to arbitrate among multiple requests. Operational rules must
be set to ensure orderly data transfers on the bus.
 Signal lines on the backplane are often functionally grouped into several buses as shown in Fig 5.1.
Various functional boards are plugged into slots on the backplane. Each slot is provided with one
or more connectors for inserting the boards as demonstrated by the vertical arrows.

Data Transfer Bus (DTB)

• Data address and control lines form the data transfer bus (DTB) in VME bus.
• Address lines broadcast data and device address
– Proportional to log of address space size
• Data lines proportional to memory word length
• Control lines specify read/write, timing, and bus error conditions

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 1


ACA (15CS72) Notes Module-III

VTUPulse.com
Bus Arbitration and Control

 The process of assigning control of the DTB to a requester is called arbitration. Dedicated lines are
reserved to coordinate the arbitration process among several requesters.
 The requester is called a master, and the receiving end is called a slave.
 Interrupt lines are used to handle interrupts, which are often prioritized. Dedicated lines may be
used to synchronize parallel activities among the processor modules.
 Utility lines include signals that provide periodic timing (clocking) and coordinate the power-up
and power-down sequences of the system.
 The backplane is made of signal lines and connectors.
 A special bus controller board is used to house the backplane control logic, such as the system
clock driver, arbiter, bus timer, and power driver.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 2


ACA (15CS72) Notes Module-III

Functional Modules

A functional module is a collection of electronic circuitry that resides on one functional board (Fig.
5.1) and works to achieve special bus control functions.
Special functional modules are introduced below:
• Arbiter is a functional module that accepts bus requests from the requester module and grants
control of the DTB to one requester at a time.
• Bus timer measures the time each data transfer takes on the DTB and terminates the DTB cycle if
a transfer takes too long.
• Interrupter module generates an interrupt request and provides status/ID information when an
interrupt handler module requests it.
• Location monitor is a functional module that monitors data transfers over the DTB. A power
monitor watches the status of the power source and signals when power becomes unstable.
• System clock driver is a module that provides a clock timing signal on the utility bus. In addition,
board interface logic is needed to match the signal line impedance, the propagation time, and
termination values between the backplane and the plug-in boards.

Physical Limitations



VTUPulse.com
Due to electrical, mechanical, and packaging limitations, only a limited number of boards can be
plugged into a single backplane.
Multiple backplane buses can be mounted on the same backplane chassis.
The bus system is difficult to scale, mainly limited by packaging constraints.

5.1.2 Addressing and Timing Protocols

• Two types of printed circuit boards connected to a bus: active and passive
• Active devices like processors can act as bus masters or as slaves at different times.
• Passive devices like memories can act only as slaves.
• The master can initiate a bus cycle
– Only one can be in control at a time
• The slaves respond to requests by a master
– Multiple slaves can respond
Bus Addressing

• The backplane bus is driven by a digital clock with a fixed cycle time: bus cycle
• Backplane has limited physical size, so will not skew information

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 3


ACA (15CS72) Notes Module-III

• Factors affecting bus delay:


– Source’s line drivers, destination’s receivers, slot capacitance, line length, and bus
loading effects
• Design should minimize overhead time, so most bus cycles used for useful operations
• Identify each board with a slot number
• When slot number matches contents of high-order address lines, the board is selected as a slave
(slot addressing)

Broadcall and Broadcast

VTUPulse.com

• Most bus transactions have one slave/master


• Broadcall: read operation where multiple slaves place data on bus
– detects multiple interrupt sources
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 4
ACA (15CS72) Notes Module-III

• Broadcast: write operation involving multiple slaves


– Implements multicache coherence on the bus
• Timing protocols are needed to synchronize master and slave operations.
• Figure 5.2 shows a typical timing sequence when information is transferred over a bus from a
source to a destination.
• Most bus timing protocols implement such a sequence.

Synchronous Timing

• All bus transaction steps take place at fixed clock edges as shown in Fig. 5.3a.
• The clock signals are broadcast to all potential masters and slaves.
• Clock cycle time determined by slowest device on bus
• Once the data becomes stabilized on the data lines, the master uses Data-ready pulse to initiate
the transfer
• The Slave uses Data-accept pulse to signal completion of the information transfer.
• Simple, less circuitry, suitable for devices with relatively the same speed.

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 5


ACA (15CS72) Notes Module-III

Asynchronous Timing

• Based on handshaking or interlocking mechanism as shown in Fig. 5.3b.


• No fixed clock cycle is needed.
• The rising edge (1) of the data-ready signal from the master trioggers the rising (2) of the data-
accept signal from the slave.
• The second signal triggers the falling (3) of the data-ready clock and removal of data from the bus.
• The third signal triggers the trailing edge (4) of the data accept clock.
• This four-edge handshaking (interlocking) process is repeated until all the data is transferred.
Advantages: Provides freedom of variable length clock signals for different speed devices
• No response time restrictions
• More flexible
Disadvantage: More complex and costly

5.1.3 Arbitration, Transaction and Interrupt

Arbitration

VTUPulse.com
• Process of selecting next bus master
• Bus tenure is duration of master’s control
• It restricts the tenure of the bus to one master at a time.
• Competing requests must be arbitrated on a fairness or priority basis
• Arbitration competition and bus transactions take place concurrently on a parallel bus over
separate lines
Central Arbitration
• Uses a central arbiter as shown in Fig 5.4a
• Potential masters are daisy chained in a cascade
• A special signal line propagates bus-grant from first master (at slot 1) to the last master (at slot
n).
• All requests share the same bus-request line
• The bus-request signals the rise of the bus-grant level, which in turn raises the bus-busy level
as shown in Fig. 5.4b.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 6


ACA (15CS72) Notes Module-III

VTUPulse.com
• Simple scheme
• Easy to add devices
• Fixed-priority sequence – not fair
• Propagation of bus-grant signal is slow
• Not fault tolerant

Independent Requests and Grants

• Provide independent bus-request and grant signals for each master as shown in Fig5.5a.
• No daisy chaining is used in this scheme.
• Require a central arbiter, but can use a priority or fairness based policy
• More flexible and faster than a daisy-chained policy
• Larger number of lines – costly

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 7


ACA (15CS72) Notes Module-III

VTUPulse.com
Distributed Arbitration

• Each master has its own arbiter and unique arbitration number as shown in Fig. 5.5b.
• Uses arbitration number to resolve arbitration competition
• When two or more devices compete for the bus, the winner is the one whose arbitration number is
the largest determined by Parallel Contention Arbitration..
• All potential masters can send their arbitration number to shared-bus request/grant (SBRG) lines
and compare its own number with SBRG number.
• If the SBRG number is greater, the requester is dismissed. At the end, the winner’s arbitration
number remains on the arbitration bus. After the current bus transaction is completed, the winner
seizes control of the bus.
• Priority based scheme

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 8


ACA (15CS72) Notes Module-III

Transfer Modes

• Address-only transfer: no data


• Compelled-data transfer: Address transfer followed by a block of one or more data transfers to
one or more contiguous address.
• Packet-data transfer: Address transfer followed by a fixed-length block of data transfers from
set of continuous address.
• Connected: carry out master’s request and a slave’s response in a single bus transaction
• Split: splits request and response into separate transactions
– Allow devices with long latency or access time to use bus resources more efficiently
– May require two or more connected bus transactions

Interrupt Mechanisms

• Interrupt: is a request from I/O or other devices to a processor for service or attention
• A priority interrupt bus is used to pass the interrupt signals
• Interrupter must provide status and identification information
• Have an interrupt handler for each request line

VTUPulse.com
• Interrupts can be handled by message passing on data lines on a time-sharing basis.
– Save lines, but use cycles
– Use of time-shared data bus lines is a virtual-interrupt

5.1.4 IEEE and other Standards

• Open bus standard Futurebus+ to support:


– 64 bit address space
– Throughput required by multi-RISC or future generations of multiprocessor
architectures
• Expandable or scalable
• Independent of particular architectures and processor technologies

Standard Requirements

The major objectives of the Futurebus+ standards committee were to create a bus standard that would
provide a significant step forward in improving the facilities and performance available to the
designers of multiprocessor systems.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 9


ACA (15CS72) Notes Module-III

Below are the design requirements set by the IEEE 896.1-1991 Standards Committee to provide a
stable platform on which several generations of computer systems could be based:
• Independence for an open standard
• Asynchronous timing protocol
• Optional packet protocol
• Distributed arbitration protocols
• Support of high reliability and fault tolerant applications
• Ability to lock modules without deadlock or livelock
• Circuit-switched and split transaction protocols
• Support of real-time mission critical computations w/multiple priority levels
• 32 or 64 bit addressing
• Direct support of snoopy cache-based multiprocessors.
• Compatible message passing protocols

5.2 Cache Memory Organizations


Cache memory is the fast memory that lies between registers and RAM in memory hierarchy. It holds

VTUPulse.com
recently used data and/or instructions.

5.2.1 Cache Addressing Models

• Most multiprocessor systems use private caches for each processor as shown in Fig. 5.6
• Have an interconnection network between caches and main memory
• Caches can be addressed using either a Physical Address or Virtual Address.
• Two different cache design models are:
– Physical address cache
– Virtual address cache

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 10


ACA (15CS72) Notes Module-III

Physical address cache


• When cache is addressed by physical address it is called physical address cache. The cache is
indexed and tagged with physical address.



VTUPulse.com
Cache lookup must occur after address translation in TLB or MMU. No aliasing is allowed so
that the address is always uniquely translated without confusion.
After cache miss, load a block from main memory
Use either write-back or write-through policy

Advantages:
• No cache flushing on a context switch
• No aliasing problem thus fewer cache bugs in OS kernel.
• Simplistic design
• Requires little intervention from OS kernel

Disadvantages:
Slowdown in accessing the cache until the MMU/TLB finishes translating the address

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 11


ACA (15CS72) Notes Module-III

Virtual Address caches

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 12


ACA (15CS72) Notes Module-III

• When a cache is indexed or tagged with virtual address it is called virtual address cache.
• In this model both cache and MMU translation or validation are done in parallel.
• The physical address generated by the MMU can be saved in tags for later write back but is not
used during the cache lookup operations.
Advantages:
• do address translation only on a cache miss
• faster for hits because no address translation
• More efficient access to cache

Disadvantages:
• Cache flushing on a context switch (example : local data segments will get an erroneous hit for
virtual addresses already cached after changing virtual address space, if no cache flushing).
• Aliasing problem (several different virtual addresses cannot span the same physical addresses
without being duplicated in cache).

The Aliasing Problem


• The major problem associated with a virtual address cache is aliasing.

VTUPulse.com
• Different logically addressed data have the same index/tag in the cache
• Confusion if two or more processors access the same physical cache location
• Flush cache when aliasing occurs, but leads to slowdown
• Apply special tagging with a process key or with a physical address

5.2.2 Direct Mapping Cache and Associative Cache


• The transfer of information from main memory to cache memory is conducted in units of cache
blocks or cache lines.
• Four block placement schemes are presented below. Each placement scheme has its own merits
and demerits.
• The ultimate performance depends upon cache access patterns, organization, and management
policy
• Blocks in caches are called block frames, and blocks in main memory are called blocks
• Bi (i  m), Bj (i  n), n>>m, n=2s, m=2r
• Each block has b words b=2w, for cache total of mb=2r+w words, main memory of nb= 2s+w
words

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 13


ACA (15CS72) Notes Module-III

Direct Mapping Cache


• Direct mapping of n/m = 2s-r memory blocks to one block frame in the cache
• Placement is by using modulo-m function. Block Bj is mapped to block frame Bi
Bj  Bi if i=j mod m
• There is a unique block frame Bi that each Bj can load into.
• There is no way to implement a block replacement policy.
• This Direct mapping is very rigid but is the simplest cache organization to implement.
The memory address is divided into 3 fields:
– The lower w bits specify the word offset within each block.
– The upper s bits specify the block address in main memory
– The leftmost (s-r) bits specify the tag to be matched

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 14


ACA (15CS72) Notes Module-III

The block field (r bits) is used to implement the (modulo-m) placement, where m=2r
Once the block Bi is uniquely identified by this field, the tag associated with the addressed block is
compared with the tag in the memory address.

• Advantages
– Simple hardware
– No associative search
– No page replacement policy
– Lower cost
– Higher speed
• Disadvantages
– Rigid mapping
– Poorer hit ratio
– Prohibits parallel virtual address translation
– Use larger cache size with more block frames to avoid contention

Fully Associative Cache




VTUPulse.com
Each block in main memory can be placed in any of the available block frames as shown in
Fig. 5.10a.
Because of this flexibility, an s-bit tag needed in each cache block.
As s > r, this represents a significant increase in tag length.
• The name fully associative cache is derived from the fact that an m-way associative search
requires tag to be compared with all block tags in the cache. This scheme offers the greatest
flexibility in implementing block replacement policies for a higher hit ratio.
• An m-way comparison of all tags is very time consuming if the tags are compared sequentially
using RAMs. Thus an associative memory is needed to achieve a parallel comparison with all
tags simultaneously.
• This demands higher implementation cost for the cache. Therefore, a Fully Associative Cache
has been implemented only in moderate size.
• Fig. 5.10b shows a four-way mapping example using a fully associative search. The tag is 4-
bits long because 16 possible cache blocks can be destined for the same block frame.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 15


ACA (15CS72) Notes Module-III

VTUPulse.com
• Advantages:
– Offers most flexibility in mapping cache blocks
– Higher hit ratio
– Allows better block replacement policy with reduced block contention

• Disadvantages:
– Higher hardware cost
– Only moderate size cache
– Expensive search process

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 16


ACA (15CS72) Notes Module-III

Set Associative Caches


• In a k-way associative cache, the m cache block frames are divided into v=m/k sets, with k
blocks per set
• Each set is identified by a d-bit set number, where 2d = v.
• The cache block tags are now reduced to s-d bits.
• In practice, the set size k, or associativity, is chosen as 2, 4, 8, 16 or 64 depending on a tradeoff
among block size w, cache size m and other performance/cost factors.

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 17


ACA (15CS72) Notes Module-III

• Compare the tag with the k tags within the identified set as shown in Fig 5.11a.
• Since k is rather small in practice, the k-way associative search is much more economical than
the full associativity.
• In general, a block Bj can be mapped into any one of the available frames Bf in a set Si defined
below.
Bj  Bf  Si if j(mod v) = i
• The matched tag identifies the current block which resides in the frame.

Sector Mapping Cache


• Partition both the cache and main memory into fixed size sectors. Then use fully associative
search ie., each sector can be placed in any of the available sector frames.
• The memory requests are destined for blocks, not for sectors.
• This can be filtered out by comparing the sector tag in the memory address with all sector tags
using a fully associative search.
• If a matched sector frame is found (a cache hit), the block field is used to locate the desired
block within the sector frame.

VTUPulse.com
If a cache miss occurs, the missing block is fetched from the main memory and brought into a
congruent block frame in available sector.
• That is the ith block in a sector must be placed into the ith block frame in a destined sector
frame.
• Attach a valid bit to each block frame to indicate whether the block is valid or invalid.

• When the contents of the block frame are replaced from a new sector, the remaining block
frames in the same sector are marked invalid. Only the block frames from the most recently
referenced sector are marked valid for reference.

Advantages:
• Flexible to implement various bkock replacement algorithms
• Economical to perform a fully associative search a limited number of sector tags.
• Sector partitioning offers more freedom in grouping cache lines at both ends of the mapping.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 18


ACA (15CS72) Notes Module-III

VTUPulse.com
4.2.4 Cache Performance Issues
As far as the performance of cache is considered the trade off exist among the cache size, set number,
block size and memory speed. Important aspect in cache designing with regard to performance are :

Cycle counts
• This refers to the number of basic machine cycles needed for cache access, update and coherence
control.
• Cache speed is affected by underlying static or dynamic RAM technology, the cache organization
and the cache hit ratios.
• The write through or write back policy also affect the cycle count.
• Cache size, block size, set number, and associativity affect count
• The cycle count is directly related to the hit ratio, which decreases almost linearly with increasing
values of above cache parameters.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 19


ACA (15CS72) Notes Module-III

Hit ratio
• The hit ratio is number of hits divided by total number of CPU references to memory (hits plus
misses).
• Hit ratio is affected by cache size and block size
• Increases w.r.t. increasing cache size
• Limited cache size, initial loading, and changes in locality prevent 100% hit ratio

Effect of Block Size:


• With a fixed cache size, cache performance is sensitive to the block size.
• As block size increases, hit ratio improves due to spatial locality
• Peaks at optimum block size, then decreases
• If too large, many words in cache not used

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 20


ACA (15CS72) Notes Module-III

Effect of set number


• In a set associative cache, the effects of set number are obvious.
• For a fixed cache capacity, the hit ratio may decrease as the number of sets increases.
• As the set number increases from 32 to 64, 128 and 256, the decrease in the hit ratio is rather small.
• When the set number increases to 512 and beyond, the hit ratio decreases faster.

5.3 Shared Memory Organizations


Memory interleaving provides a higher bandwidth for pipelined access of continuous memory
locations.
Methods for allocating and deallocating main memory to multiple user programs are considered for
optimizing memory utilization.

5.3.1 Interleaved Memory Organization


• In order to close up the speed gap between the CPU/cache and main memory built with RAM
modules, an interleaving technique is presented below which allows pipelined access of the parallel
memory modules.
• The memory design goal is to broaden the effective memory bandwidth so that more memory words


VTUPulse.com
can be accessed per unit time.
The ultimate purpose is to match the memory bandwidth with the bus bandwidth and with the
processor bandwidth.

Memory Interleaving
• The main memory is built with multiple modules.
• These memory modules are connected to a system bus or a switching network to which other
resources such as processors or I/O devices are also connected.
• Once presented with a memory address, each memory module returns with one word per cycle.
• It is possible to present different addresses to different memory modules so that parallel access of
multiple words can be done simultaneously or in a pipelined fashion.

Consider a main memory formed with m = 2a memory modules, each containing w = 2b words of
memory cells. The total memory capacity is m.w = 2a+b words.
These memory words are assigned linear addresses. Different ways of assigning linear addresses result
in different memory organizations.
Besides random access, the main memory is often block-accessed at consecutive addresses.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 21


ACA (15CS72) Notes Module-III

Figure 5.15 shows two address formats for memory interleaving.


• Low-order interleaving
• High-order interleaving

VTUPulse.com
Low-order interleaving
• Low-order interleaving spreads contiguous memory locations across the m modules horizontally
(Fig. 5.15a).
• This implies that the low-order a bits of the memory address are used to identify the memory
module.
• The high-order b bits are the word addresses (displacement) within each module.
• Note that the same word address is applied to all memory modules simultaneously. A module
address decoder is used to distribute module addresses.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 22


ACA (15CS72) Notes Module-III

High-order interleaving
• High-order interleaving uses the high-order a bits as the module address and the low-order b bits as
the word address within each module (Fig. 5.15b).
• Contiguous memory locations are thus assigned to the same memory module. In each memory
cycle, only one word is accessed from each module.
• Thus the high-order interleaving cannot support block access of contiguous locations.

Pipelined Memory Access

VTUPulse.com

• Access of the m memory modules can be overlapped in a pipelined fashion.


• For this purpose, the memory cycle (called the major cycle) is subdivided into m minor cycles.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 23


ACA (15CS72) Notes Module-III

• An eight-way interleaved memory (with m=8 and w=8 and thus a=b=3) is shown in Fig. 5.16a.
• Let  be the major cycle and  the minor cycle. These two cycle times are related as follows:
 = /m
m=degree of interleaving
=total time to complete access of one word
=actual time to produce one word
Total block access time is 2
Effective access time of each word is 
• The timing of the pipelined access of the 8 contiguous memory words is shown in Fig. 5.16b.
• This type of concurrent access of contiguous words has been called a C-access memory scheme.

5.3.2 Bandwidth and Fault Tolerance


Hellerman (1967) has derived an equation to estimate the effective increase in memory bandwidth
through multiway interleaving. A single memory module is assumed to deliver one word per memory
cycle and thus has a bandwidth of 1.
Memory Bandwidth

VTUPulse.com
The memory bandwidth B of an m-way interleaved memory is upper-bounded by m and lower-
bounded by I. The Hellerman estimate of B is

(5.5)
where m is the number of interleaved memory modules.
 This equation implies that if 16 memory modules are used, then the effective memory bandwidth is
approximately four times that of a single module.
 This pessimistic estimate is due to the fact that block access of various lengths and access of single
words are randomly mixed in user programs.
 Hellerman's estimate was based on a single-processor system. If memory-access conflicts from
multiple processors (such as the hot spot problem) are considered, the effective memory bandwidth
will be further reduced.
 In a vector processing computer, the access time of a long vector with n elements and stride
distance 1 has been estimated by Cragon (1992) as follows:
 It is assumed that the n elements are stored in contiguous memory locations in an m-way
interleaved memory system.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 24


ACA (15CS72) Notes Module-III

The average time t1 required to access one element in a vector is estimated by

(5.6)

When n  ∞ (very long vector), t1 θ/m = τ.


As n  1 (scalar access), t1 θ.
Equation 5.6 conveys the message that interleaved memory appeals to pipelined access of long vectors;
the longer the better.

Fault Tolerance
 High- and low-order interleaving can be combined to yield many different interleaved memory
organizations.
 Sequential addresses are assigned in the high-order interleaved memory in each memory module.
 This makes it easier to isolate faulty memory modules in a memory bank of m memory modules.
 When one module failure is detected, the remaining modules can still bo used by opening a
window in the address space.
 This fault isolation cannot be carried out in a low-order interleaved memory, in which a module

VTUPulse.com
failure may paralyze the entire memory bank.
 Thus low-order interleaving memory is not fault-tolerant.

5.3.3 Memory Allocation Schemes


• Virtual memory allows many s/w processes time-shared use of main memory
• Memory manager handles the swapping
• It monitors amount of available main memory and decides which processes should reside and
which to remove.
Allocation Policies
• Memory swapping: process of moving blocks of data between memory levels
• Nonpreemptive allocation: if full, then swaps out some of the allocated processes
– Easier to implement, less efficient
• Preemptive allocation:has freedom to preempt an executing process
– More complex, expensive, and flexible
• Local allocation: considers only the resident working set of the faulty process
– Used by most computers

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 25


ACA (15CS72) Notes Module-III

• Global allocation: considers the history of the working sets of all resident processes in making
a swapping decision

Swapping Systems
• Allow swapping only at entire process level
• Swap device: configurable section of a disk set aside for temp storage of data swapped
• Swap space: portion of disk set aside
• Depending on system, may swap entire processes only, or the necessary pages

VTUPulse.com
Swapping in UNIX
• System calls that result in a swap:
– Allocation of space for child process being created
– Increase in size of a process address space
– Increased space demand by stack for a process
– Demand for space by a returning process swapped out previously
• Special process 0 is the swapper

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 26


ACA (15CS72) Notes Module-III

Demand Paging Systems

• Allows only pages to be transferred b/t main memory and swap device
• Pages are brought in only on demand
• Allows process address space to be larger than physical address space
• Offers flexibility to dynamically accommodate large # of processes in physical memory on
time-sharing basis

Working Sets

• Set of pages referenced by the process during last n memory refs (n=window size)
• Only working sets of active processes are resident in memory

Other Policies

• Hybrid memory systems combine advantages of swapping and demand paging


• Anticipatory paging prefetches pages based on anticipation
– Difficult to implement


VTUPulse.com
5.4 Sequential and Weak Consistency Models

• Memory inconsistency: when memory access order differs from program execution order
Sequential consistency: memory accesses (I and D) consistent with program execution order

Memory Consistency Issues


• Memory model: behavior of a shared memory system as observed by processors
• Choosing a memory model – compromise between a strong model minimally restricting s/w
and a weak model offering efficient implementation
• Primitive memory operations: load, store, swap

Event Orderings
• Processes: concurrent instruction streams executing on different processors
• Consistency models specify the order by which events from one process should be observed by
another
• Event ordering helps determine if a memory event is legal for concurrent accesses

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 27


ACA (15CS72) Notes Module-III

• Program order: order by which memory access occur for execution of a single process, w/o
any reordering
The event ordering can he used to declare whether a memory event is legal or illegal, when several
processes are accessing a common set of memory locations.

A program order is the order by which memory accesses occur for the execution of a single process,
provided that no program reordering has taken place.
Three primitive memory operations for the purpose of specifying memory consistency models are
defined:
(1) A load by processor Pi is considered performed with respect to processor Pk at a point of time
when the issuing of a store to the same location by Pk cannot affect the value returned by the load.

(2) A store by P, is considered performed with respect to Pk at one time when an issued load to the
same address by Pk returns the value by this store.

(3) A load is globally performed if it is performed with respect to all processors and if the store that is
the source of the returned value has been performed with respect to all processors.


VTUPulse.com
As illustrated in Fig. 5.19a, a processor can execute instructions out of program order using a
compiler to resequence instructions in order to boost performance.
A uniprocessor system allows these out-of-sequence executions provided that hardware interlock
mechanisms exist to check data and control dependences between instructions.
 When a processor in a multiprocessor system executes a concurrent program as illustrated in Fig.
5.19b, local dependence checking is necessary but may not be sufficient to preserve the intended
outcome of a concurrent execution.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 28


ACA (15CS72) Notes Module-III

VTUPulse.com

Difficulty in Maintaining Correctness on an MIMD


(a) The order in which instructions belonging to different streams are executed is not fixed in a parallel
program. If no synchronization among the instruction streams exists, then a large number of different
instruction interleavings is possible.
(b) If for performance reasons the order of execution of instructions belonging to the same stream is
different from the program order, then an even larger number of instruction interleavings is passible.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 29


ACA (15CS72) Notes Module-III

(c) If accesses are not atomic with multiple copies of the same data coexisting as in a cache-based
system, then different processors can individually observe different interleavings during the same
execution. In this case, the total number of possible execution instantiations of a program becomes
even larger.

Atomicity
Three categories of multiprocessor memory behavior:
• Program order preserved and uniform observation sequence by all processors
• Out-of-program-order allowed and uniform observation sequence by all processors
• Out-of-program-order allowed and nonuniform sequences observed by different processors

Atomic memory accesses: memory updates are known to all processors at the same time
Non-atomic: having individual program orders that conform is not a sufficient condition for sequential
consistency
– Multiprocessor cannot be strongly ordered

Lamport’s Definition of Sequential Consistency


VTUPulse.com
A multiprocessor system is sequentially consistent if the result of any execution is the same as
if the operations of all the processors were executed in some sequential order, and the
operations of each individual processor appear in this sequence in the order specified by its
program.

5.4.2 Sequential Consistency Model


• Sufficient conditions:
1. Before a load is allowed to perform wrt any other processor, all previous loads must be
globally performed and all previous stores must be performed wrt all processors
2. Before a store is allowed to perform wrt any other processor, all previous loads must be
globally performed and all previous stores must be performed wrt to all processors

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 30


ACA (15CS72) Notes Module-III

Sequential Consistency Axioms


1. A load always returns the value written by the latest store to the same location

VTUPulse.com
2. The memory order conforms to a total binary order in which shared memory is accessed in real
time over all loads/stores
3. If two operations appear in particular program order, same memory order
4. Swap op is atomic with respect to stores. No other store can intervene between load and store
parts of swap
5. All stores and swaps must eventually terminate

Implementation Considerations
• A single port software services one op at a time
• Order in which software is thrown determines global order of memory access ops
• Strong ordering preserves the program order in all processors
• Sequential consistency model leads to poor memory performance due to the imposed strong
ordering of memory events

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 31


ACA (15CS72) Notes Module-III

5.4.3 Weak Consistency Models

• Multiprocessor model may range from strong (sequential) consistency to various degrees of
weak consistency
• Two models considered
– DSB (Dubois, Scheurich and Briggs) model
– TSO (Total Store Order) model

DSB Model
Dubois, Scheurich and Briggs have derived a weak consistency model by relating memory request
ordering to synchronization points in the program. We call this the DSB model specified by the
following 3 conditions:
1. All previous synchronization accesses must be performed, before a load or a store access is
allowed to perform wrt any other processor.
2. All previous load and store accesses must be performed, before a synchronization access is
allowed to perform wrt any other processor.

VTUPulse.com
3. Synchronization accesses sequentially consistent with respect to one another

TSO Model
Sindhu, Frailong and Cekleov have specified the TSO weak consistency model with 6 behavioral
axioms.
1. Load returns latest store result
2. Memory order is a total binary relation over all pairs of store operations
3. If two stores appear in a particular program order, then they must also appear in the same
memory order
4. If a memory operation follows a load in program order, then it must also follow load in
memory order
5. A swap operation is atomic with respect to other stores – no other store can interleave between
load/store parts of swap
6. All stores and swaps must eventually terminate.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 32


ACA (15CS72) Notes Module-III

Chapter-6 Pipelining and Superscalar Techniques

6.1 Linear Pipeline Processors


A linear pipeline processor is a cascade of processing stages which are linearly connected to perform a
fixed function over a stream of data flowing from one end to the other.
In modern computers, linear pipelines are applied for instruction execution, arithmetic computation,
and memory-access operations.

6.1.l Asynchronous & Synchronous models


 A linear pipeline processor is constructed with k processing stages. External inputs(operands) are
fed into the pipeline at the first stage S1.
 The processed results are passed from stage Si to stage Si+1, for all i=1,2,….,k-1. The final result
emerges from the pipeline at the last stage Sn.
 Depending on the control of data flow along the pipeline, we model linear pipelines in two
categories: Asynchronous and Synchronous.

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 33


ACA (15CS72) Notes Module-III

Asynchronous Model
 As shown in the figure data flow between adjacent stages in an asynchronous pipeline is controlled
by a handshaking protocol.
 When stage Si is ready to transmit, it sends a ready signal to stage Si+1. After stage receives the
incoming data, it returns an acknowledge signal to Si.
 Asynchronous pipelines are useful in designing communication channels in message- passing
multicomputers where pipelined wormhole routing is practiced Asynchronous pipelines may have
a variable throughput rate.
 Different amounts of delay may be experienced in different stages.

Synchronous Model:
 Synchronous pipelines are illustrated in Fig. Clocked latches are used to interface between stages.
 The latches are made with master-slave flip-flops, which can isolate inputs from outputs.
 Upon the arrival of a clock pulse All latches transfer data to the next stage simultaneously.
 The pipeline stages are combinational logic circuits. It is desired to have approximately equal
delays in all stages.
 These delays determine the clock period and thus the speed of the pipeline. Unless otherwise

 VTUPulse.com
specified, only synchronous pipelines are studied.
The utilization pattern of successive stages in a synchronous pipeline is specified by a reservation
table.

 For a linear pipeline, the utilization follows the diagonal streamline pattern shown in Fig. 6.1c.
 This table is essentially a space-time diagram depicting the precedence relationship in using the
pipeline stages.
 Successive tasks or operations are initiated one per cycle to enter the pipeline. Once the pipeline is
filled up, one result emerges from the pipeline for each additional cycle.
 This throughput is sustained only if the successive tasks are independent of each other.

6.1.2 Clocking and Timing Control


The clock cycle τ of a pipeline is determined below. Let τi be the time delay of the circuitry in stage Si
and d the time delay of a latch, as shown in Fig 6.1b.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 34


ACA (15CS72) Notes Module-III

Clock Cycle and Throughput :


Denote the maximum stage delay as τm ,and we can write τ as

 At the rising edge of the clock pulse, the data is latched to the master flip-flops of each latch
register. The clock pulse has a width equal to d.
 In general, τm >> d by one to two orders of magnitude.
 This implies that the maximum stage delay τm dominates the clock period. The pipeline frequency
is defined as the inverse of the clock period.
f=1/τ
 If one result is expected to come out of the pipeline per cycle, f represents the maximum
throughput of the pipeline.
 Depending on the initiation rate of successive tasks entering the pipeline, the actual throughput of
the pipeline may be lower than f.
 This is because more than one clock cycle has elapsed between successive task initiations.

Clock Skewing:

 VTUPulse.com
Ideally, we expect the clock pulses to arrive at all stages (latches) at the same time.
However, due to a problem known as clock skewing the same clock pulse may arrive at different
stages with a time offset of s.
 Let tmax be the time delay of the longest logic path within a stage
 tmin is the shortest logic path within a stage.
 To avoid a race in two successive stages, we must choose
τm >= tmax + s and d <= tmin - s
 These constraints translate into the following bounds on the clock period when clock skew takes
effect:
d + tmax + s <= τ <= τm + tmin - s
 In the ideal case s = 0, tmax = τm, and tmin = d. Thus, we have τ= τm + d

6.1.3 Speedup, Efficiency and Throughput of Pipeline


Ideally, a linear pipeline of k stages can process n tasks in k + (n — 1) clock cycles, where k cycles are
needed to complete the execution of the very first task and the remaining n-1 tasks require n - 1 cycles.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 35


ACA (15CS72) Notes Module-III

 Thus the total time required is

 where τ is the clock period.


 Consider an equivalent-function nonpipelined processor which has a flow-through delay of kτ. The
amount of time it takes to execute n tasks on this nonpipelined processor is,
T1 = nkτ

Speedup Factor
The speedup factor of a k-stage pipeline over an equivalent nonpipelined processor is defined as

Efficiency and Throughput


The efficiency Ek of a linear k-stage pipeline is defined as

The efficiency approaches 1 when n  ∞ , and a lower bound on Ek is 1/k when n = 1.

VTUPulse.com
The pipeline throughput Hk is defined as the number of tasks (operations) performed per unit time:

The maximum throughput f occurs when Ek  1 as n  ∞.

6.2 Non Linear Pipeline Processors


 A dynamic pipeline can be reconfigured to perform variable functions at different times.
 The traditional linear pipelines are static pipelines because they are used to perform fixed
functions.
 A dynamic pipeline allows feed forward and feedback connections in addition to the streamline
connections.

6.2.1 Reservation and Latency analysis:


 In a static pipeline, it is easy to partition a given function into a sequence of linearly ordered
subfunctions.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 36


ACA (15CS72) Notes Module-III

 However, function partitioning in a dynamic pipeline becomes quite involved because the pipeline
stages are interconnected with loops in addition to streamline connections.
 A multifunction dynamic pipeline is shown in Fig 6.3a. This pipeline has three stages.
 Besides the streamline connections from S1 to S2 and from S2 to S3, there is a feed forward
connection from S1 to S3 and two feedback connections from S3 to S2 and from S3 to S1.
 These feed forward and feedback connections make the scheduling of successive events into the
pipeline a nontrivial task.
 With these connections, the output of the pipeline is not necessarily from the last stage.
 In fact, following different dataflow patterns, one can use the same pipeline to evaluate different
functions

VTUPulse.com
Reservation Tables:
 The reservation table for a static linear pipeline is trivial in the sense that data flow follows a linear
streamline.
 The reservation table for a dynamic pipeline becomes more interesting because a nonlinear pattern
is followed.
 Given a pipeline configuration, multiple reservation tables can be generated for the evaluation of
different functions.
 Two reservation tables are given in Fig6.3b and 6.3c, corresponding to a function X and a function
Y, respectively.
 Each function evaluation is specified by one reservation table. A static pipeline is specified by a
single reservation table.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 37


ACA (15CS72) Notes Module-III

 A dynamic pipeline may be specified by more than one reservation table. Each reservation table
displays the time-space flow of data through the pipeline for one function evaluation.
 Different functions may follow different paths on the reservation table.
 A number of pipeline configurations may be represented by the same reservation table.
 There is a many-to-many mapping between various pipeline configurations and different
reservation tables.
 The number of columns in a reservation table is called the evaluation time of a given function.

Latency Analysis
 The number of time units (clock cycles) between two initiations of a pipeline is the latency
between them.
 Latency values must be non negative integers. A latency of k means that two initiations are
separated by k clock cycles.
 Any attempt by two or more initiations to use the same pipeline stage at the same time will cause a
collision.
 A collision implies resource conflicts between two initiations in the pipeline. Therefore, all

VTUPulse.com
collisions must be avoided in scheduling a sequence of pipeline initiations.
 Some latencies will cause collisions, and some will not.
 Latencies that cause collisions are called forbidden latencies.

6.2.2 Collision Free Scheduling


 When scheduling events in a nonlinear pipeline, the main objective is to obtain the shortest average
latency between initiations without causing collisions.
 Collision Vector: By examining the reservation table, one can distinguish the set of permissible
latencies from the set of forbidden latencies.
 For a reservation table with n columns, the maximum forbidden latency in m<=n-1. The
permissible latency p should be as small as possible.
 The choice is made in the range 1 <= p <= m-1.
 A permissible latency of p = 1 corresponds to the ideal case. In theory, a latency of 1 can always be
achieved in a static pipeline which follows a linear (diagonal or streamlined) reservation table.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 38


ACA (15CS72) Notes Module-III

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 39


ACA (15CS72) Notes Module-III

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 40


ACA (15CS72) Notes Module-III

6.3 Instruction Pipeline Design

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 41


ACA (15CS72) Notes Module-III

6.3.2 Mechanisms for Instruction Pipelining

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 42


ACA (15CS72) Notes Module-III

VTUPulse.com

Load operation (LD R2, M) and replaces it with the move operation (MOVE R2, R1).

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 43


ACA (15CS72) Notes Module-III

Hazard Avoidance
• The read and write of shared variables by different instructions in a pipeline may lead to different
results if these instructions are executed out of order.
• As shown in Fig. 6.15, three types of logic hazards are possible:
• Consider two instructions I and J. Instruction J is assumed to logically follow instruction I
according to program order.

• If the actual execution order of these two instructions violate the program order, incorrect results


VTUPulse.com
may be read or written, thereby producing hazards.
Hazards should be prevented before these instructions enter the pipeline, such as by holding
instruction J until the dependence on instruction I is resolved.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 44


ACA (15CS72) Notes Module-III

• We use the notation D(I) and R(I) for the domain and range of an instruction I.
– Domain contains the Input Set to be used by instruction I
– Range contains the Output Set of instruction I

Listed below are conditions under which possible hazards can occur:

R(I) ∩ D(J) ≠ ф for RAW hazard (Flow Dependence)

R(I) ∩ R(J) ≠ ф for WAW hazard (Anti Dependence)

D(I) ∩ R(J) ≠ ф for WAR hazard (Output Dependence)

6.3.4 Branch Handling Techniques

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 45


ACA (15CS72) Notes Module-III

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 46


ACA (15CS72) Notes Module-III

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 47


ACA (15CS72) Notes Module-III

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 48


ACA (15CS72) Notes Module-III

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 49


ACA (15CS72) Notes Module-III

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 50


ACA (15CS72) Notes Module-4

MODULE-IV

Chapter-7 Multiprocessors and Multicomputers

7.1 Multiprocessor system interconnect

 Parallel processing demands the use of efficient system interconnects for fast communication
among multiple processors and shared memory, I/O and peripheral devices.
 Hierarchical buses, crossbar switches and multistage networks are often used for this purpose.
 A generalized multiprocessor system is depicted in Fig. 7.1. This architecture combines features
from the UMA, NUMA and COMA models.

VTUPulse.com

 Each processor Pi is attached to its own local memory and private cache.
 These multiple processors connected to share memory through interprocessor memory network
(IPMN).

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 1


ACA (15CS72) Notes Module-4

 Processors share the access of I/O and peripheral devices through Processor-I/O Network (PION).
Both IPMN and PION are necessary in a shared-resource multiprocessor.
 An optional Interprocessor Communication Network (IPCN) can permit processor communication
without using shared memory.

Network Characteristics
The networks are designed with many choices like timing, switching and control strategy like in case
of dynamic network the multiprocessors interconnections are under program control.

Timing
 Synchronous – controlled by a global clock which synchronizes all network activity.
 Asynchronous – use handshaking or interlock mechanisms for communication and
especially suitable for coordinating devices with different speed.

Switching Method
 Circuit switching – a pair of communicating devices control the path for the entire duration

VTUPulse.com

of data transfer
Packet switching – large data transfers broken into smaller pieces, each of which can
compete for use of the path

Network Control
 Centralized – global controller receives and acts on requests
 Distributed – requests handled by local devices independently

7.1.1 Hierarchical Bus Systems

 A bus system consists of a hierarchy of buses connecting various system and subsystem
components in a computer.
 Each bus is formed with a number of signal, control, and power lines. Different buses are used to
perform different interconnection functions.
 In general, the hierarchy of bus systems are packaged at different levels as depicted in Fig. 7.2,
including local buses on boards, backplane buses, and I/O buses.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 2


ACA (15CS72) Notes Module-4

VTUPulse.com
 Local Bus Buses implemented on printed-circuit boards are called local buses.
 On a processor board one often finds a local bus which provides a common communication path
among major components (chips) mounted on the board.
 A memory board uses a memory bus to connect the memory with the interface logic.
 An I/O board or network interface board uses a data bus. Each of these board buses consists of
signal and utility lines.

Backplane Bus
A backplane is a printed circuit on which many connectors are used to plug in functional boards. A

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 3


ACA (15CS72) Notes Module-4

system bus, consisting of shared signal pathsand utility lines, is built on the backplane.This system bus
provides a common communication path among all plug-in boards.

I/O Bus
Input/Output devices are connected to a comuter system through an I/O bus such as the SCSI(Small
Computer Systems Interface) bus.
This bus is made of coaxial cables with taps connecting disks, printer and other devices to a processor
through an I/O controller.
Special interface logic is used to connect various board types to the backplane bus.

Hierarchical Buses and Caches


This is a multilevel tree structure in which the leaf nodes are processors and their private caches
(denoted Pj and C1j in Fig. 7.3). These are divided into several clusters, each of which is connected
through a cluster bus.
An intercluster bus is used to provide communications among the clusters. Second level caches
(denoted as C2i) are used between each cluster bus and the intercluster bus. Each second level cache

VTUPulse.com
must have a capacity that is at least an order of magnitude larger than the sum of the capacities of all
first-level caches connected beneath it.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 4


ACA (15CS72) Notes Module-4

 Each single cluster operates on a single-bus system. Snoopy bus coherence protocols can be used to
establish consistency among first level caches belonging to the same cluster.
 Second level caches are used to extend consistency from each local cluster to the upper level.
 The upper level caches form another level of shared memory between each cluster and the main
memory modules connected to the intercluster bus.
 Most memory requests should be satisfied at the lower level caches.
 Intercluster cache coherence is controlled among the second-level caches and the resulting effects
are passed to the lower level.

7.1.2 Crossbar Switch and Multiport Memory


Single stage networks are sometimes called recirculating networks because data items may have to pass
through the single stage many times. The crossbar switch and the multiported memory organization are
both single-stage networks.

This is because even if two processors attempted to access the same memory module (or I/O device) at the
same time, only one of the requests is serviced at a time.

VTUPulse.com
Multistage Networks
Multistage networks consist of multiple sages of switch boxes, and should be able to connect any input to
any output.
A multistage network is called blocking if the simultaneous connections of some multiple input/output
pairs may result in conflicts in the use of switches or communication links.
A nonblocking multistage network can perform all possible connections between inputs and outputs by
rearranging its connections.

Crossbar Networks
Crossbar networks connect every input to every output through a crosspoint switch. A crossbar network is a
single stage, non-blocking permutation network.
In an n-processor, m-
unary switch which can be open or closed, providing a point-to-point connection path between the
processor and a memory module.

Crosspoint Switch Design


Out of n crosspoint switches in each column of an n * m crossbar mesh, only one can be connected at a
time.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 5


ACA (15CS72) Notes Module-4

Crosspoint switches must be designed to handle the potential contention for each memory module. A
2
crossbar switch avoids competition for bandwidth by using O(N ) switches to connect N inputs to N
outputs.
Although highly non-scalable, crossbar switches are a popular mechanism for connecting a small number
of workstations, typically 20 or fewer.

VTUPulse.com
Each processor provides a request line, a read/write line, a set of address lines, and a set of data lines to a
crosspoint switch for a single column. The crosspoint switch eventually responds with an
acknowledgement when the access has been completed.

Multiport Memory
Since crossbar switches are expensive and not suitable for systems with many processors or memory
modules, multiport memory modules may be used instead.
A multiport memory module has multiple connection points for processors (or I/O devices), and the
memory controller in the module handles the arbitration and switching that might otherwise have been
accomplished by a crosspoint switch.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 6


ACA (15CS72) Notes Module-4

VTUPulse.com
A two function switch can assume only two possible state namely state or exchange states. However a four
function switch box can be any of four possible states. A multistage network is capable of connecting any
input terminal to any output terminal. Multi-stage networks are basically constructed by so called shuffle-
exchange switching element, which is basically a 2 x 2 crossbar. Multiple layers of these elements are
connected and form the network.

7.1.3 Multistage and Combining Networks


Multistage networks are used to build larger multiprocessor systems. We describe two multistage
networks, the Omega network and the Butterfly network, that have been built into commercial
machines.
Routing in Omega Networks
An 8-input Omega network is shown in Fig. 7.8.
In general, an n-input Omega network has log2n stages. The stages are labeled from 0 to log2n — 1
from the input end to the output end.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 7


ACA (15CS72) Notes Module-4

Data routing is controlled by inspecting the destination code in binary. When the ith high-order bit of
the destination code is a 0, a 2 x 2 switch at stage i connects the input to the upper output.
Otherwise, the input is directed to the lower output.

VTUPulse.com

 Two switch settings are shown in Figs. 7.8a and b with respect to permutations Π1 = (0,7,6,4,2)
(1,3)(5) and Π2= (0,6,4,7,3) (1,5)(2), respectively.
 The switch settings in Fig. 7.8a are for the implementation of Π1, which maps 0 7, 7 6, 64,
42, 2 0, 1 3, 3 1, 5 5.
 Consider the routing of a message from input 001 to output 011. This involves the use of switches
A, B, and C. Since the most significant bit of the destination 011 is a "zero," switch A must be set
straight so that the input 001 is connected to the upper output (labeled 2).

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 8


ACA (15CS72) Notes Module-4

 The middle bit in 011 is a "one," thus input 4 to switch B is connected to the lower output with a
"crossover" connection.
 The least significant bit in 011 is a "one," implying a flat connection in switch C.
 Similarly, the switches A, E, and D are set for routing a message from input 101 to output 101.
There exists no conflict in all the switch settings needed to implement the permutation Π1 in Fig.
7.8a.

 Now consider implementing the permutation Π2 in the 8-input Omega network (Fig. 7.8b0.
Conflicts in switch settings do exist in three switches identified as F, G, and H. The conflicts
occurring at F are caused by the desired routings 000 110 and 100 111.
 Since both destination addresses have a leading bit 1, both inputs to switch F must be connected to
the lower output.
 To resolve the conflicts, one request must be blocked.
 Similarly we see conflicts at switch G between 011 000 and 111011, and at switch H between
101001 and 011 000. At switches I and J, broadcast is used from one input to two outputs,

VTUPulse.com
which is allowed if the hardware is built to have four legitimate states as shown in fig. 2.24a.
 The above example indicates the fact that not all permutations can be implemented in one pass
through the Omega network.

Routing in Butterfly Networks


 This class of networks is constructed with crossbar switches as building blocks. Fig. 7.10 shows
two Butterfly networks of different sizes.
 Fig. 10a shows a 64-input Butterfly network built with two stages (2=log864) of 8X8 crossbar
switches.
 The eight-way shuffle function is used to establish the interstage connections between stage 0 and
stage 1.
 In Fig. 7.10b, a three-stage Butterfly network is constructed for 512 inputs, again with 8X8
crossbar switches.
 Each of the 64X64 boxes in Fig. 7.10b is identical to the two-stage Butterfly network in Fig. 7.10a.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 9


ACA (15CS72) Notes Module-4

VTUPulse.com

 In total, sixteen 8x8 crossbar switches are used in Fig. 7.10a and 16 x 8+8 x 8 = 192 are used in
Fig. 7.10b. Larger Butterfly networks can be modularly constructed using more stages.
 Note that no broadcast connections are allowed in a Butterfly network, making these networks a
restricted subclass of Omega networks.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 10


ACA (15CS72) Notes Module-4

The Hot-Spot Problem


 When the network traffic is nonuniform, a hot spot may appear corresponding to a certain memory
module being excessively accessed by many processors at the same time.
 For example, a semaphore variable being used as a synchronization barrier may become a hot spot
since it is shared by many processors.
 Hot spots may degrade the network performance significantly. In the NYU Ultracomputer and the
IBM RP3 multiprocessor, a combining mechanism has been added to the Omega network.
 The purpose was to combine multiple requests heading for the same destination at switch points
where conflicts are taking place.
 An atomic read-modify-write primitive Fetch&Add(x,e), has been developed to perform parallel
memory updates using the combining network.

Fectch&Add
 This atomic memory operation is effective in implementing an N-way synchronization with a
complexity independent of N.

VTUPulse.com
 In a Fetch&Add(x, e) operation, i is an integer variable in shared memory and e is an integer
increment.
 When a single processor executes this operation, the semantics is

Fetch&Add(x, e)
{ temp  x;
x  temp + e; (7.1)
return temp }

 When N processes attempt to Fetch&Add(x, e) the same memory word simultaneously, the
memory is updated only once following a serialization principle.
 The sum of the N increments, e1 + e2 + • • • + eN, is produced in any arbitrary serialization of the N
requests.
 This sum is added to the memory word x, resulting in a new value x + e1 + e2 + • • • + eN
 The values returned to the N requests are all unique, depending on the serialization order followed.
 The net result is similar to a sequential execution of N Fetch&Adds but is performed in one
indivisible operation.
 Two simultaneous requests are combined in a switch as illustrated in Fig. 7.11.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 11


ACA (15CS72) Notes Module-4

VTUPulse.com
One of the following operations will be performed if processor P1 executes Ans1  Fetch&Add(x,e1)
and P2 executes Ans2  Fetch&Add(x,e2) simultaneously on the shared variable x.
If the request from P1 is executed ahead of that from P2, the following values are returned:
Ans1  x
Ans2  x+ e1 (7.2)

If the execution order is reversed, the following values arc returned:


Ans1  x + e2
Ans2  x
Regardless of the executing order, the value x+ e1 + e2 is stored in memory.
It is the responsibility of the switch box to form the sum e1+ e2, transmit the combined request
Fetch&Add(x, e1 + e2), store the value e1 (or e2) in a wait buffer of the switch and return the values x
and x+ e1 to satisfy the original requests Fetch&Add(x, e1) and Fetch&Add(x, e2) respectively, as
shown in fig. 7.11 in four steps.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 12


ACA (15CS72) Notes Module-4

7.2 Cache Coherence and Synchronization Mechanisms


Cache Coherence Problem:

 In a memory hierarchy for a multiprocessor system, data inconsistency may occur between
adjacent levels or within the same level.
 For example, the cache and main memory may contain inconsistent copies of the same data object.
 Multiple caches may possess different copies of the same memory block because multiple
processors operate asynchronously and independently.
 Caches in a multiprocessing environment introduce the cache coherence problem. When multiple
processors maintain locally cached copies of a unique shared-memory location, any local
modification of the location can result in a globally inconsistent view of memory.
 Cache coherence schemes prevent this problem by maintaining a uniform state for each cached
block of data.
 Cache inconsistencies caused by data sharing, process migration or I/O are explained below.

VTUPulse.com
Inconsistency in Data sharing:
The cache inconsistency problem occurs only when multiple private caches are used.
In general, three sources of the problem are identified:
 sharing of writable data,
 process migration
 I/O activity.

• Consider a multiprocessor with two processors, each using a private cache and both sharing the
main memory.
• Let X be a shared data element which has been referenced by both processors. Before update, the
three copies of X are consistent.
• If processor P writes new data X’ into the cache, the same copy will be written immediately into
the shared memory under a write through policy.
• In this case. inconsistency occurs between the two copies (X and X') in the two caches.
• On the other hand, inconsistency may also occur when a write back policy is used, as shown on the
right.
• The main memory will be eventually updated when the modified data in the cache are replaced or
invalidated.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 13
ACA (15CS72) Notes Module-4

Process Migration and I/O


The figure shows the occurrence of inconsistency after a process containing a shared variable X
migrates from processor 1 to processor 2 using the write-back cache on the right. In the middle, a
process migrates from processor 2 to processor1 when using write-through caches.

VTUPulse.com
In both cases, inconsistency appears between the two cache copies, labeled X and X’. Special
precautions must be exercised to avoid such inconsistencies. A coherence protocol must be established
before processes can safely rnigrate from one processor to another.

Two Protocol Approaches for Cache Coherence


• Many of the early commercially available multiprocessors used bus-based memory systems.
• A bus is a convenient device for ensuring cache coherence because it allows all processors in the
system to observe ongoing memory transactions.
• If a bus transaction threatens the consistent state of a locally cached object, the cache controller can
take appropriate actions to invalidate the local copy.
• Protocols using this mechanism to ensure coherence are called snoopy protocols because each
cache snoops on the transactions of other caches.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 14


ACA (15CS72) Notes Module-4

• On the other hand, scalable multiprocessor systems interconnect processors using short point-to-
point links in direct or multistage networks.
• Unlike the situation in buses, the bandwidth of these networks increases as more processors are
added to the system.
• However, such networks do not have a convenient snooping mechanism and do not provide an
efficient broadcast capability. In such systems, the cache coherence problem can be solved using
some variant of directory schemes.

Protocol Approaches for Cache Coherence:


1. Snoopy Bus Protocol
2. Directory Based Protocol

1. Snoopy Bus Protocol


 Snoopy protocols achieve data consistency among the caches and shared memory through a bus
watching mechanism.
 In the following diagram, two snoopy bus protocols create different results. Consider 3 processors

VTUPulse.com
(P1, P2, Pn) maintaining consistent copies of block X in their local caches (Fig. 7.14a) and in the
shared memory module marked X.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 15


ACA (15CS72) Notes Module-4

 Using a write-invalidate protocol, the processor P1 modifies (writes) its cache from X to X’, and all
other copies are invalidated via the bus (denoted I in Fig. 7.14b). Invalidated blocks are called
dirty, meaning they should not be used.
 The write-update protocol (Fig. 7.14c) demands the new block content X’ be broadcast to all cache
copies via the bus.
 The memory copy also updated if write through caches are used. In using write-back caches, the
memory copy is updated later at block replacement time.

Write Through Caches:


• The states of a cache block copy change with respect to read, write and replacement operations in
the cache shows the state transitions for two basic write-invalidate snoopy protocols developed for
write-through and write-back caches, respectively.
• A block copy of a write through cache i attached to processor i can assume one of two possible
cache states: valid or invalid.

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 16


ACA (15CS72) Notes Module-4

• A remote processor is denoted j, where j # i. For each of the two cache states, six possible events
may take place.
• Note that all cache copies of the same block use the same transition graph in making state changes.
• In a valid state (Fig. 7.15a), all processors can read (R(i), R(j)) safely. Local processor i can also
write(W(i)) safely in a valid state. The invalid state corresponds to the case of the block either
being invalidated or being replaced (Z(i) or Z(j)).

Write Back Caches:


• The valid state of a write-back cache can be further split into two cache states, Labeled RW(read-
write) and RO(read-only) as shown in Fig.7.15b.
• The INV (invalidated or not-in-cache) cache state is equivalent to the invalid state mentioned
before. This three-state coherence scheme corresponds to an ownership protocol.
• When the memory owns a block, caches can contain only the RO copies of the block. In other
words, multiple copies may exist in the RO state and every processor having a copy (called a
keeper of the copy) can read (R(i),R(j)) safely.
• The Inv state is entered whenever a remote processor writes (W(j)) its local copy or the local

• VTUPulse.com
processor replaces (Z(i)) its own block copy.
The RW state corresponds to only one cache copy existing in the entire system owned by the local
processor i.
• Read (R(i)) and write(W(i)) can be safely performed in the RW state. From either the RO state or
the INV state, the cache block becomes uniquely owned when a local write (W(i)) takes place.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 17


ACA (15CS72) Notes Module-4

VTUPulse.com
2. Directory Based Protocol

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 18


ACA (15CS72) Notes Module-4

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 19


ACA (15CS72) Notes Module-4

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 20


ACA (15CS72) Notes Module-4

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 21


ACA (15CS72) Notes Module-4

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 22


ACA (15CS72) Notes Module-4

VTUPulse.com
7.4 Message – Passing Mechanisms

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 23


ACA (15CS72) Notes Module-4

VTUPulse.com
Two Message Passing mechanisms are:
1. Store and Forward Routing
2. Wormhole Routing
1. Store and Forward Routing

 Packets are the basic unit of information flow in a store-and-forward network.


 Each node is required to use a packet buffer.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 24


ACA (15CS72) Notes Module-4

 A packet is transmitted from a source node to a destination node through a sequence of


intermediate nodes.
 When a packet reaches an intermediate node, it is first stored in the buffer.
 Then it is forwarded to the next node if the desired output channel and a packet buffer in the
receiving node are both available.

2. Wormhole Routing

VTUPulse.com
 Packets are subdivided into smaller flits. Flit buffers are used in the hardware routers attached to
nodes.
 The transmission from the source node to the destination node is done through a sequence of
routers.
 All the flits in the same packet are transmitted in order as inseparable companions in a pipelined
fashion.
 Only the header flit knows where the packet is going.
 All the data flits must follow the header flit.
 Flits from different packets cannot be mixed up. Otherwise they may be towed to the wrong
destination.

Asynchronous Pipelining

 The pipelining of successive flits in a packet is done asynchronously using a handshaking protocol
as shown in Fig. 7.28. Along the path, a 1-bit ready/request (R/A) line is used between adjacent
routers.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 25


ACA (15CS72) Notes Module-4

 When the receiving router (D) is ready (7.28a) to receive a flit (ie., a flit buffer is available), it pulls
the R/A line low. When the sending router (S) is ready (Fig. 2.8b), it raises the line high and
transmits flit I through the channel.
 While the flit is being received by D (Fig. 7.28c), the R/A line is kept high. After flit I is removed
from D’s buffer (ie., transmitted to the next node) (Fig. 7.28d), the cycle repeats itself for the
transmission of the next flit i+1 until the entire packet is transmitted.


VTUPulse.com
Advantages:
Very efficient
 Faster clock

Latency Analysis:

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 26


ACA (15CS72) Notes Module-4

 The communication latency in store-and-forward networks is directly proportional to the distance


(the number of hops) between the source and the destination.
TSF = L (D + 1) / W

 Wormhole Routing has a latency almost independent of the distance between the source and the
destination
TWH = L / W + F D / W

where, L: Packet length (in bits)


W: Channel Bandwidth (in bits per second)
D: Distance (number of nodes traversed minus 1)
F: Flit length (in bits)

7.4.2 Deadlock and Virtual channels

The communication channels between nodes in a wormhole-routed multicomputer network are

VTUPulse.com
actually shared by many possible source and destination pairs. The sharing of a physical channel leads
to the concept of virtual channels.

Virtual channels

 A virtual channel is logical link between two nodes. It is formed by a flit buffer in the source node,
a physical channel between them and a flit buffer in the receiver node.
 Four flit buffers are used at the source node and receiver node respectively. One source buffer is
paired with one receiver buffer to form a virtual channel when the physical channel is allocated for
the pair.
 Thus the physical channel is time shared by all the virtual channels. By adding the virtual channel
the channel dependence graph can be modified and one can break the deadlock cycle.
 Here the cycle can be converted to spiral thus avoiding a deadlock. Virtual channel can be
implemented with either unidirectional channel or bidirectional channels.
 However a special arbitration line is needed between adjacent nodes interconnected by
bidirectional channel. This line determines the direction of information flow.
 The virtual channel may reduce the effective channel bandwidth available to each request.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 27


ACA (15CS72) Notes Module-4

 There exists a tradeoff between network throughput and communication latency in determining the
degree of using virtual channels.

VTUPulse.com
Deadlock Avoidance

By adding two virtual channels, V3 and V4 in Fig. 7.32c, one can break the deadlock cycle. A modified
channel-dependence graph is obtained by using the virtual channels V3 and V4, after the use of channel
C2, instead of reusing C3 and C4.

The cycle in Fig. 7.32b is being converted to a spiral, thus avoiding a deadlock. Channel multiplexing
can be done at the flit level or at the packet level if the packet length is sufficiently short.

Virtual channels can be implemented with either unidirectional channels or bidirectional channels.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 28


ACA (15CS72) Notes Module-4

VTUPulse.com Chapter-8 Multivector and SIMD Computers

8.1 Vector Processing Principles

Vector Processing Definitions

Vector: A vector is a set of scalar data items, all of the same type, stored in memory. Usually, the
vector elements are ordered to have a fixed addressing increment between successive elements called
the stride.

Vector Processor: A vector processor is an ensemble of hardware resources, including vector


registers, functional pipelines, processing elements, and register counters, for performing vector
operations.

Vector Processing: Vector processing occurs when arithmetic or logical operations are applied to
vectors. It is distinguished from scalar processing which operates on one or one pair of data.
Vector processing is faster and more efficient than scalar processing.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 29


ACA (15CS72) Notes Module-4

Vectorization: The conversion from scalar code to vector code is called vectorization.

Vectorizing Compiler: A compiler capable of vectorization is called a Vectorizing Compiler


(vectorizer).

8.1.1 Vector Instruction Types


There are six types of vector instructions. These are defined by mathematical mappings between their
working registers or memory where vector operands are stored.

1. Vector - Vector instructions


2. Vector - Scalar instructions
3. Vector - Memory instructions
4. Vector reduction instructions
5. Gather and scatter instructions
6. Masking instructions

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 30


ACA (15CS72) Notes Module-4

1. Vector - Vector instructions: One or two vector operands are fetched form the respective
vector registers, enter through a functional pipeline unit, and produce result in another vector
register.

F1: Vi Vj
F2: Vi x Vj Vk

Examples: V1 = sin(V2), V3 = V1+ V2


2. Vector - Scalar instructions
Elements of vector register are multiplied by a scalar value.

F3: s x Vi Vj

Examples: V2 = 6 + V1

3. Vector - Memory instructions: This corresponds to Store-load of vector registers (V) and
the Memory (M).

VTUPulse.com
F4: M V (Vector Load)
F5: V M (Vector Store)

Examples: X = V1 V2 = Y
4. Vector reduction instructions: include maximum, minimum, sum, mean value.

F6: Vi s

F7: Vi x Vj s
5. Gather and scatter instructions Two instruction registers are used to gather or scatter
vector elements randomly throughout the memory corresponding to the following mappings
F8: M Vi x Vj (Gather)
F9: Vi x Vj M (Scatter)
Gather is an operation that fetches from memory the nonzero elements of a sparse
vector using indices.
Scatter does the opposite, storing into memory a vector in a sparse vector whose
nonzero entries are indexed.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 31


ACA (15CS72) Notes Module-4

6. Masking instructions The Mask vector is used to compress or to expand a vector to a shorter
or longer index vector (bit per index correspondence).
F10: Vi x Vm Vj (Vm is a binary vector)

VTUPulse.com

 The gather, scatter, and masking instructions are very useful in handling sparse vectors or sparse
matrices often encountered in practical vector processing applications.
 Sparse matrices are those in which most of the entries arc zeros.
 Advanced vector processors implement these instructions directly in hardware.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 32


ACA (15CS72) Notes Module-4

8.1.2 Vector-Access Memory Schemes


The flow of vector operands between the main memory and vector registers is usually pipelined with
multiple access paths.
Vector Operand Specifications
 Vector operands may have arbitrary length.
 Vector elements are not necessarily stored in contiguous memory locations.
 To access a vector a memory, one must specify its base, stride, and length.
 Since each vector register has fixed length, only a segment of the vector can be loaded into a vector
register.
 Vector operands should be stored in memory to allow pipelined and parallel access. Access itself
should be pipelined.

Three types of Vector-access memory organization schemes


1. C-Access memory organization

VTUPulse.com
The m-way low-order memory structure, allows m words to be accessed concurrently and
overlapped.
The access cycles in different memory modules are staggered. The low-order a bits select the
modules, and the high-order b bits select the word within each module, where m=2a and a+b = n is
the address length.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 33


ACA (15CS72) Notes Module-4

 To access a vector with a stride of 1, successive addresses are latched in the address buffer at the


VTUPulse.com
rate of one per cycle.
Effectively it takes m minor cycles to fetch m words, which equals one (major) memory cycle as
stated in Fig. 5.16b.
If the stride is 2, the successive accesses must be separated by two minor cycles in order to avoid
access conflicts. This reduces the memory throughput by one-half.
 If the stride is 3, there is no module conflict and the maximum throughput (m words) results.
 In general, C-access will yield the maximum throughput of m words per memory cycle if the stride
is relatively prime to m, the number of interleaved memory modules.

2. S-Access memory organization


All memory modules are accessed simultaneously in a synchronized manner. The high order (n-a)
bits select the same offset word from each module.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 34


ACA (15CS72) Notes Module-4

VTUPulse.com

At the end of each memory cycle (Fig. 8.3b), m = 2a consecutive words are latched. If the stride is
greater than 1, then the throughput decreases, roughly proportionally to the stride.

3. C/S-Access memory organization


 Here C-access and S-access are combined.
 n access buses are used with m interleaved memory modules attached to each bus.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 35


ACA (15CS72) Notes Module-4

 The m modules on each bus are m-way interleaved to allow C-access.


 In each memory cycle, at most m.n words are fetched if the n buses are fully used with
pipelined memory accesses

 The C/S-access memory is suitable for use in vector multiprocessor configurations.


VTUPulse.com
It provides parallel pipelined access of a vector data set with high bandwidth.
 A special vector cache design is needed within each processor in order to guarantee smooth data
movement between the memory and multiple vector processors.

8.3 Compound Vector Processing


A compound vector function (CVF) is defined as a composite function of vector operations
converted from a looping structure of linked scalar operations.

Do 10 I=1,N
Load R1, X(I)
Load R2, Y(I)
Multiply R1, S
Add R2, R1
Store Y(I), R2
10 Continue
where X(I) and Y(I), I=1, 2,…. N, are two source vectors originally residing in the memory. After the
computation, the resulting vector is stored back to the memory. S is an immediate constant supplied to
the multiply instruction.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 36


ACA (15CS72) Notes Module-4

After vectorization, the above scalar SAXPY code is converted to a sequence of five vector
instructions:
M( x : x + N-1)  V1 Vector Load
M( y : y + N-1)  V2 Vector Load
S X V1  V1 Vector Multiply
V2 X V1  V2 Vector Add
V2  M( y : y + N-1) Vector Store

X and y are starting memory addresses of the X and Y vectors, respectively; V1 and V2 are two
N-element vector registers in the vector processor.

CVF: Y(1:N) = S X(1:N) + Y(1:N) or Y(I) = S X(I) + Y(I)


where Index I implies that all vector operations involve N elements.

 Typical CVF for one-dimensional arrays are load, store, multiply, divide, logical and

VTUPulse.com
shifting operations.
 The number of available vector registers and functional pipelines impose some restrictions on how
many CVFs can be executed simultaneously.
 Chaining:
Chaining is an extension of technique of internal data forwarding practiced in scalar processors.
Chaining is limited by the small number of functional pipelines available in a vector processor.

 Strip-mining:
When a vector has a length greater than that of the vector registers, segmentation of the long vector
into fixed-length segments is necessary. One vector segment is processed at a time (in Cray
computers segment is 64 elements).
 Recurrence:
The special case of vector loops in which the output of a functional pipeline may feed back into
one of its own source vector registers

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 37


ACA (15CS72) Notes Module-4

8.4 SIMD Computer Organizations

SIMD Implementation Models OR (Two models for constructing SIMD Super Computers)
SIMD models differentiates on base of memory distribution and addressing scheme used.
Most SIMD computers use a single control unit and distributed memories, except for a few that use
associative memories.
1. Distributed memory model

VTUPulse.com
 Spatial parallelism is exploited among the PEs.
 A distributed memory SIMD consists of an array of PEs (supplied with local memory) which are
controlled by the array control unit.
 Program and data are loaded into the control memory through the host computer and distributed
from there to PEs local memories.
 An instruction is sent to the control unit for decoding. If it is a scalar or program control operation,
it will be directly executed by a scalar processor attached to the control unit.
 If the decoded instruction is a vector operation, it will be broadcast to all the PEs for parallel
execution.
 Partitioned data sets are distributed to all the local memories attached to the PEs trough a vector
data bus.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 38


ACA (15CS72) Notes Module-4

 PEs are interconnected by a data routing network which performs inter-PE data communications
such as shifting, permutation and other routing operations.

2. Shared Memory Model


 An alignment network is used as the inter-PE memory communication network. This network is
controlled by control unit.
 The alignment network must be properly set to avoid access conflicts.
 Figure below shows a variation of the SIMD computer using shared memory among the PEs.
 Most SIMD computers were built with distributed memories.

VTUPulse.com
8.4.2 CM-2 Architecture
The Connection Machine CM-2 produced by Thinking Machines Corporation was a fine-grain MPP
computer using thousands of bit-slice PEs in parallel to achieve a peak processing speed of above 10
Gflops.
Program Execution Paradigm
All programs started execution on a front-end, which issued microinstructions to the back-end
processing array when data-parallel operations were desired. The sequencer broke down these
microinstructions and broadcast them to all data processors in the array.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 39


ACA (15CS72) Notes Module-4

Data sets and results could be exchanged between the front-end and the processing array in one of
three ways as shown in the figure:

VTUPulse.com
 Broadcasting: Broadcasting was carried out through the broadcast bus to all data
processors at once.
 Global combining: Global combining allowed the front-end to obtain the sum, largest
value, logical OR etc, of values one from each processor.
 Scalar memory bus: Scalar bus allowed the front-end to read or to write one 32-bit value
at a time from or to the memories attached to the data processors.
Processing Array
The processing array contained from 4K to 64K bit-slice data processors(PEs), all of which were
controlled by a sequencer.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 40


ACA (15CS72) Notes Module-4

Processing Nodes
Each data processing node contained 32 bit-slice data processors, an optional floating point accelerator
and interfaces for inter processor communication.
Hypercube Routers
The router nodes on all processor chips were wired together to frm a Boolean n-cube. A full
configuration of CM-2 had 4096 router nodes on processor chips interconnected as a 12-dimensional
hypercube.

Major Applications of CM-2


The CM-2 has been applied in almost all the MPP and grand challenge applications.
 Used in document retrieval using relevance feedback,
 in memory based reasoning as in the medical diagnostic system called QUACK for simulating
the
diagnosis of a disease,
 in bulk processing of natural languages.




VTUPulse.com
the SPICE-like VLSI circuit analysis and layout,
computational fluid dynamics,
signal/image/vision processing and integration,
neural network simulation and connectionist modeling,
 dynamic programming,
 context free parsing,
 ray tracing graphics,
 computational geometry problems.

8.4.3 MasPar MP-1 Architecture

The MP-1 architecture consists of four subsystems:


i) PE array,
ii) Array Control Unit (ACU),
iii) UNIX subsystem with standard I/O,
iv) High-speed I/O subsystem

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 41


ACA (15CS72) Notes Module-4




VTUPulse.com
The UNIX subsystem handles traditional serial processing.
The high-speed I/O, working together with the PE array, handles massively parallel computing.
The MP-1 family includes configurations with 1024, 4096. and up to 16,384 processors. The peak
performance of the 16K-processor configuration is 26,000 MIPS in 32-bit RISC integer operations.
The system also has a peak floating-point capability of 1.5 Gfiops in single-precision and 650
Mflops in double-precision operations.
 Array Control Unit The ACU is a 14-MIPS scalar RISC processor using a demand paging
instruction memory. The ACU fetches and decodes MP-1 instructions, computes addresses and
scalar data values, issues control signals to the PE array, and monitors the status of the PE array.
The ACU is microcoded to achieve horizontal control of the PE array. Most scalar ACU instructions
execute in one 70-ns clock. The whole ACU is implemented on one PC board.
An implemented functional unit, called a memory machine, is used in parallel with the ACU. The
memory machine performs PE array load and store operations, while the ACU broadcasts arithmetic,
logic, and routing instructions to the PEs for parallel execution.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 42


ACA (15CS72) Notes Module-4

Chapter 9- Scalable, Multithreaded, and Dataflow Architectures

9.1 Latency Hiding Techniques.


9.1.1 Shared Virtual Memory
Single-address-space multiprocessors/multicomputers must use shared virtual memory.
The Architecture Environment
 The Dash architecture was a large-scale, cache-coherent, NUMA multiprocessor system as
depicted in Fig. 9.1.

VTUPulse.com

 It consisted of multiple multiprocessor clusters connected through a scalable, low latency


interconnection network.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 43
ACA (15CS72) Notes Module-4

 Physical memory was distributed among the processing nodes in various clusters. The distributed
memory formed a global address space.
 Cache coherence was maintained using an invalidating, distributed directory-based protocol. For
each memory block, the directory kept track of remote nodes caching it.
 When a write occurred, point-to-point messages were sent to invalidate remote copies of the block.
 Acknowledgement messages were used to inform the originating node when an invalidation was
completed.

 Two levels of local cache were used per processing node. Loads and writes were separated with the
use of write buffers for implementing weaker memory consistency models.
 The main memory was shared by all processing nodes in the same cluster. To facilitate prefetching
and the directory-based coherence protocol, directory memory and remote-access caches were used
for each cluster.
 The remote-access cache was shared by all processors in the same cluster.

VTUPulse.com
The SVM Concept
 Figure 9.2 shows the structure of a distributed shared memory. A global virtual address space is
shared among processors residing at a large number of loosely coupled processing nodes.
 The idea of Shared virtual memory (SVM) is to implement coherent shared memory on a network
of processors without physically shared memory.
 The coherent mapping of SVM on a message-passing multicomputer architecture is shown in Fig.
9.2b.

 The system uses virtual addresses instead of physical addresses for memory references.

 Each virtual address space can be as large as a single node can provide and is shared by all nodes in
the system.
 The SVM address space is organized in pages which can be accessed by any node in the system. A
memory-mapping manager on each node views its local memory as a large cache of pages for its
associated processor.
Page Swapping
 A memory reference causes a page fault when the page containing the memory location is not in a
processor’s local memory.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 44
ACA (15CS72) Notes Module-4

 When a page fault occurs, the memory manager retrieves the missing page from the memory of
another processor.
 If there is a page frame available on the receiving node, the page is moved in.
 Otherwise, the SVM system uses page replacement policies to find an available page frame,
swapping its contents to the sending node.
 A hardware MMU can set the access rights (nil, read-only} writable) so that a memory access
violating memory coherence will cause a page fault.
 The memory coherence problem is solved in IVY through distributed fault handlers and their
servers. To client programs, this mechanism is completely transparent.
 The large virtual address space allows programs to be larger in code and data space than the
physical memory on a single node.
 This SVM approach offers the ease of shared-variable programming in a message-passing
environment.
 In addition, it improves software portability and enhances system scalability through modular
memory growth.

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 45


ACA (15CS72) Notes Module-4

Latency hiding can be accomplished through 4 complementary approaches:


i) Prefetching techniques which bring instructions or data close to the processor before
they are actually needed
ii) Coherent caches supported by hardware to reduce cache misses
iii) Relaxed memory consistency models by allowing buffering and pipelining of memory
references
iv) Multiple-contexts support to allow a processor to switch from one context to another
when a long latency operation is encountered.

1. Prefetching Techniques
Prefetching uses knowlwdge about the expected misses in a program to move the
corresponding data close to the processor before it is actually needed.
Prefetching can be classified based on whether it is
 Binding
 Non binding

VTUPulse.com
or whether it is controlled by


hardware
software
 Binding prefetching : the value of a later reference (eg, a register load) is bound at the time when
the prefetch completes.
 Non binding prefetching : brings data close to the processor, but the data remains visible to the
cache coherence protocol and is thus kept consistent until the processor actually reads the value.
 Hardware Controlled Prefetching: includes schemes such as long cache lines and instruction
lookahead.
 Software Controlled Prefetching: explicit prefetch instructions are issued. Allows the prefetching
to be done selectively and extends the possible interval between prefetch issue and actual reference.

2. Coherent Caches
 While the cache coherence problem is easily solved for small bus-based multiprocessors through
the use of snoopy cache coherence protocols, the problem is much more complicated for large scale
multiprocessors that use general interconnection networks.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 46


ACA (15CS72) Notes Module-4

 As a result, some large scale multiprocessors did not provide caches, others provided caches that
must be kept coherent by software, and still others provided full hardware support for coherent
caches.
 Caching of shared read-write data provided substantial gains in performance. The largest benefit
came from a reduction of cycles wasted due to read misses. The cycles wasted due to write misses
were also reduced.
 Hardware cache coherence is an effective technique for substantially increasing the performance
with no assistance from the compiler or programmer.

3. Relaxed memory consistency models


Some different consistency models can be defined by relaxing one or more requirements in
sequential consistency called relaxed consistency models. These consistency models do not
provide memory consistency at the hardware level. In fact, the programmers are responsible for
implementing the memory consistency by applying synchronization techniques.
There are 4 comparisons to define the relaxed consistency:

VTUPulse.com




Relaxation
Synchronizing vs non-synchronizing
Issue vs View-Based
Relative Model Strength

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 47


ACA (15CS72) Notes Module-4

VTUPulse.com
9.2 Principles of Multithreading
9.2.1 Multithreading Issues and Solutions
Multithreading demands that the processor be designed to handle multiple contexts simultaneously on
a context-switching basis.

Architecture Environment
Multithreading MPP system is modeled by a network of Processor (P) and memory (M) nodes as
shown in Fig. 9.11a. The distributed memories form a global address space.
Four machine parameters are defined below to analyze the performance of this network:
1. The Latency (L): This is the communication latency on a remote memory access. The value of
L includes the network delays, cache-miss penalty and delays caused by contentions in split
transactions.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 48


ACA (15CS72) Notes Module-4

2. The number of Threads (N): This is the number of threads that can be interleaved in each
processor. A thread is represented by a context consisting of a program counter, a register set
and the required context status words.
3. The context-switching overhead (C): This refers to the cycles lost in performing context
switching in a processor. This time depends on the switch mechanism and the amount of
processor states devoted to maintaining active threads.
4. The interval between switches (R): This refers to the cycles between switches triggered by
remote reference. The inverse p=1/R is called the rate of requests for remote accesses. This
reflects a combination of program behavior and memory system design.
In order to increase efficiency, one approach is to reduce the rate of requests by using distributed
coherent caches. Another is to eliminate processor waiting through multithreading.

Multithreaded Computations
Fig 9.11b shows the structure of the multithreaded parallel computations model.
The computation starts with a sequential thread (1), followed by supervisory scheduling (2), where
the processors begin threads of computation (3), by intercomputer messages that update variables

VTUPulse.com
among the nodes when the computer has distributed memory (4), and finally by synchronization
prior to beginning the next unit of parallel work (5).

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 49


ACA (15CS72) Notes Module-4

The communication overhead period (4) inherent in distributed memory structures is usually
distributed throughout the computation and is possibly completely overlapped.
Message passing overhead in multicomputers can be reduced by specialized hardware operating in
parallel with computation.
Communication bandwidth limits granularity, since a certain amount of data has to be transferred
with other nodes in order to complete a computational grain. Message passing calls (4) and
synchronization (5) are nonproductive.
Fast mechanisms to reduce or to hide these delays are therefore needed. Multithreading is not
capable of speedup in the execution of single threads, while weak ordering or relaxed consistency
models are capable of doing this.

Problems of Asynchrony
Massively parallel processors operate asynchronously in a network environment. The asynchrony
triggers two fundamental latency problems:
1. Remote loads
2. Synchronizing loads

VTUPulse.com

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 50


ACA (15CS72) Notes Module-4

Solutions to Asynchrony Problem


1. Multithreading Solutions
2. Distributed Caching

VTUPulse.com

1. Multithreading Solutions – Multiplex among many threads


When one thread issues a remote-load request, the processor begins work on another thread, and so on
(Fig. 9.13a).

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 51


ACA (15CS72) Notes Module-4

 Clearly the cost of thread switching should be much smaller than that of the latency of the remote
load, or else the processor might as well wait for the remote load’s response.
 As the internode latency increases, more threads are needed to hide it effectively. Another concern
is to make sure that messages carry continuations. Suppose, after issuing a remote load from thread
T1 (Fig 9.13a), we switch to thread T2, which also issues a remote load.
 The responses may not return in the same order. This may be caused by requests traveling different
distances, through varying degrees of congestion, to destination nodes whose loads differ greatly,
etc.
 One way to cope with the problem is to associate each remote load and response with an identifier
for the appropriate thread, so that it can be reenabled on the arrival of a response.

2. Distributed Caching
 The concept of Distributed Caching is shown in Fig. 9.13b. every memory location has an owner
node. For example, N1 owns B and N2 owns A.
 The directories are used to contain import-export lists and state whether the data is shared (for


VTUPulse.com
reads, many caches may hold copies) or exclusive (for writes, one cache holds the current value).
The directories multiplex among a small number of contexts to cover the cache loading effects.

The Distributed Caching offers a solution for the remote-loads problem, but not for the
synchronizing-loads problem.
 Multithreading offers a solution for remote loads and possibly for synchronizing loads.
 The two approaches can be combined to solve both types of remote access problems.

9.2.2 Multiple-Context Processors


Multithreaded systems are constructed with multiple-context (multithreaded) processors.
Enhanced Processor Model
 A conventional single-thread processor will wait during a remote reference, it is idle for a period of
time L.
 A multithreaded processor, as modeled in Fig. 9.14a, will suspend the current context and switch to
another, so after some fixed number of cycles it will again be busy doing useful work, even though
the remote reference is outstanding.
 Only if all the contexts are suspended (blocked) will the processor be idle.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 52


ACA (15CS72) Notes Module-4

The objective is to maximize the fraction of time that the processor is busy, we will use the efficiency
of the processor as our performance index, given by:

where busy, switching and idle represent the amount of time, measured over some large interval, that
the processor is in the corresponding state.
The basic idea behind a multithreaded machine is to interleave the execution of several contexts on
order to dramatically reduce the value of idle, but without overly increasing the magnitude of
switching.

Context-Switching Policies
Different multithreaded architectures are distinguished by the context-switching policies adopted.
Four switching policies are:
1. Switch on Cache miss – This policy corresponds to the case where a context is preempted
when it causes a cache miss.

VTUPulse.com
In this case, R is taken to be the average interval between misses (in Cycles) and L the time
required to satisfy the miss.
Here, the processor switches contexts only when it is certain that the current one will be
delayed for a significant number of cycles.
2. Switch on every load - This policy allows switching on every load, independent of whether it
will cause a miss or not.
In this case, R represents the average interval between loads. A general multithreading model
assumes that a context is blocked for L cycles after every switch; but in the case of a switch-on-
load processor, this happens only if the load causes a cache miss.
3. Switch on every instruction – This policy allows switching on every instruction, independent
of whether it is a load or not. Successive instructions become independent , which will benefit
pipelined execution.
4. Switch on block of instruction – Blocks of instructions from different threads are interleaved.
This will improve the cache-hit ratio due to locality. It will also benefit single-context
performance.

Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore 53

You might also like