2 3 4 Merged
2 3 4 Merged
MODULE-2
   Processors can be ―mapped‖ to a space that has clock rate and cycles per instruction (CPI) as
    coordinates. Each processor type occupies a region of this space.
   Newer technologies are enabling higher clock rates.
   Manufacturers are also trying to lower the number of cycles per instruction.
   Thus the ―future processor space‖ is moving toward the lower right of the processor design space.
      VTUPulse.com
CISC and RISC Processors
•   Complex Instruction Set Computing (CISC) processors like the Intel 80486, the Motorola 68040,
    the VAX/8600, and the IBM S/390 typically use microprogrammed control units, have lower clock
    rates, and higher CPI figures than…
•   Reduced Instruction Set Computing (RISC) processors like the Intel i860, SPARC, MIPS R3000,
    and IBM RS/6000, which have hard-wired control units, higher clock rates, and lower CPI figures.
Superscalar Processors
•   This subclass of the RISC processors allow multiple instructions to be issued simultaneously
    during each cycle.
•   The effective CPI of a superscalar processor should be less than that of a generic scalar RISC
    processor.
• Clock rates of scalar RISC and superscalar RISC machines are similar.
VLIW Machines
•   Very Long Instruction Word machines typically have many more functional units than superscalars
    (and thus the need for longer – 256 to 1024 bits – instructions to provide control for them).
•   These machines mostly use microprogrammed control units with relatively slow clock rates
    because of the need to use ROM to hold the microcode.
Superpipelined Processors
•   These processors typically use a multiphase clock (actually several clocks that are out of phase
    with each other, each phase perhaps controlling the issue of another instruction) running at a
    relatively high rate.
• The CPI in these machines tends to be relatively high (unless multiple instruction issue is used).
• Processors in vector supercomputers are mostly superpipelined and use multiple functional units
        VTUPulse.com
    for concurrent scalar and vector operations.
Instruction Pipelines
       VTUPulse.com
Basic definitions associated with Pipeline operations:
   •   Instruction pipeline cycle – the time required for each phase to complete its operation
       (assuming equal delay in all phases)
   •   Instruction issue latency – the time (in cycles) required between the issuing of two adjacent
       instructions
   •   Instruction issue rate – the number of instructions issued per cycle (the degree of a
       superscalar)
   •   Simple operation latency – the delay (after the previous instruction) associated with the
       completion of a simple operation (e.g. integer add) as compared with that of a complex
       operation (e.g. divide).
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore                                3
ACA (15CS72) Notes                                                                         Module-2
    •   Resource conflicts – when two or more instructions demand use of the same functional unit(s)
        at the same time.
Pipelined Processors
– can be fully utilized if instructions can enter the pipeline at a rate on one per cycle
    •   CPI rating is 1 for an ideal pipeline. Underpipelined systems will have higher CPI ratings,
        lower clock rates, or both.
VTUPulse.com
•   Figure 4.3 shows the data path architecture and control unit of a typical, simple scalar processor
    which does not employ an instruction pipeline. Main memory, I/O controllers, etc. are connected to
    the external bus.
•   The control unit generates control signals required for the fetch, decode, ALU operation, memory
    access, and write result phases of instruction execution.
•   The control unit itself may employ hardwired logic, or—as was more common in older CISC style
    processors—microcoded logic.
•   Modern RISC processors employ hardwired logic, and even modern CISC processors make use of
    many of the techniques originally developed for high-performance RISC processors.
    •   CISC
           –   Many different instructions
           –   Many different operand data types
           –   Many different operand addressing formats
           –   Relatively small number of general purpose registers
           –   Many instructions directly match high-level language constructions
    •   RISC
           –   Many fewer instructions than CISC (freeing chip space for more functional units!)
           –   Fixed instruction format (e.g. 32 bits) and simple operand addressing
           –   Relatively large number of registers
           –   Small CPI (close to 1) and high clock rates
        VTUPulse.com
Architectural Distinctions
    •   CISC
           – Unified cache for instructions and data (in most cases)
           – Microprogrammed control units and ROM in earlier processors (hard-wired controls
             units now in some CISC systems)
    •   RISC
           – Separate instruction and data caches
           – Hard-wired control units
• CISC Advantages
       VTUPulse.com
          – Smaller program size (fewer instructions)
          – Simpler control unit design
          – Simpler compiler design
   •   RISC Advantages
          – Has potential to be faster
          – Many more registers
   •   RISC Problems
          – More complicated register decoding system
          – Hardwired control is less flexible than microcode
           – VAX 8600
           – Motorola MC68040
           – Intel Pentium
   •
   •
       VTUPulse.com
       The CPU in the VAX 8600 consisted of two functional units for concurrent execution of
       integer and floating point instructions.
       The unified cache was used for holding both instructions and data.
       There were 16 GPRs in the instruction unit. Instruction pipelining was built with six stages in
       the VAX 8600, as in most elsc machines.
   •   The instruction unit prefetched and decoded instructions, handled branching operations, and
       supplied operands to the two functional units in a pipelined fashion.
   •   A Translation Lookaside Buffer (TLB) was used in the memory control unit for fast generation
       of a physical address from a virtual address.
   •   Both integer and floating point units were pipelined.
   •   The performance of the processor pipelines relied heavily on the cache hit ratio and on minimal
       branching damage to the pipeline flow.
   •   RISC and CISC scalar processors should have same performance if clock rate and program
       lengths are equal.
    •   RISC moves less frequent operations into software, thus dedicating hardware resources to the
        most frequently used operations.
– Sun SPARC
– Intel i860
– Motorola M88100
– AMD 29000
•   SPARC family chips produced by Cypress Semiconductors, Inc. Figure 4.7 shows the architecture
    of the Cypress CY7C601 SPARC processor and of the CY7C602 FPU.
•   The Sun SPARC instruction set contains 69 basic instructions
•   The SPARC runs each procedure with a set of thirty-two 32-bit IU registers.
•   Eight of these registers are global registers shared by all procedures, and the remaining 24 are
        VTUPulse.com
    window registers associated with only each procedure.
•   The concept of using overlapped register windows is the most important feature introduced by the
    Berkeley RISC architecture.
•   Fig. 4.8 shows eight overlapping windows (formed with 64 local registers and 64 overlapped
    registers) and eight globals with a total of 136 registers, as implemented in the Cypress 601.
•
      VTUPulse.com
    Each register window is divided into three eight-register sections, labeled Ins, Locals, and Outs.
•   The local registers are only locally addressable by each procedure. The Ins and Outs are shared
    among procedures.
•   The calling procedure passes parameters to the called procedure via its Outs (r8 to r15) registers,
    which are the Ins registers of the called procedure.
•   The window of the currently running procedure is called the active window pointed to by a current
    window pointer.
•   A window invalid mask is used to indicate which window is invalid. The trap base register serves
    as a pointer to a trap handler.
•     VTUPulse.com
    A special register is used to create a 64-bit product in multiple step instructions. Procedures can
    also be called without changing the window.
•   The overlapping windows can significantly save the time required for interprocedure
    communications, resulting in much faster context switching among cooperative procedures.
   A CISC or a RISC scalar processor can be improved with a superscalar or vector architecture.
   Scalar processors are those executing one instruction per cycle.
   Only one instruction is issued per cycle, and only one completion of instruction is expected from
    the pipeline per cycle.
   In a superscalar processor, multiple instructions are issued per cycle and multiple results are
    generated per cycle.
   A vector processor executes vector instructions on arrays of data; each vector instruction involves a
    string of repeated operations, which are ideal for pipelining with one result per cycle.
   Superscalar processors are designed to exploit more instruction-level parallelism in user programs.
   Only independent instructions can be executed in parallel without causing a wait state. The amount
    of instruction level parallelism varies widely depending on the type of code being executed.
   It has been observed that the average value is around 2 for code without loop unrolling. Therefore,
    for these codes there is not much benefit gained from building a machine that can issue more than
    three instructions per cycle.
   The instruction-issue degree in a superscalar processor has thus been limited to 2 to 5 in practice.
      VTUPulse.com
   A superscalar processor of degree m can issue m instructions per cycle.
   The base scalar processor, implemented either in RISC or CISC, has m = 1.
   In order to fully utilize a superscalar processor of degree m, m instructions must be executable in
    parallel. This situation may not be true in all clock cycles.
   In that case, some of the pipelines may be stalling in a wait state.
   In a superscalar processor, the simple operation latency should require only one cycle, as in the
    base scalar processor.
   Due to the desire for a higher degree of instruction-level parallelism in programs, the superscalar
    processor depends more on an optimizing compiler to exploit parallelism.
VTUPulse.com
   •   Code density in VLIW is less than in superscalars, because if a ―region‖ of a VLIW word isn’t
       needed in a particular instruction, it must still exist (to be filled with a ―no op‖).
   •   Superscalars can be compatible with scalar processors; this is difficult with VLIW parallel and
       non-parallel architectures.
VTUPulse.com
VLIW Opportunities
   •   The efficiency of the machine is entirely dictated by the success, or ―goodness,‖ of the
       compiler in planning the operations to be placed in the same instruction words.
   •   Different implementations of the same VLIW architecture may not be binary-compatible with
       each other, resulting in different latencies.
VLIW Summary
• VLIW reduces the effort required to detect parallelism using hardware or software techniques.
   •   The main advantage of VLIW architecture is its simplicity in hardware structure and instruction
       set.
   •   Unfortunately, VLIW does require careful analysis of code in order to ―compact‖ the most
       appropriate ‖short‖ instructions into a VLIW word.
   •
   •
   •
       VTUPulse.com
       A vector processor is a coprocessor designed to perform vector computations.
       A vector is a one-dimensional array of data items (each of the same data type).
       Vector processors are often used in multipipelined supercomputers.
   •   Typical memory-to-memory vector instructions (using the same notation as given in the
       previous slide) include these:
VTUPulse.com
   •   Vector processors can usually effectively use large pipelines in parallel, the number of such
       parallel pipelines effectively limited by the number of functional units.
   •   As usual, the effectiveness of a pipelined system depends on the availability and use of an
       effective compiler to generate code that makes good use of the pipeline facilities.
Symbolic Processors
   • Symbolic processors are somewhat unique in that their architectures are tailored toward the
     execution of programs in languages similar to LISP, Scheme, and Prolog.
    •   In effect, the hardware provides a facility for the manipulation of the relevant data objects with
        ―tailored‖ instructions.
    •   These processors (and programs of these types) may invalidate assumptions made about more
        traditional scientific and business computations.
        VTUPulse.com
4.3 Memory Hierarchical Technology
   Storage devices such as registers, caches, main memory, disk devices, and backup storage are often
    organized as a hierarchy as depicted in Fig. 4.17.
   The memory technology and storage organization at each level is characterized by five parameters:
       VTUPulse.com
Memory devices at a lower level are:
                faster to access,
                are smaller in capacity,
                are more expensive per byte,
                have a higher bandwidth, and
                have a smaller unit of transfer.
In general, ti-1 < ti,   si-1 < si,   ci-1 > ci,   bi-1 > bi   and   xi-1 < xi   for i = 1, 2, 3, and 4 in the
hierarchy where i = 0 corresponds to the CPU register level.
The cache is at level 1, main memory at level 2, the disks at level 3 and backup storage at level 4.
Caches
   The cache is controlled by the MMU and is programmer-transparent.
   The cache can also be implemented at one or multiple levels, depending on the speed and
    application requirements.
   Multi-level caches are built either on the processor chip or on the processor board.
   Multi-level cache systems have become essential to deal with memory access latency.
Peripheral Technology
   Peripheral devices include printers, plotters, terminals, monitors, graphics displays, optical
    scanners, image digitizers, output microfilm devices etc.
   Some I/O devices are tied to special-purpose or multimedia applications.
Information stored in a memory hierarchy (M1, M2,…, Mn) satisfies 3 important properties:
    1. Inclusion
    2. Coherence
    3. Locality
   We consider cache memory the innermost level M1, which directly communicates with the CPU
    registers.
   The outermost level Mn contains all the information words stored. In fact, the collection of all
    addressable words in Mn forms the virtual address space of a computer.
   Program and data locality is characterized below as the foundation for using a memory hierarchy
    effectively.
VTUPulse.com
•   The inverse, however, is not necessarily true. That is, the presence of a data item in level Mi+1 does
    not imply its presence in level Mi. We call a reference to a missing item a ―miss.‖
The requirement that copies of data items at successive memory levels be consistent is called the
―coherence property.‖
Coherence Strategies
• Write-through
• Write-back
           – The update of the data item in Mi+1 corresponding to a modified item in Mi is not
             updated unit it (or the block/page/etc. in Mi that contains it) is replaced or removed.
        VTUPulse.com
           – This is the most efficient approach, but cannot be used (without modification) when
             multiple processors share Mi+1, …, Mn.
3. Locality of References
    •   Memory references are generated by the CPU for either instruction or data access.
    •   These accesses tend to be clustered in certain regions in time, space, and ordering.
        There are three dimensions of the locality property:
           – Spatial locality – if location M is referenced at time t, then another location Mm will
             be referenced at time t+t.
Working Sets
   •   The set of addresses (bytes, pages, etc.) referenced by a program during the interval from t to t+
       t, where t is called the working set parameter, changes slowly.
   •   This set of addresses, called the working set, should be present in the higher levels of M if a
       program is to execute efficiently (that is, without requiring numerous movements of data items
       from lower levels of M). This is called the working set principle.
       VTUPulse.com
4.3.3 Memory Capacity Planning
The performance of a memory hierarchy is determined by the Effective Access Time Teff to any level
in the hierarchy. It depends on the hit ratios and access frequencies.
Hit Ratios
   •   When a needed item (instruction or data) is found in the level of the memory hierarchy being
       examined, it is called a hit. Otherwise (when it is not found), it is called a miss (and the item
       must be obtained from a lower level in the hierarchy).
   •   The hit ratio, h, for Mi is the probability (between 0 and 1) that a needed data item is found
       when sought in level memory Mi.
• We assume h0 = 0 and hn = 1.
Access Frequencies
• There are different penalties associated with misses at different levels in the memory hierarcy.
   •   VTUPulse.com
                  h1t1  (1  h1 )h2t2         (1  h1 )(1  h2 )   (1  hn1 )hntn
       The effective access time is still dependent on program behavior and memory design choices.
Hierarchy Optimization
This implies that the cost is distributed over n levels. Since cl > c2 > c3 > … cn, we have to choose s1
< s2 < s3 < … sn.
The optimal design of a memory hierarchy should result in a Teff close to the t1 of M1 and a total
cost close to the cost of Mn.
The optimization process can be formulated as a linear programming problem, given a ceiling C0 on
the total cost— that is, a problem to minimize
      •    To facilitate the use of memory hierarchies, the memory addresses normally generated by
           modern processors executing application programs are not physical addresses, but are rather
           virtual addresses of data items and instructions.
      •    Physical addresses, of course, are used to reference the available locations in the real physical
           memory of a system.
• Virtual addresses must be mapped to physical addresses before they can be used.
          VTUPulse.com
Virtual to Physical Mapping
• The mapping from virtual to physical addresses can be formally defined as follows:
      •    The mapping returns a physical address if a memory hit occurs. If there is a memory miss, the
           referenced item has not yet been brought into primary memory.
Mapping Efficiency
      •    The efficiency with which the virtual to physical mapping can be accomplished significantly
           affects the performance of the system.
    – In this scheme, each processor has a separate virtual address space, but all processors share the
      same physical address space.
      VTUPulse.com
    – Advantages:
– Disadvantages
           •   The same virtual address in different virtual spaces may point to different pages in
               physical memory
•   All processors share a single shared virtual address space, with each processor being given a
    portion of it.
Advantages:
Disadvantages
• Processors must be capable of generating large virtual addresses (usually > 32 bits)
• Since the page table is shared, mutual exclusion must be used to guarantee atomic updates
• Segmentation must be used to confine each process to its own address space
• The address translation process is slower than with private (per processor) virtual memory
Memory Allocation
Both the virtual address space and the physical address space are divided into fixed-length pieces.
•   The purpose of memory allocation is to allocate pages of virtual memory using the page frames of
    physical memory.
      VTUPulse.com
4.4.2 TLB, Paging, and Segmentation
Both the virtual memory and physical memory are partitioned into fixed-length pages. The purpose of
memory allocation is to allocate pages of virtual memory to the page frames of the physical memory.
   The process demands the translation of virtual addresses into physical addresses. Various schemes
    for virtual address translation are summarized in Fig. 4.21a.
   The translation demands the use of translation maps which can be implemented in various ways.
   Translation maps are stored in the cache, in associative memory, or in the main memory.
   To access these maps, a mapping function is applied to the virtual address. This function generates
    a pointer to the desired translation map.
   This mapping can be implemented with a hashing or congruence function.
   Hashing is a simple computer technique for converting a long page number into a short one with
    fewer bits.
   The hashing function should randomize the virtual page number and produce a unique hashed
    number to be used as the pointer.
      VTUPulse.com
        – The leftmost field holds the virtual page number,
        – the middle field identifies the cache block number,
        – the rightmost field is the word address within the block.
   Our purpose is to produce the physical address consisting of the page frame number, the block
    number, and the word address.
   The first step of the translation is to use the virtual page number as a key to search through the TLB
    for a match.
   The TLB can be implemented with a special associative memory (content addressable memory) or
    use part of the cache memory.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore                                    26
ACA (15CS72) Notes                                                                       Module-2
   In case of a match (a hit) in the TLB, the page frame number is retrieved from the matched page
    entry. The cache block and word address are copied directly.
   In case the match cannot be found (a miss) in the TLB, a hashed pointer is used to identify one of
    the page tables where the desired page frame number can be retrieved.
1. Paging
2. Segmentation
1. Paging memory
•
      VTUPulse.com
    Main memory contains some number of pages which is smaller than the number of pages in the
    virtual memory.
    For example, if the page size is 2K and the physical memory is 16M (8K pages) and the virtual
    memory is 4G (2 M pages) then there is a factor of 254 to 1 mapping.
• A page map table is used for implementing a mapping, with one entry per virtual page.
2. Segmented memory
•   In a segmented memory management system the blocks to be replaced in main memory are
    potentially of unequal length and here the segments correspond to logical blocks of code or data.
•   Segments, then, are ``atomic'' in the sense that either the whole segment should be in main
    memory, or none of the segment should be there.
•   The segments may be placed anywhere in main memory, but the instructions or data in one
    segment should be contiguous,
3. Paged Segmentation
• Within each segment, the addresses are divided into fixed size pages
– Segment Number
– Page Number
– Offset
Inverted paging
•   Besides direct mapping, address translation maps can also be implemented with inverted mapping
    (Fig. 4.21c).
•   An inverted page table is created for each page frame that has been allocated to users. Any virtual
    page number can be paired with a given physical page number.
•   Inverted page tables are accessed either by an associative search or by the use of a hashing
    function.
      VTUPulse.com
•   In using an inverted PT, only virtual pages that are currently resident in physical memory are
    included. This provides a significant reduction in the size of the page tables.
•   The generation of a long virtual address from a short physical address is done with the help of
    segment registers, as demonstrated in Fig. 4.21c.
•   The leading 4 bits (denoted sreg) of a 32-bit address name a segment register.
•   The register provides a segment id that replaces the 4-bit sreg to form a long virtual address.
•   This effectively creates a single long virtual address space with segment boundaries at multiples of
    256 Mbytes (228 bytes).
•   The IBM RT/PC had a 12-bit segment id (4096 segments) and a 40-bit virtual address space.
•   Either associative page tables or inverted page tables can be used to implement inverted mapping.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore                                     28
ACA (15CS72) Notes                                                                          Module-2
•   The inverted page table can also be assisted with the use of a TLB. An inverted PT avoids the use
    of a large page table or a sequence of page tables.
•   Given a virtual address to be translated, the hardware searches the inverted PT for that address and,
    if it is found, uses the table index of the matching entry as the address of the desired page frame.
•   A hashing table is used to search through the inverted PT.
•   The size of an inverted PT is governed by the size of the physical space, while that of traditional
    PTs is determined by the size of the virtual space.
•   Because of limited physical space, no multiple levels are needed for the inverted page table.
•   Memory management policies include the allocation and deallocation of memory pages to active
    processes and the replacement of memory pages.
•   Demand paging memory systems. refers to the process in which a resident page in main memory is
    replaced by a new page transferred from the disk.
•   Since the number of available page frames is much smaller than the number of pages, the frames
•
•
       VTUPulse.com
    will eventually be fully occupied.
    In order to accommodate a new page, one of the resident pages must be replaced.
    The goal of a page replacement policy is to minimize the number of possible page faults so that the
    effective memory-access time can be reduced.
•   The effectiveness of a replacement algorithm depends on the program behavior and memory traffic
    patterns encountered.
•   A good policy should match the program locality property. The policy is also affected by page size
    and by the number of available frames.
Page Traces: A page trace is a sequence of page frame numbers (PFNs) generated during the
execution of a given program.
The following page replacement policies are specified in a demand paging memory system for a page
fault at time t.
(1) Least recently used (LRU)—This policy replaces the page in R(t) which has the longest backward
distance:
(2) Optimal (OPT) algorithm—This policy replaces the page in R(t) with the longest forward
distance:
(3) First-in-first-out (FIFO)—This policy replaces the page in R(t) which has been in memory for the
longest time.
(4) Least frequently used (LFU)—This policy replaces the page in R(t) which has been least
referenced in the past.
(5) Circular FIFO—This policy joins all the page frame entries into a circular FIFO queue using a
pointer to indicate the front of the queue.
•   An allocation bit is associated with each page frame. This bit is set upon initial allocation of a page
    to the frame.
•   When a page fault occurs, the queue is circularly scanned from the pointer position.
•   The pointer skips the allocated page frames and replaces the very first unallocated page frame.
•   When all frames are allocated, the front of the queue is replaced, as in the FIFO policy.
      VTUPulse.com
(6) Random replacement—This is a trivial algorithm which chooses any page for replacement
randomly.
Example:
Consider a paged virtual memory system with a two-level hierarchy: main memory M1 and disk
memory M2.
Assume a page size of four words. The number of page frames in M1 is 3, labeled a, b and c; and the
number of pages in M2 is 10, identified by 0, 1, 2,….9. The ith page in M2consists of word
addresses 4i to 4i + 3 for all i = 0, 1, 2, …, 9.
A certain program generates the following sequence of word addresses which are grouped (underlined)
together if they belong to the same page. The sequence of page numbers so formed is the page trace:
Page tracing experiments are described below for three page replacement policies: LRU, OPT, and
FIFO, respectively. The successive pages loaded in the page frames (PFs) form the trace entries.
Initially, all PFs are empty.
VTUPulse.com
Module-III
   A backplane bus interconnects processors, data storage and peripheral devices in a tightly coupled
    hardware.
        VTUPulse.com
    The system bus must be designed to allow communication between devices on the devices on the
    bus without disturbing the internal activities of all the devices attached to the bus.
    Timing protocols must be established to arbitrate among multiple requests. Operational rules must
    be set to ensure orderly data transfers on the bus.
   Signal lines on the backplane are often functionally grouped into several buses as shown in Fig 5.1.
    Various functional boards are plugged into slots on the backplane. Each slot is provided with one
    or more connectors for inserting the boards as demonstrated by the vertical arrows.
    •   Data address and control lines form the data transfer bus (DTB) in VME bus.
    •   Address lines broadcast data and device address
           –    Proportional to log of address space size
    •   Data lines proportional to memory word length
    •   Control lines specify read/write, timing, and bus error conditions
      VTUPulse.com
Bus Arbitration and Control
   The process of assigning control of the DTB to a requester is called arbitration. Dedicated lines are
    reserved to coordinate the arbitration process among several requesters.
   The requester is called a master, and the receiving end is called a slave.
   Interrupt lines are used to handle interrupts, which are often prioritized. Dedicated lines may be
    used to synchronize parallel activities among the processor modules.
   Utility lines include signals that provide periodic timing (clocking) and coordinate the power-up
    and power-down sequences of the system.
   The backplane is made of signal lines and connectors.
   A special bus controller board is used to house the backplane control logic, such as the system
    clock driver, arbiter, bus timer, and power driver.
Functional Modules
A functional module is a collection of electronic circuitry that resides on one functional board (Fig.
5.1) and works to achieve special bus control functions.
Special functional modules are introduced below:
•   Arbiter is a functional module that accepts bus requests from the requester module and grants
    control of the DTB to one requester at a time.
•   Bus timer measures the time each data transfer takes on the DTB and terminates the DTB cycle if
    a transfer takes too long.
•   Interrupter module generates an interrupt request and provides status/ID information when an
    interrupt handler module requests it.
•   Location monitor is a functional module that monitors data transfers over the DTB. A power
    monitor watches the status of the power source and signals when power becomes unstable.
•   System clock driver is a module that provides a clock timing signal on the utility bus. In addition,
    board interface logic is needed to match the signal line impedance, the propagation time, and
    termination values between the backplane and the plug-in boards.
Physical Limitations
•
•
•
        VTUPulse.com
    Due to electrical, mechanical, and packaging limitations, only a limited number of boards can be
    plugged into a single backplane.
    Multiple backplane buses can be mounted on the same backplane chassis.
    The bus system is difficult to scale, mainly limited by packaging constraints.
    •   Two types of printed circuit boards connected to a bus: active and passive
    •   Active devices like processors can act as bus masters or as slaves at different times.
    •   Passive devices like memories can act only as slaves.
    •   The master can initiate a bus cycle
           –   Only one can be in control at a time
    •   The slaves respond to requests by a master
           –   Multiple slaves can respond
Bus Addressing
    •   The backplane bus is driven by a digital clock with a fixed cycle time: bus cycle
    •   Backplane has limited physical size, so will not skew information
VTUPulse.com
Synchronous Timing
   •   All bus transaction steps take place at fixed clock edges as shown in Fig. 5.3a.
   •   The clock signals are broadcast to all potential masters and slaves.
   •   Clock cycle time determined by slowest device on bus
   •   Once the data becomes stabilized on the data lines, the master uses Data-ready pulse to initiate
       the transfer
   •   The Slave uses Data-accept pulse to signal completion of the information transfer.
   •   Simple, less circuitry, suitable for devices with relatively the same speed.
VTUPulse.com
Asynchronous Timing
Arbitration
        VTUPulse.com
    •   Process of selecting next bus master
    •   Bus tenure is duration of master’s control
    •   It restricts the tenure of the bus to one master at a time.
    •   Competing requests must be arbitrated on a fairness or priority basis
    •   Arbitration competition and bus transactions take place concurrently on a parallel bus over
        separate lines
Central Arbitration
   • Uses a central arbiter as shown in Fig 5.4a
    •   Potential masters are daisy chained in a cascade
    •   A special signal line propagates bus-grant from first master (at slot 1) to the last master (at slot
        n).
    •   All requests share the same bus-request line
    •   The bus-request signals the rise of the bus-grant level, which in turn raises the bus-busy level
        as shown in Fig. 5.4b.
       VTUPulse.com
   •   Simple scheme
   •   Easy to add devices
   •   Fixed-priority sequence – not fair
   •   Propagation of bus-grant signal is slow
   •   Not fault tolerant
   •   Provide independent bus-request and grant signals for each master as shown in Fig5.5a.
   •   No daisy chaining is used in this scheme.
   •   Require a central arbiter, but can use a priority or fairness based policy
   •   More flexible and faster than a daisy-chained policy
   •   Larger number of lines – costly
      VTUPulse.com
Distributed Arbitration
•   Each master has its own arbiter and unique arbitration number as shown in Fig. 5.5b.
•   Uses arbitration number to resolve arbitration competition
•   When two or more devices compete for the bus, the winner is the one whose arbitration number is
    the largest determined by Parallel Contention Arbitration..
•   All potential masters can send their arbitration number to shared-bus request/grant (SBRG) lines
    and compare its own number with SBRG number.
•   If the SBRG number is greater, the requester is dismissed. At the end, the winner’s arbitration
    number remains on the arbitration bus. After the current bus transaction is completed, the winner
    seizes control of the bus.
•   Priority based scheme
Transfer Modes
Interrupt Mechanisms
   •   Interrupt: is a request from I/O or other devices to a processor for service or attention
   •   A priority interrupt bus is used to pass the interrupt signals
   •   Interrupter must provide status and identification information
   •   Have an interrupt handler for each request line
       VTUPulse.com
   •   Interrupts can be handled by message passing on data lines on a time-sharing basis.
           –   Save lines, but use cycles
           –   Use of time-shared data bus lines is a virtual-interrupt
Standard Requirements
The major objectives of the Futurebus+ standards committee were to create a bus standard that would
provide a significant step forward in improving the facilities and performance available to the
designers of multiprocessor systems.
Below are the design requirements set by the IEEE 896.1-1991 Standards Committee to provide a
stable platform on which several generations of computer systems could be based:
   •   Independence for an open standard
   •   Asynchronous timing protocol
   •   Optional packet protocol
   •   Distributed arbitration protocols
   •   Support of high reliability and fault tolerant applications
   •   Ability to lock modules without deadlock or livelock
   •   Circuit-switched and split transaction protocols
   •   Support of real-time mission critical computations w/multiple priority levels
   •   32 or 64 bit addressing
   •   Direct support of snoopy cache-based multiprocessors.
   •   Compatible message passing protocols
       VTUPulse.com
recently used data and/or instructions.
   •   Most multiprocessor systems use private caches for each processor as shown in Fig. 5.6
   •   Have an interconnection network between caches and main memory
   •   Caches can be addressed using either a Physical Address or Virtual Address.
   •   Two different cache design models are:
           –   Physical address cache
           –   Virtual address cache
   •
   •
       VTUPulse.com
       Cache lookup must occur after address translation in TLB or MMU. No aliasing is allowed so
       that the address is always uniquely translated without confusion.
       After cache miss, load a block from main memory
       Use either write-back or write-through policy
Advantages:
   •   No cache flushing on a context switch
   •   No aliasing problem thus fewer cache bugs in OS kernel.
   •   Simplistic design
   •   Requires little intervention from OS kernel
Disadvantages:
   Slowdown in accessing the cache until the MMU/TLB finishes translating the address
VTUPulse.com
•   When a cache is indexed or tagged with virtual address it is called virtual address cache.
•   In this model both cache and MMU translation or validation are done in parallel.
•   The physical address generated by the MMU can be saved in tags for later write back but is not
    used during the cache lookup operations.
Advantages:
    •   do address translation only on a cache miss
    •   faster for hits because no address translation
    •   More efficient access to cache
Disadvantages:
    •   Cache flushing on a context switch (example : local data segments will get an erroneous hit for
        virtual addresses already cached after changing virtual address space, if no cache flushing).
    •   Aliasing problem (several different virtual addresses cannot span the same physical addresses
        without being duplicated in cache).
        VTUPulse.com
    •   Different logically addressed data have the same index/tag in the cache
    •   Confusion if two or more processors access the same physical cache location
    •   Flush cache when aliasing occurs, but leads to slowdown
    •   Apply special tagging with a process key or with a physical address
VTUPulse.com
The block field (r bits) is used to implement the (modulo-m) placement, where m=2r
Once the block Bi is uniquely identified by this field, the tag associated with the addressed block is
compared with the tag in the memory address.
   •   Advantages
           –   Simple hardware
           –   No associative search
           –   No page replacement policy
           –   Lower cost
           –   Higher speed
   •   Disadvantages
           –   Rigid mapping
           –   Poorer hit ratio
           –   Prohibits parallel virtual address translation
           –   Use larger cache size with more block frames to avoid contention
   •
   •
       VTUPulse.com
       Each block in main memory can be placed in any of the available block frames as shown in
       Fig. 5.10a.
       Because of this flexibility, an s-bit tag needed in each cache block.
       As s > r, this represents a significant increase in tag length.
   •   The name fully associative cache is derived from the fact that an m-way associative search
       requires tag to be compared with all block tags in the cache. This scheme offers the greatest
       flexibility in implementing block replacement policies for a higher hit ratio.
   •   An m-way comparison of all tags is very time consuming if the tags are compared sequentially
       using RAMs. Thus an associative memory is needed to achieve a parallel comparison with all
       tags simultaneously.
   •   This demands higher implementation cost for the cache. Therefore, a Fully Associative Cache
       has been implemented only in moderate size.
   •   Fig. 5.10b shows a four-way mapping example using a fully associative search. The tag is 4-
       bits long because 16 possible cache blocks can be destined for the same block frame.
       VTUPulse.com
   •   Advantages:
           –   Offers most flexibility in mapping cache blocks
           –   Higher hit ratio
           –   Allows better block replacement policy with reduced block contention
   •   Disadvantages:
           –   Higher hardware cost
           –   Only moderate size cache
           –   Expensive search process
VTUPulse.com
   •   Compare the tag with the k tags within the identified set as shown in Fig 5.11a.
   •   Since k is rather small in practice, the k-way associative search is much more economical than
       the full associativity.
   •   In general, a block Bj can be mapped into any one of the available frames Bf in a set Si defined
       below.
                Bj  Bf  Si      if j(mod v) = i
   •   The matched tag identifies the current block which resides in the frame.
       VTUPulse.com
       If a cache miss occurs, the missing block is fetched from the main memory and brought into a
       congruent block frame in available sector.
   •   That is the ith block in a sector must be placed into the ith block frame in a destined sector
       frame.
   •   Attach a valid bit to each block frame to indicate whether the block is valid or invalid.
   •   When the contents of the block frame are replaced from a new sector, the remaining block
       frames in the same sector are marked invalid. Only the block frames from the most recently
       referenced sector are marked valid for reference.
Advantages:
   •   Flexible to implement various bkock replacement algorithms
   •   Economical to perform a fully associative search a limited number of sector tags.
   •   Sector partitioning offers more freedom in grouping cache lines at both ends of the mapping.
      VTUPulse.com
4.2.4 Cache Performance Issues
As far as the performance of cache is considered the trade off exist among the cache size, set number,
block size and memory speed. Important aspect in cache designing with regard to performance are :
Cycle counts
•   This refers to the number of basic machine cycles needed for cache access, update and coherence
    control.
•   Cache speed is affected by underlying static or dynamic RAM technology, the cache organization
    and the cache hit ratios.
•   The write through or write back policy also affect the cycle count.
•   Cache size, block size, set number, and associativity affect count
•   The cycle count is directly related to the hit ratio, which decreases almost linearly with increasing
    values of above cache parameters.
Hit ratio
•   The hit ratio is number of hits divided by total number of CPU references to memory (hits plus
    misses).
•   Hit ratio is affected by cache size and block size
•   Increases w.r.t. increasing cache size
•   Limited cache size, initial loading, and changes in locality prevent 100% hit ratio
VTUPulse.com
•
        VTUPulse.com
      can be accessed per unit time.
      The ultimate purpose is to match the memory bandwidth with the bus bandwidth and with the
      processor bandwidth.
Memory Interleaving
•     The main memory is built with multiple modules.
•     These memory modules are connected to a system bus or a switching network to which other
      resources such as processors or I/O devices are also connected.
•     Once presented with a memory address, each memory module returns with one word per cycle.
•     It is possible to present different addresses to different memory modules so that parallel access of
      multiple words can be done simultaneously or in a pipelined fashion.
Consider a main memory formed with m = 2a memory modules, each containing w = 2b words of
memory cells. The total memory capacity is m.w = 2a+b words.
These memory words are assigned linear addresses. Different ways of assigning linear addresses result
in different memory organizations.
Besides random access, the main memory is often block-accessed at consecutive addresses.
        VTUPulse.com
Low-order interleaving
•   Low-order interleaving spreads contiguous memory locations across the m modules horizontally
    (Fig. 5.15a).
•   This implies that the low-order a bits of the memory address are used to identify the memory
    module.
•   The high-order b bits are the word addresses (displacement) within each module.
•   Note that the same word address is applied to all memory modules simultaneously. A module
    address decoder is used to distribute module addresses.
High-order interleaving
•   High-order interleaving uses the high-order a bits as the module address and the low-order b bits as
    the word address within each module (Fig. 5.15b).
•   Contiguous memory locations are thus assigned to the same memory module. In each memory
    cycle, only one word is accessed from each module.
•   Thus the high-order interleaving cannot support block access of contiguous locations.
VTUPulse.com
•   An eight-way interleaved memory (with m=8 and w=8 and thus a=b=3) is shown in Fig. 5.16a.
•   Let  be the major cycle and  the minor cycle. These two cycle times are related as follows:
                = /m
       m=degree of interleaving
       =total time to complete access of one word
       =actual time to produce one word
       Total block access time is 2
       Effective access time of each word is 
•   The timing of the pipelined access of the 8 contiguous memory words is shown in Fig. 5.16b.
•   This type of concurrent access of contiguous words has been called a C-access memory scheme.
      VTUPulse.com
The memory bandwidth B of an m-way interleaved memory is upper-bounded by m and lower-
bounded by I. The Hellerman estimate of B is
                                                     (5.5)
where m is the number of interleaved memory modules.
   This equation implies that if 16 memory modules are used, then the effective memory bandwidth is
    approximately four times that of a single module.
   This pessimistic estimate is due to the fact that block access of various lengths and access of single
    words are randomly mixed in user programs.
   Hellerman's estimate was based on a single-processor system. If memory-access conflicts from
    multiple processors (such as the hot spot problem) are considered, the effective memory bandwidth
    will be further reduced.
   In a vector processing computer, the access time of a long vector with n elements and stride
    distance 1 has been estimated by Cragon (1992) as follows:
   It is assumed that the n elements are stored in contiguous memory locations in an m-way
    interleaved memory system.
(5.6)
Fault Tolerance
   High- and low-order interleaving can be combined to yield many different interleaved memory
    organizations.
   Sequential addresses are assigned in the high-order interleaved memory in each memory module.
   This makes it easier to isolate faulty memory modules in a memory bank of m memory modules.
   When one module failure is detected, the remaining modules can still bo used by opening a
    window in the address space.
   This fault isolation cannot be carried out in a low-order interleaved memory, in which a module
        VTUPulse.com
    failure may paralyze the entire memory bank.
   Thus low-order interleaving memory is not fault-tolerant.
   •   Global allocation: considers the history of the working sets of all resident processes in making
       a swapping decision
Swapping Systems
   •   Allow swapping only at entire process level
   •   Swap device: configurable section of a disk set aside for temp storage of data swapped
   •   Swap space: portion of disk set aside
   •   Depending on system, may swap entire processes only, or the necessary pages
       VTUPulse.com
Swapping in UNIX
   •   System calls that result in a swap:
           –   Allocation of space for child process being created
           –   Increase in size of a process address space
           –   Increased space demand by stack for a process
           –   Demand for space by a returning process swapped out previously
   •   Special process 0 is the swapper
   •   Allows only pages to be transferred b/t main memory and swap device
   •   Pages are brought in only on demand
   •   Allows process address space to be larger than physical address space
   •   Offers flexibility to dynamically accommodate large # of processes in physical memory on
       time-sharing basis
Working Sets
   •   Set of pages referenced by the process during last n memory refs (n=window size)
   •   Only working sets of active processes are resident in memory
Other Policies
   •
       VTUPulse.com
5.4 Sequential and Weak Consistency Models
   •   Memory inconsistency: when memory access order differs from program execution order
       Sequential consistency: memory accesses (I and D) consistent with program execution order
Event Orderings
   •   Processes: concurrent instruction streams executing on different processors
   •   Consistency models specify the order by which events from one process should be observed by
       another
   •   Event ordering helps determine if a memory event is legal for concurrent accesses
    •   Program order: order by which memory access occur for execution of a single process, w/o
        any reordering
The event ordering can he used to declare whether a memory event is legal or illegal, when several
processes are accessing a common set of memory locations.
A program order is the order by which memory accesses occur for the execution of a single process,
provided that no program reordering has taken place.
Three primitive memory operations for the purpose of specifying memory consistency models are
defined:
(1) A load by processor Pi is considered performed with respect to processor Pk at a point of time
when the issuing of a store to the same location by Pk cannot affect the value returned by the load.
(2) A store by P, is considered performed with respect to Pk at one time when an issued load to the
same address by Pk returns the value by this store.
(3) A load is globally performed if it is performed with respect to all processors and if the store that is
the source of the returned value has been performed with respect to all processors.
        VTUPulse.com
    As illustrated in Fig. 5.19a, a processor can execute instructions out of program order using a
    compiler to resequence instructions in order to boost performance.
    A uniprocessor system allows these out-of-sequence executions provided that hardware interlock
    mechanisms exist to check data and control dependences between instructions.
   When a processor in a multiprocessor system executes a concurrent program as illustrated in Fig.
    5.19b, local dependence checking is necessary but may not be sufficient to preserve the intended
    outcome of a concurrent execution.
VTUPulse.com
(c) If accesses are not atomic with multiple copies of the same data coexisting as in a cache-based
system, then different processors can individually observe different interleavings during the same
execution. In this case, the total number of possible execution instantiations of a program becomes
even larger.
Atomicity
   Three categories of multiprocessor memory behavior:
   •   Program order preserved and uniform observation sequence by all processors
   •   Out-of-program-order allowed and uniform observation sequence by all processors
   •   Out-of-program-order allowed and nonuniform sequences observed by different processors
Atomic memory accesses: memory updates are known to all processors at the same time
Non-atomic: having individual program orders that conform is not a sufficient condition for sequential
consistency
       –    Multiprocessor cannot be strongly ordered
       VTUPulse.com
       A multiprocessor system is sequentially consistent if the result of any execution is the same as
       if the operations of all the processors were executed in some sequential order, and the
       operations of each individual processor appear in this sequence in the order specified by its
       program.
       VTUPulse.com
   2. The memory order conforms to a total binary order in which shared memory is accessed in real
       time over all loads/stores
   3. If two operations appear in particular program order, same memory order
   4. Swap op is atomic with respect to stores. No other store can intervene between load and store
       parts of swap
   5. All stores and swaps must eventually terminate
Implementation Considerations
   •   A single port software services one op at a time
   •   Order in which software is thrown determines global order of memory access ops
   •   Strong ordering preserves the program order in all processors
   •   Sequential consistency model leads to poor memory performance due to the imposed strong
       ordering of memory events
   •   Multiprocessor model may range from strong (sequential) consistency to various degrees of
       weak consistency
   •   Two models considered
           –   DSB (Dubois, Scheurich and Briggs) model
           –   TSO (Total Store Order) model
DSB Model
Dubois, Scheurich and Briggs have derived a weak consistency model by relating memory request
ordering to synchronization points in the program. We call this the DSB model specified by the
following 3 conditions:
   1. All previous synchronization accesses must be performed, before a load or a store access is
       allowed to perform wrt any other processor.
   2. All previous load and store accesses must be performed, before a synchronization access is
       allowed to perform wrt any other processor.
       VTUPulse.com
   3. Synchronization accesses sequentially consistent with respect to one another
TSO Model
Sindhu, Frailong and Cekleov have specified the TSO weak consistency model with 6 behavioral
axioms.
   1. Load returns latest store result
   2. Memory order is a total binary relation over all pairs of store operations
   3. If two stores appear in a particular program order, then they must also appear in the same
       memory order
   4. If a memory operation follows a load in program order, then it must also follow load in
       memory order
   5. A swap operation is atomic with respect to other stores – no other store can interleave between
       load/store parts of swap
   6. All stores and swaps must eventually terminate.
VTUPulse.com
Asynchronous Model
   As shown in the figure data flow between adjacent stages in an asynchronous pipeline is controlled
    by a handshaking protocol.
   When stage Si is ready to transmit, it sends a ready signal to stage Si+1. After stage receives the
    incoming data, it returns an acknowledge signal to Si.
   Asynchronous pipelines are useful in designing communication channels in message- passing
    multicomputers where pipelined wormhole routing is practiced Asynchronous pipelines may have
    a variable throughput rate.
   Different amounts of delay may be experienced in different stages.
Synchronous Model:
   Synchronous pipelines are illustrated in Fig. Clocked latches are used to interface between stages.
   The latches are made with master-slave flip-flops, which can isolate inputs from outputs.
   Upon the arrival of a clock pulse All latches transfer data to the next stage simultaneously.
   The pipeline stages are combinational logic circuits. It is desired to have approximately equal
    delays in all stages.
   These delays determine the clock period and thus the speed of the pipeline. Unless otherwise
     VTUPulse.com
    specified, only synchronous pipelines are studied.
    The utilization pattern of successive stages in a synchronous pipeline is specified by a reservation
    table.
   For a linear pipeline, the utilization follows the diagonal streamline pattern shown in Fig. 6.1c.
   This table is essentially a space-time diagram depicting the precedence relationship in using the
    pipeline stages.
   Successive tasks or operations are initiated one per cycle to enter the pipeline. Once the pipeline is
    filled up, one result emerges from the pipeline for each additional cycle.
   This throughput is sustained only if the successive tasks are independent of each other.
   At the rising edge of the clock pulse, the data is latched to the master flip-flops of each latch
    register. The clock pulse has a width equal to d.
   In general, τm >> d by one to two orders of magnitude.
   This implies that the maximum stage delay τm dominates the clock period. The pipeline frequency
    is defined as the inverse of the clock period.
                f=1/τ
   If one result is expected to come out of the pipeline per cycle, f represents the maximum
    throughput of the pipeline.
   Depending on the initiation rate of successive tasks entering the pipeline, the actual throughput of
    the pipeline may be lower than f.
   This is because more than one clock cycle has elapsed between successive task initiations.
Clock Skewing:
     VTUPulse.com
    Ideally, we expect the clock pulses to arrive at all stages (latches) at the same time.
    However, due to a problem known as clock skewing the same clock pulse may arrive at different
    stages with a time offset of s.
   Let tmax be the time delay of the longest logic path within a stage
   tmin is the shortest logic path within a stage.
   To avoid a race in two successive stages, we must choose
        τm >= tmax + s      and       d <= tmin - s
   These constraints translate into the following bounds on the clock period when clock skew takes
    effect:
        d + tmax + s <= τ <= τm + tmin - s
   In the ideal case s = 0, tmax = τm, and tmin = d. Thus, we have τ= τm + d
Speedup Factor
The speedup factor of a k-stage pipeline over an equivalent nonpipelined processor is defined as
      VTUPulse.com
The pipeline throughput Hk is defined as the number of tasks (operations) performed per unit time:
   However, function partitioning in a dynamic pipeline becomes quite involved because the pipeline
    stages are interconnected with loops in addition to streamline connections.
   A multifunction dynamic pipeline is shown in Fig 6.3a. This pipeline has three stages.
   Besides the streamline connections from S1 to S2 and from S2 to S3, there is a feed forward
    connection from S1 to S3 and two feedback connections from S3 to S2 and from S3 to S1.
   These feed forward and feedback connections make the scheduling of successive events into the
    pipeline a nontrivial task.
   With these connections, the output of the pipeline is not necessarily from the last stage.
   In fact, following different dataflow patterns, one can use the same pipeline to evaluate different
    functions
      VTUPulse.com
Reservation Tables:
   The reservation table for a static linear pipeline is trivial in the sense that data flow follows a linear
    streamline.
   The reservation table for a dynamic pipeline becomes more interesting because a nonlinear pattern
    is followed.
   Given a pipeline configuration, multiple reservation tables can be generated for the evaluation of
    different functions.
   Two reservation tables are given in Fig6.3b and 6.3c, corresponding to a function X and a function
    Y, respectively.
   Each function evaluation is specified by one reservation table. A static pipeline is specified by a
    single reservation table.
   A dynamic pipeline may be specified by more than one reservation table. Each reservation table
    displays the time-space flow of data through the pipeline for one function evaluation.
   Different functions may follow different paths on the reservation table.
   A number of pipeline configurations may be represented by the same reservation table.
   There is a many-to-many mapping between various pipeline configurations and different
    reservation tables.
   The number of columns in a reservation table is called the evaluation time of a given function.
Latency Analysis
   The number of time units (clock cycles) between two initiations of a pipeline is the latency
    between them.
   Latency values must be non negative integers. A latency of k means that two initiations are
    separated by k clock cycles.
   Any attempt by two or more initiations to use the same pipeline stage at the same time will cause a
    collision.
   A collision implies resource conflicts between two initiations in the pipeline. Therefore, all
      VTUPulse.com
    collisions must be avoided in scheduling a sequence of pipeline initiations.
   Some latencies will cause collisions, and some will not.
   Latencies that cause collisions are called forbidden latencies.
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
Load operation (LD R2, M) and replaces it with the move operation (MOVE R2, R1).
Hazard Avoidance
•   The read and write of shared variables by different instructions in a pipeline may lead to different
    results if these instructions are executed out of order.
•   As shown in Fig. 6.15, three types of logic hazards are possible:
•   Consider two instructions I and J. Instruction J is assumed to logically follow instruction I
    according to program order.
• If the actual execution order of these two instructions violate the program order, incorrect results
•
      VTUPulse.com
    may be read or written, thereby producing hazards.
    Hazards should be prevented before these instructions enter the pipeline, such as by holding
    instruction J until the dependence on instruction I is resolved.
•   We use the notation D(I) and R(I) for the domain and range of an instruction I.
       –   Domain contains the Input Set to be used by instruction I
       –   Range contains the Output Set of instruction I
Listed below are conditions under which possible hazards can occur:
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
MODULE-IV
   Parallel processing demands the use of efficient system interconnects for fast communication
    among multiple processors and shared memory, I/O and peripheral devices.
   Hierarchical buses, crossbar switches and multistage networks are often used for this purpose.
   A generalized multiprocessor system is depicted in Fig. 7.1. This architecture combines features
    from the UMA, NUMA and COMA models.
VTUPulse.com
   Each processor Pi is attached to its own local memory and private cache.
   These multiple processors connected to share memory through interprocessor memory network
    (IPMN).
   Processors share the access of I/O and peripheral devices through Processor-I/O Network (PION).
    Both IPMN and PION are necessary in a shared-resource multiprocessor.
   An optional Interprocessor Communication Network (IPCN) can permit processor communication
    without using shared memory.
Network Characteristics
The networks are designed with many choices like timing, switching and control strategy like in case
of dynamic network the multiprocessors interconnections are under program control.
    Timing
          Synchronous – controlled by a global clock which synchronizes all network activity.
          Asynchronous – use handshaking or interlock mechanisms for communication and
           especially suitable for coordinating devices with different speed.
    Switching Method
          Circuit switching – a pair of communicating devices control the path for the entire duration
      VTUPulse.com
       
           of data transfer
           Packet switching – large data transfers broken into smaller pieces, each of which can
           compete for use of the path
    Network Control
                  Centralized – global controller receives and acts on requests
                  Distributed – requests handled by local devices independently
   A bus system consists of a hierarchy of buses connecting various system and subsystem
    components in a computer.
   Each bus is formed with a number of signal, control, and power lines. Different buses are used to
    perform different interconnection functions.
   In general, the hierarchy of bus systems are packaged at different levels as depicted in Fig. 7.2,
    including local buses on boards, backplane buses, and I/O buses.
      VTUPulse.com
   Local Bus Buses implemented on printed-circuit boards are called local buses.
   On a processor board one often finds a local bus which provides a common communication path
    among major components (chips) mounted on the board.
   A memory board uses a memory bus to connect the memory with the interface logic.
   An I/O board or network interface board uses a data bus. Each of these board buses consists of
    signal and utility lines.
Backplane Bus
A backplane is a printed circuit on which many connectors are used to plug in functional boards. A
system bus, consisting of shared signal pathsand utility lines, is built on the backplane.This system bus
provides a common communication path among all plug-in boards.
I/O Bus
Input/Output devices are connected to a comuter system through an I/O bus such as the SCSI(Small
Computer Systems Interface) bus.
This bus is made of coaxial cables with taps connecting disks, printer and other devices to a processor
through an I/O controller.
Special interface logic is used to connect various board types to the backplane bus.
      VTUPulse.com
must have a capacity that is at least an order of magnitude larger than the sum of the capacities of all
first-level caches connected beneath it.
   Each single cluster operates on a single-bus system. Snoopy bus coherence protocols can be used to
    establish consistency among first level caches belonging to the same cluster.
   Second level caches are used to extend consistency from each local cluster to the upper level.
   The upper level caches form another level of shared memory between each cluster and the main
    memory modules connected to the intercluster bus.
   Most memory requests should be satisfied at the lower level caches.
   Intercluster cache coherence is controlled among the second-level caches and the resulting effects
    are passed to the lower level.
This is because even if two processors attempted to access the same memory module (or I/O device) at the
same time, only one of the requests is serviced at a time.
        VTUPulse.com
Multistage Networks
Multistage networks consist of multiple sages of switch boxes, and should be able to connect any input to
any output.
A multistage network is called blocking if the simultaneous connections of some multiple input/output
pairs may result in conflicts in the use of switches or communication links.
A nonblocking multistage network can perform all possible connections between inputs and outputs by
rearranging its connections.
Crossbar Networks
Crossbar networks connect every input to every output through a crosspoint switch. A crossbar network is a
single stage, non-blocking permutation network.
In an n-processor, m-
unary switch which can be open or closed, providing a point-to-point connection path between the
processor and a memory module.
Crosspoint switches must be designed to handle the potential contention for each memory module. A
                                                                      2
crossbar switch avoids competition for bandwidth by using O(N ) switches to connect N inputs to N
outputs.
Although highly non-scalable, crossbar switches are a popular mechanism for connecting a small number
of workstations, typically 20 or fewer.
      VTUPulse.com
Each processor provides a request line, a read/write line, a set of address lines, and a set of data lines to a
crosspoint switch for a single column. The crosspoint switch eventually responds with an
acknowledgement when the access has been completed.
Multiport Memory
Since crossbar switches are expensive and not suitable for systems with many processors or memory
modules, multiport memory modules may be used instead.
A multiport memory module has multiple connection points for processors (or I/O devices), and the
memory controller in the module handles the arbitration and switching that might otherwise have been
accomplished by a crosspoint switch.
      VTUPulse.com
A two function switch can assume only two possible state namely state or exchange states. However a four
function switch box can be any of four possible states. A multistage network is capable of connecting any
input terminal to any output terminal. Multi-stage networks are basically constructed by so called shuffle-
exchange switching element, which is basically a 2 x 2 crossbar. Multiple layers of these elements are
connected and form the network.
Data routing is controlled by inspecting the destination code in binary. When the ith high-order bit of
the destination code is a 0, a 2 x 2 switch at stage i connects the input to the upper output.
Otherwise, the input is directed to the lower output.
VTUPulse.com
   Two switch settings are shown in Figs. 7.8a and b with respect to permutations Π1 = (0,7,6,4,2)
    (1,3)(5) and Π2= (0,6,4,7,3) (1,5)(2), respectively.
   The switch settings in Fig. 7.8a are for the implementation of Π1, which maps 0 7, 7 6, 64,
    42, 2 0, 1 3, 3 1, 5 5.
   Consider the routing of a message from input 001 to output 011. This involves the use of switches
    A, B, and C. Since the most significant bit of the destination 011 is a "zero," switch A must be set
    straight so that the input 001 is connected to the upper output (labeled 2).
   The middle bit in 011 is a "one," thus input 4 to switch B is connected to the lower output with a
    "crossover" connection.
   The least significant bit in 011 is a "one," implying a flat connection in switch C.
   Similarly, the switches A, E, and D are set for routing a message from input 101 to output 101.
    There exists no conflict in all the switch settings needed to implement the permutation Π1 in Fig.
    7.8a.
   Now consider implementing the permutation Π2 in the 8-input Omega network (Fig. 7.8b0.
    Conflicts in switch settings do exist in three switches identified as F, G, and H. The conflicts
    occurring at F are caused by the desired routings 000 110 and 100 111.
   Since both destination addresses have a leading bit 1, both inputs to switch F must be connected to
    the lower output.
   To resolve the conflicts, one request must be blocked.
   Similarly we see conflicts at switch G between 011 000 and 111011, and at switch H between
    101001 and 011 000. At switches I and J, broadcast is used from one input to two outputs,
      VTUPulse.com
    which is allowed if the hardware is built to have four legitimate states as shown in fig. 2.24a.
   The above example indicates the fact that not all permutations can be implemented in one pass
    through the Omega network.
VTUPulse.com
   In total, sixteen 8x8 crossbar switches are used in Fig. 7.10a and 16 x 8+8 x 8 = 192 are used in
    Fig. 7.10b. Larger Butterfly networks can be modularly constructed using more stages.
   Note that no broadcast connections are allowed in a Butterfly network, making these networks a
    restricted subclass of Omega networks.
Fectch&Add
   This atomic memory operation is effective in implementing an N-way synchronization with a
    complexity independent of N.
      VTUPulse.com
   In a Fetch&Add(x, e) operation, i is an integer variable in shared memory and e is an integer
    increment.
   When a single processor executes this operation, the semantics is
       Fetch&Add(x, e)
          {      temp  x;
                 x  temp + e;                                                                        (7.1)
                 return temp     }
   When N processes attempt to Fetch&Add(x, e) the same memory word simultaneously, the
    memory is updated only once following a serialization principle.
   The sum of the N increments, e1 + e2 + • • • + eN, is produced in any arbitrary serialization of the N
    requests.
   This sum is added to the memory word x, resulting in a new value x + e1 + e2 + • • • + eN
   The values returned to the N requests are all unique, depending on the serialization order followed.
   The net result is similar to a sequential execution of N Fetch&Adds but is performed in one
    indivisible operation.
   Two simultaneous requests are combined in a switch as illustrated in Fig. 7.11.
      VTUPulse.com
One of the following operations will be performed if processor P1 executes Ans1  Fetch&Add(x,e1)
and P2 executes Ans2  Fetch&Add(x,e2) simultaneously on the shared variable x.
If the request from P1 is executed ahead of that from P2, the following values are returned:
       Ans1  x
       Ans2  x+ e1                                                                                  (7.2)
   In a memory hierarchy for a multiprocessor system, data inconsistency may occur between
    adjacent levels or within the same level.
   For example, the cache and main memory may contain inconsistent copies of the same data object.
   Multiple caches may possess different copies of the same memory block because multiple
    processors operate asynchronously and independently.
   Caches in a multiprocessing environment introduce the cache coherence problem. When multiple
    processors maintain locally cached copies of a unique shared-memory location, any local
    modification of the location can result in a globally inconsistent view of memory.
   Cache coherence schemes prevent this problem by maintaining a uniform state for each cached
    block of data.
   Cache inconsistencies caused by data sharing, process migration or I/O are explained below.
      VTUPulse.com
Inconsistency in Data sharing:
The cache inconsistency problem occurs only when multiple private caches are used.
In general, three sources of the problem are identified:
              sharing of writable data,
              process migration
              I/O activity.
•   Consider a multiprocessor with two processors, each using a private cache and both sharing the
    main memory.
•   Let X be a shared data element which has been referenced by both processors. Before update, the
    three copies of X are consistent.
•   If processor P writes new data X’ into the cache, the same copy will be written immediately into
    the shared memory under a write through policy.
•   In this case. inconsistency occurs between the two copies (X and X') in the two caches.
•   On the other hand, inconsistency may also occur when a write back policy is used, as shown on the
    right.
•   The main memory will be eventually updated when the modified data in the cache are replaced or
    invalidated.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore                                 13
ACA (15CS72) Notes                                                                          Module-4
      VTUPulse.com
In both cases, inconsistency appears between the two cache copies, labeled X and X’. Special
precautions must be exercised to avoid such inconsistencies. A coherence protocol must be established
before processes can safely rnigrate from one processor to another.
•   On the other hand, scalable multiprocessor systems interconnect processors using short point-to-
    point links in direct or multistage networks.
•   Unlike the situation in buses, the bandwidth of these networks increases as more processors are
    added to the system.
•   However, such networks do not have a convenient snooping mechanism and do not provide an
    efficient broadcast capability. In such systems, the cache coherence problem can be solved using
    some variant of directory schemes.
      VTUPulse.com
    (P1, P2, Pn) maintaining consistent copies of block X in their local caches (Fig. 7.14a) and in the
    shared memory module marked X.
   Using a write-invalidate protocol, the processor P1 modifies (writes) its cache from X to X’, and all
    other copies are invalidated via the bus (denoted I in Fig. 7.14b). Invalidated blocks are called
    dirty, meaning they should not be used.
   The write-update protocol (Fig. 7.14c) demands the new block content X’ be broadcast to all cache
    copies via the bus.
   The memory copy also updated if write through caches are used. In using write-back caches, the
    memory copy is updated later at block replacement time.
VTUPulse.com
•   A remote processor is denoted j, where j # i. For each of the two cache states, six possible events
    may take place.
•   Note that all cache copies of the same block use the same transition graph in making state changes.
•   In a valid state (Fig. 7.15a), all processors can read (R(i), R(j)) safely. Local processor i can also
    write(W(i)) safely in a valid state. The invalid state corresponds to the case of the block either
    being invalidated or being replaced (Z(i) or Z(j)).
•     VTUPulse.com
    processor replaces (Z(i)) its own block copy.
    The RW state corresponds to only one cache copy existing in the entire system owned by the local
    processor i.
•   Read (R(i)) and write(W(i)) can be safely performed in the RW state. From either the RO state or
    the INV state, the cache block becomes uniquely owned when a local write (W(i)) takes place.
      VTUPulse.com
   2. Directory Based Protocol
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
      VTUPulse.com
7.4 Message – Passing Mechanisms
      VTUPulse.com
     Two Message Passing mechanisms are:
       1. Store and Forward Routing
       2. Wormhole Routing
     1. Store and Forward Routing
2. Wormhole Routing
      VTUPulse.com
   Packets are subdivided into smaller flits. Flit buffers are used in the hardware routers attached to
    nodes.
   The transmission from the source node to the destination node is done through a sequence of
    routers.
   All the flits in the same packet are transmitted in order as inseparable companions in a pipelined
    fashion.
   Only the header flit knows where the packet is going.
   All the data flits must follow the header flit.
   Flits from different packets cannot be mixed up. Otherwise they may be towed to the wrong
    destination.
Asynchronous Pipelining
   The pipelining of successive flits in a packet is done asynchronously using a handshaking protocol
    as shown in Fig. 7.28. Along the path, a 1-bit ready/request (R/A) line is used between adjacent
    routers.
   When the receiving router (D) is ready (7.28a) to receive a flit (ie., a flit buffer is available), it pulls
    the R/A line low. When the sending router (S) is ready (Fig. 2.8b), it raises the line high and
    transmits flit I through the channel.
   While the flit is being received by D (Fig. 7.28c), the R/A line is kept high. After flit I is removed
    from D’s buffer (ie., transmitted to the next node) (Fig. 7.28d), the cycle repeats itself for the
    transmission of the next flit i+1 until the entire packet is transmitted.
    
        VTUPulse.com
Advantages:
         Very efficient
        Faster clock
Latency Analysis:
   Wormhole Routing has a latency almost independent of the distance between the source and the
    destination
      TWH = L / W + F D / W
      VTUPulse.com
actually shared by many possible source and destination pairs. The sharing of a physical channel leads
to the concept of virtual channels.
Virtual channels
   A virtual channel is logical link between two nodes. It is formed by a flit buffer in the source node,
    a physical channel between them and a flit buffer in the receiver node.
   Four flit buffers are used at the source node and receiver node respectively. One source buffer is
    paired with one receiver buffer to form a virtual channel when the physical channel is allocated for
    the pair.
   Thus the physical channel is time shared by all the virtual channels. By adding the virtual channel
    the channel dependence graph can be modified and one can break the deadlock cycle.
   Here the cycle can be converted to spiral thus avoiding a deadlock. Virtual channel can be
    implemented with either unidirectional channel or bidirectional channels.
   However a special arbitration line is needed between adjacent nodes interconnected by
    bidirectional channel. This line determines the direction of information flow.
   The virtual channel may reduce the effective channel bandwidth available to each request.
   There exists a tradeoff between network throughput and communication latency in determining the
    degree of using virtual channels.
      VTUPulse.com
Deadlock Avoidance
By adding two virtual channels, V3 and V4 in Fig. 7.32c, one can break the deadlock cycle. A modified
channel-dependence graph is obtained by using the virtual channels V3 and V4, after the use of channel
C2, instead of reusing C3 and C4.
The cycle in Fig. 7.32b is being converted to a spiral, thus avoiding a deadlock. Channel multiplexing
can be done at the flit level or at the packet level if the packet length is sufficiently short.
Virtual channels can be implemented with either unidirectional channels or bidirectional channels.
Vector: A vector is a set of scalar data items, all of the same type, stored in memory. Usually, the
vector elements are ordered to have a fixed addressing increment between successive elements called
the stride.
Vector Processing: Vector processing occurs when arithmetic or logical operations are applied to
vectors. It is distinguished from scalar processing which operates on one or one pair of data.
Vector processing is faster and more efficient than scalar processing.
Vectorization: The conversion from scalar code to vector code is called vectorization.
VTUPulse.com
   1. Vector - Vector instructions: One or two vector operands are fetched form the respective
       vector registers, enter through a functional pipeline unit, and produce result in another vector
       register.
               F1: Vi Vj
               F2: Vi x Vj Vk
F3: s x Vi Vj
Examples: V2 = 6 + V1
   3. Vector - Memory instructions: This corresponds to Store-load of vector registers (V) and
       the Memory (M).
      VTUPulse.com
               F4: M V (Vector Load)
               F5: V M (Vector Store)
               Examples: X = V1 V2 = Y
   4. Vector reduction instructions: include maximum, minimum, sum, mean value.
F6: Vi s
               F7: Vi x Vj s
   5. Gather and scatter instructions Two instruction registers are used to gather or scatter
       vector elements randomly throughout the memory corresponding to the following mappings
               F8: M Vi x Vj (Gather)
               F9: Vi x Vj M (Scatter)
               Gather is an operation that fetches from memory the nonzero elements of a sparse
               vector using indices.
               Scatter does the opposite, storing into memory a vector in a sparse vector whose
               nonzero entries are indexed.
    6. Masking instructions The Mask vector is used to compress or to expand a vector to a shorter
    or longer index vector (bit per index correspondence).
               F10: Vi x Vm Vj (Vm is a binary vector)
VTUPulse.com
   The gather, scatter, and masking instructions are very useful in handling sparse vectors or sparse
    matrices often encountered in practical vector processing applications.
   Sparse matrices are those in which most of the entries arc zeros.
   Advanced vector processors implement these instructions directly in hardware.
      VTUPulse.com
        The m-way low-order memory structure, allows m words to be accessed concurrently and
        overlapped.
    The access cycles in different memory modules are staggered. The low-order a bits select the
    modules, and the high-order b bits select the word within each module, where m=2a and a+b = n is
    the address length.
 To access a vector with a stride of 1, successive addresses are latched in the address buffer at the
      VTUPulse.com
    rate of one per cycle.
    Effectively it takes m minor cycles to fetch m words, which equals one (major) memory cycle as
    stated in Fig. 5.16b.
    If the stride is 2, the successive accesses must be separated by two minor cycles in order to avoid
    access conflicts. This reduces the memory throughput by one-half.
   If the stride is 3, there is no module conflict and the maximum throughput (m words) results.
   In general, C-access will yield the maximum throughput of m words per memory cycle if the stride
    is relatively prime to m, the number of interleaved memory modules.
VTUPulse.com
At the end of each memory cycle (Fig. 8.3b), m = 2a consecutive words are latched. If the stride is
greater than 1, then the throughput decreases, roughly proportionally to the stride.
         VTUPulse.com
         It provides parallel pipelined access of a vector data set with high bandwidth.
        A special vector cache design is needed within each processor in order to guarantee smooth data
         movement between the memory and multiple vector processors.
                          Do 10 I=1,N
                          Load R1, X(I)
                          Load R2, Y(I)
                          Multiply R1, S
                          Add R2, R1
                          Store Y(I), R2
                          10 Continue
where X(I) and Y(I), I=1, 2,…. N, are two source vectors originally residing in the memory. After the
computation, the resulting vector is stored back to the memory. S is an immediate constant supplied to
the multiply instruction.
After vectorization, the above scalar SAXPY code is converted to a sequence of five vector
instructions:
                M( x : x + N-1)  V1                 Vector Load
                M( y : y + N-1)  V2                 Vector Load
                S X V1  V1                          Vector Multiply
                V2 X V1  V2                         Vector Add
                V2  M( y : y + N-1)                 Vector Store
       X and y are starting memory addresses of the X and Y vectors, respectively; V1 and V2 are two
       N-element vector registers in the vector processor.
 Typical CVF for one-dimensional arrays are load, store, multiply, divide, logical and
      VTUPulse.com
    shifting operations.
   The number of available vector registers and functional pipelines impose some restrictions on how
    many CVFs can be executed simultaneously.
   Chaining:
    Chaining is an extension of technique of internal data forwarding practiced in scalar processors.
    Chaining is limited by the small number of functional pipelines available in a vector processor.
   Strip-mining:
    When a vector has a length greater than that of the vector registers, segmentation of the long vector
    into fixed-length segments is necessary. One vector segment is processed at a time (in Cray
    computers segment is 64 elements).
   Recurrence:
     The special case of vector loops in which the output of a functional pipeline may feed back into
     one of its own source vector registers
SIMD Implementation Models OR (Two models for constructing SIMD Super Computers)
    SIMD models differentiates on base of memory distribution and addressing scheme used.
    Most SIMD computers use a single control unit and distributed memories, except for a few that use
associative memories.
     1. Distributed memory model
        VTUPulse.com
    Spatial parallelism is exploited among the PEs.
    A distributed memory SIMD consists of an array of PEs (supplied with local memory) which are
     controlled by the array control unit.
    Program and data are loaded into the control memory through the host computer and distributed
     from there to PEs local memories.
    An instruction is sent to the control unit for decoding. If it is a scalar or program control operation,
     it will be directly executed by a scalar processor attached to the control unit.
    If the decoded instruction is a vector operation, it will be broadcast to all the PEs for parallel
     execution.
    Partitioned data sets are distributed to all the local memories attached to the PEs trough a vector
     data bus.
   PEs are interconnected by a data routing network which performs inter-PE data communications
    such as shifting, permutation and other routing operations.
      VTUPulse.com
     8.4.2 CM-2 Architecture
The Connection Machine CM-2 produced by Thinking Machines Corporation was a fine-grain MPP
computer using thousands of bit-slice PEs in parallel to achieve a peak processing speed of above 10
Gflops.
Program Execution Paradigm
All programs started execution on a front-end, which issued microinstructions to the back-end
processing array when data-parallel operations were desired. The sequencer broke down these
microinstructions and broadcast them to all data processors in the array.
Data sets and results could be exchanged between the front-end and the processing array in one of
three ways as shown in the figure:
      VTUPulse.com
          Broadcasting: Broadcasting was carried out through the broadcast bus to all data
           processors at once.
          Global combining: Global combining allowed the front-end to obtain the sum, largest
           value, logical OR etc, of values one from each processor.
          Scalar memory bus: Scalar bus allowed the front-end to read or to write one 32-bit value
           at a time from or to the memories attached to the data processors.
Processing Array
The processing array contained from 4K to 64K bit-slice data processors(PEs), all of which were
controlled by a sequencer.
Processing Nodes
Each data processing node contained 32 bit-slice data processors, an optional floating point accelerator
and interfaces for inter processor communication.
Hypercube Routers
The router nodes on all processor chips were wired together to frm a Boolean n-cube. A full
configuration of CM-2 had 4096 router nodes on processor chips interconnected as a 12-dimensional
hypercube.
      VTUPulse.com
    The UNIX subsystem handles traditional serial processing.
    The high-speed I/O, working together with the PE array, handles massively parallel computing.
    The MP-1 family includes configurations with 1024, 4096. and up to 16,384 processors. The peak
    performance of the 16K-processor configuration is 26,000 MIPS in 32-bit RISC integer operations.
    The system also has a peak floating-point capability of 1.5 Gfiops in single-precision and 650
    Mflops in double-precision operations.
   Array Control Unit The ACU is a 14-MIPS scalar RISC processor using a demand paging
    instruction memory. The ACU fetches and decodes MP-1 instructions, computes addresses and
    scalar data values, issues control signals to the PE array, and monitors the status of the PE array.
The ACU is microcoded to achieve horizontal control of the PE array. Most scalar ACU instructions
execute in one 70-ns clock. The whole ACU is implemented on one PC board.
An implemented functional unit, called a memory machine, is used in parallel with the ACU. The
memory machine performs PE array load and store operations, while the ACU broadcasts arithmetic,
logic, and routing instructions to the PEs for parallel execution.
VTUPulse.com
   Physical memory was distributed among the processing nodes in various clusters. The distributed
    memory formed a global address space.
   Cache coherence was maintained using an invalidating, distributed directory-based protocol. For
    each memory block, the directory kept track of remote nodes caching it.
   When a write occurred, point-to-point messages were sent to invalidate remote copies of the block.
   Acknowledgement messages were used to inform the originating node when an invalidation was
    completed.
   Two levels of local cache were used per processing node. Loads and writes were separated with the
    use of write buffers for implementing weaker memory consistency models.
   The main memory was shared by all processing nodes in the same cluster. To facilitate prefetching
    and the directory-based coherence protocol, directory memory and remote-access caches were used
    for each cluster.
   The remote-access cache was shared by all processors in the same cluster.
      VTUPulse.com
     The SVM Concept
   Figure 9.2 shows the structure of a distributed shared memory. A global virtual address space is
    shared among processors residing at a large number of loosely coupled processing nodes.
   The idea of Shared virtual memory (SVM) is to implement coherent shared memory on a network
    of processors without physically shared memory.
   The coherent mapping of SVM on a message-passing multicomputer architecture is shown in Fig.
    9.2b.
 The system uses virtual addresses instead of physical addresses for memory references.
   Each virtual address space can be as large as a single node can provide and is shared by all nodes in
    the system.
   The SVM address space is organized in pages which can be accessed by any node in the system. A
    memory-mapping manager on each node views its local memory as a large cache of pages for its
    associated processor.
    Page Swapping
   A memory reference causes a page fault when the page containing the memory location is not in a
    processor’s local memory.
Notes by Shylaja B, Asst. Prof, Dept of CSE, DSATM, Bangalore                                   44
ACA (15CS72) Notes                                                                    Module-4
   When a page fault occurs, the memory manager retrieves the missing page from the memory of
    another processor.
   If there is a page frame available on the receiving node, the page is moved in.
   Otherwise, the SVM system uses page replacement policies to find an available page frame,
    swapping its contents to the sending node.
   A hardware MMU can set the access rights (nil, read-only} writable) so that a memory access
    violating memory coherence will cause a page fault.
   The memory coherence problem is solved in IVY through distributed fault handlers and their
    servers. To client programs, this mechanism is completely transparent.
   The large virtual address space allows programs to be larger in code and data space than the
    physical memory on a single node.
   This SVM approach offers the ease of shared-variable programming in a message-passing
    environment.
   In addition, it improves software portability and enhances system scalability through modular
    memory growth.
VTUPulse.com
     1. Prefetching Techniques
            Prefetching uses knowlwdge about the expected misses in a program to move the
            corresponding data close to the processor before it is actually needed.
            Prefetching can be classified based on whether it is
                  Binding
                  Non binding
      VTUPulse.com
            or whether it is controlled by
               
               
                   hardware
                   software
 Binding prefetching : the value of a later reference (eg, a register load) is bound at the time when
    the prefetch completes.
 Non binding prefetching : brings data close to the processor, but the data remains visible to the
    cache coherence protocol and is thus kept consistent until the processor actually reads the value.
 Hardware Controlled Prefetching: includes schemes such as long cache lines and instruction
    lookahead.
 Software Controlled Prefetching: explicit prefetch instructions are issued. Allows the prefetching
    to be done selectively and extends the possible interval between prefetch issue and actual reference.
     2. Coherent Caches
   While the cache coherence problem is easily solved for small bus-based multiprocessors through
    the use of snoopy cache coherence protocols, the problem is much more complicated for large scale
    multiprocessors that use general interconnection networks.
   As a result, some large scale multiprocessors did not provide caches, others provided caches that
    must be kept coherent by software, and still others provided full hardware support for coherent
    caches.
   Caching of shared read-write data provided substantial gains in performance. The largest benefit
    came from a reduction of cycles wasted due to read misses. The cycles wasted due to write misses
    were also reduced.
   Hardware cache coherence is an effective technique for substantially increasing the performance
    with no assistance from the compiler or programmer.
      VTUPulse.com
         
         
         
         
              Relaxation
              Synchronizing vs non-synchronizing
              Issue vs View-Based
              Relative Model Strength
      VTUPulse.com
9.2 Principles of Multithreading
9.2.1 Multithreading Issues and Solutions
Multithreading demands that the processor be designed to handle multiple contexts simultaneously on
a context-switching basis.
Architecture Environment
Multithreading MPP system is modeled by a network of Processor (P) and memory (M) nodes as
shown in Fig. 9.11a. The distributed memories form a global address space.
Four machine parameters are defined below to analyze the performance of this network:
   1. The Latency (L): This is the communication latency on a remote memory access. The value of
       L includes the network delays, cache-miss penalty and delays caused by contentions in split
       transactions.
   2. The number of Threads (N): This is the number of threads that can be interleaved in each
       processor. A thread is represented by a context consisting of a program counter, a register set
       and the required context status words.
   3. The context-switching overhead (C): This refers to the cycles lost in performing context
       switching in a processor. This time depends on the switch mechanism and the amount of
       processor states devoted to maintaining active threads.
   4. The interval between switches (R): This refers to the cycles between switches triggered by
       remote reference. The inverse p=1/R is called the rate of requests for remote accesses. This
       reflects a combination of program behavior and memory system design.
   In order to increase efficiency, one approach is to reduce the rate of requests by using distributed
   coherent caches. Another is to eliminate processor waiting through multithreading.
   Multithreaded Computations
   Fig 9.11b shows the structure of the multithreaded parallel computations model.
   The computation starts with a sequential thread (1), followed by supervisory scheduling (2), where
   the processors begin threads of computation (3), by intercomputer messages that update variables
      VTUPulse.com
   among the nodes when the computer has distributed memory (4), and finally by synchronization
   prior to beginning the next unit of parallel work (5).
   The communication overhead period (4) inherent in distributed memory structures is usually
   distributed throughout the computation and is possibly completely overlapped.
   Message passing overhead in multicomputers can be reduced by specialized hardware operating in
   parallel with computation.
   Communication bandwidth limits granularity, since a certain amount of data has to be transferred
   with other nodes in order to complete a computational grain. Message passing calls (4) and
   synchronization (5) are nonproductive.
   Fast mechanisms to reduce or to hide these delays are therefore needed. Multithreading is not
   capable of speedup in the execution of single threads, while weak ordering or relaxed consistency
   models are capable of doing this.
   Problems of Asynchrony
   Massively parallel processors operate asynchronously in a network environment. The asynchrony
   triggers two fundamental latency problems:
               1. Remote loads
               2. Synchronizing loads
VTUPulse.com
VTUPulse.com
   Clearly the cost of thread switching should be much smaller than that of the latency of the remote
    load, or else the processor might as well wait for the remote load’s response.
   As the internode latency increases, more threads are needed to hide it effectively. Another concern
    is to make sure that messages carry continuations. Suppose, after issuing a remote load from thread
    T1 (Fig 9.13a), we switch to thread T2, which also issues a remote load.
   The responses may not return in the same order. This may be caused by requests traveling different
    distances, through varying degrees of congestion, to destination nodes whose loads differ greatly,
    etc.
   One way to cope with the problem is to associate each remote load and response with an identifier
    for the appropriate thread, so that it can be reenabled on the arrival of a response.
    2. Distributed Caching
   The concept of Distributed Caching is shown in Fig. 9.13b. every memory location has an owner
    node. For example, N1 owns B and N2 owns A.
   The directories are used to contain import-export lists and state whether the data is shared (for
      VTUPulse.com
    reads, many caches may hold copies) or exclusive (for writes, one cache holds the current value).
    The directories multiplex among a small number of contexts to cover the cache loading effects.
    The Distributed Caching offers a solution for the remote-loads problem, but not for the
    synchronizing-loads problem.
   Multithreading offers a solution for remote loads and possibly for synchronizing loads.
   The two approaches can be combined to solve both types of remote access problems.
The objective is to maximize the fraction of time that the processor is busy, we will use the efficiency
of the processor as our performance index, given by:
where busy, switching and idle represent the amount of time, measured over some large interval, that
the processor is in the corresponding state.
The basic idea behind a multithreaded machine is to interleave the execution of several contexts on
order to dramatically reduce the value of idle, but without overly increasing the magnitude of
switching.
Context-Switching Policies
Different multithreaded architectures are distinguished by the context-switching policies adopted.
Four switching policies are:
   1. Switch on Cache miss – This policy corresponds to the case where a context is preempted
       when it causes a cache miss.
      VTUPulse.com
       In this case, R is taken to be the average interval between misses (in Cycles) and L the time
       required to satisfy the miss.
       Here, the processor switches contexts only when it is certain that the current one will be
       delayed for a significant number of cycles.
   2. Switch on every load - This policy allows switching on every load, independent of whether it
       will cause a miss or not.
       In this case, R represents the average interval between loads. A general multithreading model
       assumes that a context is blocked for L cycles after every switch; but in the case of a switch-on-
       load processor, this happens only if the load causes a cache miss.
   3. Switch on every instruction – This policy allows switching on every instruction, independent
       of whether it is a load or not. Successive instructions become independent , which will benefit
       pipelined execution.
   4. Switch on block of instruction – Blocks of instructions from different threads are interleaved.
       This will improve the cache-hit ratio due to locality. It will also benefit single-context
       performance.