Advanced Computer Architecture
Chapter 7
Multiprocessors and Multicomputers
 Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani
                      In this chapter…
•   Multiprocessor System Interconnects
•   Cache Coherence and Synchronization Mechanisms
•   Three Generations of Multi-computers
•   Message Routing Schemes
                                                     2
 MULTIPROCESSOR SYSTEM INTERCONNECTS
Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University
  MULTIPROCESSOR SYSTEM INTERCONNECTS
• Network Characteristics
   o Topology
       • Dynamic Networks
   o Timing control protocol
       • Synchronous (with global clock)
       • Asynchronous (with handshake or interlocking mechanism)
   o Switching method
       • Circuit switching
       • Packet switching
   o Control Strategy
       • Centralized (global controller to receive requests from all devices and grant network access)
       • Distributed (requests handled by local devices independently)
                                                                                                         4
  MULTIPROCESSOR SYSTEM INTERCONNECTS
• Hierarchical Bus System
   o Local Bus (board level)
        • Memory bus, data bus
   o Backplane Bus (backplane level)
        • VME bus (IEEE 1014-1987), Multibus II (IEEE 1296-1987), Futurebus+ (IEEE 896.1-1991)
   o I/O Bus (I/O level)
   o E.g. Encore Multimax multprocessor’s nanobus
        • 20 slots
        • 32-bit address path
        • 64-bit data path
        • Clock rate: 12.5 MHz
        • Total Memory bandwidth: 100 Megabytes per second
                                                                                                 5
 MULTIPROCESSOR SYSTEM INTERCONNECTS
Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University
 MULTIPROCESSOR SYSTEM INTERCONNECTS
• Hierarchical Buses and Caches
   o Cache Levels
       • First level caches
       • Second level caches
   o Buses
       • (Intra) Cluster Bus
       • Inter-cluster bus
   o Cache coherence
       • Snoopy cache protocol for coherence among first level caches of same cluster
       • Intra-cluster cache coherence controlled among second level caches and results passed to
          first level caches
   o Use of Bridges between multiprocessor clusters
                                                                                                    7
  MULTIPROCESSOR SYSTEM INTERCONNECTS
• Hierarchical Buses and Caches
                                        8
MULTIPROCESSOR SYSTEM INTERCONNECTS
                                      9
 MULTIPROCESSOR SYSTEM INTERCONNECTS
Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University   10
 MULTIPROCESSOR SYSTEM INTERCONNECTS
• Crossbar Switch Design
   o Based on number of network stages
       • Single stage (or recirculating) networks
       • Multistage networks
            o Blocking networks
            o Non-blocking (re-arranging) networks
       • Crossbar networks
            o n x m and n2 Cross-point switchdesign
            o Crossbar benefits and limitations
• Multiport Memory Design
   o Multiport Memory
MULTIPROCESSOR SYSTEM INTERCONNECTS
MULTIPROCESSOR SYSTEM INTERCONNECTS
            CACHE COHERENCE MECHANISMS
• Cache Coherence Problem
   o Inconsistent copies of same memory block in different caches
   o Sources of inconsistency:
       • Sharing of writable data
       • Process migration
       • I/O activity
• Protocol Approaches
   o Snoopy Bus Protocol
   o Directory Based Protocol
• Write Policies
   o (Write-back, Write-through) x (Write-invalidate, Write-update)
CACHE COHERENCE MECHANISMS
CACHE COHERENCE MECHANISMS
CACHE COHERENCE MECHANISMS
CACHE COHERENCE MECHANISMS
            CACHE COHERENCE MECHANISMS
• Snoopy Bus Protocols
   o Write-through caches
      • Write invalidate coherence protocol for write-through caches
      • Write-update coherence protocol for write-through caches
      • Data item states:
             o VALID
             o INVALID
      • Possible operations:
             o Read by same processor R(i)               Read by different processor R( j )
             o Write by same processor W(i)              Write by different processor W( j )
             o Replace by same processor Z(i)            Replace by different processor Z( j )
CACHE COHERENCE MECHANISMS
           CACHE COHERENCE MECHANISMS
• Snoopy Bus Protocols
   o Write-through caches – write invalidate scheme
           Current     Operation      New             Current               New
                                                                Operation
            State                     State            State                State
                           R(i)       Valid                       R(i)      Valid
                          W(i)        Valid                       W(i)      Valid
                           Z(i)      Invalid                       Z(i)     Invalid
             Valid                                    Invalid
                           R(j)       Valid                       R(j)      Invalid
                          W(j)       Invalid                      W(j)      Invalid
                           Z(j)       Valid                        Z(j)     Invalid
           CACHE COHERENCE MECHANISMS
• Snoopy Bus Protocols
   o Write-back caches
      • Ownership protocol: Write invalidate coherence protocol for write-through caches
      • Data item states:
             o RO : Read Only (Valid state)
             o RW : Read Write (Valid state)
             o INV : Invalid state
      • Possible operations:
             o Read by same processor R(i)          Read by different processor R( j )
             o Write by same processor W(i)         Write by different processor W( j )
             o Replace by same processor Z(i)       Replace by different processor Z( j )
                  CACHE COHERENCE MECHANISMS
  • Snoopy Bus Protocols
          o Write-back caches – write invalidate (ownership protocol) scheme
Current      Operation      New         Current                    New         Current                 New
                                                    Operation                              Operation
 State                      State        State                     State        State                  State
                 R(i)        RO                         R(i)        RW                       R(i)      RO
                W(i)         RW                        W(i)         RW                       W(i)      RW
 RO              Z(i)        INV          RW            Z(i)        INV           INV         Z(i)     INV
(Valid)          R(j)        RO          (Valid)        R(j)        RO         (Invalid)     R(j)      INV
                W(j)         INV                       W(j)         INV                      W(j)      INV
                 Z(j)        RO                         Z(j)        RW                        Z(j)     INV
              CACHE COHERENCE MECHANISMS
Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University   24
           CACHE COHERENCE MECHANISMS
• Snoopy Bus Protocols
   o Write-once Protocol
      • First write using write-through policy
      • Subsequent writes using write-back policy
      • In both cases, data item copy in remote caches is invalidated
      • Data item states:
             o Valid :cache block consistent with main memory copy
             o Reserved : data has been written exactly once and is consistent with main memory
               copy
             o Dirty : data is written more than once but is not consistent with main memory copy
             o Invalid :block not found in cache or is inconsistent with main memory copy
          CACHE COHERENCE MECHANISMS
• Snoopy Bus Protocols
   o Write-once Protocol
      • Cache events and actions:
             o Read-miss
             o Read-hit
             o Write-miss
             o Write-hit
             o Block replacement
         CACHE COHERENCE MECHANISMS
• Multilevel Cache Coherence
             CACHE COHERENCE MECHANISMS
• Protocol Performance issues
    o Snoopy Cache Protocol Performance determinants:
        • Workload Patterns
        • Implementation Efficiency
    o Goals/Motivation behind using snooping mechanism
        • Reduce bus traffic
        • Reduce effective memory access time
    o Data Pollution Point
        • Miss ratio decreases as block size increases, up to a data pollution point (that is, as blocks
           become larger, the probability of finding a desired data item in the cache increases).
        • The miss ratio starts to increasing as the block size increases to data pollution point.
    o Ping-Pong effect on data shared between multiple caches
        • If two processes update a data item alternately, data will continually migrate between two caches
           with high miss-rate
  THREE GENERATIONS OF MULTICOMPUTERS
• Multicomputer v/s Multiprocessor
• Design Choices for Multi-computers
   o Processors
       • Low cost commodity (off-the-shelf) processors
   o Memory Structure
       • Distributed memory organization
       • Local memory with each processor
   o Interconnection Schemes
       • Message passing, point-to-point , direct networks with send/receive semantics with/without
          uniform message communication speed
   o Control Strategy
       • Asynchronous MIMD, MPMD and SPMD operations
  THREE GENERATIONS OF MULTICOMPUTERS
Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University   30
  THREE GENERATIONS OF MULTICOMPUTERS
• The Past, Present and Future Development
   o First Generation
        • Example Systems: Caltech’s Cosmic Cube, Intel iPSC/1, Ametek S/14, nCube/10
   o Second Generation
        • Example Systems: iPSC/2, i860, Delta, nCube/2, Supernode 1000, Ametek Series 2010
   o Third Generation
        • Example Systems: Caltech’s Mosaic C, J-Machine, Intel Paragon
   o First and second generation multi-computers are regarded as medium-grain systems
   o Third generation multi-computers were regarded as fine-grain systems.
   o Fine-grain and shared memory approach can, in theory, combine the relative merits of
     multiprocessors and multi-computers in a heterogeneous processing environment.
                                 1st Generation   2nd Generation   3rd Generation
   THREE  GENERAT IONS OF MULTICOM PUTERS
      MIPS            1        10      100
 Typical   MFLOPS (scalar)             0.1               2                40
  Node
Attributes MFLOPS (vector)              10              40               200
           Memory Size (in MB)         0.5               4                32
           Number of Nodes (N)          64              256             1024
 Typical   MIPS                         64             2560             100 K
System     MFLOPS (scalar)             6.4              512              40 K
Attributes MFLOPS (vector)             640             10 K             200 K
           Memory Size (in MB)          32              1K               32 K
         Local Neighbour
Communi- (in microseconds)            2000               5               0.5
  cation
 Latency Non-local node               6000               5               0.5
         (in microseconds)
  THREE GENERATIONS OF MULTICOMPUTERS
Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University   33
                 MESSAGE PASSING SCHEMES
• Message Routing Schemes
• Message Formats
   o Messages
   o Packets
   o Flits (Control Flow Digits)
       • Data Only Flits
       • Sequence Number
       • Routing Information
• Store-and-forward routing
• Wormhole routing
MESSAGE PASSING SCHEMES
            MESSAGE PASSING SCHEMES
• Asynchronous Pipelining
                MESSAGE PASSING SCHEMES
• Latency Analysis
   o L: Packet length (in bits)
   o W: Channel Bandwidth (in bits per second)
   o D: Distance (number of nodes traversed minus 1)
   o F: Flit length (in bits)
   o Communication Latency in Store-and-forward Routing
       • TSF = L (D + 1) / W
   o Communication Latency in Wormhole Routing
       • TWH = L / W + F D / W
              Advanced Computer Architecture
                                      Chapter 8
Multivector and SIMD Computers
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani
                      In this chapter…
•   Vector Processing Principles
•   Compound Vector Operations
•   Vector Loops and Chaining
•   SIMD Computer Implementation Models
                                          2
             VECTOR PROCESSING PRINCIPLES
• Vector Processing Definitions
    o   Vector
    o   Stride
    o   Vector Processor
    o   Vector Processing
    o   Vectorization
    o   Vectorizing Compiler or Vectorizer
• Vector Instruction Types
    o Vector-vector instructions
    o Vector-scalar instructions
    o Vector-memory instructions
                                             3
VECTOR PROCESSING PRINCIPLES
                               4
          VECTOR PROCESSING PRINCIPLES
• Vector-Vector Instructions
    o F1:         Vi  Vj
    o F2:         Vi x Vj Vk
    o Examples:   V1 = sin(V2)                         V3 = V1+ V2
• Vector-Scalar Instructions
    o F3:         s x Vi  Vj
    o Examples:   V2 = 6 + V1
• Vector-Memory Instructions
    o F4:         MV            (Vector Load)
    o F5:         VM            (Vector Store)
    o Examples:   X = V1                      V2 = Y
                                                                     5
            VECTOR PROCESSING PRINCIPLES
• Vector Reduction Instructions
   o F6:       Vi  s
   o F7:       Vi x Vj  s
• Gather and Scatter Instructions
   o F8:       M  Vi x Vj        (Gather)
   o F9:       Vi x Vj  M        (Scatter)
• Masking
   o F10:      Vi x Vm  Vj       (Vm is a binary vector)
• Examples…
                                                            6
              VECTOR PROCESSING PRINCIPLES
Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University
          VECTOR PROCESSING PRINCIPLES
• Vector-Access Memory Schemes
   o Vector-operand Specifications
       • Base address, stride and length
   o C-Access Memory Organization
       • Low-order m-way interleaved memory
   o S-access Memory Organizations
       • High-order m-way interleaved memory
   o C/S Access Memory Organization
• Early Supercomputers (Vectors Processors)
   o Cray Series             ETA 10E           NEC Sx-X 44
   o CDC Cyber               Fujitsu VP2600    Hitachi 820/80
           VECTOR PROCESSING PRINCIPLES
• Relative Vector/Scalar Performance
   o Vector/scalar speed ratio                r
   o Vectorization ratio in program           f
   o Relative Performance P is given by:
                      𝟏              𝒓
        • 𝑷=                  =
                  𝟏−𝒇 + 𝒇/𝒓       𝟏−𝒇 𝒓 + 𝒇
   o When f is low, the speedup cannot be high even with very high r
   o Limiting Case:
       • P  1 if f  0
   o Maximum Case:
       • P  r if f  1
   o Powerful single chip processors and multicore system-on-a-chip provide High-Performance
     Computing (HPC) using MIMD and/or SPMD configurations with large no. of processors.
          COMPUOUND VECTOR PROCESSING
• Compound Vector Operations
   o Compound Vector Functions (CVFs)
       • Composite function of vector operations converted from a looping structure of linked scalar
         operations
   o CVF Example: The SAXPY (Single-precision A multiply X Plus Y) Code
       • For I = 1 to N
            o Load            R1, X(I)
            o Load            R2, Y(I)
            o Multiply        R1, A
            o Add             R2, R1
            o Store           Y(I), R2
         • (End of Loop)
          COMPUOUND VECTOR PROCESSING
• One-dimensional CVF Examples
   o V(I) = V2(I) + V(3) x V(4)
   o V1(I) = B(I) + C(I)
   oA(I) = V(I) x S + B(I)
   o A(I) = V(I) + B(I) + C(I)
   oA(I) = Q x v1(I) (R x B(I) + C(I)), etc.
   Legend:
   o Vi(I) are vector registers
   o A(I), B(I), C(I) are vectors in memory
   o Q, S are scalars available from scalar registers in memory
          COMPUOUND VECTOR PROCESSING
• Vector Loops
   o Vector segmentation or strip-mining approach
   o Example
• Vector Chaining
   o Example: SAXPY code
       • Limited Chaining using only one memory-access pipe in Cray-I
       • Complete Chaining using three memory-access pipes in Cray X-MP
• Functional Unit Independence
• Vector Recurrence
            COMPUOUND VECTOR PROCESSING
Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University   13
            COMPUOUND VECTOR PROCESSING
Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University   14
           SIMD COMPUTER ORGANIZATIONS
• SIMD Computer Variants
   o Array Processor
   o Associative Processor
• SIMD Processor v/s SISD v/s Vector Processor Operation
   o Illustration: for(i=0;i<5;i++) a[i] = a[i]+2;
   o Lockstep mode of operation in SIMD processor
   o Relative Performance comparison
• SIMD Implementation Models
   o Distributed Memory Model
       • E.g. Illiac IV
   o Shared memory Model
       • E.g. BSP (Burroughs Scientific Processor)
SIMD COMPUTER ORGANIZATIONS
SIMD COMPUTER ORGANIZATIONS
            SIMD COMPUTER ORGANIZATIONS
• SIMD Instructions
   o Scalar Operations
       • Arithmetic/Logical
   o Vector Operations
       • Arithmetic/Logical
   o Data Routing Operations
       • Permutations, broadcasts, multicasts, rotation and shifting
   o Masking Operations
       • Enable/Disable PEs
• Host and I/O
• Bit-slice and Word-slice Processing
   o WSBS, WSBP, WPBS, WPBP
              Advanced Computer Architecture
                                      Chapter 9
            …Dataflow Architectures
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani
                       In this chapter…
•   Evolution of Dataflow Computers
•   Dataflow Graphs
•   Static v/s Dynamic Data Flow Computers
•   Pure Dataflow Machines
•   Explicit Token Store Machines
•   Hybrid and Unified Architectures
•   Dataflow v/s Control flow Computers
                                             2
    DATAFLOW AND HYBRID ARCHITECTURES
• Data-driven machines
• Evolution of Dataflow Machines
• Dataflow Graphs
   o Dataflow Graphs examples.
   o Activity Templates and Activity Store
   o Example: dataflow graph for cos x
                                               𝟔
                            𝒙𝟐       𝒙𝟒       𝒙
         • 𝐜𝐨𝐬 𝐱 ≅ 𝟏 −           +        −
                            𝟐!       𝟒!       𝟔!
   o More examples
                                                   3
     DATAFLOW AND HYBRID ARCHITECTURES
Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University
    DATAFLOW AND HYBRID ARCHITECTURES
• Static Dataflow Computers
   o Special Feedback Acknowledge Signals between nodes
   o Firing Rule:
        • A node is enabled as soon as tokens are present on all input arcs and there is no tokenon
           any of its output arc
   o Example: Dennis Machine (1974)
• Dynamic Dataflow Computers
   o TaggedTokens
   o Firing Rule:
        • A node is enabled as soon as tokens with identical tags are present at each of its inputarcs
   o Example: MIT Tagged Token Dataflow Architecture (TTDA) machine (just simulation, never built)
    DATAFLOW AND HYBRID ARCHITECTURES
• Diagrams of static dataflow and dynamic dataflow
• from Hwang and Briggs….
    DATAFLOW AND HYBRID ARCHITECTURES
• Pure Dataflow Machines
   o TTDA(1983)
       • TTDA was simulated but never built
   o Manchester Dataflow Computer (1982)
       • Operated asynchronously using a separate clock for each PE
   o ETL Sigma-1 (1987)
       • 128 PEs fully synchronous with a 10-Mhz clock
       • Implemented I-structured memory proposed in TTDA
       • Lacked High Level Languages for users
    DATAFLOW AND HYBRID ARCHITECTURES
• Explicit Token Store Machines
   o Eliminate associative token matching.
   o Waiting token memory is directly accessed using full/emptybits.
   o Examples
        • MIT/Motorola Monsoon (proposed 1988; operational 1991)
              o Multithreading support
              o 8 processors
              o 8 I-structure memory modules
              o 8x8 crossbar network
        • ETL EM-4 (1989)
              o Extension of Sigma-1
              o Proposed 1024 nodes; Operational Implementation 80 nodes
    DATAFLOW AND HYBRID ARCHITECTURES
• Hybrid and Unified Architectures
   o Combining Features of von-Neumann and Dataflow architectures
   o Examples:
       • MIT P-RISC (1988)
       • IBM Empire (1991)
       • MIT/Motorola *T (1991)
   o “RISC-ified” dataflow architecture
       • Implemented in P-RISC
       • Splitting complex dataflow instructions into separate simple component instructions
       • Tighter encoding and longer threads for better performance
• Dataflow Processing v/s Control Flow Processing
    DATAFLOW AND HYBRID ARCHITECTURES
• Computing ab + a/c with:
   (a) control flow; (b) dataflow. Pure dataflow basic execution pipeline: (c) single-token-per-arc dataflow;
   (d) tagged-token dataflow; (e) explicit token store dataflow
    DATAFLOW AND HYBRID ARCHITECTURES
• Computing ab + a/c with:
   (a) control flow; (b) dataflow. Pure dataflow basic execution pipeline: (c) single-token-per-arc dataflow;
   (d) tagged-token dataflow; (e) explicit token store dataflow