UNIT - IV
Parallel and Scalable Architectures, Multiprocessors and Multicomputers, Multiprocessor
system interconnects, cache coherence and synchronization mechanism, Three Generations of
Multicomputers, Message-passing Mechanisms, Multivetor and SIMD computers.
                        Parallel and Scalable Architectures (10 Marks)
1. Introduction
       Parallel architecture refers to computer systems that use multiple processing
        elements (PEs) to perform tasks simultaneously.
       Scalable architecture means the system can increase performance proportionally
        as processors are added, without major redesign.
Such architectures are the backbone of supercomputers, data centers, AI/ML workloads,
and scientific simulations.
2. Need for Parallel and Scalable Architectures
       Increasing demand for high-performance computing (HPC).
       Limitations of single-core sequential processing (Von Neumann bottleneck).
       To support large-scale data processing, graphics, AI, weather forecasting,
        simulations.
3. Characteristics
    1. Multiple Processing Elements: Can be CPUs, GPUs, or vector processors.
    2. Levels of Parallelism:
           o Instruction-level (ILP) → pipelining.
           o Data-level (DLP) → SIMD, vector.
           o Task-level (TLP) → independent processes in parallel.
           o Thread-level (Multithreading).
    3. Interconnection Network: Provides communication between processors (Bus,
       Crossbar, Mesh, Hypercube).
    4. Scalability: System must maintain efficiency as processors grow.
4. Types of Parallel Architectures
    1. Shared Memory Multiprocessors (SMP):
          o All processors share a common memory.
          o  Easy programming but limited scalability.
          o  Example: Intel Xeon multiprocessor.
   2. Distributed Memory Multicomputers:
         o Each processor has its own local memory.
         o Communication via message passing (MPI, PVM).
         o Scalable but harder to program.
         o Example: IBM SP2, Beowulf clusters.
   3. Hybrid Architectures:
         o Combine shared + distributed memory.
         o Used in modern supercomputers.
5. Scalability
      A system is scalable if:
          o Performance grows as processors are added.
          o Communication overhead is minimal.
          o Memory system supports large-scale data access.
      Scalability Models:
          o Amdahl’s Law: pessimistic (fixed workload).
          o Gustafson’s Law: optimistic (scaled workload).
          o Sun–Ni Law: considers workload + memory constraints.
6. Advantages
      Increased speedup and throughput.
      Handles large problem sizes efficiently.
      Supports parallel programming models (OpenMP, CUDA, MPI).
      Flexible for scientific, AI, and real-time applications.
7. Challenges
      Synchronization among processors.
      Cache coherence problem in shared memory.
      Communication overhead in distributed memory.
      Load balancing across processors.
8. Diagram
         +-----------+     +-----------+     +-----------+
         | Processor |     | Processor | ... | Processor |
         +-----------+     +-----------+     +-----------+
               \               |                /
             ----------- Interconnection -----------
                   Shared / Distributed Memory
                            Multiprocessor system interconnects
1. Introduction
       In multiprocessor systems, multiple CPUs need to communicate with each other
        and with memory & I/O devices.
       The interconnection network provides this communication path.
       A good interconnect should be fast, reliable, scalable, and cost-effective.
2. Requirements of Interconnects
   1.   High Bandwidth – To support many processors.
   2.   Low Latency – Fast data transfer between processors.
   3.   Scalability – Should work efficiently as the system grows.
   4.   Fault Tolerance – Ability to reroute if a link fails.
3. Types of Interconnection Networks
A. Bus-based Interconnect
       All processors share a common communication bus.
       Advantages: Simple, low cost.
       Disadvantages: Bus contention → performance drops with more CPUs.
       Example: Early multiprocessors.
   CPU1 --\
   CPU2 ----> [ Shared Bus ] ---> Memory / I/O
   CPU3 --/
B. Crossbar Switch
       Every processor has a direct path to every memory module via switches.
       Advantages: High speed, no contention (if enough switches).
       Disadvantages: Very expensive for large systems.
   CPU1 ---|X|--- M1
   CPU2 ---|X|--- M2
   CPU3 ---|X|--- M3
C. Multistage Networks (Indirect Interconnects)
       Use multiple switching stages (e.g., Omega, Butterfly, Clos).
       Advantages: Cheaper than crossbar, scalable.
       Disadvantages: Possible blocking (two requests may collide).
D. Topology-based Interconnects (for Multicomputers)
   1.   Ring: Each processor connected to two neighbors. (Simple but slow).
   2.   Mesh / Torus: Processors in grid form, scalable, used in clusters.
   3.   Hypercube: Each node connected to log2(N) neighbors; very scalable.
   4.   Tree / Fat-tree: Hierarchical, good for large systems.
                         cache coherence and synchronization mechanism,
Introduction
       In multiprocessor systems, each CPU may have its own cache to reduce memory
        access time.
       Problem: Inconsistency occurs when multiple caches store different copies of the
        same memory location.
       Solution: Cache coherence protocols + synchronization mechanisms ensure data
        consistency and orderly access.
2. Cache Coherence Problem
       Example:
           o CPU1 and CPU2 cache variable X.
           o CPU1 updates X = 5, but CPU2’s cache still has X = 2.
           o ❌ → Inconsistent view of memory.
Conditions for Cache Coherence
   1. Write Propagation: Updates to a variable must be visible to all processors.
   2. Transaction Serialization: All processors must see memory operations in the same
      order.
3. Cache Coherence Protocols
A. Write Policies
      Write-through: Update both cache and memory (slower, but simple).
      Write-back: Update only cache, write to memory later (efficient, but complex).
B. Protocol Types
   1. Directory-based Protocols
         o A directory keeps track of which caches store each memory block.
         o Centralized control → scalable for large systems.
   2. Snoopy Protocols
         o All caches monitor (snoop) a common bus.
         o If one CPU updates, others invalidate or update their cache copies.
         o Suitable for bus-based multiprocessors.
4. Popular Snoopy Protocols
      MSI (Modified, Shared, Invalid)
      MESI (Modified, Exclusive, Shared, Invalid) → common in Intel processors.
      MOESI and MESIF → advanced variants.
5. Synchronization Mechanism
Ensures orderly and mutually exclusive access to shared data/resources.
A. Hardware Mechanisms
   1. Locks/Atomic Instructions:
         o Test-and-Set, Compare-and-Swap, Fetch-and-Add → atomic updates.
   2. Barriers: All processors wait until every processor reaches a synchronization point.
B. Software Mechanisms
   1. Semaphores & Mutexes: Control access to critical sections.
   2. Monitors & Condition Variables: High-level synchronization constructs.
6. Example (MESI Protocol Flow)
        CPU1 writes to a block → changes state to Modified.
        Other caches mark block as Invalid.
        Next time another CPU reads, it fetches updated value → coherence maintained.
7. Diagram (Exam Sketch – Snoopy Protocol)
       +---------+         +---------+
       | CPU1     |        | CPU2    |
       | Cache | <----> | Cache |
       +---------+    ||   +---------+
            \         ||        /
              \---- Shared Bus ----/
                      |
                 Main Memory
                              Three Generations of Multicomputers,
1. Introduction
        A multicomputer is a parallel computer system consisting of multiple processors,
         each with its own local memory, connected by an interconnection network.
        Unlike multiprocessors (shared memory), multicomputers use message passing to
         communicate.
        The development of multicomputers can be classified into three generations, based
         on technology, interconnects, programming models, and performance goals.
2. First Generation (1980s – Early Multicomputers)
        Architecture:
            o Experimental designs with static interconnection topologies such as mesh,
                hypercube, ring, or torus.
            o Each node = processor + local memory.
        Communication:
            o Store-and-forward packet switching.
            o High latency, limited bandwidth.
        Programming Model:
            o Low-level message passing (send/receive calls).
            o No standard libraries. Programmer handled data distribution manually.
        Applications: Scientific and research computing.
        Examples: Intel iPSC (1985), nCUBE-10, Cosmic Cube.
👉 Limitations: Hard to program, lack of standards, limited scalability (hundreds of
processors at most).
3. Second Generation (1990s – Cluster-based Multicomputers)
      Architecture:
          o Clusters of workstations or PCs connected with high-speed networks.
          o Used commodity hardware (cheap processors + network cards).
      Communication:
          o Faster interconnects (Myrinet, Fast Ethernet).
          o Introduction of wormhole routing → lower latency than store-and-forward.
      Programming Model:
          o Standardized libraries: MPI (Message Passing Interface), PVM (Parallel
              Virtual Machine).
          o Easier programming with support for collective communication.
      Applications:
          o Weather forecasting, fluid dynamics, financial modeling, military simulations.
      Examples: IBM SP2, Intel Paragon, Beowulf clusters.
👉 Improvements: More scalable (thousands of processors), portable software, better cost-
performance ratio.
4. Third Generation (2000s – Present: HPC Clusters, Grids, and
Clouds)
      Architecture:
          o Large-scale superclusters and supercomputers with thousands to millions
              of cores.
          o Integration of GPUs, accelerators, and multicore CPUs.
          o Support for grid computing and cloud computing.
      Communication:
          o Ultra-fast interconnects like InfiniBand, 10/40/100 Gigabit Ethernet, Omni-
              Path.
          o Low latency and high bandwidth with advanced routing + virtual channels.
      Programming Model:
          o Hybrid models: MPI + OpenMP (message passing + shared memory).
          o Support for distributed shared memory (DSM) and parallel programming
              frameworks.
          o Integration with cloud-based models for scalability.
      Applications:
          o AI/ML, Big Data Analytics, Molecular modeling, Astrophysics, Climate
              simulations, Quantum computing.
      Examples: IBM Blue Gene series, Cray XT, Tianhe-2 (China), Summit (USA),
       Fugaku (Japan).
                                  Message-Passing Mechanism
1. Introduction
      In multicomputers, each processor has its own local memory → no shared memory.
      Processors must communicate by sending and receiving messages over an
       interconnection network.
      This is known as the Message-Passing Mechanism.
👉 Used in parallel computing, clusters, and distributed systems.
2. Features
   1. Explicit Communication – Processes exchange data via send/receive.
   2. Synchronization – Communication ensures coordination among processes.
   3. Portability – Standard APIs like MPI (Message Passing Interface) make programs
      portable.
   4. Scalability – Well-suited for large-scale systems (clusters, supercomputers).
3. Basic Operations
   1. Send (destination, message) – Transmits message to another process.
   2. Receive (source, message) – Accepts incoming message.
4. Types of Message Passing
   1. Synchronous vs Asynchronous
         o Synchronous: Sender waits until receiver acknowledges.
         o Asynchronous: Sender continues without waiting.
   2. Buffered vs Unbuffered
         o Buffered: Messages stored temporarily in system buffer.
         o Unbuffered: Direct handoff between sender and receiver.
   3. Direct vs Indirect
         o Direct: Sender specifies the exact receiver.
         o Indirect: Messages go via mailboxes/queues.
5. Message-Passing Models
      Point-to-Point – One sender ↔ one receiver.
      Broadcast / Multicast – One sender → multiple receivers.
      Collective Communication – Group operations (scatter, gather, reduce).
6. Advantages
      Works in systems without shared memory.
      Scalable to thousands of processors.
      Standardized libraries (MPI, PVM) make programming easier.
7. Limitations
      Overhead due to copying messages and communication delays.
      More complex programming than shared memory.
8. Examples
      MPI (Message Passing Interface) – De facto standard for HPC.
      PVM (Parallel Virtual Machine) – Early library for cluster computing.
      Sockets – Used in distributed applications.
9. Diagram (Exam Sketch)
Processor A + Memory ----> Interconnection Network ----> Processor B +
Memory
             (Send)                                (Receive)