UNIT-1
Theory of parallelism
Q1. Give a brief note on parallel computers and its architectural model
Ans. Parallel computers are systems that can process multiple tasks simultaneously by
dividing a computational problem into smaller sub-problems and solving them
concurrently.
This approach significantly improves performance and efficiency, especially for tasks
involving large-scale computations.
Key characteristics of parallel computers include:
Multiple processors working together.
Concurrent execution of instructions.
High-speed interprocessor communication.
Why parallel computing?
● The whole real-world runs in dynamic nature i.e. many things happen at a certain time
but at different places concurrently. This data is extensively huge to manage.
● Real-world data needs more dynamic simulation and modeling, and for achieving the
same, parallel computing is the key.
● Parallel computing provides concurrency and saves time and money.
● Complex, large datasets, and their management can be organized only and only using
parallel computing’s approach.
● Ensures the effective utilization of the resources. The hardware is guaranteed to be
used effectively whereas in serial computation only some part of the hardware was used
and the rest rendered idle.
● Also, it is impractical to implement real-time systems using serial computing.
Architectural Models of Parallel Computers
Parallel computer architectures are categorized based on the arrangement of processors
and the way they communicate and share resources. Major architectural models include:
1.Flynn's Taxonomy
SISD (Single Instruction Single Data): A single processor executes one instruction at
a time on one data stream (traditional computers).
SIMD (Single Instruction Multiple Data): Multiple processors execute the same
instruction on different data streams simultaneously (e.g., GPUs).
MISD (Multiple Instruction Single Data): Rarely used, where multiple processors
execute different instructions on the same data stream.
MIMD (Multiple Instruction Multiple Data): Multiple processors execute different
instructions on different data streams (e.g., distributed systems).
2.Shared Memory Architecture
Processors share a common memory space.
Communication occurs through shared memory.
Suitable for tasks requiring frequent data sharing.
Example: Multi-core processors.
3.Distributed Memory Architecture
Each processor has its own local memory.
Communication occurs via message passing.
Scales better for large systems.
Example: Clusters.
4.Hybrid Architecture
Combines shared and distributed memory architectures.
Common in modern supercomputers.
Example: Systems with multi-core nodes connected via a high-speed network.
5.Dataflow Architecture
Execution is driven by the availability of data rather than a pre-determined
sequence of instructions.
Ideal for fine-grained parallelism.
NOTE:TO LEARN IN DETAIL ABOUT EACH ARCHITECTURE OF THE FLYNN’S TAXONOMY REFER SIR KI
PDF UNIT-1 KI TOPIC THE STATE OF COMPUTING pg 3-6
Q2. Discuss briefly about shared memory multiprocessor and it’s models.
Ans: Shared memory multiprocessor systems are a class of parallel computer
architectures where multiple processors share a common memory space. This
architecture allows all processors to access the same global memory,
facilitating interprocessor communication and data sharing. In shared memory
systems, multiple processors can read and write to a single, unified memory
space.
Memory is globally accessible by all processors. The processors are connected
to the shared memory through a bus or interconnection network.
Components
1. Processors: Multiple CPUs work simultaneously, each capable of
executing instructions independently.
2. Shared Memory: A single pool of memory accessible to all processors for
storing and retrieving data.
3. Interconnection Mechanism: A communication infrastructure (like buses
or networks) connects processors to memory.
4. Synchronization Mechanisms: Hardware/software tools (e.g., locks,
semaphores) to manage data consistency and prevent race conditions.
Shared memory machines can be divided into three categories based upon
memory access times
1.Uniform memory-access (UMA) model
The UMA Model In a UMA multiprocessor model (Below Fig), the physical
memory is uniformly shared by all the processors. All processors have equal
access time to all memory words,
which is why it is called uniform
memory access. Each processor
may use a private cache.
Peripherals are also shared in some
fashion.
Advantages: Simple to design, predictable memory access times.
Disadvantages: Limited scalability due to bus contention.
2.Non uniform-memory-access (NUMA) model
Memory is physically distributed among processors, but processors can still
access memory across the system. Access times vary depending on the
memory's location. Two NUMA machine models are depicted in Fig. The
shared memory is physically distributed to all processors, called local
memories. The collection of all local memories forms a global address space
accessible by all processors.
Each processor has faster access to its local memory.
Example: Modern server-class systems.
Advantages: Scalable for larger systems.
Disadvantages: More complex memory management and potential
latency for remote memory access.
3.Cache-only memory architecture (COMA) model
COMA is a type of distributed shared memory architecture where the main
memory is not statically allocated. Instead, all memory in the system is treated
as a large, dynamically managed cache. This allows data to migrate to the
memory module of the processor that needs it, ensuring faster access and
reducing latency.
In COMA, the memory is highly flexible, and the system dynamically adjusts
the placement of data based on the demands of the processors. This
architecture is particularly beneficial for workloads with irregular or
unpredictable memory access patterns, as it brings data closer to the
processor, improving performance.
Q3.Explain
the
architecture of vector super computer.
Ans. A vector supercomputer is a high-performance computing system
designed to perform operations on vectors (arrays of data) simultaneously. It is
optimized for handling large-scale computations in scientific and engineering
applications by leveraging vector processing, where a single instruction
operates on multiple data elements in parallel. This architecture is particularly
suited for tasks involving repetitive mathematical operations on large datasets,
such as simulations, weather modeling, and numerical analysis.
• A vector operand contains an ordered set of n elements, where n is called
the length of the vector. Each element in a vector is a scalar quantity, which
may be a floating-point number, an integer, a logical value or a character.
• A vector processor consists of a scalar processor and a vector unit, which
could be thought of as an independent functional unit capable of efficient
vector operations. A vector computer is often built on top of a scalar
processor. As shown in Fig, the vector processor is attached to the scalar
processor as an optional feature. Program and data are first loaded into the
main memory through a host computer. All instructions are first decoded by
the scalar control unit If the decoded instruction is a scalar operation or a
program control operation, it will be directly executed by the scalar processor
using the scalar functional pipelines.
Vector computers are designed with hardware that efficiently performs vector
operations. Instead of accessing operands directly from memory, data is first
loaded into registers, processed, and then stored back in registers. The
hardware uses pipelining to overlap the processing of operands. In pipelined
functional units, each stage handles a step of the operation on different data.
Once the pipeline is full, a new result is produced every clock cycle.
Vector Processor Classification
Vector processors are classified based on how they handle operands during
computations. The two main types are:
1. Memory-to-Memory Vector Processors:
Operands are fetched directly from memory for processing, and the results are
written back to memory without using intermediate registers.
o Operands reside in memory during computations.
o The processor directly interacts with memory for every operation.
Advantages:
o Simpler design as no vector registers are required.
o Suitable for large datasets that cannot fit in registers.
Disadvantages:
o Slower due to frequent memory accesses.
o Higher latency and increased memory bandwidth requirements.
2. Register-to-Register Vector Processors:
Data is first loaded from memory into vector registers, where it is processed.
The results are stored back in memory only after computation is complete.
o Operands are held in high-speed vector registers.
o Operations are performed entirely within the registers.
Advantages:
o Faster due to reduced memory access during computations.
o Efficient use of memory bandwidth.
Disadvantages:
o Requires large, high-speed vector registers, increasing hardware complexity.
o Limited by the size of the registers.
Q4.Briefly discuss the operational model of SIMD computers.
Ans: SIMD (Single Instruction, Multiple Data) computers are designed to
execute the same instruction simultaneously on multiple data elements. These
systems excel at performing data-parallel tasks, where the same operation
needs to be applied to large datasets, making them ideal for multimedia
processing, scientific simulations, and other data-intensive applications.
Modern SIMD examples include Intel’s SSE (Streaming SIMD Extensions) and
ARM’s NEON instruction sets.
SIMD Operational Model
An SIMD computer is represented by a 5-tuple: M=(N,C,I,M,R)M = (N, C, I, M,
R). Here's what each component means:
1. NN:
o The number of Processing Elements (PEs) in the machine.
o Example: ILLIAC IV had 64 PEs; Connection Machine CM-2 had
65,536 PEs.
2. CC:
o The set of instructions executed by the Control Unit (CU).
o Includes scalar operations and program flow controls.
3. II:
o The set of instructions broadcast by the CU to all PEs for parallel
execution.
o Examples: arithmetic, logic, data routing, masking, and local
operations.
4. MM:
o Masking schemes used to enable or disable specific PEs during
execution.
o Masks allow only a subset of PEs to process data, providing
flexibility.
5. RR:
o Data-routing functions that define communication patterns in the
interconnection network for inter-PE data transfer.
Vector Instructions in SIMD
A vector instruction specifies operations on multiple data elements stored in
vector registers. The fields in a vector instruction include:
1. Operation Code:
Specifies the operation to perform (e.g., addition, multiplication).
2. Base Address:
Refers to the memory location or vector register where operands are
fetched from or results are stored.
3. Address Increment:
Indicates how to locate the next element in a vector.
o Example: An increment of 1 means data is stored consecutively in
memory.
4. Address Offset:
Adds to the base address to compute the effective memory location.
5. Vector Length:
Specifies the number of elements in the vector to determine when to
stop processing.
Key Features of SIMD Computers
1. Parallel Processing: Multiple PEs operate in parallel under the control of
a single instruction stream.
2. Efficient for Large Data: Best suited for tasks involving repetitive
operations on large datasets.
3. Masking: Flexibility to enable/disable specific PEs for selective
processing.
4. Data Routing: Interconnection networks enable communication between
PEs for complex tasks.
Q5.Elaborate the PRAM and VSLI models
Ans: PRAM Model
The Parallel Random Access Machine (PRAM) model is a theoretical
framework used to analyze parallel algorithms and measure their
performance. It assumes an ideal parallel computer where multiple processors
can work together and access shared memory simultaneously in constant
time.
The PRAM model is ideal for studying theoretical parallel algorithms, focusing
on time and processor efficiency.
Key Features of PRAM
1. Processors:
o Contains nnn processors (or processing elements, PEs) working
together.
o All processors execute instructions in synchronization (SIMD: Single
Instruction, Multiple Data).
2. Shared Memory:
o A single, globally accessible memory shared by all processors.
o Memory access happens in constant time, regardless of the
number of processors.
3. Parallel Execution:
o Processors read from memory, compute, and write back to memory
in a synchronized manner (lockstep operation).
Memory Access Policies in PRAM
When multiple processors access memory, the system must define rules to
handle conflicts. These rules include:
1. Exclusive Read (ER):
o Only one processor can read from a specific memory location at a
time.
2. Exclusive Write (EW):
o Only one processor can write to a specific memory location at a
time.
3. Concurrent Read (CR):
o Multiple processors can read from the same memory location
simultaneously.
4. Concurrent Write (CW):
o Multiple processors can write to the same memory location
simultaneously. To avoid conflicts, a policy is needed to decide how
writes are handled.
Variants of PRAM Models
Different combinations of these memory policies lead to PRAM variants:
1. EREW (Exclusive Read, Exclusive Write):
o Most restrictive: Only one processor can read/write a memory
location at any given time.
2. CREW (Concurrent Read, Exclusive Write):
o Allows multiple processors to read, but only one can write to a
memory location.
3. CRCW (Concurrent Read, Concurrent Write):
o Most flexible: Allows multiple processors to read and write
simultaneously.
Why PRAM is Important
1. Idealized Model:
o It simplifies the study of parallel algorithms by ignoring
synchronization and memory access delays.
2. Algorithm Development:
o PRAM is widely used for designing and analyzing scalable parallel
algorithms.
VSLI Model
Very-Large-Scale Integration (VLSI) is the process of creating an integrated
circuit (IC) by combining thousands of electronic components, like transistors,
resistors, and capacitors, into a single chip.
The VLSI model deals with designing and arranging these components on a
chip while considering factors like power usage, speed, size, and
manufacturing limitations.
In parallel computers, VLSI chips are used to build processor arrays, memory
arrays, and large-scale switching networks. Modern VLSI chips are typically 2-
dimensional, and their size depends on the amount of memory they can store.
The efficiency of an algorithm on a VLSI chip is measured by its space
complexity.
This is calculated using the chip area (A) and the time (T) it takes to run the
algorithm. The product A×TA \times T represents the total number of bits the
chip processes.
For certain tasks, there’s a minimum value, f(s)f(s), that limits how efficiently
the chip can perform the computation.
such that A.T2 >= O (f(s)) Where A=chip area and T=time
Q6.Briefly discuss (a) Multivector & SIMD tracks (b) Multithreaded &
dataflow tracks.
Ans: (a) Multivector & SIMD tracks
Multivector Tracks:
These are traditional vector supercomputers. The CDC 7600 was the first
vector dual-proccssor system. Two subtraclcs were derived from the CDC
7600. The Cray and Japanese supercomputers all followed the register-to-
register architecture. It Focuses on vector processing, where operations are
performed on entire arrays (vectors) of data rather than individual elements.
Cray 1 pioneered the multivoctor development in 1978. The Cray/MPP was a
massively parallel system with distributed shared memory, to work as a back-
end accelerator engine compatible with the Cray Y-MP Series.
The other subtrack used memory-to-memory architecture in building vector
supercomputers. We have identified only the CDC Cyber 205 and its successor
the ETA10 here, for completeness in tracking different supercomputer
architectures.
SIMD Tracks:
In SIMD systems, a single instruction stream is broadcast to multiple
processors, which execute it simultaneously on different data elements.
The llliac IV pioneered thc construction of SIMD computers, although thc array
processorconccpt can be traced back iar earl icr to the 19605. The subtrack,
consisting ofthe Goodyear MPP, the AMT/DAP610, and the TMC/CM-2, were
all SIMD machines built with bit-slice PEs. The CM-5 was a synchronized MIMD
machine executing in a multiple-SIMD mode.
The other subtrack corresponds to medium-grain SIMD computers using word-
wide PEs. The BSP (Kuck and Stokes, 1982] was a shared-memory SIMD
machine built with 16 processors updating a group of 17 memory modules
synchronously. The GF11 was developed at the IBM Watson Laboratory for
scientific simulation research use. The MasPar MP1 was the only medium-
grain SIMD computer to achieve production use in that time period.
(b) Multithreaded & dataflow tracks.
Multithreaded tracks:
The cont-entional von Neumann machines are built with processors that
execute a single context by each processor at a time. ln other words, each
processor maintains a single thread ofeontrnl with its hardware resources. ln a
multithrcaded architecture, each processor can execute multiple contexts at
the same time. The term mufrirlireotfing implies that there are multiple
threads ofcontrol in each processor.
Multithreading oifers an efiicctive mechanism ibr hiding long latency in
building large-scale multiproccssors and is today a mature technology.
As shown in Fig., the multithrcading idea was pioneered by Burton Smith
[ 1978] in the HEP system which extended the concept of scoreboarding of
multiple functional units in the CDC 6400
Dataflow Tracks:
Dataflow systems focus on executing instructions based on the availability of
data, rather than following a sequential order as in traditional architectures.
The key idea is to use a datallow mechanism, instead of a control-flow
mechanism as in von Neumann machines, to direct the program flow. Fine
grain, instruction-level parallelism is exploited in dataflow computers.
the dataflow concept was pioneered by Jack Dennis (1974) with a “static”
architecture. The concept later inspired the development of “dynamic”
dataflow computers. A series of tagged-token architectures was developed at
MIT by Arvind and coworkers.
Q7. Discuss in brief Data flow architecture.
Ans: ln a daraflow computer, the execution of an instruction is driven by data
availability instead of being guided by a program counter, ln theory, any
instruction should be ready for execution whenever operands become
available.
The instructions in a data-driven program are not ordered in any way. instead
of being stored separately in a main memory, data are directly held inside
instructions. Computational results. {rrnra mirens} are passed directly
between instructions.
The data generated by an instruction will be duplicated into many copies and
forwarded directly to all needy instructions. Data tokens, once consumed by
an instruction, will no longer he available for reuse by other instructions..
A Dataflow Architecture
As shown in Fig, the global architecture consists ofn processing elements {PEs}
interconnected by an n x n routing network. The entire system supports
pipclined dataflow operations in all n PEs. Inter-PE eontnntnieations are done
through the pipelinod muting network.
Within each PE, the machine provides a low-level Iokm-nmfr-hing mechanism
which dispatches only those instructions whose input data [tokens] are
already available.
This data-driven scheme requires no program counter, and no control
sequencer. However, it requires special mechanisms to detect data availability,
to match data tokens with needy instructions, and to enable the chain reaction
of asynchronous instruction executions. No memory sharing between
instructions results in no side effects.
The global architecture consists of n processing elements (PEs) interconnected
by an n x n muting network. The entire system supports pipelined dataflow
operations in all n PEs. Inter-PE communications are done through the
pipelined routing network.
Q8. Explain in detail about program flow mechanism.
Ans: Program flow mechanisms refer to the various methods and strategies
employed to control the sequence of execution of instructions in a computer
system. They determine how a program’s control moves from one instruction
to another and ensure that computations occur in the correct order,
respecting dependencies and achieving the desired output.
The main program flow mechanisms are:
Control-Driven Mechanism
Data-Driven Mechanism
Demand-Driven Mechanism
1. Control-Driven Mechanism
Conventional computers are based on a control flow mechanism by
which the order of program execution is explicitly stated in the user
programs. A program counter keeps track of the current instruction.
Instructions are executed sequentially unless a control flow instruction
(e.g., jump, call) changes the PC. Here sequential execution is default.
Flow control uses constructs like loops, conditionals, and jumps.
2. Data-Driven Mechanism
Dataflow computers are based on a data-driven mechanism which allows
the execution of any instruction to be driven by data (operand)
availability. Dataflow computers emphasize a high degree of parallelism
at the fine-grain instructional level.
Instructions are "enabled" for execution when their input data
(operands) are available.
Tokens carrying data are passed between instructions, and execution
proceeds asynchronously.
No program counter or central control mechanism.
3. Demand-Driven Mechanism:
Reduction computers are based on a demand-driven mechanism which
initiates an operation based on the demand for its results by other
computations.
Also known as lazy evaluation, it executes instructions only when their
results are required by another part of the program. Execution starts
from the final output requirement.
Instructions are triggered in reverse order of dependency.
Avoids unnecessary computation by focusing only on required results.
Execution is guided by demand propagation.
Q9. Explain bell’s taxonomy of MIMD computers
Ans: Parallel computers appear as either SIMD or MIMD configurations. The
SLMD-s appeal more to special purpose applications. it is clear that SIMD’s are
not size scalable. but unclear whether large SIMD’s are generation-scalable.
The fact that CM-5 had an MIMD architecture, away from the SIMD
architecture in CM -Z, represents the architectural trend. Furthermore, the
boundary between multi-processors and multi-computers has become blurred
in recent years.
The architectural trend for general-purpose parallel computers is in favor of
MTMD configurations with various memory configurations. Gordon Bell (1992)
has provided a taxonomy of MIMD machines, shown in Fig. He considers
shared-memory multiprocessors as having a single address space. Scalable
multiprocessors or multi-computers must use distributed memory.
Multiprocessors using centrally shared memory have limited scalability.
Multi-computers use distributed memories with multiple address spaces. They
are scalable with distributed memory. The evolution of fast LAN (local area
network]- connected workstations has created “commodity supercomputing”.
Bell was the first to advocate high-speed workstation clusters interconnected
by highspeed switches in lieu of special-purpose multi-computers. The C M-5
development was an early move in that direction.
NOTE: CAN ALSO REFER SPECTRUM Q20.
Q10. Explain system interconnect architectures.
Ans: System Interconnect Architectures refer to the frameworks used to
connect the various components of a computer system, like processors,
memory, and input/output devices. These architectures can be broadly
categorized into Static Interconnection and Dynamic Interconnection based on
their structure and flexibility.
1.Static Interconnection Architectures:
Static networks are created point-to-point direct connections which will not
alter during execution. Which means Static interconnection systems have fixed
pathways for communication between components. These pathways do not
change during the system's operation. They are faster and simpler.
Types of Static interconnection architectures:
(a)Linear Array: Processors are arranged in a straight line, with each
connected to its immediate neighbors. Linear arrays are actually simplest
connection topology. This is a one-dimensional network in which N nodes are
connected by N~ 1 links in a line, Internal nodes have degree 2, and the
terminal nodes have degree 1.
(b)Ring: Processors form a circular topology, with each processor connected to
two neighbors (left and right). A ring is obtained by connecting the two
terminal nodes of a linear array with one extra link. A ring can be uni-
directional or bidirectional. It is symmetric with a constant node degree of 2.
(c)Chordal Ring: By increasing the node degree from 2 to 3 or 4, we obtain two
chordal rings with degree 3 and 4. One and two extra links are added to
produce the two chordal rings, respectively. In general, the more links added,
the higher the node degree and the shorter the network diameter.
1.Dynamic Interconnection Architectures:
For multipurpose or general—purpose applications. we may need to use
dynamic connections which can implement all communication patterns based
on program demands. Instead of using fixed connections, switches or arbiters
must be used along the connecting paths to provide the dynamic connectivity.
In increasing order of cost and performance, dynamic connection networks
include bus systems, multistage interconnection networks (MIN), and crossbar
switch networks.
Types of Dynamic interconnection architectures:
(a)Bus System: A bus system is essentially a collection of wires and connectors
for data transactions among processors, memory modules, and peripheral
devices attached to the bus. The bus is used for only one transaction at a time
between source and destination.
In case of multiple requests, the bus arbitration logic must be able to allocate
or deallocate the bus, servicing the requests one at a time. A bus system has a
lower cost and provides a limited bandwidth compared to the other two
dynamic connection networks.
(b)Multistage interconnection network: MINs have been used in both MIND
and SIMD computers. A generalized multistage network is illustrated in the Fig
below. A number of a x b switches are used in each stage. Fixed interstage
connections are used between the switches in adjacent stages. The switches
can be dynamically set to establish the desired connections between the
inputs and outputs.
(c)Crossbar switch networks: The highest bandwidth and interconnection
capability are provided by crossbar networks. A crossbar network can be
visualized as a single-stage switch network. Like a telephone switchboard, the
crosspoint switches dynamic connections between source, destination pairs.
Each crosspoint switch can provide a dedicated connection path between a
pair. The switch can be set on or off dynamically upon program demand.
Done!!