Layers of Implementing An Application in Software or Hardware Using Parallel Computers
Layers of Implementing An Application in Software or Hardware Using Parallel Computers
• Automatic Serial Programming: For serial computers, the programmer writes code in
high-level languages (C, Java, FORTRAN), and compilers handle the hardware-specific
details, producing fast code without the programmer needing to understand the
underlying hardware.
• Parallel Computers and Programming: For parallel systems, parallelizing compilers
can handle simple tasks like distributing loops across processors for embarrassingly
parallel algorithms, but beyond that, programmers must have detailed knowledge of
how processors interact and when tasks should be executed.
Algorithms
According to IEEE Standard Dictionary of Electrical and Electronics Terms:
"An algorithm is a prescribed set of well-defined rules or processes for solving a problem in a
finite number of steps."
Serial Algorithms
A serial algorithm involves tasks that must be executed one after another due to data
dependencies. The directed graph (DG) representation of such an algorithm resembles a long
chain of dependent tasks, where each task relies on the result of the previous ones.
Fibonacci Example: To compute the 10th Fibonacci number n10, task T10 calculates:
Parallel Algorithms
Parallel algorithms are characterized by tasks that can be executed simultaneously because
they are data-independent—one task does not depend on the result of another. In a directed
graph (DG), this would appear as a wide row of independent tasks.
This structure allows tasks to be executed in parallel, leading to faster processing when
compared to serial algorithms.
SPAs
A Serial-Parallel Algorithm (SPA) involves grouping tasks into stages, where tasks in each
stage can be executed concurrently (in parallel), but the stages themselves are executed
sequentially.
• SPA Characteristics:
o Stages: Tasks within each stage run in parallel.
o Sequential Stages: Each stage is executed one after another.
o Conversion to Parallel or Serial:
▪ If there is only one stage, the SPA becomes a parallel algorithm.
▪ If each stage contains only one task, it becomes a serial algorithm.
• Example of SPA: The CORDIC algorithm requires several iterations. In each iteration,
three operations are performed:
Here, δi and θi are constants from lookup tables, and mm is a control parameter. Each
iteration updates the values of x, y, and z, but all iterations follow a sequential order.
SPAs balance between parallelism and sequential execution, making them useful for certain
computational tasks where full parallelism isn't possible.
NSPAs
Non-Serial-Parallel Algorithms (NSPAs) do not conform to the traditional classifications of
serial, parallel, or serial-parallel algorithms. NSPAs can be represented by directed graphs
(DGs) and are further divided into two main categories based on whether their graphs contain
cycles:
1. Work (W): The total amount of processing work required to complete the algorithm.
2. Depth (D): Also known as the critical path, it is the maximum path length between any
input node and any output node.
3. Parallelism (P): The degree of parallelism, defined as the maximum number of nodes
(tasks) that can be processed in parallel. The maximum number of parallel processors
that can be active at a given time is constrained by this degree.
These properties are crucial when mapping an NSPA onto a parallel computing platform for
optimal execution.
RIA
Regular Iterative Algorithms (RIAs), introduced by Karp et al., are a class of algorithms found
in various fields such as signal processing, image processing, video processing, linear algebra,
and numerical simulations. These algorithms are particularly noteworthy because they can be
implemented on grid structures.
Key Features of RIAs:
• Dependence Graph: Unlike Directed Acyclic Graphs (DAGs), RIAs use dependence
graphs to represent task dependencies. These graphs do not have directed edges, but
rather show fixed patterns of dependencies among tasks.
• Fixed Dependency Pattern: Tasks in an RIA exhibit a regular, predictable pattern of
dependencies, which differentiates them from other types of algorithms.
Example of RIA:
• Pattern Matching Algorithm: Figure 1.5 illustrates a dependence graph for a pattern
matching algorithm, where nodes represent tasks, and edges indicate the fixed
dependencies among tasks.
• Matrix-Matrix Multiplication Algorithm (Algorithm 1.1): A simple example of an RIA.
The algorithm involves iterating over matrices A and B to compute the result matrix C
using regular dependencies on the algorithm indices i, j, and k.
Challenges in Parallelizing RIAs:
3. Data Exchange: Data can be exchanged in the form of words using a system clock, or
as packets in NoC architectures, routed between chip modules via routers.
4. Memory System: Memory design involves choosing between shared memory modules
for multiple processors or dedicated memory for each processor. Mechanisms are
needed to ensure data integrity when processors share data, particularly for read and
write operations.
5. Task Partitioning: The program is broken into segments allocated to processors. In
coarse grain partitioning, large segments are assigned to processors, while fine grain
partitioning involves smaller segments (e.g., software processes or threads). This
partitioning can be done by the programmer or compiler.
1. Speedup Factor
• Speedup (S(N)) measures how much faster a task is completed with N processors
compared to 1 processor.
• Formula for Speedup:
o S(N)=Tp (1)/Tp (N)
o Tp (1) : Time on a single processor.
o Tp (N) : Time on N parallel processors.
• In an ideal case (no communication delay), speedup is linear:
o S(N)=N.
2. Communication Overhead
• Parallel computing involves overhead due to data transfer between processors and
memory, which can reduce speedup.
• Communication overhead comes from:
o Network delay: Time for data transmission and queuing in the network.
o Memory bandwidth: Limits on memory access speed (one word per access).
o Memory collisions: Multiple processors trying to access the same memory
block.
o Memory wall: Processing speed is much faster than memory speed.
• Ideal scenario:
o Processing time for N tasks by 1 processor: Tp (1)=τp
o For N processors:Tp (N)=τp
• Time to read data:
o 1 processor:Tr (1)=τm
o N processors with shared memory:Tr (N)=ατm (where α adjusts for memory
access limitations).
• Writing results:
o 1 processor:Tw (1)=τm
o N processors: Tw (N)=ατm
• Total time (including overhead):
o For N processors: Ttotal (N)=2ατm +τp
4. Speedup Formula with Overhead
5. Key Insights
MEMORY HIERARCHY
• Ideal Memory Attributes: Ideally, memory should be nonvolatile, have short access
times to match processor speed, large storage capacity, and be inexpensive. However,
such a perfect memory doesn't exist yet.
• Memory Hierarchy: Current systems use a hierarchy of memory types including
registers, cache, RAM, and mass storage (magnetic, optical, and flash drives) to
optimize cost and performance.
• Registers & Cache: Registers are the fastest and located within the CPU, but their
capacity is limited. Cache memory, built using SRAM, is also very fast and close to the
CPU, but it has limited capacity compared to RAM.
• Cache Efficiency: Cache exploits temporal locality (reusing recently accessed data)
and spatial locality (accessing data close to current data). Data is transferred between
cache and main memory in blocks to improve performance.
• RAM (DRAM): RAM is slower but has a much larger capacity than cache. It forms the
main memory of the system. However, it is volatile and requires constant refreshing to
retain data, which makes it slower compared to registers or cache.
• Mass Storage: Mass storage, like magnetic and optical disks, is inexpensive and offers
large capacities. However, it is significantly slower since it relies on mechanical
operations.
• Flash Drives: Flash memory, used in solid-state drives (SSDs), offers faster
performance than traditional disks, is non-volatile, and has a growing capacity with
technological advances. However, its speed still doesn't match that of on-chip memory
like cache.
CACHE DESIGN
• Processing Speed: High when data/instructions are located in the cache; slows down
significantly when data is not in the cache.
• Cache Design: Several factors affect cache performance, including:
o Cache Size: Larger caches (2^c) increase chances of cache hits.
o Mapping Technique: Determines how memory block addresses are associated
with cache line addresses.
o Cache Replacement Policy: Manages which blocks are loaded into the cache
and which lines are removed.
o Cache Hierarchy: Use of multiple cache levels (L1, L2) to improve performance
further.
CACHE HIEARCHY
• Cache Hierarchy: Designed to enhance performance and reduce cache misses by
organizing cache into different levels.
• L1 Cache:
o On-chip, very fast, but with limited capacity.
o Directly communicates with the CPU.
• L2 Cache:
o Off-chip, slower than L1, but has a larger capacity.
o Built using fast SRAM technology.
• Purpose: To balance speed and capacity by using multiple cache levels, improving
access times compared to accessing main memory.
• Occur when a block is accessed for the first time and has never been loaded
into the cache.
• Cache size has no effect on reducing these misses.
Capacity Misses:
• Occur when the cache is too small to hold all the blocks required by the
program.
• Can be reduced by increasing the cache size.
PIPELINING
• Pipelining: A technique to improve system throughput, or the rate of task completion
per unit time.
• Conditions for Effective Pipelining:
o Multiple instances of a task are required.
o The task can be divided into several smaller subtasks.
• Example: Car manufacturing, where many cars are produced and the process involves
several components.
• Execution Process:
o A task is divided into smaller stages, each completed faster than the whole task.
o Registers are placed between stages to store intermediate results, allowing
successive tasks to be processed efficiently.
*ESTIMATING PIPELINE SPEED
VERY LONG INSTRUCTION WORD (VLIW) PROCESSORS
Very Long Instruction Word (VLIW) Processors
• Figure 2.16a: The VLIW processor controls two datapath units. A single VLIW word is
split into multiple instructions, each controlling a different unit.
• Figure 2.16b: This figure illustrates the contents of a VLIW word across multiple
machine cycles. Each row represents a different VLIW word issued at a specific cycle:
o Gray boxes represent active instructions within the VLIW word.
o Empty boxes represent no-op (no operation) instructions, used when the
compiler cannot resolve dependencies between instructions or when the
required datapath unit is unavailable.
Benefits of VLIW:
Key Features:
• Dynamic Scheduling: The processor includes a dynamic scheduler that examines the
instructions in memory and decides which to issue to each pipeline. This allows for out-
of-order execution, increasing instruction throughput.
• Multiple Pipelines: Superscalar processors have several instruction pipelines
operating in parallel. For example, a three-way superscalar processor (shown in
Figure 2.17) has three independent pipelines and datapath units, allowing up to three
instructions to be executed simultaneously per clock cycle.
• Instruction Execution Rate: Using this technique, the instruction execution rate can
exceed the clock rate, as multiple instructions are executed in parallel.
Both VLIW and superscalar processors rely on multiple Arithmetic Logic Units (ALUs) for
parallel execution. However, the key difference lies in how instructions are issued:
While ILP and superscalar processors aim to maximize parallelism, several factors limit the
speedup achieved:
1. True Data Dependencies (RAW): Occurs when an instruction needs to read a value
written by a prior instruction. This dependency is also known as Read After Write
(RAW). For example, in Figure 2.20, instruction I1I_1 depends on I0I_0's output.
2. Procedural Dependencies: Arise from branch instructions, as the next instruction
cannot be fetched until the branch outcome is determined. This is shown in Figure 2.21.
3. Resource Conflicts: Occur when multiple instructions require the same processor
resource, such as memory or the ALU. This can be resolved by duplicating resources
but may not always be feasible.
4. Output Dependencies (WAW): Known as Write After Write (WAW), this happens
when two instructions write to the same register. The order of execution must ensure
the correct value is written. Figure 2.22 shows an example where I_0 and I_2 write to the
same register R_0.
5. Anti dependencies (WAR): Occur when an instruction writes to a register after another
instruction has read from the same register, referred to as Write After Read (WAR). The
writing instruction must wait for the reading instruction to complete to avoid incorrect
results.
MULTITHREADED PROCESSOR
A multithreaded processor can execute multiple threads simultaneously, sharing processor
resources among them. This is in contrast to a simple processor, which can only run one
thread at a time. A thread is a portion of a program that shares the same memory space and
computational resources with other threads.
1. Single-Threaded Processor:
a. A basic processor with one Arithmetic Logic Unit (ALU) runs only one thread at
a time.
b. When a thread, say T0, stalls (e.g., due to a memory access delay or a cache
miss), the program execution halts until the stall is resolved.
c. Figure 2.23a illustrates this scenario where a single-threaded processor is
forced to wait for memory access, pausing the thread's execution.
2. Multithreaded Processor with One ALU:
a. In a multithreaded processor, multiple threads (e.g., T0 and T1) can be
managed. When one thread, T0, stalls, the operating system (OS) can switch to
another thread, T1, keeping the processor busy.
b. The OS uses a pre-emptive scheduler to switch between threads. When T0 is
ready to resume execution, T1 will be paused, allowing T0 to continue. Figure
2.23b demonstrates this switch between threads to maintain continuous
execution.
3. Two-Way Superscalar Processor (Single Thread):
a. A two-way superscalar processor can run a single thread using two ALUs in
parallel, as long as data dependencies are resolved. This increases the
instruction execution rate, as multiple instructions can be processed
simultaneously.
b. Figure 2.23c depicts this scenario where thread T0 utilizes both ALUs
concurrently, improving performance through instruction-level parallelism.
4. Two-Way Superscalar Processor (Multithreading):
a. A multithreaded two-way superscalar processor can execute two threads in
parallel, switching between them when one stalls.
b. When thread T0 stalls, the processor switches to thread T1, keeping both ALUs
busy. This reduces idle time and ensures more efficient use of processor
resources. When T0 is ready to continue, it resumes execution. Figure 2.23d
illustrates this behaviour, where the processor alternates between two threads
to maintain continuous execution.
Advantages of Multithreading:
PARALLEL COMPUTING
Parallel computing involves running multiple tasks simultaneously to improve performance,
especially for large or time-sensitive applications. Flynn's taxonomy classifies processors
based on the number of instructions and data they handle:
Flynn's Taxonomy:
Parallel Architectures:
These architectures form the foundation for modern computing systems, balancing
performance, scalability, and application needs.
UMA & NUMA
• Shared memory model: All processors share a common memory space for
communication, known as parallel random-access memory (PRAM).
• Architecture: Processors access the shared memory through an interconnection
network, which could be a simple bus or a more complex network in larger systems.
• Memory access: Multiple processors can access the same memory, but this can
cause bottlenecks when many processors request memory access simultaneously.
o To mitigate this, systems can replace a single memory module with a bank of
memories and use interconnection networks to allow simultaneous memory
access.
• Cache coherence: A major issue in shared-memory systems is keeping the local
caches of different processors updated with the shared memory. This is managed using
cache coherence protocols.
• Programming:
o Memory read operations are simple and similar to serial programming.
o Write operations require locking mechanisms like semaphores or mutexes to
ensure data integrity during parallel execution.
o Programming libraries such as POSIX and OpenMP help in managing
synchronization with tools like locks, barriers, and monitors.
• Local memory model: Each processor has its own local memory, which it can access
directly, while communication between processors requires message passing.
• Memory access:
o Processors can access other processors' memories via message-passing
interfaces (MPI), but the access time is non-uniform (NUMA), depending on
whether the processor is accessing its own memory or a remote one.
• Symmetric vs. Asymmetric:
o If all processors are identical, it’s called a symmetric multiprocessor (SMP).
o If the processors differ, it’s known as an asymmetric multiprocessor (ASMP).
• Scalability: These systems can scale to thousands of processors, making them
suitable for massive parallel computing or grid computing, used in large-scale scientific
computations.
SIMD & Systolic Processors
*CLUSTER & GRID COMPUTING
Cluster Computing
*SM
Stream Multiprocessor
• Definition: A type of SIMD or MIMD machine where processors are called streaming
processors (SPs) or thread processors.
• Stream Processor: Handles data streams with an instruction set that processes these
streams using kernels.
• GPU Connection: Stream processors are commonly found in GPUs, which can perform
general-purpose computations (GPGPU).
• Data Streams: Examples include vectors of floating-point numbers or groups of pixels
in video data.
• Locality:
o Temporal locality: Input data is reused a few times for output.
o Spatial locality: Input data is stored in the same memory block.
• Example: Modern GPUs like NVIDIA's Fermi.
Key Characteristics for Applications
TYPES OF COMMUNICATION
1. One to One (Unicast):
a. Involves one sender and one receiver.
b. Often used in SIMD machines where each processor exchanges data with its
neighbor.
c. Efficient data exchange is critical, with simple register-to-register transfers or
synchronized handshakes.
2. One to Many (Multicast):
a. One sender processor communicates with multiple receiver processors.
b. Useful when the same data needs to be sent to several processors
simultaneously.
c. Requires efficient data exchange similar to unicast.
3. One to All (Broadcast):
a. A single processor sends the same data to all processors.
b. Often used in SIMD machines and systolic arrays to provide data to all
processors in the system.
4. Gather:
a. Data is collected from multiple processors.
b. Time needed for gathering data depends on the number of processors and
transmission time.
5. Reduce:
a. Similar to gather but performs an operation (e.g., sum) on the collected data.
b. Can be hierarchical for efficiency in large systems.
c. Time depends on both gathering and processing.
MESSAGE PASSING (MP) COMMUNICATION MECHANISM
Message Passing (MP) Communication Mechanism
Steps in MP Communication:
Synchronization Strategies:
• Synchronous/Blocking:
o Sender and receiver halt until the message is successfully sent/received.
• Asynchronous/Non-blocking:
o Sender and receiver continue execution without waiting for message
transmission.
MPI Standard:
Key Features:
Important Considerations:
1. Cache Coherence:
Cache Terminology:
• Cache coherence: Ensures data consistency between shared memory and processor
caches.
• Valid bit (V): Indicates whether the cache line is coherent with shared memory.
• Write-back: Updates shared memory only when a cache block is replaced.
• Write-through: Immediately updates shared memory when a cache block is modified.
1. Directory Protocols:
Maintain a directory to track the state of each block in the shared memory.
2. Snoopy Protocols:
• Components:
o Local Cache Controllers: Handle cache updates for each processor.
o Central Controller: Maintains system-wide cache coherence.
o Directory: Part of shared memory, stores state of each shared block.
o Interconnection Network: Enables communication between controllers,
caches, and memory.
• Directory Protocol:
o Full-map directory has n + 2 bits per entry, where n is the number of processors.
o D-bit: Indicates whether data is valid (0) or modified (1).
o X-bit: Controls whether updates are broadcast (B) or not (NB).
o Pros: Tracks locations of shared blocks and ensures coherence.
o Cons: Can become a bottleneck as all transactions go through the central
controller. Entry size depends on the number of processors.
• Components:
o Local Cache Controllers: Monitor their caches for coherence.
o Shared Memory: No directory or central controller is used.
o Interconnection Network: Supports broadcasting of data transmissions.
• Snoopy Protocol:
o Processors monitor network transactions to determine if a memory operation is
relevant.
o Invalidation-based Policy: Invalidate the cache copy and fetch updated data
from memory when needed.
o Update-based Policy: Update cache copy using data on the bus while shared
memory is updated.
• Pros: Simpler structure, no central controller.
• Cons: Limited by bus bandwidth, only one transaction can occur at a time.
SYNCHRONIZATION AND MUTUAL EXCLUSION
1. Critical Section:
• The critical section is where a shared resource (like a variable, file, or data structure) is
accessed by multiple processes or threads.
• To prevent race conditions—where two or more threads modify the resource
simultaneously leading to unpredictable results—only one process is allowed to enter
the critical section at any time.
• If a process is inside its critical section, other processes must wait until the current
process finishes and releases the resource (unlocking it).
• Atomic operations (like TestAndSet) are essential because they execute a series of
operations (reading, modifying, and writing) as a single, uninterruptible action.
• This is necessary in concurrent programming to prevent a situation where two
processes read the same value, make changes based on the outdated value, and then
both write conflicting updates to memory.
• For example, in the TestAndSet operation, it first reads the current value of a lock, then
sets the lock to "occupied" (TRUE), ensuring no other process can access the critical
section.
• Busy waiting occurs when a process continuously checks (in a loop) whether the lock
is available. This can waste CPU time because the process is actively using the CPU
while waiting.
• Blocking, on the other hand, puts the waiting process in a sleep state, freeing up CPU
resources until the lock becomes available. However, blocking has overhead because
suspending and resuming processes take time.
• Busy waiting may be more efficient for very short waits, while blocking is preferred
when a process is expected to wait for a longer time.
4. Locks:
• A mutex (short for mutual exclusion) ensures that only one thread can enter the critical
section at a time. It has two key operations:
o wait(): Checks if the mutex is available (value 1). If available, it decrements the
value and allows the thread to proceed.
o signal(): After a thread finishes its work in the critical section, it increments the
mutex value to signal that the resource is available for other threads.
• Mutexes are implemented in operating system libraries (like POSIX threads) and are
commonly used to protect critical sections and manage resource access in multi-
threaded applications.
6. Barriers:
• Unlike locks and mutexes that control access to shared variables, barriers are used in
parallel computing for event synchronization.
• In many parallel algorithms, tasks are divided among multiple threads that operate
independently. However, there may be certain points in the algorithm where all threads
must wait for each other to finish a particular stage before moving on to the next stage.
• Barriers ensure that all threads reach a certain point before any thread is allowed to
proceed further. This is useful in algorithms where synchronization points are needed
for correctness.
• A barrier typically tracks how many threads have reached it. Once all threads have
reached the barrier, they can continue with the next task.
7. Synchronization Issues:
• Deadlock: If two or more processes are waiting for each other to release resources,
none of them can proceed, leading to a deadlock. Proper handling of locks and
synchronization is required to avoid this.
• Starvation: This occurs when a process waits indefinitely to acquire a lock or access
the critical section because other processes continuously occupy the resource.
• Priority Inversion: A low-priority process holding a lock may delay higher-priority
processes waiting for the same lock. In some systems, priority inheritance mechanisms
are used to handle this.
• Locks are simple and efficient but can result in busy waiting and resource contention
when many processes attempt to acquire the lock.
• Mutexes are more structured and avoid busy waiting by suspending the waiting
process, but they are slightly more memory-intensive and require proper management
of wait() and signal() calls.
• Barriers are more suitable for parallel processing where synchronization of
independent tasks is necessary. They ensure that all threads reach a certain point
before proceeding, which helps maintain execution order in parallel algorithms.