0% found this document useful (0 votes)
19 views46 pages

Layers of Implementing An Application in Software or Hardware Using Parallel Computers

The document outlines the layers involved in implementing applications on parallel computers, detailing the application layer down to hardware execution. It discusses challenges in automating parallelization, defines algorithms and their components, and introduces various types of algorithms based on task dependencies. Additionally, it covers considerations for parallel computing design, measuring benefits of parallel computing, and the relationship between parallel algorithms and architectures.

Uploaded by

akshitabhatt08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views46 pages

Layers of Implementing An Application in Software or Hardware Using Parallel Computers

The document outlines the layers involved in implementing applications on parallel computers, detailing the application layer down to hardware execution. It discusses challenges in automating parallelization, defines algorithms and their components, and introduces various types of algorithms based on task dependencies. Additionally, it covers considerations for parallel computing design, measuring benefits of parallel computing, and the relationship between parallel algorithms and architectures.

Uploaded by

akshitabhatt08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

CHAPTER 1

Layers of implementing an application in software or


hardware using parallel computers:

Layer 5: Application Layer Defines the problem to be solved, along with


inputs, and outputs, guiding the development of
the algorithm.
Layer 4: Algorithm Design Develops the algorithm, defining tasks and their
functional interdependencies. At this point,
parallelism may not yet be evident.
Layer 3: Parallelization & Scheduling Determines which parts of the algorithm can be
executed in parallel and determines thread/task
scheduling for either software or custom
hardware (e.g., VLSI implementation).
Layer 2: Coding Layer Converts the parallel algorithm into code using
high-level languages. Different tools like Cilk++,
OpenMP, or CUDA can help manage task
execution in general-purpose platforms, while
HDL languages like Verilog or VHDL are used for
custom hardware platforms like systolic arrays.
Layer 1: Hardware Execution Implements the algorithm or application on a
parallel computing platform. This could be
through multithreading on a parallel computer or
using application-specific systems like ASICs or
FPGAs.

Challenges in Automating Parallelization

• Automatic Serial Programming: For serial computers, the programmer writes code in
high-level languages (C, Java, FORTRAN), and compilers handle the hardware-specific
details, producing fast code without the programmer needing to understand the
underlying hardware.
• Parallel Computers and Programming: For parallel systems, parallelizing compilers
can handle simple tasks like distributing loops across processors for embarrassingly
parallel algorithms, but beyond that, programmers must have detailed knowledge of
how processors interact and when tasks should be executed.
Algorithms
According to IEEE Standard Dictionary of Electrical and Electronics Terms:

"An algorithm is a prescribed set of well-defined rules or processes for solving a problem in a
finite number of steps."

Basic Components of an Algorithm

1. Tasks: The steps involved in solving a problem.


2. Dependencies: Some tasks depend on the results of others.
3. Inputs: The initial data required to start the algorithm.
4. Outputs: The final results of the algorithm.

Algorithm Dependence Graph (DG)


• A dependence graph is a set of nodes and edges. The nodes represent the tasks to be
done by the algorithm and the edges represent the data used by the tasks. This data
could be input, output, or internal results.
• A DG is a set of nodes and directed edges. The nodes represent the tasks to be done by
the algorithm, and the directed edges represent the data dependencies among the
tasks. The start of an edge is the output of a task and the end of an edge the input to the
task.
• A directed acyclic graph (DAG) is a DG that has no cycles or loops.
• An input edge in a DG is one that terminates on one or more nodes but does not start
from any node. It represents one of the algorithm inputs.
• An output edge in a DG is one that starts from a node but does not terminate on any
other node. It represents one of the algorithm outputs.
• An internal edge in a DG is one that starts from a node and terminate one or more
nodes. It represents one of the algorithm internal variables.
• An input node in a DG is one whose incoming edges are all input edges.
• An output node in a DG is whose outgoing edges are all output edges.
• An internal node in a DG is one that has at least one incoming internal edge and at
least one outgoing internal edge.
Adjacency matrix
• Definition: The adjacency matrix AA is a square matrix of size W×W, where each
element a(i,j) represents whether node/task i depends on the output of node/task j.
o a(i,j)=1 means node ii depends on node jj.
o a(i,j)=0 means there is no dependency.
o There are no self-loops, so a(i,i)=0 for all nodes.
• Matrix Properties:
o The matrix is asymmetric because if i depends on j, the reverse is not
necessarily true.
• Input and Output Nodes:
o Input Nodes: Rows corresponding to input nodes have all zeroes, indicating that
these nodes do not depend on any other nodes.
o Output Nodes: Columns corresponding to output nodes have all zeroes,
indicating that no other nodes depend on these nodes.
• Internal Nodes: Nodes that are neither input nor output have one or more non-zero
elements in both their row and column, indicating they have dependencies on other
nodes and are depended on by other nodes.
• Parent-Child Relationship: If a(i,j)=1, node j is a parent of node i, meaning node i
relies on the output of node j.

Classifying Algorithms Based on Task Dependences


• Serial algorithms
• Parallel algorithms
• Serial – parallel algorithms (SPAs)
• Nonserial – parallel algorithms (NSPAs)
• Regular iterative algorithms (RIAs)

Serial Algorithms
A serial algorithm involves tasks that must be executed one after another due to data
dependencies. The directed graph (DG) representation of such an algorithm resembles a long
chain of dependent tasks, where each task relies on the result of the previous ones.

Fibonacci Example: To compute the 10th Fibonacci number n10, task T10 calculates:

n10=n8+ n9 with the initial conditions n0 =0 and n1 =1.


This structure shows that each Fibonacci number can only be calculated after the previous
two numbers have been computed, making it inherently serial. Tasks must be performed in
sequence, with no possibility of parallel execution.

Parallel Algorithms
Parallel algorithms are characterized by tasks that can be executed simultaneously because
they are data-independent—one task does not depend on the result of another. In a directed
graph (DG), this would appear as a wide row of independent tasks.

• Examples of Parallel Algorithms:


o Web server: Each incoming request can be handled independently of other
requests, allowing for concurrent processing.
o Multitasking in operating systems: The system can handle different
applications (e.g., web browser, word processor) simultaneously, as they
operate independently.

This structure allows tasks to be executed in parallel, leading to faster processing when
compared to serial algorithms.

SPAs
A Serial-Parallel Algorithm (SPA) involves grouping tasks into stages, where tasks in each
stage can be executed concurrently (in parallel), but the stages themselves are executed
sequentially.

• SPA Characteristics:
o Stages: Tasks within each stage run in parallel.
o Sequential Stages: Each stage is executed one after another.
o Conversion to Parallel or Serial:
▪ If there is only one stage, the SPA becomes a parallel algorithm.
▪ If each stage contains only one task, it becomes a serial algorithm.
• Example of SPA: The CORDIC algorithm requires several iterations. In each iteration,
three operations are performed:

Here, δi and θi are constants from lookup tables, and mm is a control parameter. Each
iteration updates the values of x, y, and z, but all iterations follow a sequential order.

SPAs balance between parallelism and sequential execution, making them useful for certain
computational tasks where full parallelism isn't possible.

NSPAs
Non-Serial-Parallel Algorithms (NSPAs) do not conform to the traditional classifications of
serial, parallel, or serial-parallel algorithms. NSPAs can be represented by directed graphs
(DGs) and are further divided into two main categories based on whether their graphs contain
cycles:

1. Directed Acyclic Graph (DAG):


a. The DG does not contain any cycles.
b. An example is shown in Figure 1.4a, where tasks are executed in a flow without
returning to previous tasks.
2. Directed Cyclic Graph (DCG):
a. The DG contains cycles, indicating feedback loops.
b. Figure 1.4b is an example of a DCG, commonly encountered in feedback control
systems.
c. In such systems, tasks may reuse previous outputs, such as in control loops
(e.g., error signal in feedback systems).

Key Constructs of NSPA Graphs:

• Nodes: Represent tasks in the algorithm.


• Directed Edges: Represent the direction of data flow between tasks. If task TjT_j
depends on the output of task TiT_i, there is a directed edge from node ii to node jj.

Properties of NSPA Graphs:

1. Work (W): The total amount of processing work required to complete the algorithm.
2. Depth (D): Also known as the critical path, it is the maximum path length between any
input node and any output node.
3. Parallelism (P): The degree of parallelism, defined as the maximum number of nodes
(tasks) that can be processed in parallel. The maximum number of parallel processors
that can be active at a given time is constrained by this degree.

These properties are crucial when mapping an NSPA onto a parallel computing platform for
optimal execution.

RIA
Regular Iterative Algorithms (RIAs), introduced by Karp et al., are a class of algorithms found
in various fields such as signal processing, image processing, video processing, linear algebra,
and numerical simulations. These algorithms are particularly noteworthy because they can be
implemented on grid structures.
Key Features of RIAs:

• Dependence Graph: Unlike Directed Acyclic Graphs (DAGs), RIAs use dependence
graphs to represent task dependencies. These graphs do not have directed edges, but
rather show fixed patterns of dependencies among tasks.
• Fixed Dependency Pattern: Tasks in an RIA exhibit a regular, predictable pattern of
dependencies, which differentiates them from other types of algorithms.

Example of RIA:

• Pattern Matching Algorithm: Figure 1.5 illustrates a dependence graph for a pattern
matching algorithm, where nodes represent tasks, and edges indicate the fixed
dependencies among tasks.
• Matrix-Matrix Multiplication Algorithm (Algorithm 1.1): A simple example of an RIA.
The algorithm involves iterating over matrices A and B to compute the result matrix C
using regular dependencies on the algorithm indices i, j, and k.
Challenges in Parallelizing RIAs:

• Parallelization Complexity: Unlike simpler algorithm classes (serial, parallel, SPA),


RIAs are more complex to parallelize due to their fixed dependency patterns.
• Higher Dimensionality: While dependence graphs are useful for algorithms with one or
two indices, RIAs with more dimensions, like the matrix-matrix multiplication (3 indices:
i, j, and k), require more formal techniques for analysis and visualization.

*PARALLEL COMPUTING DESIGN CONSIDERATIONS


1. Processor Architecture: The design of a parallel computing system starts with
selecting a processor architecture, which could range from simple elements to
superscalar processors running multithreaded operating systems.

2. Interconnection Network: Processors must communicate via an interconnection


network, which should support simultaneous communication between arbitrary pairs
of processors. A bus is a simple form of this network, but modern systems use
networks-on-chips (NoC) for efficient data exchange.

3. Data Exchange: Data can be exchanged in the form of words using a system clock, or
as packets in NoC architectures, routed between chip modules via routers.

4. Memory System: Memory design involves choosing between shared memory modules
for multiple processors or dedicated memory for each processor. Mechanisms are
needed to ensure data integrity when processors share data, particularly for read and
write operations.
5. Task Partitioning: The program is broken into segments allocated to processors. In
coarse grain partitioning, large segments are assigned to processors, while fine grain
partitioning involves smaller segments (e.g., software processes or threads). This
partitioning can be done by the programmer or compiler.

6. Synchronization: Proper synchronization is required among tasks to maintain program


correctness and data integrity, and this responsibility can fall to the programmer or
operating system.

PARALLEL ALGORITHMS AND PARALLEL ARCHITECTURES


1. Data-Level Parallelism: Operates on multiple bits of data simultaneously, using
techniques like bit-parallel addition, multiplication, and systolic arrays for processing
multiple data samples.
2. Instruction-Level Parallelism (ILP): Executes more than one instruction
simultaneously, often using instruction pipelining to improve efficiency.
3. Thread-Level Parallelism (TLP): Involves executing multiple software threads at the
same time on one or more processors, where a thread is a lightweight process sharing
resources with other threads.
4. Process-Level Parallelism: Classic multitasking where multiple programs (processes)
run concurrently, each reserving its own resources like memory and registers.

*RELATING PARALLEL ALGORITHM AND PARALLEL


ARCHITECTURE
• Parallel Algorithm: An algorithm is considered parallel if two or more parts can be
executed independently, often identified through "FOR" or "WHILE" loops in the code.
• Software-Hardware Parallelism Link: Parallelism in software depends on supporting
hardware capable of executing different threads or processes concurrently, such as on
different processors.
• Parallel Architecture: According to IEEE, parallel architecture involves a
multiprocessor system that supports parallel processing. The programmer, compiler,
or operating system supplies tasks to keep the processors busy.
• Examples of Parallel Algorithms:
o Scientific Computing: Physical simulations, differential equation solvers, wind
tunnel simulations, weather simulation.
o Computer Graphics: Image processing, video compression, ray tracing.
o Medical Imaging: MRI, computerized tomography (CT).
• Non-Parallel Algorithms: Many algorithms in areas like online medical data, banking,
data mining, and databases are not inherently parallel.
• Challenge: The challenge is to develop computer architectures and software to speed
up information technology applications that are not readily parallelizable.

*IMPLEMENTATION OF ALGORITHMS: A TWO - SIDED PROBLEM


• Two-Sided Problem: Involves the relationship between parallel algorithms and parallel
architectures.
• Route A:
o Given an algorithm, explore possible parallel hardware or processor arrays.
o Ensure correct implementation according to performance requirements and
system constraints.
o Question: What are the possible parallel processor architectures for a given
parallel algorithm?
• Route B:
o Given a parallel architecture or multicore system, explore how to best
implement a given algorithm.
o Focus on how to allocate tasks of the algorithm to different processors.
o Question: How can the different tasks of the algorithm be allocated to different
processors?
• Parallel Programming: Achieved using multithreading design techniques, handled by
the application programmer, software compiler, and operating system.
• Key Considerations:
o Mapping tasks to different processors.
o Scheduling task execution based on algorithm data dependency and I/O
requirements.
o Identifying data communication between processors and I/O.

MEASURING BENEFITS OF PARALLEL COMPUTING


Measuring Benefits of Parallel Computing

1. Speedup Factor

• Speedup (S(N)) measures how much faster a task is completed with N processors
compared to 1 processor.
• Formula for Speedup:
o S(N)=Tp (1)/Tp (N)
o Tp (1) : Time on a single processor.
o Tp (N) : Time on N parallel processors.
• In an ideal case (no communication delay), speedup is linear:
o S(N)=N.

2. Communication Overhead

• Parallel computing involves overhead due to data transfer between processors and
memory, which can reduce speedup.
• Communication overhead comes from:
o Network delay: Time for data transmission and queuing in the network.
o Memory bandwidth: Limits on memory access speed (one word per access).
o Memory collisions: Multiple processors trying to access the same memory
block.
o Memory wall: Processing speed is much faster than memory speed.

3. Estimating Speedup with Communication Overhead

• Ideal scenario:
o Processing time for N tasks by 1 processor: Tp (1)=τp
o For N processors:Tp (N)=τp
• Time to read data:
o 1 processor:Tr (1)=τm
o N processors with shared memory:Tr (N)=ατm (where α adjusts for memory
access limitations).
• Writing results:
o 1 processor:Tw (1)=τm
o N processors: Tw (N)=ατm
• Total time (including overhead):
o For N processors: Ttotal (N)=2ατm +τp
4. Speedup Formula with Overhead

• Speedup considering communication overhead:

• Memory Mismatch Ratio (R):

5. Key Insights

• Full speedup occurs when RN≪1 (small memory access delays).


• If RN>0.1, speedup decreases significantly, leading to communication-bound
problems.
• Multicore processors reduce memory delay, making them more efficient than
multiprocessor systems.

Amdahl’s Law for Multiprocessor Systems


GUSTAFSON – BARSIS’ S LAW
*APPLICATIONS OF PARALLEL COMPUTING
1. General Impact:
• Parallel computing is powerful and impacts many areas of daily life.
• Examples include web search engines that provide real-time results as we type.
2. Climate Modelling:
• Used for weather forecasting and predicting global climate changes.
• High-resolution climate models require massive computational power.
• Climate simulations involve billions of 3D weather cells and floating-point
operations, which need parallel processing to complete efficiently.
3. CT Scanning (Medical Imaging):
• CT scans and MRI generate high-resolution images of the body for medical
diagnosis.
• Generating images requires processing millions of pixels and performing trillions
of floating-point operations.
• Real-time image generation for better diagnosis relies heavily on parallel
computing.
4. Computational Fluid Dynamics (CFD):
• CFD studies the flow of fluids and gases in various fields like weather prediction,
blood flow, and airplane design.
• Complex fluid simulations require solving systems of equations at grid points,
needing enormous computing power.
• Parallel computing accelerates these simulations, reducing time from years to
minutes or days.
• Key Applications of CFD:
1. Ocean currents and global weather.
2. Blood flow in arteries and heart deformation.
3. Airplane wing design and car body shape optimization.
CHAPTER 2

*INCREASING PROESSOR CLOCK FREQUENCY

• Increasing Clock Frequency: Boosts the number of instructions a computer can


execute per unit time.
• Delays in Switching: Logic gates and system buses need time to switch states,
affecting speed.
• Technology Factors: Clock speed depends on underlying silicon tech like NMOS,
CMOS, and bipolar.
• Gate Circuits: Circuit types like CMOS, domino logic, and current-mode logic also
affect clock speed.
• Dynamic Power Dissipation: Power dissipation (p) is proportional to capacitance (C),
clock frequency (f), and supply voltage (V), as given by p = C f V².
• Reducing Power Consumption: Techniques include reducing capacitance through
finer chip processes and lowering supply voltage (e.g., from 5.0V to 1.2V).
• Voltage Limit: Further lowering voltage affects the gate's noise margin, limiting how
much the voltage can be reduced.

MEMORY HIERARCHY

• Ideal Memory Attributes: Ideally, memory should be nonvolatile, have short access
times to match processor speed, large storage capacity, and be inexpensive. However,
such a perfect memory doesn't exist yet.
• Memory Hierarchy: Current systems use a hierarchy of memory types including
registers, cache, RAM, and mass storage (magnetic, optical, and flash drives) to
optimize cost and performance.
• Registers & Cache: Registers are the fastest and located within the CPU, but their
capacity is limited. Cache memory, built using SRAM, is also very fast and close to the
CPU, but it has limited capacity compared to RAM.
• Cache Efficiency: Cache exploits temporal locality (reusing recently accessed data)
and spatial locality (accessing data close to current data). Data is transferred between
cache and main memory in blocks to improve performance.
• RAM (DRAM): RAM is slower but has a much larger capacity than cache. It forms the
main memory of the system. However, it is volatile and requires constant refreshing to
retain data, which makes it slower compared to registers or cache.
• Mass Storage: Mass storage, like magnetic and optical disks, is inexpensive and offers
large capacities. However, it is significantly slower since it relies on mechanical
operations.
• Flash Drives: Flash memory, used in solid-state drives (SSDs), offers faster
performance than traditional disks, is non-volatile, and has a growing capacity with
technological advances. However, its speed still doesn't match that of on-chip memory
like cache.

CACHE MEMORY OPERATION


• Cache Memory and Main Memory Blocks: Memory and cache communicate using
blocks of words. Main memory is divided into blocks, and the cache stores a smaller
number of corresponding lines that include both the data and an associated tag.
• Main Memory Block Addressing: Main memory is divided into 2^b blocks, each
addressed by a b-bit address.
• Cache Lines: The cache stores 2^c lines, where each line contains a block from the
main memory along with a tag. The tag identifies which block from the main memory
corresponds to the cache line.
• Cache Hit: When the processor requests data, it checks if the word is present in the
cache. If it is, this is called a cache hit, and the data is delivered quickly to the
processor.
• Cache Miss: If the requested word is not in the cache, it's a cache miss. The processor
pauses, and the block containing the word is fetched from main memory into the cache
before the word is sent to the processor.
• Data Transfer Speed: Cache-to-main memory data transfer occurs in blocks, with the
speed of the data transfer matching the speed of DRAM (main memory) access.

CACHE DESIGN

• Processing Speed: High when data/instructions are located in the cache; slows down
significantly when data is not in the cache.
• Cache Design: Several factors affect cache performance, including:
o Cache Size: Larger caches (2^c) increase chances of cache hits.
o Mapping Technique: Determines how memory block addresses are associated
with cache line addresses.
o Cache Replacement Policy: Manages which blocks are loaded into the cache
and which lines are removed.
o Cache Hierarchy: Use of multiple cache levels (L1, L2) to improve performance
further.
CACHE HIEARCHY
• Cache Hierarchy: Designed to enhance performance and reduce cache misses by
organizing cache into different levels.
• L1 Cache:
o On-chip, very fast, but with limited capacity.
o Directly communicates with the CPU.
• L2 Cache:
o Off-chip, slower than L1, but has a larger capacity.
o Built using fast SRAM technology.
• Purpose: To balance speed and capacity by using multiple cache levels, improving
access times compared to accessing main memory.

MAPPING MEMORY INTO CACHE LINES

• Mapping Memory Blocks into Cache Lines: Establishes correspondence between


memory blocks and cache lines.
• Memory Example:
o 64K memory with 16-bit address.
o 12-bit block address (for 16 words per block), and 4 bits for word within the
block.
• Cache Memory:
o Example: 128 blocks, requiring 7 bits for cache line address.
• Mapping Techniques:
o Direct Mapping:
▪ Block mapped to cache line using least significant 7 bits.
▪ 5-bit tag to handle multiple blocks mapping to the same line.
o Associative Mapping:
▪ Block can be placed in any available cache line.
▪ Uses full tag to associate a block with a cache line.
o Set-Associative Mapping:
▪ Cache divided into sets (e.g., 32 sets for 128 blocks, 4 blocks per set).
▪ Block placed in any available line within a set using 5 bits for set address
and 7 tag bits.
EFFECTS OF CACHE SIZE ON CACHE MISSES
Compulsory Misses (Cold-Start Misses):

• Occur when a block is accessed for the first time and has never been loaded
into the cache.
• Cache size has no effect on reducing these misses.

Capacity Misses:

• Occur when the cache is too small to hold all the blocks required by the
program.
• Can be reduced by increasing the cache size.

Conflict Misses (Collision Misses):

• Occur in set-associative or direct-mapped caches when multiple blocks map


to the same cache set.
• Can be reduced by increasing associativity, cache size, or reducing block
size.

PIPELINING
• Pipelining: A technique to improve system throughput, or the rate of task completion
per unit time.
• Conditions for Effective Pipelining:
o Multiple instances of a task are required.
o The task can be divided into several smaller subtasks.
• Example: Car manufacturing, where many cars are produced and the process involves
several components.
• Execution Process:
o A task is divided into smaller stages, each completed faster than the whole task.
o Registers are placed between stages to store intermediate results, allowing
successive tasks to be processed efficiently.
*ESTIMATING PIPELINE SPEED
VERY LONG INSTRUCTION WORD (VLIW) PROCESSORS
Very Long Instruction Word (VLIW) Processors

VLIW processors exploit fine-grain parallelism by executing multiple instructions


simultaneously. This technique operates at the instruction level, where algorithms are broken
down to the smallest unit of operation.
Key Characteristics:

• Multiple Instructions in One Word: A VLIW word contains several instructions or


opcodes issued to the CPU at once.
• Compiler Role: The compiler is responsible for selecting instructions that can be
executed simultaneously. It ensures:
o There are no dependencies between instructions in a VLIW word.
o The hardware supports the concurrent execution of the issued instructions.

This pre-execution scheduling gives VLIW processors an advantage over pipelined


processors, as the code is organized before execution, reducing the likelihood of instruction
stalls.

VLIW Processor Architecture

• Figure 2.16a: The VLIW processor controls two datapath units. A single VLIW word is
split into multiple instructions, each controlling a different unit.
• Figure 2.16b: This figure illustrates the contents of a VLIW word across multiple
machine cycles. Each row represents a different VLIW word issued at a specific cycle:
o Gray boxes represent active instructions within the VLIW word.
o Empty boxes represent no-op (no operation) instructions, used when the
compiler cannot resolve dependencies between instructions or when the
required datapath unit is unavailable.

Benefits of VLIW:

• Increased Parallelism: By issuing multiple instructions in parallel, VLIW processors


can achieve higher instruction throughput.
• Reduced Complexity: Since instruction scheduling is done at compile-time, the
processor hardware is simpler compared to dynamic scheduling methods.

However, VLIW processors are highly dependent on compiler efficiency, as improper


instruction scheduling can lead to suboptimal performance (e.g., more no-op instructions).

INSTRUCTION - LEVEL PARALLELISM (ILP) AND


SUPERSCALAR PROCESSORS

A superscalar processor is designed to execute multiple instructions simultaneously from


independent instruction pipelines, offering a higher degree of parallelism at the instruction
level. The core idea is to improve performance by allowing multiple instructions to be issued
and executed in parallel within a single clock cycle.

Key Features:

• Dynamic Scheduling: The processor includes a dynamic scheduler that examines the
instructions in memory and decides which to issue to each pipeline. This allows for out-
of-order execution, increasing instruction throughput.
• Multiple Pipelines: Superscalar processors have several instruction pipelines
operating in parallel. For example, a three-way superscalar processor (shown in
Figure 2.17) has three independent pipelines and datapath units, allowing up to three
instructions to be executed simultaneously per clock cycle.
• Instruction Execution Rate: Using this technique, the instruction execution rate can
exceed the clock rate, as multiple instructions are executed in parallel.

Instruction Pipelines in Superscalar Processors

• Two-Way Superscalar Pipeline: In Figure 2.18, the structure of a two-way superscalar


processor is illustrated, where two instruction pipelines work in parallel, each
controlling a separate datapath unit.
Comparing VLIW and Superscalar Processors

Both VLIW and superscalar processors rely on multiple Arithmetic Logic Units (ALUs) for
parallel execution. However, the key difference lies in how instructions are issued:

• VLIW processors issue instructions in a pre-scheduled order by the compiler, ensuring


no dependencies exist.
• Superscalar processors use dynamic scheduling at runtime, allowing for out-of-order
execution based on hardware capabilities (Figure 2.19).

ILP Limitations: Causes of Instruction Delays

While ILP and superscalar processors aim to maximize parallelism, several factors limit the
speedup achieved:

1. True Data Dependencies (RAW): Occurs when an instruction needs to read a value
written by a prior instruction. This dependency is also known as Read After Write
(RAW). For example, in Figure 2.20, instruction I1I_1 depends on I0I_0's output.
2. Procedural Dependencies: Arise from branch instructions, as the next instruction
cannot be fetched until the branch outcome is determined. This is shown in Figure 2.21.
3. Resource Conflicts: Occur when multiple instructions require the same processor
resource, such as memory or the ALU. This can be resolved by duplicating resources
but may not always be feasible.
4. Output Dependencies (WAW): Known as Write After Write (WAW), this happens
when two instructions write to the same register. The order of execution must ensure
the correct value is written. Figure 2.22 shows an example where I_0 and I_2 write to the
same register R_0.
5. Anti dependencies (WAR): Occur when an instruction writes to a register after another
instruction has read from the same register, referred to as Write After Read (WAR). The
writing instruction must wait for the reading instruction to complete to avoid incorrect
results.

MULTITHREADED PROCESSOR
A multithreaded processor can execute multiple threads simultaneously, sharing processor
resources among them. This is in contrast to a simple processor, which can only run one
thread at a time. A thread is a portion of a program that shares the same memory space and
computational resources with other threads.

1. Single-Threaded Processor:
a. A basic processor with one Arithmetic Logic Unit (ALU) runs only one thread at
a time.
b. When a thread, say T0, stalls (e.g., due to a memory access delay or a cache
miss), the program execution halts until the stall is resolved.
c. Figure 2.23a illustrates this scenario where a single-threaded processor is
forced to wait for memory access, pausing the thread's execution.
2. Multithreaded Processor with One ALU:
a. In a multithreaded processor, multiple threads (e.g., T0 and T1) can be
managed. When one thread, T0, stalls, the operating system (OS) can switch to
another thread, T1, keeping the processor busy.
b. The OS uses a pre-emptive scheduler to switch between threads. When T0 is
ready to resume execution, T1 will be paused, allowing T0 to continue. Figure
2.23b demonstrates this switch between threads to maintain continuous
execution.
3. Two-Way Superscalar Processor (Single Thread):
a. A two-way superscalar processor can run a single thread using two ALUs in
parallel, as long as data dependencies are resolved. This increases the
instruction execution rate, as multiple instructions can be processed
simultaneously.
b. Figure 2.23c depicts this scenario where thread T0 utilizes both ALUs
concurrently, improving performance through instruction-level parallelism.
4. Two-Way Superscalar Processor (Multithreading):
a. A multithreaded two-way superscalar processor can execute two threads in
parallel, switching between them when one stalls.
b. When thread T0 stalls, the processor switches to thread T1, keeping both ALUs
busy. This reduces idle time and ensures more efficient use of processor
resources. When T0 is ready to continue, it resumes execution. Figure 2.23d
illustrates this behaviour, where the processor alternates between two threads
to maintain continuous execution.

Advantages of Multithreading:

• Minimizes Stalls: Multithreading allows the processor to switch to another thread


when one thread stalls due to events like memory access delays, thus keeping the
processor active.
• Improved Resource Utilization: Multithreaded processors maximize the usage of
processor resources, especially in the case of superscalar architectures with multiple
ALUs, by allowing other threads to run when one is waiting.
CHAPTER 3

PARALLEL COMPUTING
Parallel computing involves running multiple tasks simultaneously to improve performance,
especially for large or time-sensitive applications. Flynn's taxonomy classifies processors
based on the number of instructions and data they handle:

Flynn's Taxonomy:

1. SISD (Single Instruction, Single Data): Traditional single-processor systems that


handle one instruction and one data stream at a time.
2. SIMD (Single Instruction, Multiple Data): Multiple processors execute the same
instruction on different data sets simultaneously. Examples include GPUs used in
graphics, video processing, and medical imaging.
3. MISD (Multiple Instruction, Single Data): Rare architecture where different
instructions are performed on the same data. Neural networks and fault-tolerant
systems can be examples.
4. MIMD (Multiple Instruction, Multiple Data): Each processor runs its own instructions
on different data, as seen in modern multicore processors.

Parallel Architectures:

1. Shared-memory multiprocessors: Multiple processors share the same memory. It’s


easy to program but can slow down due to memory access conflicts.
2. Distributed-memory multiprocessors: Each processor has its own memory and
communicates by passing messages, commonly used in large supercomputers.
3. SIMD processors: All processors perform the same task on different data, ideal for
repetitive tasks like image processing.
4. Systolic processors: Data moves through processors in a pipeline, great for tasks like
matrix operations.
5. Cluster computing: Multiple independent computers work together over a network to
solve tasks, often used in research.
6. Grid computing: Combines resources from many locations to tackle large
computations, like SETI@home.
7. Multicore processors: Modern CPUs with several cores that allow multiple tasks to
run simultaneously.
8. Streaming multiprocessors (SM): Found in GPUs, designed to handle thousands of
tasks efficiently, especially for data-heavy tasks like deep learning.

These architectures form the foundation for modern computing systems, balancing
performance, scalability, and application needs.
UMA & NUMA

Shared-Memory Multiprocessors (UMA)

• Shared memory model: All processors share a common memory space for
communication, known as parallel random-access memory (PRAM).
• Architecture: Processors access the shared memory through an interconnection
network, which could be a simple bus or a more complex network in larger systems.
• Memory access: Multiple processors can access the same memory, but this can
cause bottlenecks when many processors request memory access simultaneously.
o To mitigate this, systems can replace a single memory module with a bank of
memories and use interconnection networks to allow simultaneous memory
access.
• Cache coherence: A major issue in shared-memory systems is keeping the local
caches of different processors updated with the shared memory. This is managed using
cache coherence protocols.
• Programming:
o Memory read operations are simple and similar to serial programming.
o Write operations require locking mechanisms like semaphores or mutexes to
ensure data integrity during parallel execution.
o Programming libraries such as POSIX and OpenMP help in managing
synchronization with tools like locks, barriers, and monitors.

Distributed-Memory Multiprocessors (NUMA)

• Local memory model: Each processor has its own local memory, which it can access
directly, while communication between processors requires message passing.
• Memory access:
o Processors can access other processors' memories via message-passing
interfaces (MPI), but the access time is non-uniform (NUMA), depending on
whether the processor is accessing its own memory or a remote one.
• Symmetric vs. Asymmetric:
o If all processors are identical, it’s called a symmetric multiprocessor (SMP).
o If the processors differ, it’s known as an asymmetric multiprocessor (ASMP).
• Scalability: These systems can scale to thousands of processors, making them
suitable for massive parallel computing or grid computing, used in large-scale scientific
computations.
SIMD & Systolic Processors
*CLUSTER & GRID COMPUTING
Cluster Computing

• Definition: A cluster is a group of two or more computers working together to solve a


problem or perform a task.
• Interconnection: The computers are connected via a local area network (LAN).
• Communication: Processors communicate using data packets over the LAN.
• Shared Memory: Shared memory, often implemented using RAID, allows multiple
processors to access and store data simultaneously.
• Client-Server Model: A client machine distributes tasks to the processors and gathers
the final results.

Grid (Cloud) Computing

• Definition: Grid computing involves accessing distributed computing resources over a


wide area network (WAN), such as the Internet.
• Scale: It is a large-scale version of cluster computing, where the LAN is replaced by a
WAN.
• Applications: Used for complex tasks like N-body simulations, seismic simulations,
and atmospheric simulations.
• Examples of Cloud Computing:
o Peer-to-peer (P2P) computing.
o Software as a service (SaaS), like Google Apps.
o Mass storage.
o Web applications and social networks.

MULTICORE AND MULTIPROCESSOR SYSTEM

A multicore system consists of:


1. general - purpose programmable cores
2. special - purpose accelerator cores
3. shared memory modules
4. NoC (interconnection network)
5. I/O interface

*SM

Stream Multiprocessor

• Definition: A type of SIMD or MIMD machine where processors are called streaming
processors (SPs) or thread processors.
• Stream Processor: Handles data streams with an instruction set that processes these
streams using kernels.
• GPU Connection: Stream processors are commonly found in GPUs, which can perform
general-purpose computations (GPGPU).
• Data Streams: Examples include vectors of floating-point numbers or groups of pixels
in video data.
• Locality:
o Temporal locality: Input data is reused a few times for output.
o Spatial locality: Input data is stored in the same memory block.
• Example: Modern GPUs like NVIDIA's Fermi.
Key Characteristics for Applications

1. Compute Intensity: Heavy computation required.


2. Data Parallelism: Multiple data processed at once.
3. Consumer-Producer Locality: Temporal and spatial proximity of data during
processing.

TYPES OF COMMUNICATION
1. One to One (Unicast):
a. Involves one sender and one receiver.
b. Often used in SIMD machines where each processor exchanges data with its
neighbor.
c. Efficient data exchange is critical, with simple register-to-register transfers or
synchronized handshakes.
2. One to Many (Multicast):
a. One sender processor communicates with multiple receiver processors.
b. Useful when the same data needs to be sent to several processors
simultaneously.
c. Requires efficient data exchange similar to unicast.
3. One to All (Broadcast):
a. A single processor sends the same data to all processors.
b. Often used in SIMD machines and systolic arrays to provide data to all
processors in the system.
4. Gather:
a. Data is collected from multiple processors.
b. Time needed for gathering data depends on the number of processors and
transmission time.
5. Reduce:
a. Similar to gather but performs an operation (e.g., sum) on the collected data.
b. Can be hierarchical for efficiency in large systems.
c. Time depends on both gathering and processing.
MESSAGE PASSING (MP) COMMUNICATION MECHANISM
Message Passing (MP) Communication Mechanism

• MP is mainly used in distributed-memory systems, facilitating communication between


processes.
• Communication involves two key library functions:
o send(destination, message): Specifies the destination processor ID and the
data to send.
o recv(source, message type): Specifies the source processor ID and the type of
data to receive.

Steps in MP Communication:

1. Establish a Communication Link:


a. Depends on the interconnection network (physical or logical).
b. Properties include addressing, direction (unidirectional or bidirectional),
capacity, and message size.
2. Exchange Messages:
a. Uses the send() and recv() library calls.

Synchronization Strategies:

• Synchronous/Blocking:
o Sender and receiver halt until the message is successfully sent/received.
• Asynchronous/Non-blocking:
o Sender and receiver continue execution without waiting for message
transmission.

MPI Standard:

• Provides support for both one-to-one and broadcast communication.


CHAPTER 4
Shared-Memory Multiprocessors
• Shared-memory processors are popular for their simple and general programming
model.
• They provide a single physical address space for all processors, allowing easy sharing
of code and data.

Key Features:

• Each processor has local memory, cache, and internal registers.


• Shared memory is organized in separate modules and is accessible to all processors.
• Processors communicate via shared variables stored in shared memory.
• The interconnection network enables multiple memory read/write operations
simultaneously.

Important Considerations:

1. Cache Coherence:

Ensures copies of shared variables across caches are consistent.

2. Synchronization and Mutual Exclusion:

Necessary to control the access of multiple processors to shared resources.

Cache Coherence and Memory Consistency


• Private caches speed up execution by reducing memory access time and matching
processor speed.
• Cache coherence ensures that the data stored in caches and shared memory remain
consistent.
• Caches help reduce memory contention when multiple processors access the same
memory module.

Cache Terminology:

• Cache coherence: Ensures data consistency between shared memory and processor
caches.
• Valid bit (V): Indicates whether the cache line is coherent with shared memory.
• Write-back: Updates shared memory only when a cache block is replaced.
• Write-through: Immediately updates shared memory when a cache block is modified.

Cache Coherence Protocols:

1. Directory Protocols:

Maintain a directory to track the state of each block in the shared memory.

2. Snoopy Protocols:

Processors monitor memory transactions to ensure cache coherence.

DIRECTORY AND SNOOPY PROTOCOLS


Cache Coherence Using Directory Protocols:

• Components:
o Local Cache Controllers: Handle cache updates for each processor.
o Central Controller: Maintains system-wide cache coherence.
o Directory: Part of shared memory, stores state of each shared block.
o Interconnection Network: Enables communication between controllers,
caches, and memory.
• Directory Protocol:
o Full-map directory has n + 2 bits per entry, where n is the number of processors.
o D-bit: Indicates whether data is valid (0) or modified (1).
o X-bit: Controls whether updates are broadcast (B) or not (NB).
o Pros: Tracks locations of shared blocks and ensures coherence.
o Cons: Can become a bottleneck as all transactions go through the central
controller. Entry size depends on the number of processors.

Cache Coherence Using Snoopy Protocols:

• Components:
o Local Cache Controllers: Monitor their caches for coherence.
o Shared Memory: No directory or central controller is used.
o Interconnection Network: Supports broadcasting of data transmissions.
• Snoopy Protocol:
o Processors monitor network transactions to determine if a memory operation is
relevant.
o Invalidation-based Policy: Invalidate the cache copy and fetch updated data
from memory when needed.
o Update-based Policy: Update cache copy using data on the bus while shared
memory is updated.
• Pros: Simpler structure, no central controller.
• Cons: Limited by bus bandwidth, only one transaction can occur at a time.
SYNCHRONIZATION AND MUTUAL EXCLUSION
1. Critical Section:

• The critical section is where a shared resource (like a variable, file, or data structure) is
accessed by multiple processes or threads.
• To prevent race conditions—where two or more threads modify the resource
simultaneously leading to unpredictable results—only one process is allowed to enter
the critical section at any time.
• If a process is inside its critical section, other processes must wait until the current
process finishes and releases the resource (unlocking it).

2. Atomic Operations and TestAndSet:

• Atomic operations (like TestAndSet) are essential because they execute a series of
operations (reading, modifying, and writing) as a single, uninterruptible action.
• This is necessary in concurrent programming to prevent a situation where two
processes read the same value, make changes based on the outdated value, and then
both write conflicting updates to memory.
• For example, in the TestAndSet operation, it first reads the current value of a lock, then
sets the lock to "occupied" (TRUE), ensuring no other process can access the critical
section.

3. Busy Waiting vs. Blocking:

• Busy waiting occurs when a process continuously checks (in a loop) whether the lock
is available. This can waste CPU time because the process is actively using the CPU
while waiting.
• Blocking, on the other hand, puts the waiting process in a sleep state, freeing up CPU
resources until the lock becomes available. However, blocking has overhead because
suspending and resuming processes take time.
• Busy waiting may be more efficient for very short waits, while blocking is preferred
when a process is expected to wait for a longer time.

4. Locks:

• Locks are crucial for managing access to critical sections, especially in


multiprocessor systems where many processes or threads may attempt to modify
shared resources simultaneously.
• Without proper locking, shared data can become corrupted as multiple threads could
modify it concurrently.
• Locks can be hardware-supported (such as TestAndSet) or software-supported using
higher-level mechanisms like semaphores.
5. Mutex (Mutual Exclusion):

• A mutex (short for mutual exclusion) ensures that only one thread can enter the critical
section at a time. It has two key operations:
o wait(): Checks if the mutex is available (value 1). If available, it decrements the
value and allows the thread to proceed.
o signal(): After a thread finishes its work in the critical section, it increments the
mutex value to signal that the resource is available for other threads.
• Mutexes are implemented in operating system libraries (like POSIX threads) and are
commonly used to protect critical sections and manage resource access in multi-
threaded applications.

6. Barriers:

• Unlike locks and mutexes that control access to shared variables, barriers are used in
parallel computing for event synchronization.
• In many parallel algorithms, tasks are divided among multiple threads that operate
independently. However, there may be certain points in the algorithm where all threads
must wait for each other to finish a particular stage before moving on to the next stage.
• Barriers ensure that all threads reach a certain point before any thread is allowed to
proceed further. This is useful in algorithms where synchronization points are needed
for correctness.
• A barrier typically tracks how many threads have reached it. Once all threads have
reached the barrier, they can continue with the next task.

7. Synchronization Issues:

• Deadlock: If two or more processes are waiting for each other to release resources,
none of them can proceed, leading to a deadlock. Proper handling of locks and
synchronization is required to avoid this.
• Starvation: This occurs when a process waits indefinitely to acquire a lock or access
the critical section because other processes continuously occupy the resource.
• Priority Inversion: A low-priority process holding a lock may delay higher-priority
processes waiting for the same lock. In some systems, priority inheritance mechanisms
are used to handle this.

8. Comparison of Synchronization Primitives:

• Locks are simple and efficient but can result in busy waiting and resource contention
when many processes attempt to acquire the lock.
• Mutexes are more structured and avoid busy waiting by suspending the waiting
process, but they are slightly more memory-intensive and require proper management
of wait() and signal() calls.
• Barriers are more suitable for parallel processing where synchronization of
independent tasks is necessary. They ensure that all threads reach a certain point
before proceeding, which helps maintain execution order in parallel algorithms.

You might also like