UNIT - I: Parallel and Distributed Computing
Introduction
In the realm of modern computing, the quest for greater computational
power is incessant. As the physical limits of single-processor performance
are approached, Parallel and Distributed Computing emerges as a
critical paradigm to meet the ever-growing demands of complex scientific,
industrial, and commercial applications. This unit provides a
comprehensive introduction to the fundamental concepts, benefits,
theoretical underpinnings, and algorithmic principles of parallel and
distributed computing.
Parallel computing is a type of computation in which many calculations
or the execution of processes are carried out simultaneously. Large
problems can often be divided into smaller ones, which can then be solved
at the same time.
Distributed computing involves multiple autonomous computers that
communicate through a network and interact with each other in order to
achieve a common goal. A distributed system can be seen as a collection
of independent computers that appears to its users as a single coherent
system.
While related, the key distinction lies in memory architecture. In parallel
computing, processors often share memory, allowing for high-speed
communication. In distributed computing, each processor typically has its
own private memory, and communication occurs through message
passing over a network.
Benefits and Needs
The adoption of parallel and distributed computing is driven by a
multitude of benefits and pressing needs:
Increased Performance and Speed: By dividing a task among
multiple processors, the overall time to solution can be drastically
reduced. This is crucial for time-sensitive applications such as
weather forecasting, financial modeling, and real-time data analysis.
Solving Larger and More Complex Problems: Many scientific
and engineering problems are too large or complex to be solved by
a single computer in a reasonable amount of time. Parallel and
distributed systems provide the necessary computational resources
to tackle these "grand challenge" problems, such as simulating the
human brain or modeling climate change.
Cost-Effectiveness: It is often more economical to build a cluster
of commodity computers than to purchase a single, powerful
supercomputer. This approach allows organizations to achieve high
performance at a lower cost.
Improved Reliability and Fault Tolerance: In a distributed
system, the failure of a single computer does not necessarily bring
the entire system to a halt. The workload of the failed node can be
redistributed among the remaining nodes, enhancing the overall
resilience of the system.
Resource Sharing and Scalability: Distributed systems facilitate
the sharing of hardware, software, and data resources across a
network. They are also inherently scalable, as more computing
nodes can be added to the system to handle increasing workloads.
Concurrency: Many real-world applications are inherently
concurrent, involving multiple independent activities. Parallel and
distributed systems provide a natural framework for modeling and
executing such applications.
Parallel and Distributed Systems
The architecture of parallel and distributed systems can be broadly
categorized based on their memory organization and the way processors
communicate.
Feature Parallel Systems Distributed Systems
Processor
Tightly coupled Loosely coupled
Coupling
Typically shared memory Distributed memory (each
Memory (processors share a common processor has its own
address space) private memory)
Communicatio Via message passing over
Via shared memory (fast)
n a network (slower)
Synchronizati Implicit, through shared
Explicit, through messages
on variables and locks
Can be limited by memory bus Highly scalable by adding
Scalability
contention more machines
More fault-tolerant as the
Fault A single hardware failure can
failure of one node may
Tolerance affect the entire system
not affect others
Multi-core processors, The Internet, a computer
Example Symmetric Multiprocessors cluster, a grid computing
(SMPs) system
Export to Sheets
Programming Environment
Developing software for parallel and distributed systems requires
specialized programming environments that provide tools and libraries to
manage parallelism and communication. Key components of these
environments include:
Programming Models: These are abstractions that allow
programmers to express parallelism in their code. Common models
include:
o Shared Memory Model: All tasks share a common address
space, which they read and write to asynchronously.
Synchronization mechanisms like locks and semaphores are
used to control access to shared data.
o Message Passing Model: Tasks have their own local
memory and communicate by sending and receiving
messages. The Message Passing Interface (MPI) is a
standardized and widely used library for this model.
o Data Parallel Model: The same operation is performed
simultaneously on different subsets of a large dataset.
Programming Languages and Libraries:
o OpenMP (Open Multi-Processing): An API that supports
multi-platform shared-memory parallel programming in C, C+
+, and Fortran. It uses compiler directives to parallelize code.
o MPI (Message Passing Interface): A standardized and
portable message-passing system designed to function on a
wide variety of parallel computers.
o Hadoop and MapReduce: A popular framework for
processing large datasets in a distributed computing
environment. MapReduce is a programming model that
abstracts the complexity of distributed processing.
Tools: Compilers, debuggers, and performance analysis tools
specifically designed for parallel and distributed environments are
essential for efficient software development.
Theoretical Foundations
The design and analysis of parallel algorithms are grounded in a set of
theoretical concepts and models:
Flynn's Taxonomy: A classification of computer architectures
based on the number of instruction streams and data streams:
o SISD (Single Instruction, Single Data): A traditional
sequential computer.
o SIMD (Single Instruction, Multiple Data): A single
instruction is executed on multiple data streams
simultaneously. Vector processors and GPUs are examples.
o MISD (Multiple Instruction, Single Data): Multiple
instructions operate on a single data stream. This is a rare
architecture.
o MIMD (Multiple Instruction, Multiple Data): Multiple
processors execute different instructions on different data
streams. Most modern parallel computers fall into this
category.
Amdahl's Law: This law gives the theoretical speedup in latency of
the execution of a task at fixed workload that can be expected of a
system whose resources are improved. It highlights that the
speedup of a program using multiple processors is limited by the
sequential fraction of the program. If P is the proportion of the
program that can be made parallel and (1−P) is the proportion that
is sequential, then the maximum speedup that can be achieved by
using N processors is: S(N)=(1−P)+NP1
Gustafson's Law: This law addresses the limitations of Amdahl's
law. It suggests that as the number of processors increases, the
problem size should also increase. For a fixed amount of time, a
larger problem can be solved with more processors.
PRAM (Parallel Random Access Machine) Model: A theoretical
model of a parallel computer with a shared memory. It consists of a
set of processors that can all access a shared memory in a
synchronous manner. The PRAM model is useful for designing and
analyzing parallel algorithms without being concerned with the
details of the underlying hardware. There are several variations of
the PRAM model based on how they handle read and write conflicts
to the same memory location (EREW, ERCW, CREW, CRCW).
Parallel Algorithms— Introduction
A parallel algorithm is an algorithm which can be executed a piece at a
time on many different processing devices, and then combined together
again at the end to get the correct result. The goal of designing a parallel
algorithm is to reduce the overall computation time by distributing the
workload among multiple processors.
Parallel Models and Algorithms
Several models are used to design and analyze parallel algorithms:
Task/Channel Model: The problem is decomposed into a set of
tasks that communicate with each other through channels.
Master-Slave (or Master-Worker) Model: A master process
distributes work to a number of slave processes and collects the
results.
Pipeline Model: A stream of data is passed through a series of
processes, each of which performs a specific operation on the data.
Data Parallel Model: Each processor performs the same task on a
different part of the distributed data.
The design of efficient parallel algorithms involves addressing key
challenges such as:
Partitioning: Decomposing the problem into smaller tasks.
Communication: Managing the exchange of data between
processors.
Synchronization: Coordinating the execution of different tasks.
Load Balancing: Ensuring that all processors have an equal
amount of work.
Sorting
Sorting is a fundamental problem in computer science. Parallel sorting
algorithms aim to speed up the sorting process by using multiple
processors.
Parallel Sorting Algorithms
Parallel Bubble Sort (Odd-Even Transposition Sort): In this
algorithm, adjacent elements are compared and swapped in parallel.
It consists of alternating odd and even phases. In the odd phase,
elements at odd indices are compared with their right neighbors. In
the even phase, elements at even indices are compared with their
right neighbors. This process is repeated until the list is sorted.
Parallel Merge Sort: This algorithm follows the divide-and-conquer
strategy. The list is divided into sublists, which are sorted in parallel
by different processors. Then, the sorted sublists are merged in
parallel to produce the final sorted list.
Parallel Quick Sort: Similar to its sequential counterpart, this
algorithm selects a pivot element and partitions the array into two
subarrays. The subarrays are then sorted recursively in parallel. The
main challenge is to perform the partitioning step in parallel
efficiently.
Matrix Multiplication
Matrix multiplication is a computationally intensive operation that is
common in many scientific and engineering applications. Parallel
algorithms can significantly speed up this process.
Consider the multiplication of two n×n matrices, C=A×B. The element Cij
is calculated as: Cij=k=0∑n−1Aik×Bkj
Parallel Matrix Multiplication Algorithms
Block-Based Decomposition: The matrices are partitioned into
smaller sub-matrices or blocks. Each processor is assigned the task
of multiplying a set of these blocks. The results from each processor
are then combined to form the final result matrix.
Cannon's Algorithm: A memory-efficient algorithm for distributed
memory systems. It involves shifting the blocks of matrices A and B
in a coordinated way among the processors to ensure that each
processor can compute its portion of the result matrix with minimal
communication.
Strassen's Algorithm (in parallel): Strassen's algorithm is a
divide-and-conquer algorithm that reduces the number of
multiplications required for 2×2 matrices from 8 to 7. This recursive
algorithm can be parallelized by performing the sub-matrix
multiplications in parallel.
Convex Hull
The convex hull of a set of points is the smallest convex polygon that
contains all the points. Finding the convex hull has applications in
computer graphics, pattern recognition, and computational geometry.
Parallel Convex Hull Algorithms
Parallel Quickhull Algorithm: This is a divide-and-conquer
algorithm.
1. Find the points with the minimum and maximum x-
coordinates. These two points are guaranteed to be on the
convex hull.
2. The line connecting these two points divides the set of points
into two subsets.
3. For each subset, find the point that is farthest from the line
segment. This point is also on the convex hull.
4. The three points form a triangle. The points inside this triangle
can be ignored. The remaining points outside the triangle are
partitioned into two new subsets based on the two new edges
of the triangle.
5. This process is applied recursively and in parallel to the new
subsets.
Pointer Based Data Structures
Pointer-based data structures, such as linked lists and trees, pose unique
challenges in parallel computing due to their dynamic and irregular
nature. Accessing and modifying these structures in parallel requires
careful synchronization to avoid race conditions and ensure data integrity.
Parallel Algorithms for Pointer-Based Data Structures
Parallel Linked List Traversal and Manipulation:
o Pointer Jumping (or Shortcut): This technique is used to
parallelize operations like finding the end of a list or
computing prefix sums on a linked list. Each node's pointer is
updated to point to its successor's successor. This process is
repeated logarithmically, effectively shortening the traversal
path in each step.
o Lock-based and Lock-free Approaches: To allow
concurrent insertions and deletions, fine-grained locking
(locking individual nodes) can be used. Lock-free algorithms,
which use atomic operations like Compare-And-Swap (CAS),
provide a non-blocking alternative, avoiding issues like
deadlocks and priority inversion.
Parallel Tree Algorithms:
o Parallel Tree Traversal: Different subtrees can be traversed
in parallel. For example, in a binary tree, one processor can
traverse the left subtree while another traverses the right
subtree.
o Parallel Search and Update in Balanced Trees (e.g.,
Red-Black Trees): Operations like search, insertion, and
deletion can be parallelized. However, rebalancing operations
require careful synchronization to maintain the tree's
properties. Techniques like hand-over-hand locking (where a
lock on a child node is acquired before releasing the lock on
the parent) can be used.
o Parallel Tree Contraction: A technique for reducing a tree
to a single vertex by repeatedly applying local operations like
removing leaves (rake) and compressing chains of nodes with
single children. This is useful for a variety of tree-based
problems.
UNIT - II: Parallel and Distributed Computing
Synchronization
In parallel and distributed computing, synchronization is the mechanism of
coordinating the execution of multiple concurrent processes or threads to
ensure they cooperate correctly and maintain data consistency. Without
proper synchronization, systems can suffer from race conditions,
deadlocks, and data corruption.
Race Condition: A situation where the outcome of a computation
depends on the non-deterministic sequence or timing of operations from
different threads or processes. This occurs when multiple processes
access and manipulate shared data concurrently, and at least one of the
accesses is a write.
Synchronization Primitives
1. Locks (Mutexes): A lock, or a mutual exclusion (mutex) object, is
the simplest synchronization primitive. It ensures that only one
thread can execute a critical section of code at a time.
o lock() or acquire(): A thread calls this function to gain
ownership of the lock. If the lock is already held by another
thread, the calling thread will block until the lock is released.
o unlock() or release(): The thread that owns the lock calls
this function to release it, allowing other waiting threads to
acquire it.
o Usage: Protects a shared resource from concurrent access.
<!-- end list -->
2. Semaphores: A semaphore is a more general synchronization tool.
It is an integer variable that, apart from initialization, is accessed
only through two standard atomic operations: wait() and signal().
o Counting Semaphore: The integer value can range over an
unrestricted domain. It can be used to control access to a
resource with a finite number of instances (e.g., a pool of
database connections).
o Binary Semaphore: The integer value can range only
between 0 and 1. It behaves similarly to a mutex.
o Operations:
wait(S) or P(S): Decrements the semaphore value. If the
value becomes negative, the process executing wait is
blocked.
signal(S) or V(S): Increments the semaphore value. If the
value is not positive, a process blocked by a wait
operation is unblocked.
3. Monitors: A monitor is a high-level synchronization construct that
encapsulates shared data and the procedures that operate on that
data. It ensures that only one thread can be active within the
monitor at any given time, thus providing mutual exclusion
automatically. Monitors also include condition variables, which allow
threads to wait for certain conditions to become true inside the
critical section.
4. Barriers: A barrier is a synchronization point for a group of threads.
A thread arriving at a barrier must wait until all other threads in the
group have also arrived. Once all threads have reached the barrier,
they are all released and can proceed. This is useful for coordinating
stages of a parallel algorithm.
Process Parallel Languages
These are programming languages, libraries, and APIs designed to express
and manage parallelism.
1. OpenMP (Open Multi-Processing):
o Model: Shared Memory. It is an API that uses compiler
directives (pragmas), library routines, and environment
variables to express parallelism.
o Usage: Excellent for parallelizing loops and sections of code
on multi-core processors. The programmer inserts pragmas
into the source code, and the compiler automatically
generates threaded code.
o Example (C++):
2. MPI (Message Passing Interface):
o Model: Distributed Memory (Message Passing). MPI is a
standardized and portable library, not a language. It provides
functions for sending and receiving messages between
processes.
o Usage: The standard for high-performance computing (HPC)
on clusters and supercomputers. Each process has its own
private memory, and all communication is explicit.
o Example Concepts (C):
3. Go Language:
o Model: Concurrency based on Communicating Sequential
Processes (CSP).
o Usage: Go has built-in features for concurrency:
Goroutines: Lightweight threads managed by the Go
runtime. Starting one is as simple as go myFunction();.
Channels: Typed conduits through which you can send
and receive values with the channel operator <-. They
are used to communicate and synchronize between
goroutines.
o This model encourages sharing memory by communicating,
rather than communicating by sharing memory.
Architecture of Parallel and Distributed Systems
Architectures define how processors and memory are interconnected.
Parallel System Architectures (Tightly-Coupled)
1. Shared Memory Architecture: All processors share a single,
global address space.
o UMA (Uniform Memory Access): All processors have equal
access time to all memory locations. Symmetric
Multiprocessing (SMP) systems are a common example.
o NUMA (Non-Uniform Memory Access): Memory is
physically distributed, but logically shared. A processor can
access its own local memory faster than it can access the
memory local to another processor. Modern multi-socket
servers are NUMA systems.
2. Multi-core and Many-core:
o Multi-core: A single CPU chip contains multiple processing
cores, each capable of independent execution.
o Many-core: A specialized architecture with a very high
number of cores (tens to hundreds), like GPUs. They are
designed for high data parallelism (SIMD).
Distributed System Architectures (Loosely-Coupled)
1. Cluster Computing: A group of independent computers (nodes)
connected via a high-speed network, working together as a single
integrated system. Nodes are typically homogeneous.
2. Grid Computing: A distributed system formed by a collection of
heterogeneous, geographically dispersed, and administratively
autonomous resources. It aims to create a "virtual supercomputer."
3. Cloud Computing: A model for delivering on-demand computing
services—including servers, storage, databases, networking, and
software—over the Internet ("the cloud"). It relies heavily on
virtualization and distributed system principles.
Consistency and Replication
Replication is the process of creating and managing multiple copies of
data on different nodes in a distributed system. It is done for two main
reasons:
Reliability/Fault Tolerance: If one node fails, data is still
accessible from other replicas.
Performance: Placing data closer to users reduces access latency.
The main challenge with replication is consistency: ensuring that all
replicas have the same value over time.
The CAP Theorem: A fundamental theorem in distributed systems states
that it is impossible for a distributed data store to simultaneously provide
more than two out of the following three guarantees:
Consistency: Every read receives the most recent write or an error.
Availability: Every request receives a (non-error) response, without
the guarantee that it contains the most recent write.
Partition Tolerance: The system continues to operate despite an
arbitrary number of messages being dropped (or delayed) by the
network between nodes. Since network partitions are a fact of life, a
distributed system must choose between being consistent (CP) or
being available (AP).
Consistency Models
1. Strong Consistency:
o Linearizability (or Atomic Consistency): The strongest
model. All operations appear to have executed atomically at
some single point in time, consistent with the real-time
ordering of operations. A read is guaranteed to return the
value of the most recent completed write.
2. Weaker Consistency: These models relax the guarantees of strong
consistency to achieve higher availability and performance.
o Sequential Consistency: Weaker than linearizability. The
result of any execution is the same as if the operations of all
processors were executed in some sequential order, and the
operations of each individual processor appear in this
sequence in the order specified by its program.
o Causal Consistency: Requires that writes that are causally
related are seen by all processes in the same order.
Concurrent writes may be seen in different orders by different
processes.
o Eventual Consistency: The weakest model. If no new
updates are made to a given data item, eventually all
accesses to that item will return the last updated value. This is
widely used in highly available systems like Amazon's
DynamoDB and Apache Cassandra.
Security
Security in distributed systems is complex due to the lack of a single point
of control and the reliance on insecure public networks.
Key Security Goals (CIA Triad)
Confidentiality: Preventing the disclosure of information to
unauthorized individuals or systems.
o Mechanism: Encryption (of data at rest and in transit, e.g.,
SSL/TLS).
Integrity: Maintaining the consistency, accuracy, and
trustworthiness of data. Data should not be altered in transit or
tampered with by unauthorized parties.
o Mechanism: Cryptographic hashes and digital signatures.
Availability: Ensuring that authorized users can access information
and resources when required.
o Mechanism: Redundancy, failover systems, and protection
against Denial-of-Service (DoS) attacks.
Security Mechanisms
Authentication: Verifying the identity of a user, process, or device.
"Are you who you say you are?"
o Examples: Passwords, digital certificates, Kerberos.
Authorization: Determining whether an authenticated entity is
permitted to access a particular resource or perform a specific
operation. "Are you allowed to do that?"
o Examples: Access Control Lists (ACLs), Role-Based Access
Control (RBAC).
Secure Communication: Creating a secure channel over an
insecure network to protect confidentiality and integrity.
o Examples: Secure Sockets Layer (SSL) / Transport Layer
Security (TLS), Virtual Private Networks (VPNs).
Parallel Operating Systems
A Parallel OS is a type of operating system that manages resources and
provides services for computers with multiple processors
(multiprocessors). It is designed to maximize the performance and
efficiency of parallel computing hardware.
Key Functions and Challenges
1. Process and Thread Management:
o Scheduling: The OS scheduler must be ableto dispatch
processes/threads across multiple CPUs efficiently. It must
consider issues like cache affinity (trying to keep a thread on
the same core to leverage cached data) and load balancing.
o Load Balancing: Distributing the workload evenly among all
available processors to prevent some processors from being
idle while others are overloaded.
2. Memory Management:
o In a UMA system, memory management is relatively
straightforward.
o In a NUMA system, the OS must be "NUMA-aware." It should
try to allocate memory for a process on the same node where
the process is running to minimize remote memory access
latency.
3. Synchronization Support: The OS kernel provides efficient
implementations of synchronization primitives like mutexes,
semaphores, and condition variables that can be used by user-level
applications.
4. I/O Management: The OS needs to manage parallel I/O operations
to prevent bottlenecks, allowing multiple processes to access
storage devices concurrently and efficiently.
Types of Operating Systems for Multi-Computer Systems
Network Operating System (NOS): Each computer in a network
runs its own OS. The NOS provides services that make the network
resources (e.g., files, printers) transparently accessible from any
computer. Examples: Windows Server, Novell NetWare. Users are
aware that there are multiple computers.
Distributed Operating System (DOS): A single operating system
that manages the resources of a collection of autonomous
computers and presents them to the user as a single, coherent
system (a "single system image"). Users are not aware of where
their programs are running or where their files are stored. Examples
are more academic (e.g., Amoeba, Plan 9) but their principles
influence modern distributed systems.
Unit- 2 continue
As a first step, I have successfully executed a series of targeted Google
searches for each of the main topics in UNIT - II. The initial results have
provided a good overview of:
Synchronization: Found introductory materials on synchronization
primitives (mutexes, semaphores, monitors, barriers), the
challenges of synchronization (deadlocks, race conditions), and its
necessity in parallel systems.
Process Parallel Languages: The searches yielded information on
languages and models designed for parallel programming,
mentioning concepts like message passing (MPI) and shared
memory programming (OpenMP).
Architecture of Parallel and Distributed Systems: I have found
materials covering Flynn's Taxonomy (SISD, SIMD, MISD, MIMD),
memory architectures (shared memory, distributed memory,
NUMA), and network topologies.
Consistency and Replication: The results point to various data
consistency models (strict, sequential, causal, eventual) and the
reasons for replication (fault tolerance, performance).
Security: I've gathered preliminary information on security
challenges in distributed environments, including authentication,
authorization, and secure communication.
Parallel Operating Systems: The initial searches have
distinguished between operating systems for parallel computers and
network operating systems, highlighting the need for specialized
resource management and scheduling.
The information gathered so far is foundational. To create comprehensive
notes, I need to delve deeper into each topic. For instance, in
"Synchronization," I need to elaborate on how each primitive works with
examples. For "Architecture," I should detail the pros and cons of different
memory models and interconnection networks. "Consistency and
Replication" requires a more in-depth comparison of the models.
"Security" needs specific examples of threats and mitigation techniques.
"Process Parallel Languages" should be expanded to include more
language examples and a comparison of their features. Finally, "Parallel
Operating Systems" needs more detail on specific design challenges and
examples.Here's a thinking process that could lead to the detailed notes
for UNIT - II:
1. Deconstruct the User's Request:
o The user wants "full notes" for "UNIT - II".
o The topics are:
Synchronization
Process Parallel Languages
Architecture of Parallel and Distributed Systems
Consistency and Replication
Security
Parallel Operating Systems
2. Initial Brainstorming & Keyword Association (Internal
Knowledge & Search Strategy):
o Synchronization: What's the core problem? Race conditions.
What are the solutions? Locks (mutexes), semaphores,
monitors, barriers. What are the potential problems with the
solutions? Deadlock, livelock, starvation. This needs to be
explained in the context of both shared memory (parallel) and
distributed systems.
o Process Parallel Languages: This is about how
programmers write parallel code. What are the models?
Shared memory (like OpenMP, POSIX threads), Message
Passing (like MPI), Data Parallel (like CUDA, OpenCL). Let's
categorize them and give a brief on each. "Process Parallel"
suggests a focus on explicit parallelism where the
programmer manages processes/threads.
o Architecture: This is the hardware foundation. I recall Flynn's
Taxonomy (SISD, SIMD, MISD, MIMD) from Unit I, but it's worth
recapping as it's fundamental. The real meat here is memory
architecture: Shared Memory (UMA vs. NUMA) and Distributed
Memory (MPPs, Clusters). I should also touch on
interconnection networks (bus, crossbar, mesh, hypercube)
because they dictate communication performance.
o Consistency and Replication: These two are tightly linked.
Why replicate? Fault tolerance, performance (data locality).
What's the problem with replication? Keeping copies
consistent. This leads to consistency models. I need to create
a hierarchy or spectrum of models, from the strictest (and
hardest to implement) to the weakest (and most scalable).
Key models: Strict, Sequential, Causal, Eventual. I should also
mention the CAP theorem briefly as it's the fundamental
trade-off here.
o Security: This is a broad topic. I need to frame it for
parallel/distributed systems. Key pillars of security:
Confidentiality, Integrity, Availability (CIA Triad). What are the
threats? Eavesdropping on the network, unauthorized access
to nodes, denial of service, malicious code injection. What are
the solutions? Cryptography (for communication),
Authentication (proving identity), Authorization (what you're
allowed to do), Auditing.
o Parallel Operating Systems: How is a "Parallel OS" different
from a regular one (like Linux on a laptop) or a Network OS? A
regular OS manages one machine. A Network OS connects
different machines but each runs its own OS instance. A
Parallel OS (or Distributed OS) presents a single system
image. It manages resources (CPUs, memory) across multiple
nodes transparently. Key challenges: Process/thread
scheduling across nodes, global memory management,
synchronization mechanisms provided by the OS, inter-process
communication. I should differentiate between tightly coupled
(Parallel OS) and loosely coupled (Network OS/Distributed OS)
systems.
3. Structure the Notes:
o Create a main heading for "UNIT - II".
o Use a major heading (e.g., ###) for each topic from the
syllabus. This keeps it organized and directly maps to the
user's request.
o Within each topic, use subheadings (e.g., ####) to break
down complex ideas. For example, under "Synchronization,"
have subheadings for "The Critical Section Problem,"
"Synchronization Primitives," and "Deadlocks."
o Use bullet points and lists to present information clearly (e.g.,
listing different types of synchronization primitives or
consistency models).
o Use tables to compare and contrast concepts (like the UMA vs.
NUMA table).
o Use bold text to highlight key terms (race condition, mutex,
CAP Theorem). This helps with scannability and retention.
o Where appropriate, use simple analogies. For example,
explaining a mutex as a key to a single restroom.
o Ensure a logical flow. Start with the problem (e.g., race
conditions), then present the solutions (primitives), and then
discuss the problems with the solutions (deadlocks).
4. Flesh out each section (Drafting - combining search results
and internal knowledge):
o Synchronization:
Start with the "why": concurrent access to shared
resources. Define race condition and critical section.
Detail the primitives:
Mutex: Simple lock/unlock. The "key" analogy
works well.
Semaphore: Generalize the mutex. Explain
counting vs. binary semaphores. Use a parking
garage analogy (the count is the number of free
spots).
Monitor: Higher-level construct. Bundle data with
the procedures that operate on it. Mention
condition variables. This is more of a language
feature.
Barriers: A synchronization point. All processes
must wait here before any can proceed. Analogy:
Runners waiting at the starting line.
Discuss Deadlocks: Define it (circular dependency). List
the four necessary conditions (Coffman conditions).
Mention prevention, avoidance, and detection/recovery.
o Process Parallel Languages:
Frame this as the "software view" of parallelism.
Create categories:
Shared Memory: Explain the concept. Name-drop
OpenMP (directives-based) and POSIX Threads
(library-based).
Message Passing: Explain the concept (no shared
memory). Name-drop MPI (the standard). Mention
send() and receive() as the core functions.
Data Parallel: Explain the concept (one operation
on many data points). Name-drop CUDA (for
NVIDIA GPUs) and OpenCL (the open standard).
Keep it concise but clear on the distinction between the
models.
o Architecture:
Briefly re-introduce Flynn's Taxonomy (SIMD, MIMD are
most relevant).
Focus on Memory Architecture:
Shared Memory: Define it. Create the UMA vs.
NUMA comparison table. Explain cache coherency
(MESI protocol is a good keyword to include for
depth).
Distributed Memory: Define it. Mention MPPs and
Clusters.
Hybrid: Acknowledge that modern systems (like
clusters of multi-core nodes) are hybrid.
Add a small section on Interconnection Networks. List
common topologies (bus, ring, mesh, hypercube,
crossbar) and briefly state their trade-offs (cost vs.
performance/contention).
o Consistency and Replication:
Start with "Why Replicate?" (Performance, Availability).
Introduce the problem: "How to keep replicas in sync?"
This is the consistency problem.
Create the hierarchy of models, from strongest to
weakest:
Strict: The ideal but impossible one (requires
global clock).
Sequential: The most intuitive one. "As if all
operations were executed in some sequential
order."
Causal: Weaker. Only causally related events
need to be seen in the same order by all.
Eventual: The weakest. "Eventually, all replicas
will converge." Used in many modern NoSQL
databases. Explain its benefits (high availability,
scalability).
Briefly explain Client-Centric Consistency models (e.g.,
Monotonic Reads) as a contrast to the data-centric ones.
o Security:
Start with the CIA triad (Confidentiality, Integrity,
Availability) as a framework.
List key threats in distributed systems: Eavesdropping,
Masquerading (impersonation), Man-in-the-middle,
Denial of Service.
List key security mechanisms/solutions:
Authentication: Verifying identity (passwords,
certificates).
Authorization/Access Control: What you can do
(ACLs - Access Control Lists).
Cryptography: Securing data in transit (SSL/TLS)
and at rest (encryption).
Secure Group Communication: Mention challenges
like key distribution.
o Parallel Operating Systems:
Distinguish it from Network OS. The key concept is the
"single system image" or transparency. The OS hides the
fact that there are multiple computers.
List key design issues/responsibilities:
Process/Thread Management: How to create,
manage, and schedule tasks across different
nodes.
Resource Management: Global view of memory,
CPUs, I/O.
Communication: Providing efficient Inter-Process
Communication (IPC) mechanisms that work
across the network but might look like local IPC to
the programmer.
Synchronization: Providing OS-level support for the
primitives discussed earlier, but in a distributed
context.
Give examples of types: Differentiate between a true
"Distributed OS" (more of an academic concept, e.g.,
Amoeba) and the more practical approach of using
middleware or specific OS features on top of standard
kernels (e.g., Beowulf clusters using Linux).
5. Review and Refine:
o Read through the entire document.
o Check for clarity, consistency, and accuracy. Is the language
easy to understand?
o Are the key terms well-defined?
o Does the structure flow logically from one topic to the next?
o Trim any unnecessary jargon or overly complex explanations.
The goal is "full notes," which implies comprehensiveness but
also clarity.
o Add a concluding sentence to tie it all together.## UNIT - II:
Synchronization, Languages, Architecture, and System Issues
This unit delves into the critical software and hardware mechanisms that
enable the functioning of parallel and distributed systems. We will explore
how processes coordinate their actions, the languages used to program
them, the architectures they run on, and the fundamental challenges of
maintaining data consistency, security, and operating system support.
Synchronization
In a parallel or distributed system where multiple processes or threads
access shared resources, synchronization is the mechanism that controls
the order of their execution to prevent incorrect results.
The Critical Section Problem
A critical section is a segment of code that accesses a shared resource
(e.g., a shared variable, a file, a database). The fundamental challenge is
to ensure that at most one process is executing in its critical section at
any given time. Failure to do so leads to a race condition, where the final
state of the shared resource depends on the non-deterministic order of
execution, often resulting in corrupted data.
To solve the critical section problem, a solution must satisfy three
requirements:
1. Mutual Exclusion: If a process is executing in its critical section,
then no other processes can be executing in their critical sections.
2. Progress: If no process is executing in its critical section and some
processes wish to enter their critical sections, then the selection of
the process that will enter its critical section next cannot be
postponed indefinitely.
3. Bounded Waiting: There must be a bound on the number of times
that other processes are allowed to enter their critical sections after
a process has made a request to enter its critical section and before
that request is granted.
Synchronization Primitives
These are low-level mechanisms provided by hardware or the operating
system to achieve synchronization.
Mutex (Mutual Exclusion Lock): The simplest synchronization
tool. A mutex acts like a key to a room (the critical section). A
process must "acquire" the lock before entering the critical section
and "release" it upon exiting. If a process tries to acquire a lock that
is already held, it will be blocked until the lock is released.
Semaphore: A more general synchronization tool. A semaphore is
an integer variable that, apart from initialization, is accessed only
through two standard atomic operations: wait() (or P()) and signal()
(or V()).
o Counting Semaphore: The integer value can range over an
unrestricted domain. It can be used to control access to a
resource with a finite number of instances (e.g., a pool of 5
connections).
o Binary Semaphore: The integer value can range only
between 0 and 1. It functions similarly to a mutex.
Monitor: A high-level language construct that provides a
convenient and effective mechanism for process synchronization. A
monitor is an abstract data type that encapsulates shared data and
the procedures that operate on that data. Only one process can be
active within the monitor at any one time, ensuring mutual
exclusion. Monitors also provide condition variables to allow
processes to wait for specific conditions to become true.
Barriers: A synchronization point for a group of processes. A
process arriving at a barrier must wait until all other processes in
the group have also reached the barrier. Once all have arrived, they
are all released to continue execution. This is useful for coordinating
stages of a parallel algorithm.
Deadlocks
A deadlock is a state in which two or more processes are stuck in a
circular wait, each waiting for a resource held by another process in the
chain. For a deadlock to occur, four conditions (Coffman conditions) must
hold simultaneously:
1. Mutual Exclusion: At least one resource must be held in a non-
sharable mode.
2. Hold and Wait: A process must be holding at least one resource
and waiting to acquire additional resources held by other processes.
3. No Preemption: Resources cannot be forcibly taken from a
process; they must be explicitly released.
4. Circular Wait: A set of waiting processes {P₀, P₁, ..., Pₙ} must exist
such that P₀ is waiting for a resource held by P₁, P₁ is waiting for a
resource held by P₂, ..., and Pₙ is waiting for a resource held by P₀.
Process Parallel Languages
These are programming languages, libraries, and models designed to
express and manage parallelism.
Shared Memory Programming (e.g., OpenMP, POSIX
Threads):
o Model: All threads or processes share a single address space.
Communication is implicit; threads communicate by reading
and writing to shared variables. Synchronization must be
explicitly managed by the programmer using primitives like
mutexes and barriers.
o OpenMP (Open Multi-Processing): An API that uses
compiler directives (#pragmas) to easily parallelize loops and
sections of C, C++, and Fortran code. It is a higher-level, more
user-friendly approach.
o POSIX Threads (Pthreads): A library-based approach that
provides a set of C function calls to create and manage
threads. It offers more fine-grained control than OpenMP but is
also more complex to use.
Message Passing (e.g., MPI):
o Model: Processes have their own private memory spaces.
Communication is explicit; processes must send and receive
messages to exchange data. There is no shared data, so race
conditions on variables are not an issue, but deadlocks in
communication can occur (e.g., a send waiting for a receive
that never comes).
o MPI (Message Passing Interface): A standardized and
portable message-passing library. It is the de-facto standard
for programming large-scale distributed memory systems
(clusters). It provides a rich set of routines for point-to-point
communication (e.g., MPI_Send, MPI_Recv) and collective
communication (e.g., MPI_Bcast for broadcast, MPI_Reduce for
reduction).
Data Parallel Languages (e.g., CUDA, OpenCL):
o Model: Focuses on performing the same operation
simultaneously on large datasets. This model is highly suitable
for GPUs (Graphics Processing Units). The programmer writes
a single "kernel" function that is executed by thousands of
threads, each operating on a different element of the data.
o CUDA (Compute Unified Device Architecture): A
proprietary parallel computing platform and programming
model created by NVIDIA for their GPUs.
o OpenCL (Open Computing Language): An open standard
for writing programs that execute across heterogeneous
platforms consisting of CPUs, GPUs, DSPs, and other
processors.
Architecture of Parallel and Distributed Systems
This refers to the hardware organization of the system, particularly the
processors, memory, and interconnection network.
Flynn's Taxonomy (Recap)
A classification based on instruction and data streams:
SISD (Single Instruction, Single Data): Uniprocessor systems.
SIMD (Single Instruction, Multiple Data): A single instruction is
executed on multiple data elements. Ideal for data parallel tasks
(e.g., GPUs, vector processors).
MISD (Multiple Instruction, Single Data): Rarely used in
practice.
MIMD (Multiple Instruction, Multiple Data): Multiple
autonomous processors execute different instructions on different
data. This is the most common type of parallel system,
encompassing multi-core processors and distributed clusters.
Memory Architecture
Feature Shared Memory Distributed Memory
A single global address space Each processor has its own
Concept
is accessible by all processors. private memory.
Implicit, via memory Explicit, via message
Communicati
reads/writes. Fast and low- passing over a network.
on
latency. Slower.
Synchronizat Programmer's responsibility Implicit in message passing
ion using locks, semaphores. (a receive follows a send).
Limited by memory bus Highly scalable by adding
Scalability
bandwidth and contention. more nodes to the network.
More complex due to
Programmin Easier for simple tasks, but
explicit communication
g prone to race conditions.
management.
UMA (Uniform Memory
MPPs (Massively Parallel
Sub-types Access), NUMA (Non-Uniform
Processors), Clusters
Memory Access)
Export to Sheets
Cache Coherence: In shared memory systems with caches, the
problem of ensuring that all processors have a consistent view of
the memory. If one processor updates a value in its cache, other
processors' caches must be updated or invalidated. This is typically
handled in hardware by protocols like MESI.
Interconnection Networks
The network fabric that connects processors and memory. The topology of
this network affects performance, scalability, and cost. Examples include:
Bus: Simple and low-cost, but becomes a bottleneck.
Crossbar Switch: Provides direct connection from any processor to
any memory module, offering maximum bandwidth but at a very
high cost.
Mesh/Torus: Processors are arranged in a grid, suitable for many
scientific simulations.
Hypercube: A complex topology that offers good connectivity and
low diameter.
Consistency and Replication
Replication: The practice of storing copies of the same data on
multiple machines.
o Why Replicate?
1. Reliability/Fault Tolerance: If one replica becomes
unavailable, the system can continue to function using
other replicas.
2. Performance: Placing data near the clients that use it
reduces access latency and network traffic.
Consistency: The challenge created by replication. When a data
item is updated, all its replicas must also be updated. A
consistency model is a contract between the data store and the
application, specifying the rules for the visibility and ordering of
updates.
Data-Centric Consistency Models
These models define the consistency of the data store itself. They range
from strong to weak:
1. Strict Consistency: The strongest model. Any read to a data item
returns the value of the most recent write. It behaves as if there
were a single global clock, which is practically impossible to
implement in a distributed system.
2. Sequential Consistency: A slightly weaker model. The result of
any execution is the same as if the operations of all processors were
executed in some sequential order, and the operations of each
individual processor appear in this sequence in the order specified
by its program. It's intuitive but can be slow.
3. Causal Consistency: A further relaxation. It requires that writes
that are causally related must be seen by all processes in the same
order. Concurrent writes may be seen in a different order by
different processes.
4. Eventual Consistency: A very weak model. It guarantees that if no
new updates are made to a given data item, eventually all accesses
to that item will return the last updated value. This model offers
high availability and scalability and is widely used in modern
systems like NoSQL databases (e.g., Amazon's DynamoDB,
Cassandra).
Security
Security in parallel and distributed systems is complex due to the multiple
points of entry, network communication, and shared resources. Key
security goals are:
Confidentiality: Preventing the disclosure of information to
unauthorized individuals or systems.
o Threat: Eavesdropping on network communication.
o Solution: Cryptography (encrypting data in transit with
protocols like SSL/TLS and at rest).
Integrity: Ensuring that data cannot be modified in an
unauthorized or undetected manner.
o Threat: Man-in-the-middle attacks, data tampering.
o Solution: Using cryptographic checksums or digital
signatures to verify data integrity.
Availability: Ensuring that the system is operational and accessible
when needed.
o Threat: Denial of Service (DoS) attacks, which flood a
server or network with traffic to make it unavailable.
o Solution: Firewalls, traffic filtering, and redundant systems.
Authentication: Verifying the identity of a user, process, or
machine.
o Solution: Passwords, digital certificates, Kerberos.
Authorization (Access Control): Determining what an
authenticated user is allowed to do.
o Solution: Access Control Lists (ACLs), Role-Based Access
Control (RBAC).
Parallel Operating Systems
An operating system that manages the resources of a parallel machine. It
differs from a network operating system, which simply allows machines on
a network to communicate, and from a standard single-machine OS.
The goal of a true Distributed Operating System (a type of parallel OS
for loosely coupled systems) is to provide a single system image or
transparency, hiding the fact that there are multiple independent
computers. The entire system looks like a single powerful computer to the
user.
Key design issues and responsibilities:
Process and Thread Management: Scheduling processes/threads
across multiple CPUs to balance the load and maximize parallelism.
Resource Management: Managing the collective resources of the
system, including a global view of memory and I/O devices.
Communication: Providing efficient and transparent inter-process
communication (IPC) mechanisms that hide the underlying network
communication.
Synchronization: Providing OS-level support for synchronization
primitives (mutexes, semaphores) that work across different nodes
in the system.
File Systems: Implementing a distributed file system that provides
a unified view of files across all nodes, managing data location and
replication transparently.
Examples include specialized OSs for supercomputers (e.g., Cray's
UNICOS) and research systems (e.g., Amoeba). However, a more common
approach is to use a standard kernel like Linux with added software layers
and middleware (e.g., clustering software) to manage parallel execution
and resources.
Unit -3
As a first step, I have initiated parallel Google searches for the three core
topics of UNIT - III. The initial results have provided a solid foundation for
each area:
Management of Resources in Parallel Systems: I've found
materials covering fundamental concepts like resource management
goals (efficiency, scalability), scheduling algorithms (static vs.
dynamic), and load balancing techniques. The key challenges of
managing resources in a parallel environment are also highlighted.
Tools for Parallel Computing: The searches have yielded
information on various categories of tools, including programming
environments (like MPI and OpenMP, which I've covered before but
can now place in this context), performance analysis tools (like
profilers and tracers), and debuggers designed for parallel
applications.
Parallel Database Systems and Multimedia Object Servers: I
have found good introductory articles and notes on parallel
database architectures (shared nothing, shared disk, shared
memory), parallel query processing techniques (inter-query and
intra-query parallelism), and the unique challenges posed by
multimedia servers, such as storing and streaming large objects and
quality of service (QoS) requirements.
The information gathered is a good starting point, but to create "full
notes," I need to go deeper. For resource management, I need to detail
specific load balancing algorithms and discuss process migration. For
tools, I should name and briefly describe specific, widely-used examples
for performance analysis and debugging. For parallel databases, I need to
elaborate on how operations like joins and sorting are parallelized. For
multimedia servers, I should detail storage strategies and data placement
techniques. The next step will be to perform more targeted searches to
flesh out these details.Here's a thinking process that could lead to the
detailed notes for UNIT - III:
1. Deconstruct the User's Request:
o The user wants "full notes" for "UNIT - III".
o The topics are:
Management of Resources in Parallel Systems
Tools for Parallel Computing
Parallel Database Systems and Multimedia Object
Servers
2. Initial Brainstorming & Keyword Association (Internal
Knowledge & Search Strategy):
o Management of Resources: What are the key resources?
Processors (CPUs), memory, I/O, network bandwidth. What's
the main goal? Efficiency. How do we achieve it? By keeping
all processors busy with useful work. This immediately leads
to two core concepts:
Scheduling: Deciding which task runs on which
processor and when. This can be static (pre-computed)
or dynamic (at runtime).
Load Balancing: The process of distributing the
workload evenly. This is a direct consequence of
scheduling. I need to explain static vs. dynamic load
balancing and provide examples of algorithms for each
(e.g., Round Robin for static, work-stealing for dynamic).
I should also touch on Process Migration: moving a
process from one node to another, which is a key
mechanism for dynamic load balancing. I'll need to
explain the challenges involved (state transfer, residual
dependencies).
o Tools for Parallel Computing: This is a practical topic. What
does a parallel programmer need to be successful?
First, they need a way to write the code. I've covered
Programming Environments (MPI, OpenMP) in Unit II,
but it's crucial to list them here again as they are
foundational tools.
Second, once the code is written, it will have bugs.
Parallel bugs are harder to find than sequential ones
(race conditions, deadlocks). So, I need to discuss
Parallel Debuggers (like GDB with multi-process
support, TotalView, DDT). I should explain why they are
different from regular debuggers (managing multiple
states, non-determinism).
Third, the code might be correct but slow. The
programmer needs to know why. This leads to
Performance Analysis and Visualization Tools. I
should categorize them:
Profilers: Show where the program spends its
time (e.g., gprof, VTune).
Tracers: Record events over time (e.g., Vampir,
Score-P). This helps visualize communication
patterns and identify bottlenecks.
I should structure this section by the programmer's
workflow: Write -> Debug -> Analyze/Optimize.
o Parallel Database Systems and Multimedia Object
Servers: This is a two-part topic.
Parallel Database Systems:
Why parallelize a database? To handle huge
datasets (Big Data) and complex queries faster.
The most important concept is Architecture. I
need to detail the three main types:
Shared Memory: Simple to program, but
bottleneck issues.
Shared Disk: Balances simplicity and
scalability, but has contention at the disk
level.
Shared Nothing: The most scalable and
popular for large systems. Each node has its
own CPU, memory, and disk. I must
emphasize this one.
Next, I need to explain Parallel Query
Processing. How are database operations like
SELECT, JOIN, GROUP BY made parallel?
Inter-query parallelism: Running different
queries at the same time.
Intra-query parallelism: Parallelizing a
single complex query. This is the more
interesting part. I should explain how
operations are parallelized:
Parallel Sorting (using techniques from
Unit I).
Parallel Joins (e.g., Hash Join, Sort-
Merge Join).
I can use a simple table to compare the
architectures.
Multimedia Object Servers:
How is this different from a database? It deals with
large, continuous-media objects (video, audio).
The main requirements are high bandwidth and
low latency (for streaming).
What are the key challenges?
Storage: Storing massive files.
Real-time Delivery (QoS): The data must
be delivered at a steady rate (e.g., 30
frames per second). This is a Quality of
Service (QoS) requirement.
User Interaction: Supporting VCR-like
functions (pause, rewind, fast-forward).
What are the architectural solutions?
Data Placement/Striping: Spreading a
single video file across multiple disks (or
nodes in a shared-nothing system). This
allows multiple disk heads to read different
parts of the file simultaneously, aggregating
their bandwidth. I should explain different
striping strategies (round-robin).
Admission Control: The server must
decide if it has enough resources
(bandwidth, buffer space) to accept a new
streaming request without degrading
existing streams.
Buffering: Using memory buffers to smooth
out variations in disk access times and
network delays.
3. Structure the Notes:
o Create a main heading for "UNIT - III".
o Use a major heading (###) for each of the three main
syllabus topics.
o Under "Management of Resources," create subheadings for
"Goals," "Scheduling," "Load Balancing," and "Process
Migration." Use bullet points for different algorithm types.
o Under "Tools," organize by function: "Programming
Environments," "Debuggers," and "Performance Analysis
Tools." This creates a logical workflow.
o Under "Parallel Database Systems and Multimedia Object
Servers," create two distinct major sub-sections for each topic.
For databases, use subheadings for "Architectures" and
"Parallel Query Processing." A comparison table for
architectures is a great idea. For multimedia servers, use
subheadings for "Challenges" and "Key Techniques."
4. Flesh out each section (Drafting - combining search results
and internal knowledge):
o Resource Management: Define the goals clearly. When
explaining scheduling, link static to predictable workloads and
dynamic to unpredictable ones. For load balancing, describe a
simple static algorithm (Round Robin) and a more complex
dynamic one (Work Stealing, explaining the "thief" and
"victim" concept). For process migration, list the steps: freeze,
transfer state, forward messages, unfreeze. This makes the
abstract concept concrete.
o Tools: For each tool type, explain what it is and why it's
needed in a parallel context. Name-drop specific, well-known
tools to make it less abstract (e.g., gdb, TotalView, VTune,
Vampir). For performance tools, the distinction between
profiling ("where is the time spent?") and tracing ("what
happened over time?") is crucial.
o Databases & Multimedia:
Databases: Start with the "why." The architecture table
is key. For shared-nothing, emphasize that it's the
foundation of modern systems like Google's Bigtable,
Hadoop, and most data warehouses. When explaining
intra-query parallelism, use the JOIN operation as the
primary example, as it's often the most expensive.
Explain how a parallel hash join works (partitioning
phase, join phase).
Multimedia Servers: Contrast them with traditional
file servers and databases. The core concept to hammer
home is Quality of Service (QoS) and the need for
continuous, real-time data delivery. Explain disk striping
with a simple diagram or clear example (e.g., block 1 to
disk 1, block 2 to disk 2, etc.). Explain admission control
with an analogy: a bouncer at a club who only lets
people in if there's enough room to dance without
bumping into everyone.
5. Review and Refine:
o Read through the entire document.
o Check for clarity and flow. Does the explanation of resource
management set the stage for why performance tools are
needed? Does the discussion of parallel database
architectures (especially shared-nothing) naturally lead into
the discussion of multimedia servers which often use a similar
architecture?
o Ensure key terms are in bold.
o Add a brief introduction and conclusion for the unit to frame
the content.
o Make sure the level of detail is appropriate for "full notes" –
not a PhD thesis, but more than just a list of definitions. The
inclusion of how things work (e.g., work-stealing, parallel hash
join, disk striping) is what makes the notes comprehensive.
This systematic process ensures all parts of the user's request are covered
in a structured, detailed, and easy-to-understand manner.## UNIT - III:
Resource Management, Tools, and Specialized Systems
This unit explores the practical aspects of working with parallel systems.
We will cover how to efficiently manage the system's resources, the
essential tools required for developing and optimizing parallel
applications, and the specialized architectures for handling large-scale
data in parallel databases and multimedia servers.
Management of Resources in Parallel Systems
Effective resource management is crucial for achieving the high
performance promised by parallel computing. The primary goal is to
maximize system utilization and throughput while minimizing execution
time. This involves strategically allocating resources like processors,
memory, and I/O bandwidth to competing parallel tasks.
Key Goals of Resource Management
High Performance: Achieve the best possible speedup for
applications.
Efficiency: Ensure all processors are kept busy with useful
computation.
Scalability: The management strategy should work effectively as
the number of processors grows.
Fairness: Provide fair access to resources for different users and
applications.
Stability: The system should not become unstable under high load.
Task Scheduling
Scheduling determines which task runs on which processor at which time.
Static Scheduling:
o Task allocation to processors is determined before the
program begins execution.
o Works best when the workload of each task is predictable and
uniform.
o Advantages: Low runtime overhead as decisions are made
offline.
o Disadvantages: Inflexible. Cannot adapt to runtime
variations in load, leading to poor performance if the workload
is irregular.
o Example Algorithms: Round-robin, randomized allocation.
Dynamic Scheduling:
o Task allocation decisions are made at runtime.
o Adapts to the changing state of the system, making it suitable
for irregular and unpredictable problems.
o Advantages: Can achieve better load balance and overall
performance.
o Disadvantages: Incurs runtime overhead due to the
scheduling process itself.
Load Balancing
Load balancing is the process of distributing the computational workload
across processors to ensure that no single processor becomes a
bottleneck while others are idle. It is a key component of dynamic
scheduling.
Static Load Balancing: A-priori distribution of work based on
known task characteristics. It's part of the static scheduling process.
Dynamic Load Balancing: Tasks are redistributed during
execution. Common strategies include:
o Centralized Schemes: A single "master" process manages
the workload and assigns tasks to "worker" processes. This is
simple but the master can become a bottleneck.
o Decentralized Schemes (Distributed): All processors
participate in the load balancing decisions. This is more
scalable and robust.
Work-Stealing: An idle processor (the "thief") actively
looks for a busy processor (the "victim") and "steals" a
task from its work queue. This is a highly effective and
widely used technique.
Work-Sharing: Overloaded processors actively try to
offload tasks to underloaded or idle processors.
Process Migration
Process migration is the act of transferring a process from one processor
to another during its execution. It is the underlying mechanism that
enables dynamic load balancing. The process involves:
1. Freezing the Process: Halting the process on the source node.
2. State Transfer: Capturing and transferring the process's entire
state (address space, execution context, open file descriptors) to the
destination node.
3. Redirection of Communication: Forwarding any messages or
signals meant for the process to its new location.
4. Restarting the Process: Unfreezing and resuming the process on
the destination node.
Process migration is a complex and costly operation, so it must be used
judiciously.
Tools for Parallel Computing
Developing, debugging, and optimizing parallel programs is significantly
more complex than for sequential programs. A suite of specialized tools is
essential for programmer productivity.
1. Programming Environments and Libraries
These are the foundational tools for writing parallel code.
Message Passing Interface (MPI): A library standard for writing
programs for distributed memory systems.
OpenMP (Open Multi-Processing): A directive-based API for
writing shared-memory parallel programs.
CUDA and OpenCL: Programming models and platforms for data-
parallel computing on GPUs.
2. Parallel Debuggers
Finding bugs in parallel programs (e.g., race conditions, deadlocks) is
challenging due to non-deterministic execution.
Challenges for Debuggers:
o Managing and displaying the state of thousands of concurrent
processes/threads.
o The "probe effect," where the act of debugging alters the
program's execution and can hide the bug.
o Handling non-deterministic race conditions that may not
appear in every run.
Example Tools:
o GDB (GNU Debugger): Can be scripted to manage multiple
processes.
o TotalView / DDT (Linaro DDT): Powerful commercial
debuggers designed specifically for parallel codes, providing
graphical interfaces to manage and inspect many processes
simultaneously.
3. Performance Analysis and Visualization Tools
Once a parallel program is running correctly, the next step is to optimize
its performance. Performance tools help identify bottlenecks and
inefficiencies.
Profilers:
o Function: Collect statistics about the program's execution,
showing where it spends the most time (which functions or
lines of code are "hotspots").
o Example Tools: gprof (standard Unix profiler), Intel VTune
Profiler, AMD uProf. These tools help identify computational
bottlenecks.
Tracers and Visualization Tools:
o Function: Record a time-stamped log of events (e.g., function
calls, message passing, synchronization events) during
program execution. This trace data can then be visualized on
a timeline to understand complex interactions between
processes. They are excellent for identifying communication
and load imbalance issues.
o Example Tools:
Vampir / Score-P: A widely used toolset to instrument,
trace, and visualize the performance of MPI and OpenMP
applications.
Paraver: A flexible performance visualizer that can
display a wide variety of information from trace files.
Parallel Database Systems
Parallel database systems leverage parallel processing to execute
database operations, enabling them to handle massive datasets
(terabytes or petabytes) and high query loads far beyond the capability of
a single machine.
Architectures of Parallel Databases
1. Shared Memory Architecture:
o All processors share a single main memory and a single set of
disks.
o Pros: Simple to implement and manage; efficient
communication between processors.
o Cons: Becomes a bottleneck at the memory bus and
interconnect; not scalable to a large number of processors.
2. Shared Disk Architecture:
o Each processor has its own private memory, but all processors
share access to all disks through an interconnection network.
o Pros: Good balance of scalability and simplicity; high
availability, as another processor can take over if one fails.
o Cons: The interconnection to the disk subsystem can become
a bottleneck; requires a distributed lock manager to handle
contention.
3. Shared Nothing Architecture:
o Each processor (or node) has its own private memory and its
own private disk(s). Processors communicate by passing
messages over a network. This is the most common
architecture for large-scale data warehousing and big data
systems.
o Pros: Highly scalable by simply adding more nodes; no
central point of contention.
o Cons: More complex to implement and manage; the cost of
data exchange and processing over the network is higher.
o Scal o Fault o Cost /
o Archit
abilit Toler Comp
ecture
y ance lexity
o Share
d
o Low o Low o Low
Memo
ry
o Share o Medi o Mediu
o High
d Disk um m
o Share o High
d (via
o High o High
Nothi replic
ng ation)
o Export to Sheets
o Parallel Query Processing
o Inter-query Parallelism: Different queries are executed in
parallel by different processors. This increases the overall
query throughput.
o Intra-query Parallelism: A single, complex query is broken
down and its sub-operations are executed in parallel. This
reduces the response time for the query.
o Parallel Sort: The dataset is partitioned among processors.
Each processor sorts its local partition, and then the sorted
partitions are merged.
o Parallel Join: This is often the most expensive database
operation.
o Partitioned Hash Join: The two tables to be joined are
"hashed" and partitioned across the processors. Each
processor then performs a local hash join on its partitions. This
is very effective in shared-nothing architectures.
o Sort-Merge Join: Both tables are sorted on the join attribute
in parallel, and then the sorted tables are merged and the join
is performed.
o Multimedia Object Servers
o A multimedia object server is a specialized system designed to
store, manage, and deliver large, continuous-media objects
like video and audio to a large number of concurrent users. Its
primary challenge is not just data volume, but the need for
real-time, continuous delivery.
o Key Challenges
o High Storage Requirements: A single high-definition movie
can be many gigabytes.
o High Bandwidth Demands: Streaming video requires
significant, sustained network bandwidth.
o Real-time Delivery (Quality of Service - QoS): Data must
be delivered at a constant rate (e.g., 30 frames per second) to
avoid jitter or stalls in playback. This is the most critical
requirement.
o User Interactivity: Supporting VCR-style functions (play,
pause, fast-forward, rewind) for many users.
o Key Techniques and Architecture
o Multimedia servers often use a shared-nothing or shared-disk
architecture to achieve the necessary I/O bandwidth.
o Data Placement (Striping):
o A single multimedia file (e.g., a movie) is broken into blocks
and "striped" across multiple disks.
o Fine-grained Striping: Small blocks are distributed in a
round-robin fashion. This allows the server to read from all
disks simultaneously to aggregate their bandwidth for
servicing a single high-bitrate stream.
o Coarse-grained Striping: Larger chunks of the file are
placed on different disks. This allows different disks to service
requests for different files independently.
o Admission Control:
o Before starting a new stream, the server checks if it has
sufficient resources (CPU, disk bandwidth, network bandwidth,
buffer space) to service the new request without degrading
the QoS of existing streams. If not, the new request is
rejected. This is essential for maintaining stability.
o Disk Scheduling:
o Traditional disk scheduling algorithms (like First-Come, First-
Served) are not suitable. Multimedia servers use real-time
scheduling algorithms (e.g., Scan-EDF, Earliest Deadline First)
that prioritize read requests based on their playback deadlines
to ensure continuity.
o Buffering:
o Server-side and client-side buffers are used to smooth out
variations (jitter) in disk access and network delivery times,
ensuring a steady playback experience for the user.
Unit-4
Here's a thinking process that could lead to the detailed notes for UNIT -
IV:
1. Deconstruct the User's Request:
o The user wants "full notes" for "UNIT - IV".
o The topics are:
1. Networking Aspects of Distributed and Parallel
Computing
2. Process- Parallel and Distributed Scientific Computing
2. Initial Brainstorming & Keyword Association (Internal
Knowledge & Search Strategy):
o Part 1: Networking Aspects: This is about the "glue" that
connects processors. It's not just about the internet (WWW);
it's about high-performance interconnects within a
supercomputer or cluster.
What connects them? Interconnection Networks.
What shapes do they take? Topologies. I need to
cover the classic ones: Bus, Ring, Mesh, Torus,
Hypercube, and modern ones like Fat-Tree. I should
create a table to compare them on metrics like
diameter, bisection bandwidth, and cost.
How does data move?
Switching Techniques: How is a path
established? Circuit switching (like an old
telephone call - dedicated path) vs. Packet
switching (like internet traffic - broken into
packets). I should also mention store-and-forward
vs. cut-through routing for packets.
Routing Algorithms: How is the path chosen?
This can be Deterministic (e.g., Dimension-Order
Routing in a mesh) or Adaptive (chooses path
based on network congestion).
What are the rules of communication?
Communication Protocols.
I must start with the standard TCP/IP. Explain its
role but also its limitations for HPC (high overhead
due to its general-purpose, reliable nature).
Then, I must introduce the high-performance
alternatives. The keyword here is RDMA (Remote
Direct Memory Access). I need to explain why
it's so much faster (kernel bypass, zero-copy). This
is a critical concept. I'll mention InfiniBand as the
primary hardware technology that implements
RDMA.
What measures performance? Latency and
Bandwidth. I must define both clearly and explain their
trade-offs.
o Part 2: Scientific Computing: This is about the application
of parallel computing to solve science and engineering
problems. The term "Process-Parallel" points towards MIMD-
style parallelism where different processes (often MPI
processes) work on parts of a larger problem.
How do we parallelize scientific problems? The
most common technique is Domain Decomposition. I
need to explain this concept clearly: break up the
physical domain (e.g., a car body, a weather map) into
sub-domains and assign each to a processor. This
creates "halo" regions that require communication.
What are the common computational
patterns/paradigms?
Embarrassingly Parallel: The easiest case. No
communication needed between tasks (e.g.,
Monte Carlo simulations, parameter sweeps). This
is a good starting point.
Synchronous Parallelism: The most common in
scientific computing. Processors compute on their
local data and then enter a global
synchronization/communication phase (e.g., a
barrier or MPI_Allreduce). This often happens in
iterative solvers.
Asynchronous Parallelism: Processors work and
communicate independently without global
barriers. This can be more efficient but is much
harder to program and debug.
What are the classic examples? I need to pick 2-3
canonical scientific problems and explain how they are
parallelized.
N-Body Simulation: Simulating gravitational or
electrostatic forces between particles (e.g.,
galaxies, molecules). I should explain the O(N^2)
all-pairs approach and how it's parallelized by
assigning particles to processors. Then, I should
mention more advanced, efficient methods like
Barnes-Hut (tree-based), which are also
parallelizable.
Finite Element Method (FEM): Used for
structural analysis, heat transfer, etc. The core
idea is to mesh a continuous object into discrete
elements. I'll explain that the parallelization
involves: 1) Parallel mesh generation, and 2)
Parallelizing the solver, which solves a large
system of linear equations (Ax=b). This involves
communication at the boundaries of the sub-
domains.
Computational Fluid Dynamics (CFD): Solving
equations of fluid flow (Navier-Stokes) on a grid.
This is very similar to FEM in its parallelization
strategy: decompose the grid, have each
processor solve the equations for its sub-grid, and
communicate boundary values with neighbors.
This is a classic example of synchronous, nearest-
neighbor communication.
3. Structure the Notes:
o Create a main heading for "UNIT - IV".
o Create two major sections using ### headings for the two
main topics.
o In "Networking Aspects":
Use subheadings for "Interconnection Networks,"
"Switching Techniques," "Routing Algorithms," and
"Communication Protocols."
The table comparing network topologies will be very
effective.
When discussing protocols, create a clear contrast
between TCP/IP and RDMA/InfiniBand.
o In "Scientific Computing":
Start with the foundational concept of "Domain
Decomposition." Use a simple diagram or a clear textual
description.
Use a subheading for "Common Computational
Paradigms" and explain the different patterns
(embarrassingly parallel, synchronous, asynchronous).
Use a major subheading "Case Studies in Scientific
Computing." Under this, create individual sections for
"N-Body Simulation," "Finite Element Method (FEM),"
and "Computational Fluid Dynamics (CFD)." For each
case study, explain:
1. What the problem is.
2. The parallelization strategy (usually domain
decomposition).
3. The nature of the computation and
communication.
4. Flesh out each section (Drafting):
o Networking: For the topology table, I'll fill in the columns for
Diameter, Bisection Width, Node Degree, and typical use
cases. For routing, I'll clearly define Dimension-Order Routing
as a simple, deterministic example. For protocols, I'll write a
paragraph explaining why the TCP/IP stack in the OS kernel is
slow for HPC (context switches, data copies) and then explain
how RDMA's kernel bypass and zero-copy features solve this.
o Scientific Computing: I'll start with a strong, clear
explanation of domain decomposition, maybe using an
analogy of dividing a large map among several people to color
in. For the N-body problem, I will explicitly write down the
O(N^2) complexity and show how distributing N particles
among P processors leads to O(N^2/P) computation, but also
introduces communication. For FEM/CFD, I'll emphasize the
iterative nature: Compute on the local domain ->
Communicate boundary data -> Synchronize -> Repeat.
This loop is the heart of most synchronous scientific
algorithms. I will explicitly mention the "halo" or "ghost cell"
exchange as the main communication pattern.
5. Review and Refine:
o Read through the entire document.
o Ensure the two parts feel connected. The "Networking
Aspects" section explains the how of communication, while
the "Scientific Computing" section provides concrete examples
of why efficient communication is so critical. I can add a
sentence to bridge the two sections.
o Check that key terms are bolded and clearly defined (latency,
bandwidth, RDMA, domain decomposition, halo exchange).
o Is the explanation of FEM and CFD distinct enough? I'll make
sure to mention the different underlying equations (e.g.,
structural mechanics for FEM, Navier-Stokes for CFD) even if
the parallel strategy is similar.
o Final check for flow and clarity. The notes should guide the
reader from the hardware and protocols up to the high-level
application of those concepts.## UNIT - IV: Networking and
Scientific Computing
This unit bridges the gap between the underlying hardware and high-level
applications. We first explore the critical networking aspects that form the
communication backbone of any parallel or distributed system. We then
examine how these systems are applied to solve large-scale problems in
the domain of scientific computing.
Networking Aspects of Distributed and Parallel Computing
The performance of a parallel system is not just determined by the speed
of its processors, but equally by the speed and efficiency of the network
that connects them. This "interconnect" is fundamental to how quickly
data can be shared and how well tasks can be synchronized.
1. Interconnection Networks
The physical layout and connection structure of the network is called its
topology. The choice of topology involves trade-offs between
performance, cost, and scalability.
Key Metrics for Topologies:
o Diameter: The longest shortest path between any two nodes.
A smaller diameter implies lower worst-case communication
latency.
o Bisection Width: The minimum number of links that must be
cut to divide the network into two equal halves. A higher
bisection width indicates greater aggregate bandwidth and
less contention.
o Node Degree: The number of links connected to a single
node. Higher degrees can offer better performance but
increase cost and complexity.
Common Topologies:
Node
Diamet Bisectio
Topology Degre Pros / Cons
er n Width
e
Simple, low-cost. Not scalable,
Bus O(N) 1 1
becomes a bottleneck.
Ring O(N) 2 2 Simple, better than bus. Still not
very scalable.
Good for problems with local
2D Mesh O(√N) O(√N) 4
communication (e.g., grids).
A mesh with wraparound
Torus O(√N) O(√N) 4 connections, improving diameter
and bisection width.
Excellent connectivity and low
Hypercu O(log
O(log N) O(N) diameter. Node degree grows with
be N)
size.
Modern, highly scalable topology
used in many supercomputers.
Fat-Tree O(log N) O(N) Varies
Provides higher bandwidth near the
top of the tree.
Export to Sheets
2. Switching Techniques
Switching determines how data moves from a source to a destination
node through the network's switches.
Circuit Switching: A dedicated physical path is established from
source to destination before the message is sent. The entire path is
reserved for the duration of the communication.
o Analogy: An old-fashioned telephone call.
o Pros: Guaranteed bandwidth, no contention once the circuit is
set up.
o Cons: High initial setup time (latency); inefficient for short,
bursty messages as the path is underutilized.
Packet Switching: The message is broken down into smaller units
called packets. Each packet is routed independently through the
network and may take a different path.
o Analogy: The Internet postal system.
o Store-and-Forward: The entire packet is received and stored
at an intermediate switch before being forwarded to the next
hop. High latency.
o Cut-Through Routing: The switch begins forwarding the
packet as soon as the destination address is read, without
waiting for the full packet to arrive. Significantly reduces
latency.
3. Routing Algorithms
A routing algorithm determines the specific path a packet takes through
the network.
Deterministic Routing: The path is determined entirely by the
source and destination addresses. It does not consider network
congestion.
o Example: Dimension-Order Routing (DOR) in a mesh. A
packet first travels along the X-dimension until its X-coordinate
matches the destination, then travels along the Y-dimension.
It's simple and deadlock-free but cannot route around
congested areas.
Adaptive Routing: The path is chosen based on the current state
of the network, such as traffic or link failures.
o Pros: Can improve performance by avoiding hotspots and
balancing the network load.
o Cons: More complex to implement; can be prone to deadlocks
if not designed carefully.
4. Communication Protocols
Protocols are the set of rules governing communication.
TCP/IP (Transmission Control Protocol/Internet Protocol):
o The standard protocol suite for the Internet and general-
purpose networking.
o Role in Distributed Systems: Provides reliable, ordered,
error-checked delivery of data.
o Limitations for High-Performance Computing (HPC): TCP
has high overhead. It is processed by the OS kernel, which
involves multiple data copies and context switches,
introducing significant latency. This makes it unsuitable for the
rapid, low-latency messaging required by tightly coupled
parallel applications.
High-Performance Protocols (RDMA):
o RDMA (Remote Direct Memory Access): A technology that
allows a node to directly access the memory of another node
without involving the operating system of either node.
o Key Features:
Kernel Bypass: The network interface card (NIC)
communicates directly with the application's memory,
bypassing the slow OS kernel.
Zero-Copy: Data is transferred directly from the source
memory to the network and then to the destination
memory without any intermediate copies, dramatically
reducing CPU overhead and latency.
o InfiniBand: A high-performance networking standard
commonly used in supercomputers and data centers that is
built around the RDMA concept. It provides much higher
bandwidth and lower latency than traditional Ethernet with
TCP/IP.
Process-Parallel and Distributed Scientific Computing
Scientific computing (or computational science) uses computational power
to analyze and solve complex problems in science and engineering.
Parallel computing is the engine that makes it possible to tackle problems
of realistic size and complexity.
Domain Decomposition
This is the most common strategy for parallelizing scientific problems that
are based on a physical space or domain.
1. Decomposition: The physical domain of the problem (e.g., a car
body for crash simulation, a volume of air for weather forecasting) is
divided into smaller sub-domains.
2. Distribution: Each sub-domain is assigned to a different processor.
3. Computation: Each processor performs the core computation on its
local sub-domain.
4. Communication: Processors communicate data at the boundaries
of their sub-domains. These boundary regions are often called halos
or ghost cells. A processor needs data from its neighbor's halo
region to correctly compute values at the edge of its own sub-
domain.
This "compute-communicate" cycle is the fundamental pattern in many
synchronous parallel scientific algorithms.
Common Computational Paradigms
Embarrassingly Parallel: Problems where tasks are completely
independent of each other and require little to no communication.
These are the easiest to parallelize and scale almost perfectly.
o Examples: Monte Carlo simulations, rendering different
frames of a movie, parameter sweeps (running the same
simulation with many different input parameters).
Synchronous Parallelism: Processors perform computation on
their local data in parallel and then periodically coordinate through a
global synchronization/communication step (e.g., using a barrier or
collective operation like MPI_Allreduce). Most iterative solvers fall
into this category.
Asynchronous Parallelism: Processors work and communicate
independently without global synchronization points. This can hide
latency and be more efficient but is significantly harder to design,
implement, and debug.
Case Studies in Scientific Computing
1. N-Body Simulation
Problem: Simulating the motion of a set of particles that interact
with each other, such as stars in a galaxy (gravitational forces) or
atoms in a molecule (electrostatic forces).
Parallelization:
o All-Pairs Method (O(N^2)): In the simplest approach, the
force on each particle is calculated by summing the forces
from all other N-1 particles. This is parallelized by distributing
the particles among the processors. Each processor computes
the forces for its subset of particles, which requires
communicating to get the positions of all other particles.
o Hierarchical Methods (e.g., Barnes-Hut): For large N, the
O(N^2) complexity is too high. Tree-based algorithms like
Barnes-Hut reduce the complexity to O(N log N). They group
distant particles together and treat them as a single larger
particle, reducing the number of force calculations.
Parallelizing these algorithms is more complex, involving
parallel construction of the tree and traversing it.
2. Finite Element Method (FEM)
Problem: A numerical technique for solving differential equations
that describe physical phenomena like structural stress on a bridge,
heat distribution in an engine block, or fluid flow. It works by
meshing a continuous object into a large number of discrete "finite
elements."
Parallelization:
1. Parallel Meshing: The domain decomposition strategy is
applied to the geometry, and each processor generates the
mesh for its sub-domain.
2. Parallel Solver: The FEM formulation results in a very large
system of linear equations (Ax=b). This system is solved
iteratively. In each iteration, a processor updates the values
for its local nodes based on the values of its neighbors. This
requires a halo exchange communication step at the end of
each iteration to update the boundary data before the next
iteration begins.
3. Computational Fluid Dynamics (CFD)
Problem: Simulating the behavior of fluid flows by numerically
solving the Navier-Stokes equations. Applications include designing
aircraft wings, predicting weather, and modeling blood flow.
Parallelization: The strategy is very similar to FEM. The fluid
domain is discretized into a grid (the set of sub-domains). Each
processor is assigned a part of the grid and solves the Navier-Stokes
equations for the cells in its sub-grid. Like FEM, this is an iterative
process requiring a synchronous, nearest-neighbor halo exchange in
each step to communicate boundary conditions (e.g., pressure,
velocity) between adjacent sub-grids.
Here’s a consolidated set of full notes covering High-Performance
Computing in Molecular Sciences, Communication & Multimedia
Applications for Parallel/Distributed Systems, and Distributed File
Systems:
UNIT - V FULL NOTES
1. High-Performance Computing in Molecular Sciences
Overview
High-Performance Computing (HPC) enables molecular scientists
to simulate and analyze complex molecular systems with high
accuracy and speed.
Key Applications
Molecular Dynamics Simulations: Simulate atomic-level
behavior of molecules over time.
Quantum Chemistry: Solving Schrödinger's equation for
chemical systems.
Protein Folding: Understanding how proteins assume their
functional shapes.
Drug Discovery: Modeling drug-receptor interactions to
accelerate pharmaceutical development.
HPC Techniques Used
Parallel Computing: Distributing computation across
CPUs/GPUs.
Message Passing Interface (MPI): For communication in
distributed memory systems.
OpenMP: Shared memory parallelism.
GPU Acceleration: Utilizing CUDA/OpenCL for faster
computation.
Tools & Software
GROMACS, NAMD, AMBER – Molecular dynamics.
Gaussian, ORCA – Quantum chemistry simulations.
2. Communication Multimedia Applications for Parallel and
Distributed Systems
Definition
Multimedia communication includes the transmission of audio,
video, and text across networked systems, often requiring real-
time or near-real-time performance.
Challenges
Bandwidth Management
Latency and Jitter
Synchronization of audio/video streams
Fault Tolerance and Scalability
Parallel and Distributed Computing Role
Real-Time Video Processing: Encoding/decoding across
distributed nodes.
Load Balancing: Distributing tasks like streaming, rendering,
etc.
Data Replication: Ensures fault tolerance and fast access.
Common Frameworks and Tools
MPI & OpenMP: Used to parallelize media processing
algorithms.
MapReduce: For processing large multimedia datasets.
Streaming Platforms: Hadoop-based systems for large-scale
video analysis.
Use Cases
Video Conferencing Systems (Zoom, MS Teams)
Content Delivery Networks (CDNs): Efficient delivery of
large-scale media.
Distributed Rendering: In animation and visual effects.
3. Distributed File Systems (DFS)
Definition
A Distributed File System allows access to files from multiple
hosts sharing via a network. Files are stored across multiple
nodes but appear as a single logical system to users.
Objectives
Transparency: Location, access, replication, concurrency.
Fault Tolerance: Ability to recover from failures.
Scalability: Handling increasing data volumes.
Types
Google File System (GFS)
Hadoop Distributed File System (HDFS)
Network File System (NFS)
Andrew File System (AFS)
HDFS – Key Features
Master-Slave Architecture: NameNode (metadata) &
DataNodes (data blocks).
Large Block Size: Default 128MB or more.
Replication: Default factor is 3 for fault tolerance.
Write-Once-Read-Many: Optimized for large streaming
reads.
Benefits of DFS
High availability.
Data locality awareness.
Efficient for Big Data and parallel processing.
Applications
Big Data Analytics
Cloud Storage Services
Multimedia Streaming
Scientific Data Management
Conclusion
UNIT V bridges the advanced concepts of high-performance and
distributed computing in various scientific and multimedia
contexts. From molecular modeling to managing massive data
with distributed file systems, these technologies form the
backbone of modern computational science and communication
systems.
1. High-Performance Computing (HPC) in Molecular Sciences 🧬
Applications & Models
HPC accelerates molecular dynamics (MD), quantum chemistry, and
protein modeling.
Modern MD packages like GROMACS use:
o MPI between nodes
o OpenMP or threading within nodes
o GPU offloading + SIMD instructions
en.wikipedia.org+15arxiv.org+15amcsregulations.psgtech.ac.i
n+15studylib.netscribd.com
Achieves massive speedups via fine-grained parallelism and
neighbor-search optimizations.
Levels of Parallelism
Inter-node: Cluster-based MPI communication.
Intra-node: Multithreading with OpenMP.
Hardware acceleration: SIMD vectorization, GPU usage.
Performance Aspects
Scalability: both strong and weak scaling.
Efficiency maximized by reducing communication latency, ensuring
load balance, and minimizing I/O bottlenecks.
2. Communication & Multimedia Applications for Parallel &
Distributed Systems
Programming Models
MPI: Standard for distributed-memory communication (point-to-
point and collective, up to MPI-4.1)
arxiv.orglibrary.fiveable.mescribd.com+2en.wikipedia.org+2library.fi
veable.me+2
RPC/RMI: Enable distributed object interactions, especially in Java-
based multimedia apps.
informaltuitions.in
Multimedia Applications
Parallel processing of audio/video codecs (like H.264), real-time
streaming, VR rendering.
Use pipeline parallelism: separate filter stages (e.g., capture →
encode → network) with worker threads.
Communication Layers & Protocols
OSI layers 4–7: TCP/UDP for reliable vs. real-time streams.
Strategies for latency hiding, jitter smoothing, buffering for QoS.
Multimedia apps often combine threaded shared-memory (e.g.,
OpenMP) + MPI for inter-node syncing.
3. Distributed File Systems (DFS)
Goals & Architectures
Provide scalable, reliable, high-throughput storage across
nodes.
Data distribution across servers; supports concurrency, caching,
replication, fault tolerance.
amcsregulations.psgtech.ac.instudylib.net+1library.fiveable.me+1lib
rary.fiveable.me
Key Examples
File System Description
Widely used in HPC, powers many TOP500 systems
Lustre
arxiv.org+5en.wikipedia.org+5en.wikipedia.org+5
BeeGFS Highly scalable parallel filesystem, latest release Mar 2025
GPFS / IBM
Shared-disk/shared-nothing, used on Summit
Spectrum
supercomputer
Scale
OrangeFS Evolved from PVFS; open-source HPC file system
Hadoop’s DFS, optimized for big data, not
HDFS
POSIX-compliant
Design Considerations
Metadata vs. data separation: Metadata servers manage
namespace, chunk servers handle blocks.
Replication & consistency: DFS ensure high availability &
resilience.
Fault tolerance through redundancy and failover mechanisms
(e.g., HDFS NameNode).
Performance trade-offs: POSIX compliance vs throughput; HDFS
relaxes POSIX for speed.
4. Communication in Distributed Systems
Message Passing vs Shared Memory
Shared-memory (e.g., OpenMP): easy data access but needs sync
and has scalability limits
en.wikipedia.org+1arxiv.org+1studylib.net+6library.fiveable.me+6a
mcsregulations.psgtech.ac.in+6en.wikipedia.org
Distributed-memory (e.g., MPI): explicit messaging, high
scalability, but more complex programming
library.fiveable.me+3scribd.com+3researchgate.net+3
Algorithms and System Traits
Concepts like latency hiding, remote memory access, and non-
blocking collectives reduce communication overhead
amcsregulations.psgtech.ac.in
Synchronization tools: barriers, locks, logical clocks, election
algorithms informaltuitions.in
5. Distributed File Systems in Multimedia & Molecular Workflows
These subsystems often rely on DFS to:
Store large datasets (MD trajectories, video streams).
Enable parallel I/O access: MPI-IO, HDF5, NetCDF.
Employ caching strategies to reduce network latency.
6. Key Tools & Frameworks
GROMACS – state-of-the-art MD with full hybrid parallel support.
arxiv.org+1arxiv.org+1en.wikipedia.org+3en.wikipedia.org+3en.wiki
pedia.org+3reddit.com
MPI & OpenMP – main parallel programming paradigms.
library.fiveable.me+1scribd.com+1
Lustre, BeeGFS, GPFS, HDFS, OrangeFS – critical storage
systems.
7. Study Tips & Reference Map
1. Understand hardware hierarchy – CPU cores, caches,
interconnects – to optimize code en.wikipedia.orgreddit.com
2. Learn MPI first, then combine with OpenMP (hybrid parallelism).
3. Explore DFS architecture: metadata/data flow, caching, and
failure modes.
4. Hands-on: setup MPI+OpenMP molecular simulation (e.g. with
GROMACS) using a parallel filesystem like BeeGFS/Lustre.
5. Core textbooks: Pacheco’s Intro to Parallel Programming,
Tanenbaum & van Steen’s Distributed Systems, Hennessy &
Patterson’s Computer Architecture.