Parallel and Distributed
Algorithms
Course for Undergraduate Students in the 3rd year
(Major in Computer Science)
Instructor:
Mihai L. Mocanu, Ph.D., Professor
E-mail: mihai.mocanu@edu.ucv.ro
Office hours (by appointment only): Friday13:00-14:00
Course: 2024-2025_ParaDis_3CE (Google Workspace)
What this course is about
Parallelism/ the ability to think in parallel:
Classically, algorithm designers assume a computer with only
one processing element
Algorithms they design are “sequential”, that is, all steps of
the algorithms must be performed in a particular sequence
But today’s computers have multiple processors, multiple cores etc,
each one can perform its own chain of execution
Stands for basic desktop computers (commodity comp.), they
have multicore processors or pipelined processing elements
Not to mention the vast clusters of computers, aggregated in
networks, grids, clouds etc.
Worth to mention that real world processes may need algorithmic
descriptions that are far from sequential
Concurrent Computing
Concurrency means multiple computations are happening at
the same time
Theoretically, computer programming has long been
interested in the concurrency concept
Pioneering work on Petri Nets in the 1960s was among the
earliest ideas, but other follow (CSP, CCS)
“While concurrent policy coherence had been contemplated
for decades, the computer programming of concurrency
started with Edsger Dijkstra’s famous 1965 article that
presented the deadlock issue” - Leslie Lamport (2015)
The history of concurrency is long and there is an enormous
amount of development, on many perspectives
Parallel Computing
Parallel computing is based on the idea of breaking down a
task into smaller parts that can be processed simultaneously
The idea of concurrency is connected to parallelism; still, the
two must not be confused: concurrency is the simultaneous
operation of separate processes, while parallelism is the
construction and operation of several sub-processes of the
same process at the same time
In the actual ceation/ operation of a computer, architectural
intricacies and hardware specifications of the parts often
manifest themselves correspondingly
Also, the characteristics of parallel computing should be
incorporated in the development as well as assessment of
parallel algorithms
Distributed Computing
Thanks to the emergence of large network interconnections,
many of the resources required by programming no longer
remain on a single system
Initially, dispersed systems were built and at some point, a
physical access to remote systems became required
Distributed systems emerged as collections of autonomous
entities that work together to solve an issue that cannot be
solved separately
High performance isn’t the main purpose of employing a
distributed system, what is needed is optimal performance
The user-perceived response time is critical
Network delay and access to common resources can
cause significant latencies in highly distributed systems,
which should be reduced
Why study P&D Algorithms?
Suppose we’re working with a commodity multiprocessor system
(a p-processor system – like almost all are today)
Almost inevitably, we’ll run into some complex problem that we
want it to solve in less time
Let T represent the amount of time that a single processor takes to
solve this problem; if we divide the work, can we hope to develop
an algorithm that takes T/p time to reach a solution?
Why this will rarely be possible to achieve entirely, in reality?
Depending not only on the problem but also on the solution
we adopted, the processors will almost always spend some
time coordinating their work, which will lead to a penalty
Example 1
Sometimes a near-optimal speedup is easy to achieve
Suppose that we have an array of integers, and we want to display
all the negative integers in the array
27 -113 45 11 84 -32 -55 18 -7 23
We can divide the array into equal-sized segments, one
segment for each processor, and each processor can display
all the negative integers in its segment
The sequential algorithm would take O(n) time
In our multiprocessor algorithm each processor handles an
array segment with at most n/p elements
So, processing the whole array takes O(n/p) time
Example 2
There are some problems where it is hard to figure out how to use
multiple processors to speed up the processing time
Let’s consider the problem of arbitrary precision arithmetic*: we
are given 2 integer arrays representing very large numbers (where
a[0] contains the 1’s digit, a[1] contains the 10’s digit, a.s.o.) and
we want to construct a new array containing the sum of numbers
We know a sequential algorithm to do this that takes O(n) time.
We can’t easily adapt this algorithm to use many processors,
though: In processing each segment of the arrays, we need to know
whether there is a carry from the preceding segments.
Thus, this problem, like many others, seems inherently sequential.
Still, there is an algorithm that can add two such big-integers in
O(n/p + log p) time (Q: slowdown with log p?)
PDA Field of Study inside the CS Area
Par.& Distr.
Algorithms
Social and
Professional Database and
Context Information
Algorithms Architecture Retrieval
& Data
Structures
Operating
Systems
Artificial Computer Software
Intelligence Science Methodology
and Robotics &Software
Engineering
Human-
Numerical and Computer
Symbolic Programming Interaction
Computation Languages ACM & IEEE
curriculum
recommendations
Course objectives
To provide you with an introduction and overview to the
computational aspects of parallel and distributed computing
To introduce several important parallel computing models
To study the typical models for distributed computing.
To make basic concepts of P & D computing both accessible and
ready for immediate use
Broadly understand P&D architectures
Become familiar with typical software/ programming approaches
Be able to quickly adapt to any programming environment
Study the main classes of P&D algorithms
Learn the jargon … understanding what people are talking about
Be able to apply this knowledge
Course objectives (cont.)
More specific, with the aim of drastically flattening the learning
curve in a parallel and/or distributed environment:
Study the development methods of efficient P&D algorithms using:
Data partitioning and functional partitioning
Parallelism exploitation for different process topologies
Efficient overcoming of communication latencies,
Load balancing
Termination detection
Failure tolerance etc.
Learn the conceptual models for the specification and verification
of P&D algorithms
Understand the complexity models of P&D algorithms
Learning Outcomes
On successfully completing this course you should understand a
number of different models of P&D computing and the basic
techniques for designing algorithms in these models.
Development techniques for parallel algorithms - examples from
the main classes: prefix algorithms, search, sort, selection, matrix
processing, graph algorithms, lists, trees
Main development techniques of classes of distributed algorithms:
wave algorithms, mutual exclusion, topology establishment,
termination, fault tolerance, leader election, genetic algorithms
Textbooks and other references
Vipin Kumar, Ananth Grama, Anshul Gupta, George Kyrypis - Introduction to Parallel
Computing Pearson Ed. 2003, (2nd Edition) or Benjamin/Cummings 1994, (1st Edition)
http://srmcse.weebly.com/uploads/8/9/0/9/8909020/introduction_to_parallel_computing_second_editi
on-ananth_grama..pdf , or https://docs.google.com/...
Behrooz Parhami - Introduction to Parallel Processing: Algorithms and Architectures,
Kluwer Academic Publ, 2002
Dan Grigoras – Parallel Computing. From Systems to Applications, Computer Libris
Agora, 2000, ISBN 973-97534-6-9
Mihai Mocanu – Algorithms and Languages for Parallel Processing, Publ. University of
Craiova, updated yearly for 20+ years on my home page
George Coulouris, Jean Dollimore, Tim Kindberg and Gordon Blair, Distributed Systems
Concepts and Design (5th ed.), Addison Wesley, 2011
https://repository.dinus.ac.id/docs/ajar/George-Coulouris-Distributed-Systems-Concepts-and-Design-5th-Edition.pdf
P.Pacheco, M.Malensek, An Introduction to Parallel Programming, MK, 2022
https://books.google.ro/books?id=rElkCwAAQBAJ&pg=PA460&lpg=PA460&dq=Peter+Pacheco,+M
alensek,+Matthew+-+An+Introduction+to+Parallel+Programming++pdf&source=...
… and many others
Working Resources
Laboratory and Projects:
1. Mihai Mocanu, Alexandru Patriciu – Parallel Computing in C for Unix and
Windows NT Networks, Publ. University of Craiova, 1998
2. Christofer H.Nevison et al. - Laboratories for Parallel Computing, Jones and
Bartlett, 1994
3. SPIN multi-threaded software verification tool, http://spinroot.com/spin/
whatispin.html
Other resources are on the web page
Topics
1. Parallel Programming Platforms & Parallel Models
logical and physical organization
interconnection networks for parallel machines
communication costs in parallel machines
graph models: process - processor mappings, graph embeddings
PRAMs. Complexity measures
Why?
It is better to be aware of the physical and economical constraints and tradeoffs
of the parallel system you are designing for, not to be sorry later.
Topics (continued)
2. Quick Introduction to Virtual Computing Environments
heterogeneous platforms: understand and set up working environment
semantics and syntax of basic communication operations
PVM (Parallel Virtual Machine) vs MPI (Message Passing Interface)
compiling and running PVM or MPI programs
other parallel execution models/ programming interfaces
pthreads (low-level) vs. OpenMP (high-level)
CPU/ GPU/ DSP programming: CUDA vs. OpenCL
etc.
Why?
You can start to program simple parallel programs early on.
Topics (cont.)
3. Principles of Parallel Algorithm Design
basic PRAM techniques: doubling technique
summation trees and prefix summation
decomposition techniques
load balancing
techniques for reducing communication overhead
parallel algorithm models
Why?
These are fundamental issues that appear/apply to every parallel program.
You really should learn this stuff by hearth.
Topics (cont.)
4. Implementation & Cost of Basic Communication Operations
broadcast, reduction, scatter, gather, parallel prefix, …
interconnection networks: graph models of networks
network properties
Why?
These are fundamental primitives you would often use and you should known
them well: Not only what they do, but also how much do they cost and when and
how to use them.
Going through details of implementation allows us to see how are the principles
from the previous topic applied to relatively simple problems.
Topics (cont.)
5. Analytical Modeling of Parallel Programs
sources of overhead
execution time, speedup, efficiency, cost, Amdahl's law
Why?
Parallel programming is done to increase performance.
Debugging and profiling is extremely difficult in parallel setting, so it is better
to understand from the beginning what performance to expect from a given
parallel program and, more generally, how to design parallel programs with
low execution time. It is also important to know the limits of what can be done
and what not.
Topics (cont.)
6. Parallel Dense Matrix Algorithms
matrix vector multiplication
matrix matrix multiplication
solving systems of linear equations
7. Parallel Graph Algorithms
minimum spanning tree
single-source shortest paths
all-pairs shortest paths
connected components
algorithms for sparse graphs
Why?
Classical problems with lots of applications, many interesting and useful
techniques exposed.
Topics (cont.)
8. Parallel Sorting
odd-even transpositions sort
sorting networks, bitonic sort
parallel quicksort, bucket and sample sort
9. Search Algorithms for Discrete Optimization Problems
search overhead factor, speedup anomalies
parallel depth-first search
sorting and searching on PRAMs
10. Geometric Algorithms
convex hulls, closest pair of points
Why?
As before, plus shows many examples of hard-to-parallelize problems.
Topics (cont.)
11. Concepts of distributed computation
termination detection
failure tolerance
distributed network topologies
12. Formal Models
state transitions and executions
asynchronous and synchronous models
causality constraints and the computation theorem
logical clocks (Lamport) and vector clocks (Mattern-Fidge)
Why?
It is important and useful to understand their social importance for Internet,
WWW, small devices (mobiles, sensors) and their technical importance
that aims to improve scalability, reliability through inherent distribution
Topics (cont.)
13. Distributed abstractions
event-based component model
specification of services
safety and liveness
node and channel failure models and links
timing assumptions
14. Failure detectors
classes of detectors
leader elections and reductions
quorums; byzantine quorums and leader election
Why?
Making use of abstractions, like timing assumptions and ways to encapsulate
them (failure detectors), proves to be very useful.
Topics (cont.)
15. Broadcast abstractions
(lazy/ eager/ uniform/ fail-silent) reliable broadcast
causal broadcast, vector-clock algorithms
performance of algorithms
16. Consistency
replicated shared memory
linearizable registers
linearizable single write multiple reader
multiple writers
17. Consensus
control-oriented vs. event-based, uniform consensus
Paxos and multi-Paxos consensus
Topics (cont.)
18. Reconfigurable Replicated State Machines
total order broadcasting
robust distributed networks
19. Random Walks
introduction to Markov processes
random walks (hitting time, cover time (s.t)-connectivity)
20. Time and Clocks in Distributed Systems
time lease
clock drift and the complete algorithm
…
Grading (tentative)
20% practical laboratory assignments/ projects (L)
20% periodic evaluation through homeworks/exercises (H)
20% final test quizz (T)
40% final exam (E)
You have to get at least 50% on final laboratory evaluation (L) in
order to be allowed to sustain the final exam in the next session.
You have to get at least 50% on the final exam (E) to pass and to
obtain a mark greater than 5. All the grades obtained go with the
specified weight into the computation of the final mark.
If you have problems with setting up your working environment and/or
running your programs, ask the TA for help/advice. He is there to help you
with that. Use him, but do not abuse him with normal programming bugs
Assignments and evaluations
assignments from individual homeworks (may be a project) for a
total of 20 points
mostly programming in C, C++ with:
• threads (POSIX Threads and OpenMP)
• multiple processes, PVM, MPI, etc. implementing
(relatively) simple algorithms and load balancing techniques
• task-based and data-based parallelism (OpenCL)
• CUDA,
deep understanding of algorithms and performance models
continuous evaluation based on theoretical questions thrown in
to prepare you better for the final exam
Introduction
Parallel vs. Distributed Computing
Background. “Traditional” definitions. Code examples
Speedup. Amdahl’s law. Performance Measures
Parallel Platforms for High Performance Computing
The context and difficulties of parallel computing
HPC/ parallel computing evolution over 50 years
Grand challenge problems in demand for HPC (higher
computational speed)
A few core problems in Distributed Computing
Background
Parallel Computer: a computer with many processors that
are closely connected
frequently, all processors share the same memory
they also communicate by accessing this shared memory
Examples of parallel computers include the multicore
processors found in many computers today (even cheap
ones), as well as many graphics processing units (GPUs)
Parallel Computing: “using more than one computer, or a
computer with more than one processor, to solve a task ”
The parallel programming paradigm is not new: it has been
around for more than 50 years! Motives: faster computation,
larger amount of memory available, etc.
“... There is therefore nothing new in the idea of parallel
programming, but its application to computers. The author
cannot believe that there will be any insuperable difficulty in
extending it to computers. It is not to be expected that the
necessary programming techniques will be worked out
overnight. Much experimenting remains to be done. After all,
the techniques that are commonly used in programming today
were only won at the cost of considerable toil several years
ago. In fact the advent of parallel programming may do
something to revive the pioneering spirit in programming
which seems at the present to be degenerating into a rather
dull and routine occupation ...”
Gill, S. (1958), “Parallel Programming,” The Computer Journal, vol. 1, April 1958, pp. 2-10.
Background
Distributed Computer: one in which the processors are less
strongly connected –“a set of nodes connected by a network,
which appear to its users as a single coherent system”
A typical distributed system consists of many independent
computers, attached via network connections
Such an arrangement is called a cluster
In a distributed system, each processor has its own memory.
This precludes using shared memory for communicating
Processors instead communicate by sending messages
Although message passing is slower than shared memory, it
scales better for many processors, and it is cheaper
PARALLEL vs. Distributed Computing
A working definition may differentiate them after:
• Focus (relative): coarse, medium or fine grain
• Main goal: shorter running time!
• The processors are contributing to the implementation of a more
efficient execution of a solution to the same problem
• In parallel computing a problem involves lots of computations and
data (e.g. matrix multiplication, sorting), and, to attain efficiency,
communication must be kept to a minimum (optimal)
• In distributed systems the problems are different: often coordination
of resources is the most important (e.g. leader election, commit,
termination detection…)
Code example 1
For a shared-memory parallel computer, here you have a code
intended to be used in finding the sum of all the elements in a
long array.
Variables whose name begin with my_ are specific to each
processor; this might be implemented by storing these variables in
individual processors’ registers
The code fragment assumes that a variable array has already been
set up with the numbers we want to add; there is a variable procs
that indicates how many processors our system has
In addition, we assume each register has its own my_pid variable,
which stores that processor’s own processor ID, a unique number
between 0 and procs – 1
Parallel program
// 1. Determine where processor’s segment is and add up numbers in segment
count = array.length / procs;
my_start = my_pid * count;
my_total = array[my_start];
for(my_i = 1; my_i<count; my_i++) my_total += array[my_start+ my_i];
// 2. Store subtotal into shared array, then add up the subtotals.
subtotals[my_pid] = my_total; // denoted as line A in remarks below
my_total = subtotals[0]; // line B in remarks below
for(my_i = 1; my_i < procs; my_i++) {
my_total += subtotals[my_i]; } // line C in remarks below
// 3. If array.length isn’t a multiple of procs, then total will exclude some
// elements at the end of the array. Add these last elements in now.
for(my_i = procs * count; my_i < array.length; my_i++) my_total +=
array[my_i];
Analysis: computation
Here, we first divide the array into segments of length count, and
each processor adds up the elements within its segment, placing
that into its variable my_total.
We write this variable into shared memory in line A so that all
processors can read it; then we go through this shared array of
subtotals to find the total of the subtotals.
The last step is to take care of any numbers that may have been
excluded when we tried to divide the array into p equally-sized
segments.
Analysis: synchronization
Each processor must complete line A before any other processor
tries to use that saved value in line B or line C (discuss!)
One way of ensuring this is to build the computer so that all
processors share the same program counter as they step through
identical programs. Such an approach:
Allows all processors to execute line A simultaneously.
Though it works, it is quite rare because it can be difficult to
write a program so that all processors work identically
We often want different processors to perform different tasks
The more common approach is to allow each processor to execute
at its own pace, giving programmers the responsibility to include
code enforcing dependencies between processors’ work *
*
In our example above, we would add code between line A
and line B to enforce the restriction that all processors
complete line A before any proceed to line B and line C.
If we were using Java’s built-in features for supporting
such synchronization between threads, we could accomplish
this by introducing a new shared variable number_saved
whose value starts out at 0. The code following line A could
be as follows.
synchronized(subtotals) {
number_saved++;
if(number_saved == procs)
subtotals.notifyAll();
while(number_saved < procs) {
try { subtotals.wait(); }
catch(InterruptedException e) { }
}
}
Code example 2
From now on, we’ll be working with a message-passing system
implemented using the following two functions
void send(int dst_pid, int data) – sends a message containing the
integer data to the processor whose ID is dst_pid. Note that the
function’s return may be delayed until the receiving processor
requests to receive the data — though the message might instead
be buffered so that the function can return immediately
int receive(int src_pid) – waits until the processor whose ID is
src_pid sends a message and returns the integer in that message
(blocking receive). Some systems also support a non-blocking
receive, which returns immediately if the processor hasn’t yet
sent a message. Another variation is a receive that allows a
program to receive the first message sent by any processor*
Distributed code
On the same example, we assume that each processor already has
its segment of the array in its memory, called segment
The variable procs holds the number of processors in the system,
and pid holds the processor’s ID (a unique integer between 0 and
procs – 1, as before)
total = segment[0];
for(i = 1; i < segment.length; i++) total += segment[i];
if(pid > 0) { // each processor but 0 sends its total to processor 0
send(0, total);
} else { // processor 0 adds all these totals up
for(int k = 1; k < procs; k++) total += receive(k);
}
Analysis
This code says that each processor should first add the elements
of its segment. Then each processor except processor 0 should
send its total to processor 0
Processor 0 waits to receive each of these messages in succession,
adding the total of that processor’s segment into its total. By the
end, processor 0 will have the total of all segments
In a large distributed system, this approach may be flawed since
inevitably there is a possibility that some processors would break,
often due to the failure of some equipment such as a hard disk or
power supply
We’ll ignore this issue here, but it is an important issue when
writing programs for large distributed systems in real life
Performance view
Parallel System: An optimized collection of processors, dedicated
to the execution of complex tasks; each processor executes in a
semi-independent manner a subtask and co-ordination may be
needed from time to time. The primary goal of parallel processing
is a significant increase in performance.
Distributed System: A collection of multiple autonomous
computers, communicating through a computer network, that
interact with each other in order to achieve a common goal.
Leslie Lamport: “A distributed system is one in which the failure
of a computer you didn't even know existed can render your own
computer unusable “
Remark. Parallel processing in distributed environments is not
only possible but also a cost-effective attractive alternative
Speedup Factor
Execution time using one processor (best sequential algor
ithm) ts
S(p) = =
Execution time using a multiprocessor with p processors tp
Speedup factor can also be cast in terms of computational steps:
Number of computational steps using one processor
S(p) =
Number of parallel computational steps with p processors
S(p) gives increase in speed by using “a multiprocessor”
Hints:
Use best sequential algorithm with single processor system
Underlying algorithm for parallel implementation might be (and it
is usually) different
Maximum Speedup
Is usually p with p processors (linear speedup)
Speedup factor is given by:
ts p
S(p)
fts (1 f )ts /p 1 (p 1)f
This equation is known as Amdahl’s law
Remark: Possible but unusual to get superlinear speedup
(greater than p) but due to a specific reason such as:
Extra memory in multiprocessor system
Nondeterministic algorithm
Maximum Speedup
Amdahl’s law
ts
fts (1 - f)ts
Serial section Parallelizable sections
(a) One processor
(b) Multiple
processors
p processors
tp (1 - f)ts /p
Speedup against number of processors
•Even with infinite number of processors, maximum speedup limited to 1/f
•Ex: With only 5% of computation being serial, maximum speedup is 20
*
This is a perfect example of how NOT to multithread!
You are creating a new task for each iteration of a
recursive function
So, each task creates a new task, waits for that task to
finish and then adds the numbers from the result
Each thread has two jobs :
1 - to create a new thread
2 - to add two numbers.
The overhead cost for creating each thread is going to
far outweigh the cost of adding two numbers together
Solutions:
1. Limit the number of threads
2. Use the iterative algorithm to start from, instead
of the recursive one
Superlinear Speedup - Searching
(a) Searching each sub-space sequentially
Start Time
ts
t s/p
Sub-space t
search
xts/p
Solution found
x indeterminate
(b) Searching each sub-space in parallel ts
x t
Speedup is given by: p
S ( p)
t
Worst case for sequential search when solution found in last
sub-space search. Then the parallel version offers the greatest
benefit, i.e.
Least advantage for parallel version when solution found in
first sub-space search of the sequential search, i.e.
t
Solution found
Measures of Performance
Speedup is obviously a MoP, but what is the really relevant from
the performance pov?
• To computer scientists: speedup, execution time
• To applications people: size of problem, accuracy of solution etc.
Speedup of algorithm
= sequential execution time/execution time on p processors (with
the same data set).
Speedup on problem
= sequential execution time of best known sequential algorithm /
execution time on p processors.
A more honest measure of performance.
Avoids picking an easily parallelizable algorithm with poor
sequential execution time.
The Context of Parallel Processing
Facts:
• The explosive growth of digital computer architectures
• Therefore, the need for:
• a better understanding of concurrency
• user-friendliness, compactness and simplicity of code
• high performance computing (HPC) but low cost,
low power consumption a.o.
• HPC uni-processors are increasingly complex & expensive
• they have high power-consumption
• they may also be under-utilized - mainly due to the
lack of appropriate software.
Increasing performance of microprocessors
From 1986 to 2003, the performance of microprocessors
increased, on average, more than 50% per year [Henn, 2019*]
Since 2003, performance improvement has slowed to the point
that from 2015 to 2017, it increased at less than 4% per year
The first unprecedented increase could mean that software
developers might simply wait for the next generation of
microprocessors to increase performance of their applications?
This difference between the two is dramatic: at 50% per year,
performance will increase by almost a factor of 60 in 10 years,
while at 4%, it will increase by about a factor of 1.5
* John Hennessy, David Patterson, Computer Architecture: A
Quantitative Approach. Morgan Kaufmann, 6th ed. (2019)
Possible trade-offs to achieve efficiency
What’s better?
The use of one or a small number of such complex processors,
OR
A moderate to very large number of simpler processors
The answer may seem simple, but there is a clue, forcing us to
answer first to another question:
How “good” is communication between processors?
So:
When combined with a high-bandwidth, but logically simple,
inter-processor communication facility, the latter approach
may lead to significant increase in efficiency, not only at the
execution but also in earlier stages (i.e. in the design process)
The Difficulties of Parallel Processing
Two were the major problems that have prevented over the years
the immediate and widespread adoption of such (moderately to)
massively parallel architectures:
the inter-processor communication bottleneck
the difficulty of algorithmic/ software development – this may
come at a very high cost
How were these problems overcomed?
At very high clock rates, the link between the processor and
memory becomes very critical
- integrated processor/memory design optimization
- emergence of multiple-processor microchips
The emergence of standard programming and communication
models has removed some of the concerns with compatibility
and software design issues in parallel processing
Increasing performance of applications
Continuous demand for greater computational speed
from a computer system made things possible
The number of problems that we can seriously consider
to solve solving also increases
Remember: Computations must not only be completed,
but completed within a “reasonable” time period
Areas requiring great computational speed include
numerical modeling and simulation, scientific and
engineering problems, accurate medical imaging, Web
search, gaming, and many others
Grand Challenge Problems
Typically, they cannot be solved in a reasonable amount
of time with today’s computers. Obviously, an
execution time of 2 months is always unreasonable
Examples:
Global weather forecasting/ Climate modeling
Data analysis
Modeling large DNA structures/ Protein folding
Modeling motion of astronomical bodies
Pharmaceutical research for drugs and vaccines
Research for efficient clean energy sources, etc.
Global Weather Forecasting
Atmosphere modeled by dividing it into 3-dim. cells
Computations in each cell repeat many times to model
time passing
Suppose whole global atmosphere divided into cubic cells of
size 1 km to a height of 15 km - about 7.5 x 109 cells
(Land Surface = 510.1 millions km2 = 196.9 millions sq. miles)
Suppose each calculation requires 200 float. point operations.
In one time step, 15x1013 floating point operations necessary
To forecast weather over 7 days using 10-minute intervals (103
minutes), a computer operating at 1Gflops (109 flops) takes
15x107 s ≈250 days. To perform calculation in under 30 min.
needs computer operating at 100 Tflops (1012 flops, att. 2000)
Data Analysis
We generate and store every year a huge amount of data;
by some estimates, it doubles worldwide every 2 years
Most data is largely useless unless it's analyzed
Examples:
Knowing the sequence of nucleotides in DNA is, by
itself, of little use. Understanding how this sequence
affects our development and how it can cause disease
requires extensive analysis
In addition to genomics, huge quantities of data are
generated by particle colliders, such as the Large Hadron
Collider at CERN, astronomical research, and so on
Modeling Motion of Astronomical Bodies
Bodies are attracted to each others by gravitational forces
Movement of each body is predicted by calculating total
force on each body
With N bodies, N - 1 forces to calculate for each body, or
approx. N2 calculations (N log2 N for an efficient approx.
algorithm); after determining new positions of bodies,
calculations repeated
If a galaxy might have, say, 1011 stars, even if each
calculation done in 1 ms (extremely optimistic figure), it
takes 109 years for one iteration using N2 algorithm and
almost a year for one iteration using an efficient N log2 N
approximate algorithm.
Astrophysical N-body simulation – screen snapshot
Pharmaceutical Research
There are many ways in which increased computational
power can be used in research into new drugs, vaccines
or medical treatments.
I.e., fighting COVID-19 required extensive research in
areas like bioinformatics, epidemiology, and molecular
modeling to understand the threat we’re facing and to
develop strategies to address it, but also…
…bringing together the Federal government, industry,
and academic leaders to provide access to the world’s
most powerful HPC resources in support of research
See https://covid19-hpc-consortium.org/ for details
Wide Range of Applications that really
depend now on HPC (Dongarra, 2017*)
• Airplane wing design
• Quantum chemistry
• Geophysical flows
• Noise reduction
• Diffusion of solid bodies in a liquid
• Computational materials research
• Weather forecasting
• Deep learning in neural networks
• Stochastic simulation
• Massively parallel data mining
…
Do we need powerful, HPC?
Yes, to solve much bigger problems much faster!
“Fine grain” and “coarse-grain” parallelism are different, i.e. the
latter is mainly applicable to long-running, scientific programs
Performance
- there are problems which can use any amount of
computing (i.e. simulation)
Capability
- to solve previously unsolvable problems (such as prime
number factorization): too big data sizes, real time
constraints
Capacity
- to handle a lot of processing much faster, perform more
precise computer simulations (e.g. weather prediction)
Why are HPC computers parallel?
From Transistors to FLOPS
By Moore’s law the no of transistors per area doubles every 18
months, so: how to make better use of these transistors?
more execution units, graphical pipelines, processors etc.
But technology is not the only key, computer structure (architecture)
and organization are also important! (see next slide)
There are also inhibitors of parallelism, such as dependencies
The Data Communication Argument
Move computation towards data: for huge data, it is cheaper
The Memory/Disk Speed Argument
Parallel computers yield better memory system performance,
because of larger aggregate caches, higher aggregate bandwidth
How did parallel computers evolved?
Execution Speed
With >102 times increase of (floating point) execution speed every
<10 years (more than the 26 increase suggested by Moore’s law)
Communication Technology
A factor which is critical to the performance of parallel
computing platforms
1985 – 1990 : in spite of an average 20x increase in processor
performance, the communication speed kept constant
Parallel Computing – How exectime decreased
Time(s) per
fp instruction
Motto: “I think there is a world market for maybe five
computers” (Thomas Watson, IBM Chairman, 1943)
Towards Parallel Computing – The 5 ERAs
More on the historical perspective
Parallel computing has been here since the early days of computing.
Traditionally: custom HW, custom SW, high prices
The “doom” of the Moore law:
- custom HW has hard time catching up with the commodity processors
Current trend: use commodity HW components, standardize SW
Parallelism sneaking into commodity computers:
• Instruction Level Parallelism - wide issue, pipelining, OOO
• Data Level Parallelism – 3DNow, Altivec
• Thread Level Parallelism – Hyper-threading in Pentium IV
Transistor budgets allow for multiple processor cores on a chip.
More on historical perspective (cont.)
Most applications would benefit from being parallelized and
executed on a parallel computer.
• even PC applications, especially the most demanding ones – games,
multimedia
Chicken & Egg Problem:
1. Why build parallel computers when the applications are sequential?
2. Why parallelize applications when there are no parallel commodity
computers?
Answers:
1. What else to do with all those transistors?
2. Applications already are a bit parallel (wide issue, multimedia
instructions, hyper-threading), and this bit is growing.
Core Problems in Distributed Computing
• What types of problems are there?
• An example: Generals’ Problem
Two generals need to coordinate an attack
1. Must agree on time to attack
2. They’ll win only if they attack simultaneously
3. Communicate through messengers
4. Messengers may be killed on their way
Lets try to solve it for general g1 and g2
step 1: g1 sends time of attack to g2
Problem: how to ensure g2 received msg?
step 2 (Solution): let g2 ack receipt of msg
Problem: how to ensure g1 received ack
step 3 (Solution): let g1 ack the receipt of the ack…
…
This problem seems (is!) really impossible to solve!
Consensus
Two nodes need to agree on a value
Communicate by messages using an unreliable channel
Agreement is a core problem… agreeing on a number = consensus
The Consensus problem
All nodes propose a value
Some nodes might crash & stop responding
The algorithm must ensure:
All correct nodes eventually decide
Every node decides the same
Only decide on proposed values
Broadcast
Atomic Broadcast
A node broadcasts a message
If sender correct, all correct nodes deliver msg
All correct nodes deliver same messages
Messages delivered in the same order
Given Atomic broadcast
Can use it to solve Consensus
Every node broadcasts its proposal
Decide on the first received proposal
Messages received same order
All nodes will decide the same
Given Consensus
Can use it to solve Atomic broadcast
Atomic Broadcast equivalent to Consensus