ADVANCED
PROGRAMMING
TECHNIQUES
VNU - UNIVERSITY of ENGINEERING & TECHNOLOGY
LECTURE 5: Introduction to Parallel Programming
CONTENTS
> Introduction
> Parallel computer models
> Parallel programming models
> Designing parallel programs
> Limits of parallel programming
Introduction
Serial computing:
▪ Problem = a series of instructions
▪ Executed sequentially on one processor
▪ One instruction executed at any moment
Parallel computing:
▪ Problem = a set of concurrent parts
▪ Parts =sets of series of instructions
▪ Instructions from each part run
concurrently on different processors
Concurrent vs. Parallel Programming
Concurrency enables pseudo-parallelism on a single CPU via rapid task switching
Parallelism enables simultaneous execution by distributing tasks across multiple CPUs
Common Types of Paralellism
> Parallel programming: to exploit a program's inherent parallelism
to enable concurrent execution of its parallelizable components.
▪ Identifying and utilizing parallelism in programs can be challenging.
> Functional parallelism:
▪ Different functional tasks can be executed concurrently and independently.
▪ Analogy: football play, where 11 players simultaneously and independently
execute their specific roles: attacking, defending, and goal keeping.
> Data parallelism:
▪ Each task does the same work on unique & independent pieces of data.
▪ Analogy: orchestra, where each musician is applying the same musical
instructions to their own specific part of the music.
> Pipeline parallelism:
▪ Multiple instructions are executed simultaneously, but at different stages.
▪ Analogy: assembly line in a factory.
Why Parallel Programming?
> Increasing performance demands: solving larger problems faster.
> Leveraging multi-core processors (e.g. >11M on the El Capitan).
Parallel Computers
> Definition of a “parallel computer” not really precise.
▪ Almasi and Gottlieb [1989]: “a collection of processing elements that
communicate and cooperate to solve large problems fast”.
> Flynn’s taxonomy of computers [1972]
A load D ~ mv
B load C ~ mv
A load
C load A ~ A mv B ~ mv
D load A ~ mv
> Other classifications
▪ Handler’s classification [1977]
▪ Structural classification, e.g. memory architectures
Parallel Computer’s Memory Architectures
> Includes shared, distributed, and hybrid memory.
Shared memory computer
Distributed memory computer
Uniform Memory Access (UMA)
Non-Uniform Memory Access (NUMA) Hybrid shared-distributed memory computer
MIMD Processing
> Distributed memory (a.k.a. loosely coupled multiprocessors)
▪ NO shared global memory address space.
▪ Real-life implementation: multicomputer network.
✓ Network-based multiprocessors.
▪ Usually programmed via message passing
✓ Explicit calls (send, receive) for communication
> Shared memory (a.k.a. tightly coupled multiprocessors)
▪ Shared global memory address space.
▪ Traditionally multiprocessing: symmetric multiprocessing (SMP).
▪ Real-life implementation: multi-core processors, multithreaded processors.
▪ Programming model similar to multitasking uniprocessor except
✓ Operations on shared data require synchronization
Common Parallel Programming Models
> Used as an abstraction above hardware & memory architectures.
▪ Any model can (theoretically) be implemented on any underlying hardware.
> Parallel programming models in common use:
▪ Shared memory (without threads)
✓ Implementations: SHMEM distributed memory machines
▪ Threads
✓ Implementations: POSIX Threads, OpenMP.
▪ Message passing
✓ Implementation: MPI (Message Passing Interface)
▪ Hybrid
✓ Implementations: MPI+Pthreads, MPI+OpenMP
▪ Data parallel
✓ Implementations: Coarray Fortran, Unified Parallel C (UPC), Chapel.
▪ Others: SPMD, MPMD, etc.
Threads Programming Model
> A shared memory programming type
▪ A heavyweight process comprises multiple
concurrently executing paths (threads).
> Common implementations:
▪ A library of subroutines that are called from
within parallel source code.
▪ A set of compiler directives imbedded in
either serial or parallel source code.
▪ In both cases, programmer is responsible
for determining parallelism.
> Main issues apart from load balancing:
▪ Synchronization and Performance overhead
▪ Scalability
Message Passing Programming Model
each task uses their
Multiple tasks are executed
own local memory
simultaneously on the
during computation
same physical machine
and/or across multiple
networked machines. tasks exchange data through
communications by sending
and receiving messages.
> Implementations: usually comprise a library of subroutines.
▪ Calls to subroutines are imbedded in source code.
▪ Programmer is responsible for determining all parallelism.
> Main issues apart from partitioning and load balancing:
▪ Communication complexity and Performance overhead
▪ Scalability
Hybrid Programming Models
Utilizes threads to perform Employs MPI to facilitate
computationally intensive inter-node communication
kernels on node’s local data. over the network.
Employs on-node GPUs for
computationally intensive
kernels & CUDA for node local
memory – GPUs data exchange.
Leverages MPI for executing
tasks on CPUs using local
data & communicating with
other nodes over a network.
Parallel Program Design
> Automatic parallelization of serial programs: limited success.
▪ The best parallelization may be writing an entirely new parallel algorithm.
▪ There is no single, universal solution for designing parallel programs.
> Foster’s methodology:
1. Partitioning: divide the computation to be performed and the data
operated on by the computation into small tasks.
2. Communication: determine what communication needs to be carried
out among the tasks identified in the previous step.
3. Agglomeration: combine tasks where feasible to reduce communication
overhead and improve efficiency, trading off some parallelism.
4. Mapping: assign tasks to processes/threads to balance workload.
✓ Static mapping: tasks assigned before execution.
✓ Dynamic mapping: tasks assigned during execution.
Partitioning
> Includes domain decomposition & functional decomposition
The problem is decomposed according to the work
that must be done, each task then performs a portion
of the overall work.
Data associated with a problem is
decomposed, each parallel task then
works on a portion of the data.
Performance Limits
> Amdahl’s law: potential speedup
of a parallel program is:
1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = 𝑝 .
𝑁
+ 1−𝑝
▪ 𝑝: parallelizable fraction of the code
▪ 𝑁: number of processors
> Corollaries:
▪ Parallelism has diminishing returns
✓ more processors don’t always help
▪ Serial code portion is a serious bottleneck
✓ E.g., for 𝑝 = 0.9, 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 < 10 no matter how many processors we use.
▪ The law does not consider other practical bottlenecks in parallel portion, e.g.,
✓ Load imbalance
✓ Resource contention
Scalability
> A serial portion always exists in practical parallel programs:
▪ Synchronization operations cannot be parallelized
▪ Non-parallelizable loops , e.g.,
for ( i = 0 ; i < N; i++)
A[i] = (A[i] + A[i-1]) / 2;
▪ Single thread prepares data and spawns parallel tasks
> Different perspective: as we increase the number of processors,
we can also increase the problem size proportionally.
▪ Gustafson’s law: 𝑆 𝑁 = 𝑁 − 𝑝(𝑁 − 1), where 𝑆 𝑁 is the scaled speedup.
▪ Like Amdahl’s Law, it recognizes the presence of a serial portion but argues
that its impact diminishes as the problem size increases.
▪ Parallel computing can achieve significant performance gain by tackling
problems that were intractable in serial machines due to their size.
Example: Parallel Summation psum_mutex
> Sum numbers 0, … , 𝑛 − 1 in parallel:
▪ Partition values [1, 𝑛 − 1] into 𝑡 ranges, each of 𝑡 threads processes 1 range.
▪ Simplest approach: threads sum into a global variable protected by a mutex.
#include<stdio.h> void *psum(void *arg)
#include<stdlib.h> {
#include<pthread.h> long i, myid = *((long *)arg);
long nelems_p_t, gsum = 0; /* Global sum */ long start = myid*nelems_p_t +1;
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;long end = start + nelems_p_t;
void main(int argc, char *argv[]){ for (i = start; i < end; i++) {
long i, nelems = 1<<30, myid[4]; pthread_mutex_lock(&mtx);
pthread_t tid[4]; gsum+=i;
char *endptr; pthread_mutex_unlock(&mtx);}
long nthreads = strtol(argv[1],&endptr,10); }
nelems_p_t = nelems / nthreads;
/* Create peer threads and wait for them to finish */
for (i = 0; i < nthreads; i++) {
myid[i] = i;
pthread_create(&tid[i], NULL, psum, &myid[i]);}
for (i = 0; i < nthreads; i++)
pthread_join(tid[i], NULL);
printf("Total sum = %ld\n", gsum);
}
Example: Parallel Summation psum_array
> Eliminates need for mutex synchronization:
▪ Peer thread i sums into global array element psum[i].
▪ Main waits for theads to finish, then sums elements of psum.
#include<stdio.h> void *psum(void *arg)
#include<stdlib.h> {
#include<pthread.h> long i, myid = *((long *)arg);
long nelems_p_t, psum[4] = {0}; /* Global sum */long start = myid*nelems_p_t +1;
long end = start + nelems_p_t;
void main(int argc, char *argv[]){
long i, nelems = 1<<30, myid[4]; for (i = start; i < end; i++)
pthread_t tid[4]; psum[myid]+=i;
char *endptr; }
long nthreads = strtol(argv[1],&endptr,10);
/* Create peer threads and wait for them to finish */
for (i = 0; i < nthreads; i++) {
myid[i] = i;
pthread_create(&tid[i], NULL, psum, &myid[i]);}
for (i = 0; i < nthreads; i++) {
pthread_join(tid[i], NULL);
tsum += psum[i];}
printf("Total sum = %ld\n", tsum);
}
Example: Parallel Summation psum_array (cont.)
> Performance: orders of magnitude faster than psum-mutex
Parallel Summation
6
5.36
5
4.24
4
Elapsed seconds
3 2.54
psum-array
2 1.64
0.94
1
0
1(1) 2(2) 4(4) 8(8) 16(8)
Threads (cores)
Example: Parallel Summation psum_local
> Reduce memory references:
▪ Peer thread i sums into a local variable.
#include<stdio.h> void *psum(void *arg)
#include<stdlib.h> {
#include<pthread.h> long myid = *((long *)arg);
long nelems_p_t, psum[4] = {0}; /* Global sum */long start = myid*nelems_p_t +1;
long end = start + nelems_p_t;
void main(int argc, char *argv[]){
long i, lsum = 0;
long i, nelems = 1<<30, myid[4]; for (i = start; i < end; i++)
pthread_t tid[4]; lsum += i;
char *endptr; psum[myid] = sum;
long nthreads = strtol(argv[1],&endptr,10); }
/* Create peer threads and wait for them to finish */
for (i = 0; i < nthreads; i++) {
myid[i] = i;
pthread_create(&tid[i], NULL, psum, &myid[i]);}
for (i = 0; i < nthreads; i++) {
pthread_join(tid[i], NULL);
tsum += psum[i];}
printf("Total sum = %ld\n", tsum);
}
Example: Parallel Summation psum_local (cont.)
> Performance: significantly faster than psum-array
Parallel Summation
6
5.36
5
4.24
4
Elapsed seconds
3 2.54 psum-array
1.98 psum-local
2 1.64
1.14
0.94
1 0.6
0.32 0.33
0
1(1) 2(2) 4(4) 8(8) 16(8)
Threads (cores)
NEXT LECTURE
[Flipped class] Parallel programming (cont.)
> Pre-class
▪ Study pre-class materials on Canvas
> In class
▪ Reinforcement/enrichment discussion
> Post class
▪ Homework
▪ Consultation (if needed)