0% found this document useful (0 votes)
7 views23 pages

Apt05 2024S2

This document provides an introduction to parallel programming, outlining key concepts such as serial vs. parallel computing, types of parallelism, and common programming models. It discusses the design of parallel programs, including partitioning, communication, and performance limits, as well as examples of parallel summation techniques. The lecture emphasizes the importance of leveraging multi-core processors to meet increasing performance demands.

Uploaded by

minhtrongc31120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views23 pages

Apt05 2024S2

This document provides an introduction to parallel programming, outlining key concepts such as serial vs. parallel computing, types of parallelism, and common programming models. It discusses the design of parallel programs, including partitioning, communication, and performance limits, as well as examples of parallel summation techniques. The lecture emphasizes the importance of leveraging multi-core processors to meet increasing performance demands.

Uploaded by

minhtrongc31120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

ADVANCED

PROGRAMMING
TECHNIQUES

VNU - UNIVERSITY of ENGINEERING & TECHNOLOGY


LECTURE 5: Introduction to Parallel Programming

CONTENTS

> Introduction
> Parallel computer models
> Parallel programming models
> Designing parallel programs
> Limits of parallel programming
Introduction

Serial computing:
▪ Problem = a series of instructions
▪ Executed sequentially on one processor
▪ One instruction executed at any moment

Parallel computing:
▪ Problem = a set of concurrent parts
▪ Parts =sets of series of instructions
▪ Instructions from each part run
concurrently on different processors
Concurrent vs. Parallel Programming

Concurrency enables pseudo-parallelism on a single CPU via rapid task switching

Parallelism enables simultaneous execution by distributing tasks across multiple CPUs


Common Types of Paralellism
> Parallel programming: to exploit a program's inherent parallelism
to enable concurrent execution of its parallelizable components.
▪ Identifying and utilizing parallelism in programs can be challenging.

> Functional parallelism:


▪ Different functional tasks can be executed concurrently and independently.
▪ Analogy: football play, where 11 players simultaneously and independently
execute their specific roles: attacking, defending, and goal keeping.

> Data parallelism:


▪ Each task does the same work on unique & independent pieces of data.
▪ Analogy: orchestra, where each musician is applying the same musical
instructions to their own specific part of the music.

> Pipeline parallelism:


▪ Multiple instructions are executed simultaneously, but at different stages.
▪ Analogy: assembly line in a factory.
Why Parallel Programming?

> Increasing performance demands: solving larger problems faster.


> Leveraging multi-core processors (e.g. >11M on the El Capitan).
Parallel Computers
> Definition of a “parallel computer” not really precise.
▪ Almasi and Gottlieb [1989]: “a collection of processing elements that
communicate and cooperate to solve large problems fast”.

> Flynn’s taxonomy of computers [1972]

A load D ~ mv

B load C ~ mv
A load
C load A ~ A mv B ~ mv

D load A ~ mv

> Other classifications


▪ Handler’s classification [1977]
▪ Structural classification, e.g. memory architectures
Parallel Computer’s Memory Architectures
> Includes shared, distributed, and hybrid memory.
Shared memory computer
Distributed memory computer
Uniform Memory Access (UMA)

Non-Uniform Memory Access (NUMA) Hybrid shared-distributed memory computer


MIMD Processing
> Distributed memory (a.k.a. loosely coupled multiprocessors)
▪ NO shared global memory address space.
▪ Real-life implementation: multicomputer network.
✓ Network-based multiprocessors.
▪ Usually programmed via message passing
✓ Explicit calls (send, receive) for communication

> Shared memory (a.k.a. tightly coupled multiprocessors)


▪ Shared global memory address space.
▪ Traditionally multiprocessing: symmetric multiprocessing (SMP).
▪ Real-life implementation: multi-core processors, multithreaded processors.
▪ Programming model similar to multitasking uniprocessor except
✓ Operations on shared data require synchronization
Common Parallel Programming Models
> Used as an abstraction above hardware & memory architectures.
▪ Any model can (theoretically) be implemented on any underlying hardware.

> Parallel programming models in common use:


▪ Shared memory (without threads)
✓ Implementations: SHMEM distributed memory machines
▪ Threads
✓ Implementations: POSIX Threads, OpenMP.
▪ Message passing
✓ Implementation: MPI (Message Passing Interface)
▪ Hybrid
✓ Implementations: MPI+Pthreads, MPI+OpenMP
▪ Data parallel
✓ Implementations: Coarray Fortran, Unified Parallel C (UPC), Chapel.
▪ Others: SPMD, MPMD, etc.
Threads Programming Model
> A shared memory programming type
▪ A heavyweight process comprises multiple
concurrently executing paths (threads).

> Common implementations:


▪ A library of subroutines that are called from
within parallel source code.
▪ A set of compiler directives imbedded in
either serial or parallel source code.
▪ In both cases, programmer is responsible
for determining parallelism.

> Main issues apart from load balancing:


▪ Synchronization and Performance overhead
▪ Scalability
Message Passing Programming Model

each task uses their


Multiple tasks are executed
own local memory
simultaneously on the
during computation
same physical machine
and/or across multiple
networked machines. tasks exchange data through
communications by sending
and receiving messages.

> Implementations: usually comprise a library of subroutines.


▪ Calls to subroutines are imbedded in source code.
▪ Programmer is responsible for determining all parallelism.

> Main issues apart from partitioning and load balancing:


▪ Communication complexity and Performance overhead
▪ Scalability
Hybrid Programming Models

Utilizes threads to perform Employs MPI to facilitate


computationally intensive inter-node communication
kernels on node’s local data. over the network.

Employs on-node GPUs for


computationally intensive
kernels & CUDA for node local
memory – GPUs data exchange.
Leverages MPI for executing
tasks on CPUs using local
data & communicating with
other nodes over a network.
Parallel Program Design
> Automatic parallelization of serial programs: limited success.
▪ The best parallelization may be writing an entirely new parallel algorithm.
▪ There is no single, universal solution for designing parallel programs.

> Foster’s methodology:


1. Partitioning: divide the computation to be performed and the data
operated on by the computation into small tasks.
2. Communication: determine what communication needs to be carried
out among the tasks identified in the previous step.
3. Agglomeration: combine tasks where feasible to reduce communication
overhead and improve efficiency, trading off some parallelism.
4. Mapping: assign tasks to processes/threads to balance workload.
✓ Static mapping: tasks assigned before execution.
✓ Dynamic mapping: tasks assigned during execution.
Partitioning
> Includes domain decomposition & functional decomposition

The problem is decomposed according to the work


that must be done, each task then performs a portion
of the overall work.
Data associated with a problem is
decomposed, each parallel task then
works on a portion of the data.
Performance Limits
> Amdahl’s law: potential speedup
of a parallel program is:
1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = 𝑝 .
𝑁
+ 1−𝑝
▪ 𝑝: parallelizable fraction of the code
▪ 𝑁: number of processors

> Corollaries:
▪ Parallelism has diminishing returns
✓ more processors don’t always help
▪ Serial code portion is a serious bottleneck
✓ E.g., for 𝑝 = 0.9, 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 < 10 no matter how many processors we use.
▪ The law does not consider other practical bottlenecks in parallel portion, e.g.,
✓ Load imbalance
✓ Resource contention
Scalability
> A serial portion always exists in practical parallel programs:
▪ Synchronization operations cannot be parallelized
▪ Non-parallelizable loops , e.g.,
for ( i = 0 ; i < N; i++)
A[i] = (A[i] + A[i-1]) / 2;
▪ Single thread prepares data and spawns parallel tasks

> Different perspective: as we increase the number of processors,


we can also increase the problem size proportionally.
▪ Gustafson’s law: 𝑆 𝑁 = 𝑁 − 𝑝(𝑁 − 1), where 𝑆 𝑁 is the scaled speedup.
▪ Like Amdahl’s Law, it recognizes the presence of a serial portion but argues
that its impact diminishes as the problem size increases.
▪ Parallel computing can achieve significant performance gain by tackling
problems that were intractable in serial machines due to their size.
Example: Parallel Summation psum_mutex
> Sum numbers 0, … , 𝑛 − 1 in parallel:
▪ Partition values [1, 𝑛 − 1] into 𝑡 ranges, each of 𝑡 threads processes 1 range.
▪ Simplest approach: threads sum into a global variable protected by a mutex.
#include<stdio.h> void *psum(void *arg)
#include<stdlib.h> {
#include<pthread.h> long i, myid = *((long *)arg);
long nelems_p_t, gsum = 0; /* Global sum */ long start = myid*nelems_p_t +1;
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;long end = start + nelems_p_t;
void main(int argc, char *argv[]){ for (i = start; i < end; i++) {
long i, nelems = 1<<30, myid[4]; pthread_mutex_lock(&mtx);
pthread_t tid[4]; gsum+=i;
char *endptr; pthread_mutex_unlock(&mtx);}
long nthreads = strtol(argv[1],&endptr,10); }
nelems_p_t = nelems / nthreads;
/* Create peer threads and wait for them to finish */
for (i = 0; i < nthreads; i++) {
myid[i] = i;
pthread_create(&tid[i], NULL, psum, &myid[i]);}
for (i = 0; i < nthreads; i++)
pthread_join(tid[i], NULL);
printf("Total sum = %ld\n", gsum);
}
Example: Parallel Summation psum_array
> Eliminates need for mutex synchronization:
▪ Peer thread i sums into global array element psum[i].
▪ Main waits for theads to finish, then sums elements of psum.
#include<stdio.h> void *psum(void *arg)
#include<stdlib.h> {
#include<pthread.h> long i, myid = *((long *)arg);
long nelems_p_t, psum[4] = {0}; /* Global sum */long start = myid*nelems_p_t +1;
long end = start + nelems_p_t;
void main(int argc, char *argv[]){
long i, nelems = 1<<30, myid[4]; for (i = start; i < end; i++)
pthread_t tid[4]; psum[myid]+=i;
char *endptr; }
long nthreads = strtol(argv[1],&endptr,10);
/* Create peer threads and wait for them to finish */
for (i = 0; i < nthreads; i++) {
myid[i] = i;
pthread_create(&tid[i], NULL, psum, &myid[i]);}
for (i = 0; i < nthreads; i++) {
pthread_join(tid[i], NULL);
tsum += psum[i];}
printf("Total sum = %ld\n", tsum);
}
Example: Parallel Summation psum_array (cont.)
> Performance: orders of magnitude faster than psum-mutex
Parallel Summation
6
5.36

5
4.24

4
Elapsed seconds

3 2.54
psum-array

2 1.64

0.94
1

0
1(1) 2(2) 4(4) 8(8) 16(8)
Threads (cores)
Example: Parallel Summation psum_local
> Reduce memory references:
▪ Peer thread i sums into a local variable.
#include<stdio.h> void *psum(void *arg)
#include<stdlib.h> {
#include<pthread.h> long myid = *((long *)arg);
long nelems_p_t, psum[4] = {0}; /* Global sum */long start = myid*nelems_p_t +1;
long end = start + nelems_p_t;
void main(int argc, char *argv[]){
long i, lsum = 0;
long i, nelems = 1<<30, myid[4]; for (i = start; i < end; i++)
pthread_t tid[4]; lsum += i;
char *endptr; psum[myid] = sum;
long nthreads = strtol(argv[1],&endptr,10); }
/* Create peer threads and wait for them to finish */
for (i = 0; i < nthreads; i++) {
myid[i] = i;
pthread_create(&tid[i], NULL, psum, &myid[i]);}
for (i = 0; i < nthreads; i++) {
pthread_join(tid[i], NULL);
tsum += psum[i];}
printf("Total sum = %ld\n", tsum);
}
Example: Parallel Summation psum_local (cont.)
> Performance: significantly faster than psum-array
Parallel Summation
6
5.36

5
4.24

4
Elapsed seconds

3 2.54 psum-array
1.98 psum-local
2 1.64
1.14
0.94
1 0.6
0.32 0.33

0
1(1) 2(2) 4(4) 8(8) 16(8)
Threads (cores)
NEXT LECTURE

[Flipped class] Parallel programming (cont.)


> Pre-class
▪ Study pre-class materials on Canvas

> In class
▪ Reinforcement/enrichment discussion

> Post class


▪ Homework
▪ Consultation (if needed)

You might also like