0% found this document useful (0 votes)

7 views23 pages

Apt05 2024S2

This document provides an introduction to parallel programming, outlining key concepts such as serial vs. parallel computing, types of parallelism, and common programming models. It discusses the design of parallel programs, including partitioning, communication, and performance limits, as well as examples of parallel summation techniques. The lecture emphasizes the importance of leveraging multi-core processors to meet increasing performance demands.

Uploaded by

minhtrongc31120

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views23 pages

Apt05 2024S2

Uploaded by

minhtrongc31120

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

ADVANCED

PROGRAMMING
TECHNIQUES

VNU - UNIVERSITY of ENGINEERING & TECHNOLOGY

LECTURE 5: Introduction to Parallel Programming

CONTENTS

> Introduction
> Parallel computer models
> Parallel programming models
> Designing parallel programs
> Limits of parallel programming
Introduction

Serial computing:
▪ Problem = a series of instructions
▪ Executed sequentially on one processor
▪ One instruction executed at any moment

Parallel computing:
▪ Problem = a set of concurrent parts
▪ Parts =sets of series of instructions
▪ Instructions from each part run
concurrently on different processors
Concurrent vs. Parallel Programming

Concurrency enables pseudo-parallelism on a single CPU via rapid task switching

Parallelism enables simultaneous execution by distributing tasks across multiple CPUs

Common Types of Paralellism
> Parallel programming: to exploit a program's inherent parallelism
to enable concurrent execution of its parallelizable components.
▪ Identifying and utilizing parallelism in programs can be challenging.

> Functional parallelism:

▪ Different functional tasks can be executed concurrently and independently.
▪ Analogy: football play, where 11 players simultaneously and independently
execute their specific roles: attacking, defending, and goal keeping.

> Data parallelism:

▪ Each task does the same work on unique & independent pieces of data.
▪ Analogy: orchestra, where each musician is applying the same musical
instructions to their own specific part of the music.

> Pipeline parallelism:

▪ Multiple instructions are executed simultaneously, but at different stages.
▪ Analogy: assembly line in a factory.
Why Parallel Programming?

> Increasing performance demands: solving larger problems faster.

> Leveraging multi-core processors (e.g. >11M on the El Capitan).
Parallel Computers
> Definition of a “parallel computer” not really precise.
▪ Almasi and Gottlieb [1989]: “a collection of processing elements that
communicate and cooperate to solve large problems fast”.

> Flynn’s taxonomy of computers [1972]

A load D ~ mv

B load C ~ mv
A load
C load A ~ A mv B ~ mv

D load A ~ mv

> Other classifications

▪ Handler’s classification [1977]
▪ Structural classification, e.g. memory architectures
Parallel Computer’s Memory Architectures
> Includes shared, distributed, and hybrid memory.
Shared memory computer
Distributed memory computer
Uniform Memory Access (UMA)

Non-Uniform Memory Access (NUMA) Hybrid shared-distributed memory computer

MIMD Processing
> Distributed memory (a.k.a. loosely coupled multiprocessors)
▪ NO shared global memory address space.
▪ Real-life implementation: multicomputer network.
✓ Network-based multiprocessors.
▪ Usually programmed via message passing
✓ Explicit calls (send, receive) for communication

> Shared memory (a.k.a. tightly coupled multiprocessors)

▪ Shared global memory address space.
▪ Traditionally multiprocessing: symmetric multiprocessing (SMP).
▪ Real-life implementation: multi-core processors, multithreaded processors.
▪ Programming model similar to multitasking uniprocessor except
✓ Operations on shared data require synchronization
Common Parallel Programming Models
> Used as an abstraction above hardware & memory architectures.
▪ Any model can (theoretically) be implemented on any underlying hardware.

> Parallel programming models in common use:

▪ Shared memory (without threads)
✓ Implementations: SHMEM distributed memory machines
▪ Threads
✓ Implementations: POSIX Threads, OpenMP.
▪ Message passing
✓ Implementation: MPI (Message Passing Interface)
▪ Hybrid
✓ Implementations: MPI+Pthreads, MPI+OpenMP
▪ Data parallel
✓ Implementations: Coarray Fortran, Unified Parallel C (UPC), Chapel.
▪ Others: SPMD, MPMD, etc.
Threads Programming Model
> A shared memory programming type
▪ A heavyweight process comprises multiple
concurrently executing paths (threads).

> Common implementations:

▪ A library of subroutines that are called from
within parallel source code.
▪ A set of compiler directives imbedded in
either serial or parallel source code.
▪ In both cases, programmer is responsible
for determining parallelism.

> Main issues apart from load balancing:

▪ Synchronization and Performance overhead
▪ Scalability
Message Passing Programming Model

each task uses their

Multiple tasks are executed
own local memory
simultaneously on the
during computation
same physical machine
and/or across multiple
networked machines. tasks exchange data through
communications by sending
and receiving messages.

> Implementations: usually comprise a library of subroutines.

▪ Calls to subroutines are imbedded in source code.
▪ Programmer is responsible for determining all parallelism.

> Main issues apart from partitioning and load balancing:

▪ Communication complexity and Performance overhead
▪ Scalability
Hybrid Programming Models

Utilizes threads to perform Employs MPI to facilitate

computationally intensive inter-node communication
kernels on node’s local data. over the network.

Employs on-node GPUs for

computationally intensive
kernels & CUDA for node local
memory – GPUs data exchange.
Leverages MPI for executing
tasks on CPUs using local
data & communicating with
other nodes over a network.
Parallel Program Design
> Automatic parallelization of serial programs: limited success.
▪ The best parallelization may be writing an entirely new parallel algorithm.
▪ There is no single, universal solution for designing parallel programs.

> Foster’s methodology:

1. Partitioning: divide the computation to be performed and the data
operated on by the computation into small tasks.
2. Communication: determine what communication needs to be carried
out among the tasks identified in the previous step.
3. Agglomeration: combine tasks where feasible to reduce communication
overhead and improve efficiency, trading off some parallelism.
4. Mapping: assign tasks to processes/threads to balance workload.
✓ Static mapping: tasks assigned before execution.
✓ Dynamic mapping: tasks assigned during execution.
Partitioning
> Includes domain decomposition & functional decomposition

The problem is decomposed according to the work

that must be done, each task then performs a portion
of the overall work.
Data associated with a problem is
decomposed, each parallel task then
works on a portion of the data.
Performance Limits
> Amdahl’s law: potential speedup
of a parallel program is:
1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = 𝑝 .
𝑁
+ 1−𝑝
▪ 𝑝: parallelizable fraction of the code
▪ 𝑁: number of processors

> Corollaries:
▪ Parallelism has diminishing returns
✓ more processors don’t always help
▪ Serial code portion is a serious bottleneck
✓ E.g., for 𝑝 = 0.9, 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 < 10 no matter how many processors we use.
▪ The law does not consider other practical bottlenecks in parallel portion, e.g.,
✓ Load imbalance
✓ Resource contention
Scalability
> A serial portion always exists in practical parallel programs:
▪ Synchronization operations cannot be parallelized
▪ Non-parallelizable loops , e.g.,
for ( i = 0 ; i < N; i++)
A[i] = (A[i] + A[i-1]) / 2;
▪ Single thread prepares data and spawns parallel tasks

> Different perspective: as we increase the number of processors,

we can also increase the problem size proportionally.
▪ Gustafson’s law: 𝑆 𝑁 = 𝑁 − 𝑝(𝑁 − 1), where 𝑆 𝑁 is the scaled speedup.
▪ Like Amdahl’s Law, it recognizes the presence of a serial portion but argues
that its impact diminishes as the problem size increases.
▪ Parallel computing can achieve significant performance gain by tackling
problems that were intractable in serial machines due to their size.
Example: Parallel Summation psum_mutex
> Sum numbers 0, … , 𝑛 − 1 in parallel:
▪ Partition values [1, 𝑛 − 1] into 𝑡 ranges, each of 𝑡 threads processes 1 range.
▪ Simplest approach: threads sum into a global variable protected by a mutex.
#include<stdio.h> void *psum(void *arg)
#include<stdlib.h> {
#include<pthread.h> long i, myid = *((long *)arg);
long nelems_p_t, gsum = 0; /* Global sum */ long start = myid*nelems_p_t +1;
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;long end = start + nelems_p_t;
void main(int argc, char *argv[]){ for (i = start; i < end; i++) {
long i, nelems = 1<<30, myid[4]; pthread_mutex_lock(&mtx);
pthread_t tid[4]; gsum+=i;
char *endptr; pthread_mutex_unlock(&mtx);}
long nthreads = strtol(argv[1],&endptr,10); }
nelems_p_t = nelems / nthreads;
/* Create peer threads and wait for them to finish */
for (i = 0; i < nthreads; i++) {
myid[i] = i;
pthread_create(&tid[i], NULL, psum, &myid[i]);}
for (i = 0; i < nthreads; i++)
pthread_join(tid[i], NULL);
printf("Total sum = %ld\n", gsum);
}
Example: Parallel Summation psum_array
> Eliminates need for mutex synchronization:
▪ Peer thread i sums into global array element psum[i].
▪ Main waits for theads to finish, then sums elements of psum.
#include<stdio.h> void *psum(void *arg)
#include<stdlib.h> {
#include<pthread.h> long i, myid = *((long *)arg);
long nelems_p_t, psum[4] = {0}; /* Global sum */long start = myid*nelems_p_t +1;
long end = start + nelems_p_t;
void main(int argc, char *argv[]){
long i, nelems = 1<<30, myid[4]; for (i = start; i < end; i++)
pthread_t tid[4]; psum[myid]+=i;
char *endptr; }
long nthreads = strtol(argv[1],&endptr,10);
/* Create peer threads and wait for them to finish */
for (i = 0; i < nthreads; i++) {
myid[i] = i;
pthread_create(&tid[i], NULL, psum, &myid[i]);}
for (i = 0; i < nthreads; i++) {
pthread_join(tid[i], NULL);
tsum += psum[i];}
printf("Total sum = %ld\n", tsum);
}
Example: Parallel Summation psum_array (cont.)
> Performance: orders of magnitude faster than psum-mutex
Parallel Summation
6
5.36

5
4.24

4
Elapsed seconds

3 2.54
psum-array

2 1.64

0.94
1

0
1(1) 2(2) 4(4) 8(8) 16(8)
Threads (cores)
Example: Parallel Summation psum_local
> Reduce memory references:
▪ Peer thread i sums into a local variable.
#include<stdio.h> void *psum(void *arg)
#include<stdlib.h> {
#include<pthread.h> long myid = *((long *)arg);
long nelems_p_t, psum[4] = {0}; /* Global sum */long start = myid*nelems_p_t +1;
long end = start + nelems_p_t;
void main(int argc, char *argv[]){
long i, lsum = 0;
long i, nelems = 1<<30, myid[4]; for (i = start; i < end; i++)
pthread_t tid[4]; lsum += i;
char *endptr; psum[myid] = sum;
long nthreads = strtol(argv[1],&endptr,10); }
/* Create peer threads and wait for them to finish */
for (i = 0; i < nthreads; i++) {
myid[i] = i;
pthread_create(&tid[i], NULL, psum, &myid[i]);}
for (i = 0; i < nthreads; i++) {
pthread_join(tid[i], NULL);
tsum += psum[i];}
printf("Total sum = %ld\n", tsum);
}
Example: Parallel Summation psum_local (cont.)
> Performance: significantly faster than psum-array
Parallel Summation
6
5.36

5
4.24

4
Elapsed seconds

3 2.54 psum-array
1.98 psum-local
2 1.64
1.14
0.94
1 0.6
0.32 0.33

0
1(1) 2(2) 4(4) 8(8) 16(8)
Threads (cores)
NEXT LECTURE

[Flipped class] Parallel programming (cont.)

> Pre-class
▪ Study pre-class materials on Canvas

> In class
▪ Reinforcement/enrichment discussion

> Post class

▪ Homework
▪ Consultation (if needed)

Pthreads Mod
No ratings yet
Pthreads Mod
110 pages
Mansi Kadam PC Lab Assignment 1
No ratings yet
Mansi Kadam PC Lab Assignment 1
4 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
34 pages
Unit 1
No ratings yet
Unit 1
11 pages
08 Systems Programming-Concurrent Programming
No ratings yet
08 Systems Programming-Concurrent Programming
61 pages
Lecture 4
No ratings yet
Lecture 4
41 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Bcs702 Parallel Computing Module 1
100% (3)
Bcs702 Parallel Computing Module 1
35 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
3.introduction To Parallelism
No ratings yet
3.introduction To Parallelism
64 pages
High Performance Computing Labs & Concepts
No ratings yet
High Performance Computing Labs & Concepts
5 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
Lec7 - TLP Shared Memory and OpenMP
No ratings yet
Lec7 - TLP Shared Memory and OpenMP
45 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Unit 4
No ratings yet
Unit 4
42 pages
Group 2 Assignment 1
No ratings yet
Group 2 Assignment 1
10 pages
02 - Introduction To Concurrent Systems PDF
No ratings yet
02 - Introduction To Concurrent Systems PDF
31 pages
Parallel & Distributed Computing Course Overview
No ratings yet
Parallel & Distributed Computing Course Overview
63 pages
CS-3006 2 PDC Overview Compressed
No ratings yet
CS-3006 2 PDC Overview Compressed
107 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
03 (Parallel Software)
No ratings yet
03 (Parallel Software)
38 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
Distributed Computing Seminar
No ratings yet
Distributed Computing Seminar
37 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
No ratings yet
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
27 pages
Intro to Serial & Parallel Computing
No ratings yet
Intro to Serial & Parallel Computing
39 pages
Module 3
No ratings yet
Module 3
104 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
Multi Threading
No ratings yet
Multi Threading
96 pages
CS4230 Parallel Programming Introduction To Parallel Algorithms
No ratings yet
CS4230 Parallel Programming Introduction To Parallel Algorithms
25 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
4 Multi-Threading
No ratings yet
4 Multi-Threading
34 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
01 Introduction
No ratings yet
01 Introduction
41 pages
Book
No ratings yet
Book
10 pages
Lecture 03
No ratings yet
Lecture 03
39 pages
PDC Lecture 02
No ratings yet
PDC Lecture 02
35 pages
UNIT 2 (HPC)
No ratings yet
UNIT 2 (HPC)
10 pages
Multithreading Algorithms
No ratings yet
Multithreading Algorithms
36 pages
Chapter Three Parallel Computing
No ratings yet
Chapter Three Parallel Computing
44 pages
3-Parallel Software
No ratings yet
3-Parallel Software
35 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
2 ND
No ratings yet
2 ND
19 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
Unit1 RMD PDF
No ratings yet
Unit1 RMD PDF
27 pages
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
No ratings yet
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
50 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Codigos Error Cobol
No ratings yet
Codigos Error Cobol
4 pages
Usb
No ratings yet
Usb
29 pages
Installation Guide: Epson Javapos Adk
No ratings yet
Installation Guide: Epson Javapos Adk
37 pages
Fix RAW Hard Drive Issues
No ratings yet
Fix RAW Hard Drive Issues
59 pages
Open Source OS & Scripting Exam Paper
No ratings yet
Open Source OS & Scripting Exam Paper
4 pages
DB6 Current db6 Update DB db6 Update Client Script V72 v107
No ratings yet
DB6 Current db6 Update DB db6 Update Client Script V72 v107
12 pages
.NET Framework Essentials
No ratings yet
.NET Framework Essentials
39 pages
Kernel Module Programming
100% (1)
Kernel Module Programming
25 pages
NWHK LAB 1-Installing Kali Linux v.1.1
No ratings yet
NWHK LAB 1-Installing Kali Linux v.1.1
11 pages
Ncs-451 Lab Manual
No ratings yet
Ncs-451 Lab Manual
71 pages
Mac OS X S IO Management and Disk Scheduling
No ratings yet
Mac OS X S IO Management and Disk Scheduling
14 pages
Project 3: Writing A Kernel From Scratch: 15-410 Operating Systems
No ratings yet
Project 3: Writing A Kernel From Scratch: 15-410 Operating Systems
44 pages
CS 1 Compressed
No ratings yet
CS 1 Compressed
177 pages
Data Center Escalation Matrix
No ratings yet
Data Center Escalation Matrix
18 pages
Chapter Thirteen: Multiprocessors
No ratings yet
Chapter Thirteen: Multiprocessors
55 pages
NFSV4 Access Control List: 1.1 Important Tips For Setting NFSV4 ACL
No ratings yet
NFSV4 Access Control List: 1.1 Important Tips For Setting NFSV4 ACL
4 pages
Practice Test 3 New
No ratings yet
Practice Test 3 New
22 pages
Citrix, XenApp, NT 4.0 TSE, Presentation Server, MetaFrame Server
No ratings yet
Citrix, XenApp, NT 4.0 TSE, Presentation Server, MetaFrame Server
6 pages
Troubleshooting
No ratings yet
Troubleshooting
9 pages
JetsamEvent 2025 08 05 232158.ips
No ratings yet
JetsamEvent 2025 08 05 232158.ips
246 pages
PBIS Open Edition Setup
No ratings yet
PBIS Open Edition Setup
4 pages
IT Admin Command Reference
No ratings yet
IT Admin Command Reference
39 pages
Red Hat Enterprise Linux-5-5.9 Technical Notes-en-US
No ratings yet
Red Hat Enterprise Linux-5-5.9 Technical Notes-en-US
503 pages
Blktrace Usage
100% (2)
Blktrace Usage
22 pages
Understanding Transaction Management
No ratings yet
Understanding Transaction Management
28 pages
(Amish Tripathi) The Secret of The Nagas
No ratings yet
(Amish Tripathi) The Secret of The Nagas
3 pages
Switch Cisco 3850
No ratings yet
Switch Cisco 3850
13 pages
Git Version Control Certification
No ratings yet
Git Version Control Certification
6 pages
Debian Reference Card
No ratings yet
Debian Reference Card
2 pages
DBMS Concurrency Control Tutorialspoint
No ratings yet
DBMS Concurrency Control Tutorialspoint
3 pages

Apt05 2024S2

Uploaded by

Apt05 2024S2

Uploaded by

ADVANCED

VNU - UNIVERSITY of ENGINEERING & TECHNOLOGY

Concurrency enables pseudo-parallelism on a single CPU via rapid task switching

Parallelism enables simultaneous execution by distributing tasks across multiple CPUs

> Functional parallelism:

> Data parallelism:

> Pipeline parallelism:

> Increasing performance demands: solving larger problems faster.

> Flynn’s taxonomy of computers [1972]

> Other classifications

Non-Uniform Memory Access (NUMA) Hybrid shared-distributed memory computer

> Shared memory (a.k.a. tightly coupled multiprocessors)

> Parallel programming models in common use:

> Common implementations:

> Main issues apart from load balancing:

each task uses their

> Implementations: usually comprise a library of subroutines.

> Main issues apart from partitioning and load balancing:

Utilizes threads to perform Employs MPI to facilitate

Employs on-node GPUs for

> Foster’s methodology:

The problem is decomposed according to the work

> Different perspective: as we increase the number of processors,

[Flipped class] Parallel programming (cont.)

> Post class

You might also like