0% found this document useful (0 votes)

16 views123 pages

Introduction To Openmp: Openmp in Small Bites: Overview

Uploaded by

Mirok Zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views123 pages

Introduction To Openmp: Openmp in Small Bites: Overview

Uploaded by

Mirok Zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 123

Introduction to OpenMP

OpenMP in small bites: Overview

Dr. Christian Terboven

History

• De-facto standard for Shared-Memory Parallelization.

• 1997: OpenMP 1.0 for FORTRAN

• 1998: OpenMP 1.0 for C and C++
• 1999: OpenMP 1.1 for FORTRAN http://www.OpenMP.org
• 2000: OpenMP 2.0 for FORTRAN
• 2002: OpenMP 2.0 for C and C++
• 2005: OpenMP 2.5 now includes
both programming languages.
RWTH Aachen University
• 05/2008: OpenMP 3.0 is a member of the
• 07/2011: OpenMP 3.1 OpenMP Architecture
• 07/2013: OpenMP 4.0 Review Board
(ARB) since 2006.
• 11/2015: OpenMP 4.5
• 11/2018: OpenMP 5.0
2 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
What is OpenMP?

• De-facto standard Application Programming Interface (API)

to write shared memory parallel

呈
applications in C,
C++, and Fortran

• Compiler Directives,
Runtime routines
and Environment
variables

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP Overview & Parallel Region
OpenMP‘s machine model

• OpenMP: Shared-Memory Parallel Programming Model

All processors/cores access

a shared main memory. Memory

Real architectures are

Crossbar / Bus
more complex, as we
will see later / as we
have seen.
Cache Cache Cache Cache

Parallelization in OpenMP
employs multiple threads.
Proc Proc Proc Proc

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP Memory Model

• All threads have access to

==
the same, globally shared memory
private private
memory memory

• Data in private memory is

only accessible by the thread
T T
owning this memory
Shared T
• No other thread sees the Memory
private private
change(s) in private memory memory memory

T T
• Data transfer is through shared private
memory and is 100% transparent memory

to the application

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP Execution Model

• OpenMP programs start with Master Thread Serial Part

-_-
just one thread: The Master.
Parallel
• Worker threads are spawned Slave Region
Slave
Threads
at Parallel Regions, together Worker
Threads
with the Master they form the Threads
Team of threads.
Serial Part
• In between Parallel Regions the
Worker threads are put to sleep.
The OpenMP Runtime takes care
of all thread management work. Parallel
Region

1⼆
• Concept: Fork-Join. '

• Allows for an incremental parallelization!

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Parallel Region and Structured Blocks

• The parallelism has to be expressed explicitly

C/C++ Fortran
#pragma omp parallel !$omp parallel
{ ...
... structured block
structured block ...
... !$omp end parallel
}
• Specification of number of threads:
• Structured Block
− Environment variable:
_=
− Exactly one entry point at the top
− Exactly one exit point at the bottom OMP_NUM_THREADS=…
− Branching in or out is not allowed − Or: Via num_threads clause:
− Terminating the program is allowed
(abort / exit) add num_threads(num) to the
parallel construct
8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Demo

Hello OpenMP World

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Starting OpenMP Programs on Linux

• From within a shell, global setting of the number of threads:

export OMP_NUM_THREADS=4
./program

• From within a shell, one-time setting of the number of threads:

OMP_NUM_THREADS=4 ./program

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
For Worksharing Construct
For Worksharing

• If only the parallel construct is used, each thread executes the Structured Block.

-_-
• Program Speedup: Worksharing
• OpenMP‘s most common Worksharing construct: for

C/C++ Fortran
int i; INTEGER :: i
#pragma omp for !$omp do
for (i = 0; i < 100; i++) DO i = 0, 99
{ a[i] = b[i] + c[i]
a[i] = b[i] + c[i]; END DO
}

− Distribution of loop iterations over all threads in a Team.

− Scheduling of the distribution can be influenced.

• Loops often account for most of a program‘s runtime!

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Worksharing illustrated

Pseudo-Code Memory
Here: 4 Threads
Thread 1 do i = 0, 24 A(0)
.
a(i) = b(i) + c(i) .
end do .
A(99)
Thread 2 do i = 25, 49
Serial B(0)
a(i) = b(i) + c(i)
do i = 0, 99 .
end do .
a(i) = b(i) + c(i) .
end do do i = 50, 74 B(99)
a(i) = b(i) + c(i)
Thread 3 end do C(0)
.
do i = 75, 99 .
a(i) = b(i) + c(i) .
C(99)
Thread 4 end do
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Demo

Vector Addition

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Influencing the For Loop Scheduling

• for-construct: OpenMP allows to influence how the iterations are scheduled among the threads of
the team, via the schedule clause:

-_-
− schedule(static [, chunk]): Iteration space divided into blocks of chunk size, blocks are
assigned to threads in a round-robin fashion. If chunk is not specified: #threads blocks.
− schedule(dynamic [, chunk]): Iteration space divided into blocks of chunk (not specified: 1) size,
blocks are scheduled to threads in the order in which threads finish previous blocks.
− schedule(guided [, chunk]): Similar to dynamic, but block size starts with implementation-defined
value, then is decreased exponentially down to chunk.

• Default on most implementations is schedule(static).

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Synchronization Overview

• Can all loops be parallelized with for-constructs? No!

− Simple test: If the results differ when the code is executed backwards, the loop iterations are not
independent. BUT: This test alone is not sufficient:

C/C++ This is a combined

int i, int s = 0; construct, which is
#pragma omp parallel for semantically equivalent to
for (i = 0; i < 100; i++) #pragma omp parallel
{ #pragma omp for
s = s + a[i]; for(...) {…}
}

•-
Data Race: If between two synchronization points at least one thread writes to a memory location
-_-
from which at least one other thread reads, the result is not deterministic (race condition).
_

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Synchronization: Critical Region

ㄧ
• A Critical Region is executed by all threads, but by only one thread simultaneously (Mutual
Exclusion).
C/C++
#pragma omp critical (name)
{
... structured block ...
}

• Do you think this solution scales well?

C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The Barrier Construct
The Barrier Construct

• OpenMP barrier (implicit or explicit)

− Threads wait until all threads of the current Team have reached the barrier

C/C++
#pragma omp barrier

⼀
• All worksharing constructs contain an implicit barrier at the end

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Single and Master Construct
The Single Construct

C/C++ Fortran
#pragma omp single [clause] !$omp single [clause]
... structured block ... ... structured block ...
!$omp end single

=_=
• The single construct specifies that the enclosed structured block is executed by only on thread of
the team.

三
− It is up to the runtime which thread that is.

• Useful for:
− I/O
− Memory allocation and deallocation, etc. (in general: setup work)
− Implementation of the single-creator parallel-executor pattern as we will see soon…

12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The Master Construct

C/C++ Fortran
#pragma omp master[clause] !$omp master[clause]

⼆
... structured block ... ... structured block ...
!$omp end master

• The master construct specifies that the enclosed structured block is executed only by the master
thread of a team.

• Note: The master construct is no worksharing construct and does not contain an implicit barrier at
the end.

13 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Runtime Library
Runtime Library

• C and C++:
− If OpenMP is enabled during compilation, the preprocessor symbol _OPENMP is defined.
To use the OpenMP runtime library, the header omp.h has to be included.
− omp_set_num_threads(int): The specified number of threads will be used for the

噩parallel region encountered next.

− int omp_get_num_threads: Returns the number of threads in the current team.
− int omp_get_thread_num(): Returns the number of the calling thread in the team, the
Master has always the id 0.

• Additional functions are available, e.g. to provide locking functionality.

15 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Data Scoping
Scoping Rules

• Managing the Data Environment is the challenge of OpenMP.

Tasks are

tposs.si
introduced
• Scoping in OpenMP: Dividing variables in shared and private:

竺
later
− private-list and shared-list on Parallel Region
− private-list and =
shared-list on Worksharing constructs
− General default is shared for Parallel Region, firstprivate for Tasks.
− Loop control variables on for-constructs are private
ˋ

− Non-static variables local to Parallel Regions are private _

− private: A new uninitialized instance is created for the task or each thread executing the construct
firstprivate: Initialization with the value before encountering the construct
.

另
▪
▪ lastprivate: Value of last loop iteration is written back to Master
− Static variables are shared
⼀

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Privatization of Global/Static Variables

Global / static variables can be privatized with the 1 threadprivate directive

•on_no
_ _
⼆
− One instance is created for each thread
▪ Before the first parallel region is encountered
▪ Instance exists until the program ends
▪ Does not work (well) with nested Parallel Region
− Based on thread-local storage (TLS)
▪ TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword __thread (GNU extension)

C/C++ Fortran
ii.
static int i;
#pragma omp -7
threadprivate(i)
SAVE INTEGER :: i
!$omp threadprivate(i)
, -_-

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Privatization of Global/Static Variables

• Global / static variables can be privatized with the threadprivate directive

− One instance is created for each thread
▪ Before the first parallel region is encountered
▪ Instance exists until the program ends
▪ Does not work (well) with nested Parallel Region
− Based on thread-local storage (TLS)
▪ TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword __thread (GNU extension)

C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Back to our bad scaling example

C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}
It‘s your turn: Make It Scale!

#pragma omp parallel do i = 0, 24

s = s + a(i)
{ end do

#pragma omp for do i = 25, 49

s = s + a(i)
for (i = 0; i < 99; i++) end do
{ do i = 0, 99
s=s+
s = s + a[i]; a(i) do i = 50, 74
s = s + a(i)
end do end do
}

do i = 75, 99
} // end parallel s = s + a(i)
end do

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The Reduction Clause

⾢
• In a reduction-operation the operator is applied to all variables in the list. The variables have to be
shared.
− reduction(operator:list)
− The result is provided in the associated reduction variable
C/C++
int i, s = 0;
#pragma omp parallel for reduction(+:s)
for(i = 0; i < 99; i++)
{
s = s + a[i];
}

− Possible reduction operators with initialization value:

+ (0), * (1), - (0), & (~0), | (0), && (1), || (0), ^ (0), min (largest
number), max (least number)

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Reduction Operations

serial part
a=0
int a=0;
.
.
#pragma omp parallel a=0 a=0 a=0 a=0

parallel region
⼝
#pragma omp for reduction(+:a) local copies
for
for (int i=0; i<100; i++)
computation
{
a+=i; 300 925 1550 2175
}
update is written

)
reduction computes

serial part
1
. - 4950
to the shared final result in the
. =
variable shared variable
.

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
False Sharing
3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Memory Bottleneck

• There is a growing gap between core and memory performance:

⼀
− memory, since 1980: 1.07x per year improvement in latency
− single core: since 1980: 1.25x per year until 1986, 1.52x p. y. until 2000,
1.20x per year until 2005, then no change on a per-core basis

SPECint benchmark
performance

Latency

− Source: John L. Hennessy, Stanford University, and David A. Patterson, University of California, September 25, 2012

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Caches

• CPU is fast
− Order of 3.0 GHz core

⾼
on-chip cache
• Caches:
− Fast, but expensive
− Thus small, order of MB off-chip cache

• Memory is slow
− Order of 0.3 GHz
− Large, order of GB
memory

• A good utilization of caches is

crucial for good performance of HPC applications!

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Visualization of the Memory Hierarchy

• Latency on the Intel Westmere-EP 3.06 GHz processor

20
18
16
14
Latency in ns
12

L2 cache
L1 cache

L3 cache
10
8
6
4
2
0
256 B

128 MB
1B

12 MB
32 MB

512 MB
4B

2 GB
16 KB

1 MB
4 MB
16 B
64 B

1 KB
4 KB

64 KB
256 KB
Memory Footprint
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data in Caches

• When data is used, it is copied into caches.

Core Core

• The hardware always copies chunks into on-chip cache on-chip cache

the cache, so called cache-lines.

• This is useful, when: bus

− the data is used frequently (temporal locality)
− consecutive data is used which is on
the same cache-line (spatial locality)
memory

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
False Sharing

i_ei.in
• False Sharing occurs when
− different threads use elements of the same cache-line Core Core
− one of the threads writes to the cache-line on-chip cache on-chip cache

• As a result the cache line is moved

between the threads,
although there is no real data dependency bus

not a correctness issue

iii)
• Note: False Sharing is a performance problem,
memory

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Summing up vector elements again

#pragma omp parallel do i = 0, 24

{ s = s + a(i)
end do

#pragma omp for

for (i = 0; i < 99; i++)
do i = 25, 49
s = s + a(i)
{ end do
do i = 0, 99
s=s+
s = s + a[i]; a(i) do i = 50, 74
s = s + a(i)
end do end do
}
do i = 75, 99
} // end parallel s = s + a(i)
end do

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
False Sharing

1 double s_priv[nthreads];
2 #pragma omp parallel num_threads(nthreads)
3 {
4 int t=omp_get_thread_num();
5 #pragma omp for
6 for (i = 0; i < 99; i++)
7 {
8 s_priv[t] += a[i];
9 }
10 } // end parallel
11 for (i = 0; i < nthreads; i++)
12 {
13 s += s_priv[i];
14 }

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
False Sharing
4000
• No performance benefit for more threads! 3000

MFLOPS
• Reason: false sharing of s_priv
2000
。
• Solution: padding so that only

nnsti 把泄
one variable per cache line is used 1000
了
P … 都分开 0
。

1 2 3 4 5 6 7 8 9 10 11 12
#threads
with false sharing
with false without
sharing false sharing

cache line 1 cache line 2

Standard 1 2 3 4 …
With padding 1 2 3 …
11 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
False Sharing avoided

double s_priv[nthreads * 8];

#pragma omp parallel num_threads(nthreads)
{
int t=omp_get_thread_num();
#pragma omp for
for (i = 0; i < 99; i++)
{
s_priv[t * 8] += a[i];
}
} // end parallel
for (i = 0; i < nthreads; i++)
{
s += s_priv[i * 8];
}
12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
13 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: PI
Example: Pi

1 double f(double x)
2 { 1
3 return (4.0 / (1.0 + x*x)); 4
4 } 𝜋=න
5
1 + 𝑥2
0
6 double CalcPi (int n)
7 {
8 const double fH = 1.0 / (double) n;
9 double fSum = 0.0;
10 double fX;
11 int i;
12
13 #pragma omp parallel for
14 for (i = 0; i < n; i++)
15 {
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }

15 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi

1 double f(double x) What is

2 {
3 return (4.0 / (1.0 + x*x)); wrong with 1
4
4 } this code? 𝜋=න
1 + 𝑥2
5 0
6 double CalcPi (int n)
7 {
8 const double fH = 1.0 / (double) n;
9 double fSum = 0.0;
10 double fX;
11 int i;
12
13 #pragma omp parallel for
14 for (i = 0; i < n; i++)
15 {
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }

16 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi

1 double f(double x)
2 { 1
3 return (4.0 / (1.0 + x*x)); 4
4 } 𝜋=න
5
1 + 𝑥2
0
6 double CalcPi (int n)
7 {
8 const double fH = 1.0 / (double) n;
9 double fSum = 0.0;
10 double fX;
11 int i;
12
13
14
.
#pragma omp parallel for private(fX,i) reduction(+:fSum)
for (i = 0; i < n; i++)
15 {
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }

17 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi

1 double f(double x)
2 { 1
3 return (4.0 / (1.0 + x*x)); 4
4 } 𝜋=න
5
1 + 𝑥2
0
6 double CalcPi (int n)
7 {
8
9
const double fH
double fSum = 0.0;
= 1.0 / (double) n;
What if we
10 double fX; had forgotten
11 int i;
12 this?
13 #pragma omp parallel for private(fX,i) reduction(+:fSum)
14 for (i = 0; i < n; i++)
15 {
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }

18 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Race Condition

-聚
• 1 Data Race: the typical OpenMP ⼀
programming error, when:

事 ? 㘜鑫 :
− two or more threads access the same memory location, and
− at least one of these accesses is a write, and ,吗
− the accesses are not protected by locks or critical regions, and 有 ni 保护

rne-t-n-nsyndutnn-r.no
− the accesses are not synchronized, e.g. by a barrier.

• Non-deterministic occurrence: e.g. the sequence of the execution of parallel loop iterations
is non-deterministic and may change from run to run

• In many cases-
private clauses, barriers or critical regions are missing

• Data races are hard to find using a traditional debugger

− Use tools like Intel Inspector XE, ThreadSanitizer, Archer

19 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Inspector XE – Results
1 detected problems The missing
2 filters reduction is
3 code location detected.

3 2
20 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi

1 double f(double x)
2 {
3 return (4.0 / (1.0 + x*x));
4 }
5
What if we just
6 double CalcPi (int n)
7 { made the fSum
8 const double fH = 1.0 / (double) n; variable private?
9 double fSum = 0.0;
10 double fX;
11 int i;
12 fSum == 0
13 #pragma omp parallel for private(fX,i,fSum)
(no update to
14 for (i = 0; i < n; i++)
15 { global variable)
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }

21 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi Scalability

• Results for 𝒏 = 𝟐 ⋅ 𝟏𝟎𝟗 :

# Threads Runtime [sec.] Speedup
1 1.141 1.00
2 0.575 1.96
4 0.298 3.93
8 0.161 7.08
System: CLAIX2018 Node (Intel Xeon 8160)
Compiler: Intel Compiler 19.0, Flags: –fopenmp –O3

• Scalability is good (for up to 8 threads):

− About 100% of the runtime has been parallelized.
− As there is just one parallel region, there is virtually no overhead introduced by the parallelization.
⼀
− Problem is parallelizable in a trivial fashion ...

22 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Recursive approach to compute Fibonacci

• Fibonacci numbers
− Form a sequence 𝑭𝒏 such that each number is the sum of the two preceding
− 𝑭𝟎 = 𝟎, 𝑭𝟏 = 𝟏
− 𝑭𝒏 = 𝑭𝒏−𝟏 + 𝑭𝒏−𝟐 (for 𝑛 > 1)

int main(int argc, int fib(int n) {

char* argv[]) if (n < 2) return n;
{ int x = fib(n - 1);
[...] int y = fib(n - 2);
fib(input); return x+y;
[...] }
}

• On the following slides we will discuss three approaches to parallelize this recursive code with
Tasking.

2 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The task construct

• Deferring (or not) a unit of work (executable for any member of the team)
− Always attached to a structured block

#pragma omp task [clause[[,] clause]...] !$omp task [clause[[,] clause]...]

{structured-block} …structured-block…
!$omp end task

◼ Where clause:
→ private(list), → if(scalar-expression)

→ firstprivate(list), → mergeable Cutoff Strategies

Data → final(scalar-expression)
→ shared(list) Environment
→ default(shared | none) → depend(dep-type: list) Dependencies
7
→ in_reduction(r-id: list) ≥ 5.0 → priority(priority-value)

→ untied

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Tasks in OpenMP: Data Scoping

• Some rules from Parallel Regions apply:

"
− Static and Global variables are shared
− Automatic Storage (local) variables are private
− Task variables are firstprivate unless shared in the enclosing context
▪ Only shared attribute is inherited
▪ Exception: Orphaned Task variables are firstprivate by default!

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
First version parallelized with Tasking (omp-v1)

1 int main(int argc, 14 int fib(int n) {

2 char* argv[]) 15 if (n < 2) return n;
3 { 16 int x, y;
4 [...] 17 #pragma omp task shared(x)
5 #pragma omp parallel 18 {
6 { 19 x = fib(n - 1);
7 #pragma omp single 20 }
8 { 21 #pragma omp task shared(y)
9 fib(input); 22 {
10 } 23 y = fib(n - 2);
11 } 24 }
12 [...] 25 #pragma omp taskwait
13 } 26 return x+y;
27 }

三
• Only one Task / Thread enters fib() from main(), it is responsible for
creating the two initial work tasks
• Taskwait is required, as otherwise x and y would be lost
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskwait directive

-_-
• The taskwait directive (shallow task synchronization)
- _ -

− It is a stand-alone directive
#pragma omp taskwait

− wait on the completion of child tasks of the current task; just direct children, not all descendant tasks;
includes an implicit task scheduling point (TSP)

#pragma omp parallel

#pragma omp single A
{
#pragma omp task :A
{
#pragma omp task : B
{ … } wait for… B C
#pragma omp task : C
{{ …… #C.1; #C.2; …}
…}
#pragma omp taskwait
C.1 C.2
}
} // implicit barrier will wait for C.x

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Fibonacci Illustration

• T1 enters fib(4) .

• T1 creates tasks for fib(3) and fib(2) 1

fib(4)
• T1 and T2 execute tasks from the queue
• T1 and T2 create 4 new tasks
• T1 - T4 execute tasks fib(3) fib(2)

fib(2) fib(1) fib(1) fib(0)

Task Queue

fib(2)
fib(3) fib(2)
fib(1) fib(1) fib(0)

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Fibonacci Illustration

• T1 enters fib(4)
• T1 creates tasks for fib(3) and fib(2) fib(4)
• T1 and T2 execute tasks from the queue
• T1 and T2 create 4 new tasks
• T1 - T4 execute tasks fib(3) fib(2)
• …

fib(2) fib(1) fib(1) fib(0)

fib(1) fib(0)

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scalability measurements (1/3)

•⼀
Overhead of task creation prevents scalability!
Speedup of Fibonacci with Tasks
16
15
14
13
12
11
10
Speedup

The OpenMP runtime

9
8
7 omp-v1

optimizes when there is only

6 optimal
5
4
3
2
1
one thread!
0
1 2 4 8 16
# Threads

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
if Clause

⾔
• The if clause of a task construct
− allows to optimize task creation/execution
− reduces parallelism but also reduces the pressure in the runtime’s task pool
− for “very” fine grain tasks you may need to do your own (manual) if
#pragma omp task if(expression)
{structured-block}

• If the expression of the “if” clause evaluates to false

− the encountering task is suspended
− the new task is executed immediately
− the parent task resumes when the task finishes

• This is known as undeferred task

惟⼀
不可达

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Improved parallelization with Tasking (omp-v2)

• Improvement: Don‘t create yet another task once a certain (small

enough) n is reached
1 int main(int argc, 14 int fib(int n) {
2 char* argv[]) 15 if (n < 2) return n;
3 { 16 int x, y;
4 [...] 17 #pragma omp task shared(x) \
5 #pragma omp parallel 18 if(n > 30)
6 { 19 {
7 #pragma omp single 20 x = fib(n - 1);
8 { 21 }
9 fib(input); 22 #pragma omp task shared(y) \
10 } 23 if(n > 30)
11 } 24 {
12 [...] 25 y = fib(n - 2);
13 } 26 }
27 #pragma omp taskwait
28 return x+y;
29 }

11 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scalability measurements (2/3)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• Speedup is better, but still not great
Speedup of Fibonacci with Tasks
16
15
14
13
12 Small tasks still have
11
10
to be allocated in
Speedup

9
8 omp-v1
7
6 omp-v2! omp-v2
5 optimal
4
3
2
1
0
1 2 4 8 16
# Threads

12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Improved parallelization with Tasking (omp-v3)

• Improvement: Skip the OpenMP overhead once a certain n

is reached (no issue w/ production compilers)
1 int main(int argc, 14 int fib(int n) {
2 char* argv[]) 15 if (n < 2) return n;

on_no
3 { 16 if (n <= 30)
4 [...] 17 return serfib(n);
5 #pragma omp parallel 18 int x, y;
6 { 19 #pragma omp task shared(x)
7 #pragma omp single 20 {
8 { 21 x = fib(n - 1);
9 fib(input); 22 }
10 } 23 #pragma omp task shared(y)
11 } 24 {
12 [...] 25 y = fib(n - 2);
13 } 26 }
27 #pragma omp taskwait
28 return x+y;
29 }

13 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scalability measurements (3/3)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• Looks promising…
Speedup of Fibonacci with Tasks
16
15
14
13
12
11
10
Speedup

9 omp-v1
8
7 omp-v2
6 omp-v3
5
4 optimal
3
2
1
0
1 2 4 8 16
# Threads

14 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Runtime measurements (1/2)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• First two versions
⼀
were slow because of overhead!
Runtime of Fibonacci with Tasks
450
400
350
300
Runtime [s]

250
omp-v1
200
omp-v2
150 omp-v3
100
50
0
1 2 4 8 16
# Threads

15 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Runtime measurements (2/2)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• Third version is comparable to serial version w/o OpenMP ☺
Runtime of Fibonacci with Tasks
8

5
Runtime [s]

4
omp-v3
3 serial
2

0
1 2 4 8 16
# Threads

16 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
T-s-nn.rs
Tasking Overheads

• Typical overheads in task-based programs are:

譿蠹
"
− Task creation: populate task data structure, add task to task queue
− Task execution: retrieve a task from the queue (may including work stealing)

• If tasks become too fine-grained, overhead becomes noticeable

-_-
− Execution spends a higher relative amount of time in the runtime

-_-
-_-
− Task execution contributing to runtime becomes significantly smaller

• A rough rule of thumb to avoid (visible) tasking overhead

− OpenMP tasks: 80-100k instructions executed per task
− TBB tasks: 30-50k instructions executed per task
− Other programming models may have another ideal granularity!

17 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Threads vs Tasks

• Threads do not compose well

rnst.name/
− Example: multi-threaded plugin in a multi-threaded application
− Composition usually leads to oversubscription and load imbalance

• Task models are inherently composable

==
− A pool of threads executes all created tasks
− Tasks from different modules can freely mix

• Task models make complex algorithms easier to parallelize

− Programmers can think in concurrent pieces of work
− Mapping of concurrent execution to threads handled elsewhere
− Task creation can be irregular (e.g., recursion, graph traversal)

18 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Sometimes You’re Better off with Threads…

过
• Some scenarios are more amenable for traditional threads
− Granularity too coarse for tasking
− Isolation of autonomous agents

• Static allocation of parallel work is typically easier with threads

− Controlling allocation of work to cache hierarchy

• Graphical User Interfaces (event thread + worker threads)

• Request/response processing, e.g.,

− Web servers
− Database servers

19 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
-_-
Tasks in OpenMP: Data Scoping

• Some rules from Parallel Regions apply:

− Automatic Storage (local) variables are private
− Static and Global variables are shared

• Tasking:
− Variables are firstprivate unless shared in the enclosing context
▪ Only shared attribute is inherited
▪
-_-
Exception: Orphaned Task variables are firstprivate by default!
- See an example later on

在
赢 task 外⾯的 pnvate 变量

2 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Data Scoping
Data Scoping Example (1/7)

1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a:
13 // Scope of b:
14 // Scope of c:
15 // Scope of d:
16 // Scope of e:
17 } } }

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (2/7)

1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b:
14 // Scope of c:
15 // Scope of d:
16 // Scope of e:
17 } } }

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (3/7)

1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b: firstprivate
14 // Scope of c:
15 // Scope of d:
16 // Scope of e:
17 } } }

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (4/7)

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (5/7)

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (6/7)

1 int a = 1;
Hint: Use default(none) to
2 void foo()
be forced to think about
3 {
every variable if you do not
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
see clear.
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b: firstprivate
14 // Scope of c: shared
15 // Scope of d: firstprivate
16 // Scope of e: private
17 } } }

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
r
Data Scoping Example (7/7)

1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared, value of a: 1

古
13 // Scope of b: firstprivate, value of b: 0 / undefined
14 // Scope of c: shared, value of c: 3
15 // Scope of d: firstprivate, value of d: 4
16 // Scope of e: private, value of e: 5
17 } } }

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scoping – Lifetime (1/2)

• How long do private / firstprivate instances exist?

int i = 5; #pragma omp parallel

#pragma omp parallel \
#pragma omp single
firstprivate(i)
{
{
// private copy per thread int i = 5; // alive until end of single
// initialized with 5 #pragma omp task
// alive until end of {
parallel region // firstprivate copy of i for task
}< // alive until end of task

}<
}

11 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scoping – Lifetime (2/2)

• How long do private / firstprivate instances exist?

int i = 5; #pragma omp parallel

#pragma omp parallel \
#pragma omp single
firstprivate(i)
{
{
// private copy per thread int i = 5; // alive until end of single

n_n
// initialized with 5 #pragma omp task
// alive until end of {
parallel region // firstprivate copy of i for task
}< // alive until end of task

}<
➢Alive until end of assigned
}
structured block or construct

12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Orphaned Task Variables

• Arguments passed by reference are firstprivate by default in orphaned task generating

constructs, example:
void task_body (int &);
• Question: What is the scoping of x?
void gen_task (int &x) { // an• orphaned task firstprivate
General rule: construct reference argument
if not shared before
#pragma omp task // •x is
Problem: Due todetermined
implicitly call by reference it might or might not be
firstprivate(x)

⾼
task_body (x); shared

旁
}

• Solution: Special OpenMP rule for

void test (int &y, int &z) {
orphaned task has to be applied.
#pragma omp parallel private(y)
{
y = z + 2;
gen_task (y); // no matter if the argument is determined private
gen_task (z); // or shared in the enclosing context.
y++; // each thread has its own int object y refers to
gen_task (y);
13 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example taken from
} } the OpenMP 4.5
Examples
Questions?
Task Synchronization
The barrier directive

• OpenMP barrier (implicit or explicit)

− All tasks created by any thread of the current Team are guaranteed to be completed at barrier exit

C/C++
#pragma omp barrier

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskwait directive

• The taskwait directive (shallow task synchronization)

− It is a stand-alone directive

ㄧ
#pragma omp taskwait

− wait on the completion of child tasks of the current task; just direct children, not all descendant tasks;
includes an implicit task scheduling point (TSP)
#pragma omp parallel
#pragma omp single A
{
#pragma omp task :A
{
#pragma omp task : B
{ … } wait for… B C
#pragma omp task : C
{ … #C.1; #C.2; …}
#pragma omp taskwait
C.1 C.2
}
} // implicit barrier will wait for C.x
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskgroup directive

• The taskgroup construct (deep task synchronization)

-
− attached to a structured block; completion of all descendants of the current task; TSP at the end
==

#pragma omp taskgroup [clause[[,] clause]...]

{structured-block}

− where clause (could only be): reduction(reduction-identifier: list-items)

#pragma omp parallel

#pragma omp single A
{
#pragma omp taskgroup :A
{
#pragma omp task :B
{ … } B C
#pragma omp task :C
{ … #C.1; #C.2; …} wait for…
C.1 C.2
} // end of taskgroup
5 }
Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The Barrier and Taskwait Constructs

• Task Synchronization explained:

#pragma omp parallel num_threads(np)

{
#pragma omp task np Tasks created here, one by each thread
function_A();
#pragma omp barrier All Tasks guaranteed to be completed here
#pragma omp single
{
#pragma omp task
function_B(); 1 Task created here
}
}
B-Task guaranteed to be completed here

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Loops with Tasks
The taskloop Construct

• Task generating construct: decompose a loop into chunks, create a task for each loop chunk

.ir?.
#pragma omp taskloop [clause[[,] clause]…]
{structured-for-loops}

◼ Where clause is one of:

→ shared(list) → if(scalar-expression)
Cutoff

鱻
→ private(list) → final(scalar-expression)
Strategies
→ firstprivate(list) → mergeable
Data
→ lastprivate(list) → untied
Environment Scheduler (R/H)
→ default(sh | pr | fp | none) → priority(priority-value)
→ reduction(r-id: list) → collapse(n)
→ in_reduction(r-id: list) → nogroup Miscellaneous
→ grainsize(grain-size) → allocate([allocator:] list)
Chunks/Grain
→ num_tasks(num-tasks)

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Worksharing vs. taskloop constructs (1/2)

subroutine worksharing subroutine taskloop

integer :: x integer :: x
integer :: i integer :: i
integer, parameter :: T = 16 integer, parameter :: T = 16
integer, parameter :: N = 1024 integer, parameter :: N = 1024

o
x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)

!$omp do !$omp taskloop

do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop

!$omp end parallel !$omp end parallel

write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Worksharing vs. taskloop constructs (2/2)

subroutine worksharing subroutine taskloop

integer :: x integer :: x
integer :: i integer :: i
integer, parameter :: T = 16 integer, parameter :: T = 16
integer, parameter :: N = 1024 integer, parameter :: N = 1024

x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
!$omp single
!$omp do !$omp taskloop
do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop
!$omp end single ☆
!$omp end parallel !$omp end parallel
write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Taskloop decomposition approaches

◼ Clause: grainsize(grain-size) ◼ Clause: num_tasks(num-tasks)

tntensn
→ Chunks have at least grain-size iterations → Create num-tasks chunks

-.-
→ Chunks have maximum 2x grain-size iterations → Each chunk must have at least one iteration

-1--7
int TS = 4 * 1024;
#pragma omp taskloop grainsize(TS)
int NT = 4 * omp_get_num_threads();
#pragma omp taskloop num_tasks(NT)

-_-
for ( i = 0; i<SIZE; i+=1) { for ( i = 0; i<SIZE; i+=1) {
A[i]=A[i]*B[i]*S; A[i]=A[i]*B[i]*S;
} }

◼ If none of previous clauses is present, the number of chunks and the

ar nt
number of iterations per
nrntn
chunk is implementation defined
arns
◼ Additional considerations:
→ The order of the creation of the loop tasks is unspecified

→ Taskloop creates an implicit taskgroup region; nogroup → no implicit taskgroup region is created
-
。

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Task Scheduling
Tasks in OpenMP: Scheduling

-_-
• Default: Tasks are tied to the thread that first executes them → not neccessarily the creator.
Scheduling constraints:
− Only the thread a task is tied to can execute it
− A task can only be suspended at task scheduling points
▪ Task creation, task finish, taskwait, barrier, taskyield
− If task is not suspended in a barrier, executing thread can only switch to a direct descendant of all tasks tied
to the thread

• Tasks created with the untied clause are never tied

− Resume at task scheduling points possibly by different thread
− No scheduling restrictions, e.g., can be suspended at any point
− But: More freedom to the implementation, e.g., load balancing

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Unsafe use of untied Tasks

thredbidchagdatdkswitdypoints.tn
• Problem: Because untied tasks may migrate between threads at any point, thread-centric constructs

-3
can yield unexpected results

• Remember when using untied tasks:

−_Avoid threadprivate variable
−_Avoid any use of thread-ids (i.e., omp_get_thread_num())
− Be careful with-
critical _-
region and locks

• Simple Solution:
为
'

− Create a tied task region with

!⾯wymwe
#pragma omp task if(0)
,

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The taskyield Directive

• The taskyield directive specifies that the current task can be suspended in favor of execution
-of a different task.
− Hint to the runtime for optimization and/or deadlock prevention
C/C++ Fortran
#pragma omp taskyield !$omp taskyield

_
dewedputumnttaskskpchaekqueues.tk
anothentask.ttterwds 把没完成的完成

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
ˊ
taskyield Example (1/2)

#include <omp.h>

void something_useful();
void something_critical();

void foo(omp_lock_t * lock, int n)

{ o.it
for(int i = 0; i < n; i++)

jteckrns.su
#pragma omp task
{
something_useful();
while( !omp_test_lock(lock) ) {
#pragma omp taskyield
}
something_critical();
omp_unset_lock(lock); api
}
}
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskyield Example (2/2)

#include <omp.h>

void something_useful();
void something_critical();

void foo(omp_lock_t * lock, int n)

{
for(int i = 0; i < n; i++)
#pragma omp task
{
something_useful();
The waiting task may be
while( !omp_test_lock(lock) ) {
suspended here and allow the
#pragma omp taskyield executing thread to perform

_
} other work; may also avoid
something_critical(); deadlock situations.
omp_unset_lock(lock);
}
}
7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Tasks and Dependencies
Tasks and Dependencies

• Catchy example: Building a house

Landscape
Put up Shingl
Roof e Roof
Build
Walls Install
Electrical Drywall

Exterior
Siding

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Tasks and Dependencies

l-iODpeudenciesfne-grained.rns_e-xbeenread.in
• Task dependencies constrain execution order and times for tasks

• Fine-grained synchronization of tasks

int x = 0; Traditional task int x = 0; Dependencies
#pragma omp parallel #pragma omp parallel
#pragma omp single wait #pragma omp single
{ { :
#pragma omp task #pragma omp task depend(in: x)
std::cout << x << std::endl; std::cout << x << std::endl;

onsumedtmdr-nng_inoutoyi.ua
#pragma omp taskwait

#pragma omp task #pragma omp task depend(inout: x)

x++; x++;
} }

Task wait
t1
t2
Task’s creation time
readltnte.in
out
Task’s execution time

Dependencies t1
t2

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
NUMA Architectures
NUMA
Non-Uniform Memory Architecture

How To Distribute The Data ?

Core Core Core Core

on-chip on-chip on-chip on-chip

double* A; cache cache cache cache
A = (double*)
malloc(N * sizeof(double));
interconnect

for (int i = 0; i < N; i++) {

A[i] = 0.0;
memory memory
}
A[0] … A[N]
3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
About Data Distribution

• Important aspect on cc-NUMA systems

− If not optimal, longer memory access times and hotspots

÷
• OpenMP does not provide explicit support for cc-NUMA on first sight

• Placement comes from the Operating System

− This is therefore Operating System dependent
− OpenMP 5.0 introduced Memory Management to provide fine-grained control

• Windows, Linux and Solaris all use the “First Touch” placement policy by default
− May be possible to override default (check the docs)

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Non-Uniform Memory Architecture

• Serial code: all array elements are allocated in the memory of the NUMA node containing the core
executing this thread

Core Core Core Core

on-chip on-chip on-chip on-chip

double* A; cache cache cache cache
A = (double*)
malloc(N * sizeof(double));
interconnect

for (int i = 0; i < N; i++) {

A[i] = 0.0;
memory memory
}
A[0] … A[N]
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Non-Uniform Memory Architecture

E
• First Touch w/ parallel code: all array elements are allocated in the memory of the NUMA node
-
containing the core executing the thread
initializing the respective partition
Core Core Core Core

on-chip on-chip on-chip on-chip

double* A; cache cache cache cache
A = (double*)
malloc(N * sizeof(double));
interconnect
omp_set_num_threads(4);

#pragma omp parallel for

for (int i = 0; i < N; i++) {
A[i] = 0.0;
memory memory
}
A[0] … A[N/2] A[N/2] … A[N]
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Get Info on the System Topology

• Before you design a strategy for thread binding, you should have a basic understanding of the
⼀system topology. Please use one of the following options on a target machine:

− Intel MPI‘s cpuinfo tool

▪ module switch openmpi intelmpi
▪ cpuinfo
▪ Delivers information about the number of sockets (= packages) and the mapping of processor ids used by the operating
system to cpu cores.

− hwlocs‘tools
▪ lstopo (command line: hwloc-ls)
▪ Displays a graphical representation of the system topology, separated into NUMA nodes, along with the mapping of
processor ids used by the operating system to cpu cores and additional info on caches.

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
ns.
Decide for Binding Strategy

• Selecting the „right“ binding strategy depends not only on the topology, but also on the
☆
n _ n

characteristics of your application.

-_-
orz

-_- ini
− Putting threads far apart, i.e. on different sockets

-.-
▪ May improve the aggregated memory bandwidth available to your application
▪ May improve the combined cache size available to your application
_
▪ May decrease performance of synchronization constructs

1--7
− Putting threads-_-
close together, i.e. on two adjacent cores which possibly shared some caches

.int
▪ May improve performance of synchronization constructs
▪ May decrease the available memory bandwidth and cache size

• If you are unsure, just try a few options and then select the best one.

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
𨫣
OpenMP 4.0: Places + Binding Policies (1/2)

• Define OpenMP Places

-
− set of OpenMP threads running on one or more processors
− can be defined by the user, i.e.-_-
OMP_PLACES=cores

线程关联
。

• Define a set of OpenMP Thread Affinity Policies

−-_-
○
-_-
SPREAD: spread OpenMP threads evenly among the places
− CLOSE: pack OpenMP threads near master thread
rst
− MASTER: collocate OpenMP thread with master thread
_

tint

蔗排列。

• Goals
− user has a way to specify where to execute OpenMP threads for
− locality between OpenMP threads / less false sharing / memory bandwidth

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Places

• Assume the following machine:

p0 p1 p2 p3 p4 p5 p6 p7

− 2 sockets, 4 cores per socket, 4 hyper-threads per core

• Abstract names for OMP_PLACES:

☆ snss
− threads: Each place corresponds to a single hardware thread on the target machine.
− cores: Each place corresponds to a single core (having one or more hardware threads) on the target
machine.

⽽
− numa_domains (5.1): Each places corresponds to _
⼜
− sockets: Each place corresponds to a single socket (consisting of one or more cores) on the target machine.
− ll_caches (5.1): Each place corresponds to a set of cores that
-_-
share the last level cache.
a set of cores for which their closest memory is: the same
_
memory; and at a similar distance from the cores.

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP 4.0: Places + Binding Policies (2/2)

• Example‘s Objective:
− separate cores for outer loop and near cores for inner loop
• Outer Parallel Region: proc_bind(spread), Inner: proc_bind(close)
− spread creates partition, compact binds threads within respective partition
OMP_PLACES=(0,1,2,3), (4,5,6,7), ... = (0-3):8:4 = cores
#pragma omp parallel proc_bind(spread) num_threads(4)
#pragma omp parallel proc_bind(close) num_threads(4)
• Example p0 p1 p2 p3 p4 p5 p6 p7
− initial

− spread 4 p0 p1 p2 p3 p4 p5 p6 p7

− close 4
p0 p1 p2 p3 p4 p5 p6 p7

11 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Serial vs. Parallel Initialization

• Performance of OpenMP-parallel STREAM vector assignment measured on 2-socket Intel® Xeon®

X5675 („Westmere“) using Intel® Composer XE 2013 compiler with different thread binding options:

50000
Bandwidth in MB/s

40000
30000
20000
10000
0
1 2 4 8 12 16 20 24
Number of Threads
serial init. / no binding serial init. / close binding
serial init. / spread binding NUMA aware init. / close binding
NUMA aware init. / spread binding
12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?

OpenMP Workshop Day 1
No ratings yet
OpenMP Workshop Day 1
49 pages
OpenMP Workshop Day 1
No ratings yet
OpenMP Workshop Day 1
56 pages
OpenMP 01 Introduction
No ratings yet
OpenMP 01 Introduction
70 pages
Unit 3 HPC
No ratings yet
Unit 3 HPC
10 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Openmp Overview
No ratings yet
Openmp Overview
74 pages
OpenMPSlides Tamu SC PDF
No ratings yet
OpenMPSlides Tamu SC PDF
74 pages
21th 22th Lecture
No ratings yet
21th 22th Lecture
22 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
OpenMP Shared-Memory Programming Guide
No ratings yet
OpenMP Shared-Memory Programming Guide
37 pages
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
No ratings yet
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
50 pages
OpenMP Parallel Programming Guide
No ratings yet
OpenMP Parallel Programming Guide
25 pages
Parallel Programming Using OpenMP
No ratings yet
Parallel Programming Using OpenMP
76 pages
OpenMP Intro
No ratings yet
OpenMP Intro
52 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
51 pages
OpenMP Examples
No ratings yet
OpenMP Examples
12 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
40 pages
Cs6801 Mcap MGM
No ratings yet
Cs6801 Mcap MGM
7 pages
OpenMP Shared Memory Programming Guide
No ratings yet
OpenMP Shared Memory Programming Guide
65 pages
PDSOpen MP
No ratings yet
PDSOpen MP
22 pages
High Performance Computing (HPC) Lec4
No ratings yet
High Performance Computing (HPC) Lec4
32 pages
OpenMP for Shared Memory Programming
No ratings yet
OpenMP for Shared Memory Programming
30 pages
Sample - Code - Parallel - Cse6230 Fa14 04 Omp
No ratings yet
Sample - Code - Parallel - Cse6230 Fa14 04 Omp
51 pages
Introduction To Open MP
No ratings yet
Introduction To Open MP
42 pages
Mpsoc Architectures Openmp
No ratings yet
Mpsoc Architectures Openmp
35 pages
Lecture Open MP
No ratings yet
Lecture Open MP
35 pages
Introduction To OpenMP
No ratings yet
Introduction To OpenMP
46 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
29 pages
Openmp HPC Ass1
No ratings yet
Openmp HPC Ass1
43 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
Mcap-Lab Manual 1
No ratings yet
Mcap-Lab Manual 1
19 pages
PDC Lecture 7
No ratings yet
PDC Lecture 7
11 pages
CS-3006 5 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 5 UsingOpenMP SharedMemoryProgramming
76 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
OpenMP SPM
No ratings yet
OpenMP SPM
9 pages
Shared Memory and Accelerators
No ratings yet
Shared Memory and Accelerators
88 pages
OpenMP Basics and Examples
No ratings yet
OpenMP Basics and Examples
80 pages
Worksharing and Parallel Loops
No ratings yet
Worksharing and Parallel Loops
23 pages
A Tutorial On Parallel Computing On Shared Memory Systems
No ratings yet
A Tutorial On Parallel Computing On Shared Memory Systems
23 pages
ParallelProgramming Start2016
No ratings yet
ParallelProgramming Start2016
41 pages
OpenMP Guide for Parallel Computing
No ratings yet
OpenMP Guide for Parallel Computing
32 pages
Open MP
No ratings yet
Open MP
30 pages
Lab # 2 by Akram
No ratings yet
Lab # 2 by Akram
14 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
Lec 12 OpenMP
No ratings yet
Lec 12 OpenMP
152 pages
Openmp
No ratings yet
Openmp
61 pages
OpenMP for Parallel Programming
No ratings yet
OpenMP for Parallel Programming
8 pages
M4: Shared Memory Programming With Openmp
No ratings yet
M4: Shared Memory Programming With Openmp
63 pages
4 Performance.4x
No ratings yet
4 Performance.4x
14 pages
Program Excecution ExpFinal
No ratings yet
Program Excecution ExpFinal
10 pages
HPC Lab Manual
No ratings yet
HPC Lab Manual
31 pages
OMP Common Core-Voss
No ratings yet
OMP Common Core-Voss
217 pages
How To Clone Virtual Machines in VMware ESXi
No ratings yet
How To Clone Virtual Machines in VMware ESXi
4 pages
7579-301 Assignment - 2
No ratings yet
7579-301 Assignment - 2
14 pages
Model Answers - ICT Final Revision Sheet - Grade 5
No ratings yet
Model Answers - ICT Final Revision Sheet - Grade 5
12 pages
An IoT Based Smart Farming System Using Machine Learning
No ratings yet
An IoT Based Smart Farming System Using Machine Learning
7 pages
Quarter1 Week5etech
No ratings yet
Quarter1 Week5etech
9 pages
AI-Driven Phishing Scam Detection
No ratings yet
AI-Driven Phishing Scam Detection
8 pages
05 Pnd-软件虚拟化与开源免费解决方案
No ratings yet
05 Pnd-软件虚拟化与开源免费解决方案
27 pages
Videogame Developer's Strategy Guide
No ratings yet
Videogame Developer's Strategy Guide
448 pages
Flask
No ratings yet
Flask
25 pages
Good Interface Design Principles
No ratings yet
Good Interface Design Principles
116 pages
ICT521
No ratings yet
ICT521
2 pages
Top 10 Billionaires Overview
No ratings yet
Top 10 Billionaires Overview
6 pages
Smart Automation
No ratings yet
Smart Automation
3 pages
Userguide AE C2105-C2104 3 Android4.1
No ratings yet
Userguide AE C2105-C2104 3 Android4.1
126 pages
Computer Science Textbook Solutions - 3
No ratings yet
Computer Science Textbook Solutions - 3
30 pages
Computer Networks PBL
No ratings yet
Computer Networks PBL
6 pages
2 Module - Computer
No ratings yet
2 Module - Computer
10 pages
Modeling Example
No ratings yet
Modeling Example
66 pages
Building An Amazon ECS Service With An ASP - NET Website
No ratings yet
Building An Amazon ECS Service With An ASP - NET Website
56 pages
DDIC Objects Creation & Field Addition For Material Legder Sales Stock Split Valuation
No ratings yet
DDIC Objects Creation & Field Addition For Material Legder Sales Stock Split Valuation
9 pages
Android Ransomware Detection
No ratings yet
Android Ransomware Detection
21 pages
Indusoft Thin Client Setup and Troubleshooting Guide: Before You Begin
No ratings yet
Indusoft Thin Client Setup and Troubleshooting Guide: Before You Begin
7 pages
Mbenet
No ratings yet
Mbenet
42 pages
Multimedia Storage
0% (1)
Multimedia Storage
11 pages
Blitz Basic 2.1 Manual 21 Jan 2020
No ratings yet
Blitz Basic 2.1 Manual 21 Jan 2020
262 pages
Object Oriented Programming in PHP: CS-02: Advance Web Development in Laravel
No ratings yet
Object Oriented Programming in PHP: CS-02: Advance Web Development in Laravel
28 pages
LLM Cheat Sheetpdf
No ratings yet
LLM Cheat Sheetpdf
7 pages
W1 BufferOverflows
No ratings yet
W1 BufferOverflows
24 pages
IT Trends & Developments Guide
No ratings yet
IT Trends & Developments Guide
11 pages
New Text Document
No ratings yet
New Text Document
1 page