0% found this document useful (0 votes)
16 views123 pages

Introduction To Openmp: Openmp in Small Bites: Overview

Uploaded by

Mirok Zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views123 pages

Introduction To Openmp: Openmp in Small Bites: Overview

Uploaded by

Mirok Zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

Introduction to OpenMP

OpenMP in small bites: Overview

Dr. Christian Terboven


History

• De-facto standard for Shared-Memory Parallelization.

• 1997: OpenMP 1.0 for FORTRAN


• 1998: OpenMP 1.0 for C and C++
• 1999: OpenMP 1.1 for FORTRAN http://www.OpenMP.org
• 2000: OpenMP 2.0 for FORTRAN
• 2002: OpenMP 2.0 for C and C++
• 2005: OpenMP 2.5 now includes
both programming languages.
RWTH Aachen University
• 05/2008: OpenMP 3.0 is a member of the
• 07/2011: OpenMP 3.1 OpenMP Architecture
• 07/2013: OpenMP 4.0 Review Board
(ARB) since 2006.
• 11/2015: OpenMP 4.5
• 11/2018: OpenMP 5.0
2 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
What is OpenMP?

• De-facto standard Application Programming Interface (API)


to write shared memory parallel


applications in C,
C++, and Fortran

• Compiler Directives,
Runtime routines
and Environment
variables

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP Overview & Parallel Region
OpenMP‘s machine model

• OpenMP: Shared-Memory Parallel Programming Model

All processors/cores access


a shared main memory. Memory

Real architectures are


Crossbar / Bus
more complex, as we
will see later / as we
have seen.
Cache Cache Cache Cache

Parallelization in OpenMP
employs multiple threads.
Proc Proc Proc Proc

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP Memory Model

• All threads have access to

==
the same, globally shared memory
private private
memory memory

• Data in private memory is


only accessible by the thread
T T
owning this memory
Shared T
• No other thread sees the Memory
private private
change(s) in private memory memory memory

T T
• Data transfer is through shared private
memory and is 100% transparent memory

to the application

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP Execution Model

• OpenMP programs start with Master Thread Serial Part

-_-
just one thread: The Master.
Parallel
• Worker threads are spawned Slave Region
Slave
Threads
at Parallel Regions, together Worker
Threads
with the Master they form the Threads
Team of threads.
Serial Part
• In between Parallel Regions the
Worker threads are put to sleep.
The OpenMP Runtime takes care
of all thread management work. Parallel
Region

1⼆
• Concept: Fork-Join. '

• Allows for an incremental parallelization!


7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Parallel Region and Structured Blocks

• The parallelism has to be expressed explicitly


C/C++ Fortran
#pragma omp parallel !$omp parallel
{ ...
... structured block
structured block ...
... !$omp end parallel
}
• Specification of number of threads:
• Structured Block
− Environment variable:
_=
− Exactly one entry point at the top
− Exactly one exit point at the bottom OMP_NUM_THREADS=…
− Branching in or out is not allowed − Or: Via num_threads clause:
− Terminating the program is allowed
(abort / exit) add num_threads(num) to the
parallel construct
8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Demo

Hello OpenMP World

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Starting OpenMP Programs on Linux

• From within a shell, global setting of the number of threads:


export OMP_NUM_THREADS=4
./program

• From within a shell, one-time setting of the number of threads:


OMP_NUM_THREADS=4 ./program

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
For Worksharing Construct
For Worksharing

• If only the parallel construct is used, each thread executes the Structured Block.

-_-
• Program Speedup: Worksharing
• OpenMP‘s most common Worksharing construct: for

C/C++ Fortran
int i; INTEGER :: i
#pragma omp for !$omp do
for (i = 0; i < 100; i++) DO i = 0, 99
{ a[i] = b[i] + c[i]
a[i] = b[i] + c[i]; END DO
}

− Distribution of loop iterations over all threads in a Team.


− Scheduling of the distribution can be influenced.

• Loops often account for most of a program‘s runtime!


3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Worksharing illustrated

Pseudo-Code Memory
Here: 4 Threads
Thread 1 do i = 0, 24 A(0)
.
a(i) = b(i) + c(i) .
end do .
A(99)
Thread 2 do i = 25, 49
Serial B(0)
a(i) = b(i) + c(i)
do i = 0, 99 .
end do .
a(i) = b(i) + c(i) .
end do do i = 50, 74 B(99)
a(i) = b(i) + c(i)
Thread 3 end do C(0)
.
do i = 75, 99 .
a(i) = b(i) + c(i) .
C(99)
Thread 4 end do
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Demo

Vector Addition

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Influencing the For Loop Scheduling

• for-construct: OpenMP allows to influence how the iterations are scheduled among the threads of
the team, via the schedule clause:

-_-
− schedule(static [, chunk]): Iteration space divided into blocks of chunk size, blocks are
assigned to threads in a round-robin fashion. If chunk is not specified: #threads blocks.
− schedule(dynamic [, chunk]): Iteration space divided into blocks of chunk (not specified: 1) size,
blocks are scheduled to threads in the order in which threads finish previous blocks.
− schedule(guided [, chunk]): Similar to dynamic, but block size starts with implementation-defined
value, then is decreased exponentially down to chunk.

• Default on most implementations is schedule(static).

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Synchronization Overview

• Can all loops be parallelized with for-constructs? No!


− Simple test: If the results differ when the code is executed backwards, the loop iterations are not
independent. BUT: This test alone is not sufficient:

C/C++ This is a combined


int i, int s = 0; construct, which is
#pragma omp parallel for semantically equivalent to
for (i = 0; i < 100; i++) #pragma omp parallel
{ #pragma omp for
s = s + a[i]; for(...) {…}
}

•-
Data Race: If between two synchronization points at least one thread writes to a memory location
-_-
from which at least one other thread reads, the result is not deterministic (race condition).
_

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Synchronization: Critical Region


• A Critical Region is executed by all threads, but by only one thread simultaneously (Mutual
Exclusion).
C/C++
#pragma omp critical (name)
{
... structured block ...
}

• Do you think this solution scales well?


C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The Barrier Construct
The Barrier Construct

• OpenMP barrier (implicit or explicit)


− Threads wait until all threads of the current Team have reached the barrier

C/C++
#pragma omp barrier


• All worksharing constructs contain an implicit barrier at the end

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Single and Master Construct
The Single Construct

C/C++ Fortran
#pragma omp single [clause] !$omp single [clause]
... structured block ... ... structured block ...
!$omp end single

=_=
• The single construct specifies that the enclosed structured block is executed by only on thread of
the team.


− It is up to the runtime which thread that is.

• Useful for:
− I/O
− Memory allocation and deallocation, etc. (in general: setup work)
− Implementation of the single-creator parallel-executor pattern as we will see soon…

12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The Master Construct

C/C++ Fortran
#pragma omp master[clause] !$omp master[clause]


... structured block ... ... structured block ...
!$omp end master

• The master construct specifies that the enclosed structured block is executed only by the master
thread of a team.

• Note: The master construct is no worksharing construct and does not contain an implicit barrier at
the end.

13 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Runtime Library
Runtime Library

• C and C++:
− If OpenMP is enabled during compilation, the preprocessor symbol _OPENMP is defined.
To use the OpenMP runtime library, the header omp.h has to be included.
− omp_set_num_threads(int): The specified number of threads will be used for the

噩parallel region encountered next.


− int omp_get_num_threads: Returns the number of threads in the current team.
− int omp_get_thread_num(): Returns the number of the calling thread in the team, the
Master has always the id 0.

• Additional functions are available, e.g. to provide locking functionality.

15 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Data Scoping
Scoping Rules

• Managing the Data Environment is the challenge of OpenMP.


Tasks are

tposs.si
introduced
• Scoping in OpenMP: Dividing variables in shared and private:


later
− private-list and shared-list on Parallel Region
− private-list and =
shared-list on Worksharing constructs
− General default is shared for Parallel Region, firstprivate for Tasks.
− Loop control variables on for-constructs are private
ˋ

− Non-static variables local to Parallel Regions are private _

− private: A new uninitialized instance is created for the task or each thread executing the construct
firstprivate: Initialization with the value before encountering the construct
.



▪ lastprivate: Value of last loop iteration is written back to Master
− Static variables are shared

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Privatization of Global/Static Variables

Global / static variables can be privatized with the 1 threadprivate directive


•on_no
_ _

− One instance is created for each thread
▪ Before the first parallel region is encountered
▪ Instance exists until the program ends
▪ Does not work (well) with nested Parallel Region
− Based on thread-local storage (TLS)
▪ TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword __thread (GNU extension)

C/C++ Fortran
ii.
static int i;
#pragma omp -7
threadprivate(i)
SAVE INTEGER :: i
!$omp threadprivate(i)
, -_-

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Privatization of Global/Static Variables

• Global / static variables can be privatized with the threadprivate directive


− One instance is created for each thread
▪ Before the first parallel region is encountered
▪ Instance exists until the program ends
▪ Does not work (well) with nested Parallel Region
− Based on thread-local storage (TLS)
▪ TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword __thread (GNU extension)

C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Back to our bad scaling example

C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}
It‘s your turn: Make It Scale!

#pragma omp parallel do i = 0, 24


s = s + a(i)
{ end do

#pragma omp for do i = 25, 49


s = s + a(i)
for (i = 0; i < 99; i++) end do
{ do i = 0, 99
s=s+
s = s + a[i]; a(i) do i = 50, 74
s = s + a(i)
end do end do
}

do i = 75, 99
} // end parallel s = s + a(i)
end do

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The Reduction Clause


• In a reduction-operation the operator is applied to all variables in the list. The variables have to be
shared.
− reduction(operator:list)
− The result is provided in the associated reduction variable
C/C++
int i, s = 0;
#pragma omp parallel for reduction(+:s)
for(i = 0; i < 99; i++)
{
s = s + a[i];
}

− Possible reduction operators with initialization value:


+ (0), * (1), - (0), & (~0), | (0), && (1), || (0), ^ (0), min (largest
number), max (least number)

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Reduction Operations

serial part
a=0
int a=0;
.
.
#pragma omp parallel a=0 a=0 a=0 a=0

parallel region

#pragma omp for reduction(+:a) local copies
for
for (int i=0; i<100; i++)
computation
{
a+=i; 300 925 1550 2175
}
update is written

)
reduction computes

serial part
1
. - 4950
to the shared final result in the
. =
variable shared variable
.

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
False Sharing
3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Memory Bottleneck

• There is a growing gap between core and memory performance:



− memory, since 1980: 1.07x per year improvement in latency
− single core: since 1980: 1.25x per year until 1986, 1.52x p. y. until 2000,
1.20x per year until 2005, then no change on a per-core basis

SPECint benchmark
performance

Latency

− Source: John L. Hennessy, Stanford University, and David A. Patterson, University of California, September 25, 2012

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Caches

• CPU is fast
− Order of 3.0 GHz core


on-chip cache
• Caches:
− Fast, but expensive
− Thus small, order of MB off-chip cache

• Memory is slow
− Order of 0.3 GHz
− Large, order of GB
memory

• A good utilization of caches is


crucial for good performance of HPC applications!

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Visualization of the Memory Hierarchy

• Latency on the Intel Westmere-EP 3.06 GHz processor


20
18
16
14
Latency in ns
12

L2 cache
L1 cache

L3 cache
10
8
6
4
2
0
256 B

128 MB
1B

12 MB
32 MB

512 MB
4B

2 GB
16 KB

1 MB
4 MB
16 B
64 B

1 KB
4 KB

64 KB
256 KB
Memory Footprint
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data in Caches

• When data is used, it is copied into caches.


Core Core

• The hardware always copies chunks into on-chip cache on-chip cache

the cache, so called cache-lines.

• This is useful, when: bus


− the data is used frequently (temporal locality)
− consecutive data is used which is on
the same cache-line (spatial locality)
memory

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
False Sharing

i_ei.in
• False Sharing occurs when
− different threads use elements of the same cache-line Core Core
− one of the threads writes to the cache-line on-chip cache on-chip cache

• As a result the cache line is moved


between the threads,
although there is no real data dependency bus

not a correctness issue


iii)
• Note: False Sharing is a performance problem,
memory

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Summing up vector elements again

#pragma omp parallel do i = 0, 24


{ s = s + a(i)
end do

#pragma omp for


for (i = 0; i < 99; i++)
do i = 25, 49
s = s + a(i)
{ end do
do i = 0, 99
s=s+
s = s + a[i]; a(i) do i = 50, 74
s = s + a(i)
end do end do
}
do i = 75, 99
} // end parallel s = s + a(i)
end do

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
False Sharing

1 double s_priv[nthreads];
2 #pragma omp parallel num_threads(nthreads)
3 {
4 int t=omp_get_thread_num();
5 #pragma omp for
6 for (i = 0; i < 99; i++)
7 {
8 s_priv[t] += a[i];
9 }
10 } // end parallel
11 for (i = 0; i < nthreads; i++)
12 {
13 s += s_priv[i];
14 }

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
False Sharing
4000
• No performance benefit for more threads! 3000

MFLOPS
• Reason: false sharing of s_priv
2000

• Solution: padding so that only

nnsti 把 泄
one variable per cache line is used 1000

P … 都 分开 0

1 2 3 4 5 6 7 8 9 10 11 12
#threads
with false sharing
with false without
sharing false sharing

cache line 1 cache line 2


Standard 1 2 3 4 …
With padding 1 2 3 …
11 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
False Sharing avoided

double s_priv[nthreads * 8];


#pragma omp parallel num_threads(nthreads)
{
int t=omp_get_thread_num();
#pragma omp for
for (i = 0; i < 99; i++)
{
s_priv[t * 8] += a[i];
}
} // end parallel
for (i = 0; i < nthreads; i++)
{
s += s_priv[i * 8];
}
12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
13 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: PI
Example: Pi

1 double f(double x)
2 { 1
3 return (4.0 / (1.0 + x*x)); 4
4 } 𝜋=න
5
1 + 𝑥2
0
6 double CalcPi (int n)
7 {
8 const double fH = 1.0 / (double) n;
9 double fSum = 0.0;
10 double fX;
11 int i;
12
13 #pragma omp parallel for
14 for (i = 0; i < n; i++)
15 {
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }

15 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi

1 double f(double x) What is


2 {
3 return (4.0 / (1.0 + x*x)); wrong with 1
4
4 } this code? 𝜋=න
1 + 𝑥2
5 0
6 double CalcPi (int n)
7 {
8 const double fH = 1.0 / (double) n;
9 double fSum = 0.0;
10 double fX;
11 int i;
12
13 #pragma omp parallel for
14 for (i = 0; i < n; i++)
15 {
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }

16 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi

1 double f(double x)
2 { 1
3 return (4.0 / (1.0 + x*x)); 4
4 } 𝜋=න
5
1 + 𝑥2
0
6 double CalcPi (int n)
7 {
8 const double fH = 1.0 / (double) n;
9 double fSum = 0.0;
10 double fX;
11 int i;
12
13
14
.
#pragma omp parallel for private(fX,i) reduction(+:fSum)
for (i = 0; i < n; i++)
15 {
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }

17 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi

1 double f(double x)
2 { 1
3 return (4.0 / (1.0 + x*x)); 4
4 } 𝜋=න
5
1 + 𝑥2
0
6 double CalcPi (int n)
7 {
8
9
const double fH
double fSum = 0.0;
= 1.0 / (double) n;
What if we
10 double fX; had forgotten
11 int i;
12 this?
13 #pragma omp parallel for private(fX,i) reduction(+:fSum)
14 for (i = 0; i < n; i++)
15 {
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }

18 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Race Condition

-聚
• 1 Data Race: the typical OpenMP ⼀
programming error, when:

事 ? 㘜鑫 :
− two or more threads access the same memory location, and
− at least one of these accesses is a write, and ,吗
− the accesses are not protected by locks or critical regions, and 有 ni 保护

rne-t-n-nsyndutnn-r.no
− the accesses are not synchronized, e.g. by a barrier.

• Non-deterministic occurrence: e.g. the sequence of the execution of parallel loop iterations
is non-deterministic and may change from run to run

• In many cases-
private clauses, barriers or critical regions are missing

• Data races are hard to find using a traditional debugger


− Use tools like Intel Inspector XE, ThreadSanitizer, Archer

19 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Inspector XE – Results
1 detected problems The missing
2 filters reduction is
3 code location detected.

3 2
20 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi

1 double f(double x)
2 {
3 return (4.0 / (1.0 + x*x));
4 }
5
What if we just
6 double CalcPi (int n)
7 { made the fSum
8 const double fH = 1.0 / (double) n; variable private?
9 double fSum = 0.0;
10 double fX;
11 int i;
12 fSum == 0
13 #pragma omp parallel for private(fX,i,fSum)
(no update to
14 for (i = 0; i < n; i++)
15 { global variable)
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }

21 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi Scalability

• Results for 𝒏 = 𝟐 ⋅ 𝟏𝟎𝟗 :


# Threads Runtime [sec.] Speedup
1 1.141 1.00
2 0.575 1.96
4 0.298 3.93
8 0.161 7.08
System: CLAIX2018 Node (Intel Xeon 8160)
Compiler: Intel Compiler 19.0, Flags: –fopenmp –O3

• Scalability is good (for up to 8 threads):


− About 100% of the runtime has been parallelized.
− As there is just one parallel region, there is virtually no overhead introduced by the parallelization.

− Problem is parallelizable in a trivial fashion ...

22 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Recursive approach to compute Fibonacci

• Fibonacci numbers
− Form a sequence 𝑭𝒏 such that each number is the sum of the two preceding
− 𝑭𝟎 = 𝟎, 𝑭𝟏 = 𝟏
− 𝑭𝒏 = 𝑭𝒏−𝟏 + 𝑭𝒏−𝟐 (for 𝑛 > 1)

int main(int argc, int fib(int n) {


char* argv[]) if (n < 2) return n;
{ int x = fib(n - 1);
[...] int y = fib(n - 2);
fib(input); return x+y;
[...] }
}

• On the following slides we will discuss three approaches to parallelize this recursive code with
Tasking.

2 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The task construct

• Deferring (or not) a unit of work (executable for any member of the team)
− Always attached to a structured block

#pragma omp task [clause[[,] clause]...] !$omp task [clause[[,] clause]...]


{structured-block} …structured-block…
!$omp end task

◼ Where clause:
→ private(list), → if(scalar-expression)

→ firstprivate(list), → mergeable Cutoff Strategies


Data → final(scalar-expression)
→ shared(list) Environment
→ default(shared | none) → depend(dep-type: list) Dependencies
7
→ in_reduction(r-id: list) ≥ 5.0 → priority(priority-value)

→ untied

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Tasks in OpenMP: Data Scoping

• Some rules from Parallel Regions apply:


"
− Static and Global variables are shared
− Automatic Storage (local) variables are private
− Task variables are firstprivate unless shared in the enclosing context
▪ Only shared attribute is inherited
▪ Exception: Orphaned Task variables are firstprivate by default!

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
First version parallelized with Tasking (omp-v1)

1 int main(int argc, 14 int fib(int n) {


2 char* argv[]) 15 if (n < 2) return n;
3 { 16 int x, y;
4 [...] 17 #pragma omp task shared(x)
5 #pragma omp parallel 18 {
6 { 19 x = fib(n - 1);
7 #pragma omp single 20 }
8 { 21 #pragma omp task shared(y)
9 fib(input); 22 {
10 } 23 y = fib(n - 2);
11 } 24 }
12 [...] 25 #pragma omp taskwait
13 } 26 return x+y;
27 }


• Only one Task / Thread enters fib() from main(), it is responsible for
creating the two initial work tasks
• Taskwait is required, as otherwise x and y would be lost
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskwait directive

-_-
• The taskwait directive (shallow task synchronization)
- _ -

− It is a stand-alone directive
#pragma omp taskwait

− wait on the completion of child tasks of the current task; just direct children, not all descendant tasks;
includes an implicit task scheduling point (TSP)

#pragma omp parallel


#pragma omp single A
{
#pragma omp task :A
{
#pragma omp task : B
{ … } wait for… B C
#pragma omp task : C
{{ …… #C.1; #C.2; …}
…}
#pragma omp taskwait
C.1 C.2
}
} // implicit barrier will wait for C.x

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Fibonacci Illustration

• T1 enters fib(4) .

• T1 creates tasks for fib(3) and fib(2) 1

fib(4)
• T1 and T2 execute tasks from the queue
• T1 and T2 create 4 new tasks
• T1 - T4 execute tasks fib(3) fib(2)

fib(2) fib(1) fib(1) fib(0)

Task Queue

fib(2)
fib(3) fib(2)
fib(1) fib(1) fib(0)

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Fibonacci Illustration

• T1 enters fib(4)
• T1 creates tasks for fib(3) and fib(2) fib(4)
• T1 and T2 execute tasks from the queue
• T1 and T2 create 4 new tasks
• T1 - T4 execute tasks fib(3) fib(2)
• …

fib(2) fib(1) fib(1) fib(0)

fib(1) fib(0)

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scalability measurements (1/3)

•⼀
Overhead of task creation prevents scalability!
Speedup of Fibonacci with Tasks
16
15
14
13
12
11
10
Speedup

The OpenMP runtime


9
8
7 omp-v1

optimizes when there is only


6 optimal
5
4
3
2
1
one thread!
0
1 2 4 8 16
# Threads

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
if Clause


• The if clause of a task construct
− allows to optimize task creation/execution
− reduces parallelism but also reduces the pressure in the runtime’s task pool
− for “very” fine grain tasks you may need to do your own (manual) if
#pragma omp task if(expression)
{structured-block}

• If the expression of the “if” clause evaluates to false


− the encountering task is suspended
− the new task is executed immediately
− the parent task resumes when the task finishes

• This is known as undeferred task


惟⼀
不可达

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Improved parallelization with Tasking (omp-v2)

• Improvement: Don‘t create yet another task once a certain (small


enough) n is reached
1 int main(int argc, 14 int fib(int n) {
2 char* argv[]) 15 if (n < 2) return n;
3 { 16 int x, y;
4 [...] 17 #pragma omp task shared(x) \
5 #pragma omp parallel 18 if(n > 30)
6 { 19 {
7 #pragma omp single 20 x = fib(n - 1);
8 { 21 }
9 fib(input); 22 #pragma omp task shared(y) \
10 } 23 if(n > 30)
11 } 24 {
12 [...] 25 y = fib(n - 2);
13 } 26 }
27 #pragma omp taskwait
28 return x+y;
29 }

11 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scalability measurements (2/3)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• Speedup is better, but still not great
Speedup of Fibonacci with Tasks
16
15
14
13
12 Small tasks still have
11
10
to be allocated in
Speedup

9
8 omp-v1
7
6 omp-v2! omp-v2
5 optimal
4
3
2
1
0
1 2 4 8 16
# Threads

12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Improved parallelization with Tasking (omp-v3)

• Improvement: Skip the OpenMP overhead once a certain n


is reached (no issue w/ production compilers)
1 int main(int argc, 14 int fib(int n) {
2 char* argv[]) 15 if (n < 2) return n;

on_no
3 { 16 if (n <= 30)
4 [...] 17 return serfib(n);
5 #pragma omp parallel 18 int x, y;
6 { 19 #pragma omp task shared(x)
7 #pragma omp single 20 {
8 { 21 x = fib(n - 1);
9 fib(input); 22 }
10 } 23 #pragma omp task shared(y)
11 } 24 {
12 [...] 25 y = fib(n - 2);
13 } 26 }
27 #pragma omp taskwait
28 return x+y;
29 }

13 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scalability measurements (3/3)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• Looks promising…
Speedup of Fibonacci with Tasks
16
15
14
13
12
11
10
Speedup

9 omp-v1
8
7 omp-v2
6 omp-v3
5
4 optimal
3
2
1
0
1 2 4 8 16
# Threads

14 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Runtime measurements (1/2)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• First two versions

were slow because of overhead!
Runtime of Fibonacci with Tasks
450
400
350
300
Runtime [s]

250
omp-v1
200
omp-v2
150 omp-v3
100
50
0
1 2 4 8 16
# Threads

15 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Runtime measurements (2/2)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• Third version is comparable to serial version w/o OpenMP ☺
Runtime of Fibonacci with Tasks
8

5
Runtime [s]

4
omp-v3
3 serial
2

0
1 2 4 8 16
# Threads

16 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
T-s-nn.rs
Tasking Overheads

• Typical overheads in task-based programs are:

譿蠹
"
− Task creation: populate task data structure, add task to task queue
− Task execution: retrieve a task from the queue (may including work stealing)

• If tasks become too fine-grained, overhead becomes noticeable

-_-
− Execution spends a higher relative amount of time in the runtime

-_-
-_-
− Task execution contributing to runtime becomes significantly smaller

• A rough rule of thumb to avoid (visible) tasking overhead


− OpenMP tasks: 80-100k instructions executed per task
− TBB tasks: 30-50k instructions executed per task
− Other programming models may have another ideal granularity!

17 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Threads vs Tasks

• Threads do not compose well

rnst.name/
− Example: multi-threaded plugin in a multi-threaded application
− Composition usually leads to oversubscription and load imbalance

• Task models are inherently composable

==
− A pool of threads executes all created tasks
− Tasks from different modules can freely mix

• Task models make complex algorithms easier to parallelize


− Programmers can think in concurrent pieces of work
− Mapping of concurrent execution to threads handled elsewhere
− Task creation can be irregular (e.g., recursion, graph traversal)

18 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Sometimes You’re Better off with Threads…


• Some scenarios are more amenable for traditional threads
− Granularity too coarse for tasking
− Isolation of autonomous agents

• Static allocation of parallel work is typically easier with threads


_

− Controlling allocation of work to cache hierarchy

• Graphical User Interfaces (event thread + worker threads)

• Request/response processing, e.g.,


− Web servers
− Database servers

19 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
-_-
Tasks in OpenMP: Data Scoping

• Some rules from Parallel Regions apply:


− Automatic Storage (local) variables are private
− Static and Global variables are shared

• Tasking:
− Variables are firstprivate unless shared in the enclosing context
▪ Only shared attribute is inherited

-_-
Exception: Orphaned Task variables are firstprivate by default!
- See an example later on


赢 task 外⾯的 pnvate 变量

2 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Data Scoping
Data Scoping Example (1/7)

1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a:
13 // Scope of b:
14 // Scope of c:
15 // Scope of d:
16 // Scope of e:
17 } } }

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (2/7)

1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b:
14 // Scope of c:
15 // Scope of d:
16 // Scope of e:
17 } } }

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (3/7)

1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b: firstprivate
14 // Scope of c:
15 // Scope of d:
16 // Scope of e:
17 } } }

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (4/7)

1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b: firstprivate
14 // Scope of c: shared
15 // Scope of d:
16 // Scope of e:
17 } } }

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (5/7)

1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b: firstprivate
14 // Scope of c: shared
15 // Scope of d: firstprivate
16 // Scope of e:
17 } } }

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (6/7)

1 int a = 1;
Hint: Use default(none) to
2 void foo()
be forced to think about
3 {
every variable if you do not
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
see clear.
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b: firstprivate
14 // Scope of c: shared
15 // Scope of d: firstprivate
16 // Scope of e: private
17 } } }

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
r
Data Scoping Example (7/7)

1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared, value of a: 1


13 // Scope of b: firstprivate, value of b: 0 / undefined
14 // Scope of c: shared, value of c: 3
15 // Scope of d: firstprivate, value of d: 4
16 // Scope of e: private, value of e: 5
17 } } }

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scoping – Lifetime (1/2)

• How long do private / firstprivate instances exist?

int i = 5; #pragma omp parallel


#pragma omp parallel \
#pragma omp single
firstprivate(i)
{
{
// private copy per thread int i = 5; // alive until end of single
// initialized with 5 #pragma omp task
// alive until end of {
parallel region // firstprivate copy of i for task
}< // alive until end of task

}<
}

11 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scoping – Lifetime (2/2)

• How long do private / firstprivate instances exist?

int i = 5; #pragma omp parallel


#pragma omp parallel \
#pragma omp single
firstprivate(i)
{
{
// private copy per thread int i = 5; // alive until end of single

n_n
// initialized with 5 #pragma omp task
// alive until end of {
parallel region // firstprivate copy of i for task
}< // alive until end of task

}<
➢Alive until end of assigned
}
structured block or construct

12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Orphaned Task Variables

• Arguments passed by reference are firstprivate by default in orphaned task generating


constructs, example:
void task_body (int &);
• Question: What is the scoping of x?
void gen_task (int &x) { // an• orphaned task firstprivate
General rule: construct reference argument
if not shared before
#pragma omp task // •x is
Problem: Due todetermined
implicitly call by reference it might or might not be
firstprivate(x)


task_body (x); shared


}

• Solution: Special OpenMP rule for


void test (int &y, int &z) {
orphaned task has to be applied.
#pragma omp parallel private(y)
{
y = z + 2;
gen_task (y); // no matter if the argument is determined private
gen_task (z); // or shared in the enclosing context.
y++; // each thread has its own int object y refers to
gen_task (y);
13 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example taken from
} } the OpenMP 4.5
Examples
Questions?
Task Synchronization
The barrier directive

• OpenMP barrier (implicit or explicit)


− All tasks created by any thread of the current Team are guaranteed to be completed at barrier exit

C/C++
#pragma omp barrier

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskwait directive

• The taskwait directive (shallow task synchronization)


− It is a stand-alone directive


#pragma omp taskwait

− wait on the completion of child tasks of the current task; just direct children, not all descendant tasks;
includes an implicit task scheduling point (TSP)
#pragma omp parallel
#pragma omp single A
{
#pragma omp task :A
{
#pragma omp task : B
{ … } wait for… B C
#pragma omp task : C
{ … #C.1; #C.2; …}
#pragma omp taskwait
C.1 C.2
}
} // implicit barrier will wait for C.x
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskgroup directive

• The taskgroup construct (deep task synchronization)


-
− attached to a structured block; completion of all descendants of the current task; TSP at the end
==

#pragma omp taskgroup [clause[[,] clause]...]


{structured-block}

− where clause (could only be): reduction(reduction-identifier: list-items)

#pragma omp parallel


#pragma omp single A
{
#pragma omp taskgroup :A
{
#pragma omp task :B
{ … } B C
#pragma omp task :C
{ … #C.1; #C.2; …} wait for…
C.1 C.2
} // end of taskgroup
5 }
Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The Barrier and Taskwait Constructs

• Task Synchronization explained:

#pragma omp parallel num_threads(np)


{
#pragma omp task np Tasks created here, one by each thread
function_A();
#pragma omp barrier All Tasks guaranteed to be completed here
#pragma omp single
{
#pragma omp task
function_B(); 1 Task created here
}
}
B-Task guaranteed to be completed here

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Loops with Tasks
The taskloop Construct

• Task generating construct: decompose a loop into chunks, create a task for each loop chunk

.ir?.
#pragma omp taskloop [clause[[,] clause]…]
{structured-for-loops}

◼ Where clause is one of:


→ shared(list) → if(scalar-expression)
Cutoff


→ private(list) → final(scalar-expression)
Strategies
→ firstprivate(list) → mergeable
Data
→ lastprivate(list) → untied
Environment Scheduler (R/H)
→ default(sh | pr | fp | none) → priority(priority-value)
→ reduction(r-id: list) → collapse(n)
→ in_reduction(r-id: list) → nogroup Miscellaneous
→ grainsize(grain-size) → allocate([allocator:] list)
Chunks/Grain
→ num_tasks(num-tasks)

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Worksharing vs. taskloop constructs (1/2)

subroutine worksharing subroutine taskloop


integer :: x integer :: x
integer :: i integer :: i
integer, parameter :: T = 16 integer, parameter :: T = 16
integer, parameter :: N = 1024 integer, parameter :: N = 1024

o
x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)

!$omp do !$omp taskloop


do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop

!$omp end parallel !$omp end parallel


write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Worksharing vs. taskloop constructs (2/2)

subroutine worksharing subroutine taskloop


integer :: x integer :: x
integer :: i integer :: i
integer, parameter :: T = 16 integer, parameter :: T = 16
integer, parameter :: N = 1024 integer, parameter :: N = 1024

x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
!$omp single
!$omp do !$omp taskloop
do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop
!$omp end single ☆
!$omp end parallel !$omp end parallel
write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Taskloop decomposition approaches

◼ Clause: grainsize(grain-size) ◼ Clause: num_tasks(num-tasks)

tntensn
→ Chunks have at least grain-size iterations → Create num-tasks chunks

-.-
→ Chunks have maximum 2x grain-size iterations → Each chunk must have at least one iteration

-1--7
int TS = 4 * 1024;
#pragma omp taskloop grainsize(TS)
int NT = 4 * omp_get_num_threads();
#pragma omp taskloop num_tasks(NT)

-_-
for ( i = 0; i<SIZE; i+=1) { for ( i = 0; i<SIZE; i+=1) {
A[i]=A[i]*B[i]*S; A[i]=A[i]*B[i]*S;
} }

◼ If none of previous clauses is present, the number of chunks and the


ar nt
number of iterations per
nrntn
chunk is implementation defined
arns
◼ Additional considerations:
→ The order of the creation of the loop tasks is unspecified

→ Taskloop creates an implicit taskgroup region; nogroup → no implicit taskgroup region is created
-

6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Task Scheduling
Tasks in OpenMP: Scheduling

-_-
• Default: Tasks are tied to the thread that first executes them → not neccessarily the creator.
Scheduling constraints:
− Only the thread a task is tied to can execute it
− A task can only be suspended at task scheduling points
▪ Task creation, task finish, taskwait, barrier, taskyield
− If task is not suspended in a barrier, executing thread can only switch to a direct descendant of all tasks tied
to the thread

• Tasks created with the untied clause are never tied


− Resume at task scheduling points possibly by different thread
− No scheduling restrictions, e.g., can be suspended at any point
− But: More freedom to the implementation, e.g., load balancing

3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Unsafe use of untied Tasks

thredbidchagdatdkswitdypoints.tn
• Problem: Because untied tasks may migrate between threads at any point, thread-centric constructs

-3
can yield unexpected results

• Remember when using untied tasks:


−_Avoid threadprivate variable
−_Avoid any use of thread-ids (i.e., omp_get_thread_num())
− Be careful with-
critical _-
region and locks

• Simple Solution:

'

− Create a tied task region with

!⾯wymwe
#pragma omp task if(0)
,

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The taskyield Directive

• The taskyield directive specifies that the current task can be suspended in favor of execution
-of a different task.
− Hint to the runtime for optimization and/or deadlock prevention
C/C++ Fortran
#pragma omp taskyield !$omp taskyield

_
dewedputumnttaskskpchaekqueues.tk
anothentask.ttterwds 把没完成的 完成

5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
ˊ
taskyield Example (1/2)

#include <omp.h>

void something_useful();
void something_critical();

void foo(omp_lock_t * lock, int n)


{ o.it
for(int i = 0; i < n; i++)

jteckrns.su
#pragma omp task
{
something_useful();
while( !omp_test_lock(lock) ) {
#pragma omp taskyield
}
something_critical();
omp_unset_lock(lock); api
}
}
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskyield Example (2/2)

#include <omp.h>

void something_useful();
void something_critical();

void foo(omp_lock_t * lock, int n)


{
for(int i = 0; i < n; i++)
#pragma omp task
{
something_useful();
The waiting task may be
while( !omp_test_lock(lock) ) {
suspended here and allow the
#pragma omp taskyield executing thread to perform

_
} other work; may also avoid
something_critical(); deadlock situations.
omp_unset_lock(lock);
}
}
7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Tasks and Dependencies
Tasks and Dependencies

• Catchy example: Building a house

Landscape
Put up Shingl
Roof e Roof
Build
Walls Install
Electrical Drywall

Exterior
Siding

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Tasks and Dependencies

l-iODpeudenciesfne-grained.rns_e-xbeenread.in
• Task dependencies constrain execution order and times for tasks

• Fine-grained synchronization of tasks


int x = 0; Traditional task int x = 0; Dependencies
#pragma omp parallel #pragma omp parallel
#pragma omp single wait #pragma omp single
{ { :
#pragma omp task #pragma omp task depend(in: x)
std::cout << x << std::endl; std::cout << x << std::endl;

onsumedtmdr-nng_inoutoyi.ua
#pragma omp taskwait

#pragma omp task #pragma omp task depend(inout: x)


x++; x++;
} }

Task wait
t1
t2
Task’s creation time
readltnte.in
out
Task’s execution time

Dependencies t1
t2

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
NUMA Architectures
NUMA
Non-Uniform Memory Architecture

How To Distribute The Data ?

Core Core Core Core

on-chip on-chip on-chip on-chip


double* A; cache cache cache cache
A = (double*)
malloc(N * sizeof(double));
interconnect

for (int i = 0; i < N; i++) {


A[i] = 0.0;
memory memory
}
A[0] … A[N]
3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
About Data Distribution

• Important aspect on cc-NUMA systems


− If not optimal, longer memory access times and hotspots

÷
• OpenMP does not provide explicit support for cc-NUMA on first sight

• Placement comes from the Operating System


− This is therefore Operating System dependent
− OpenMP 5.0 introduced Memory Management to provide fine-grained control

• Windows, Linux and Solaris all use the “First Touch” placement policy by default
− May be possible to override default (check the docs)

4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Non-Uniform Memory Architecture

• Serial code: all array elements are allocated in the memory of the NUMA node containing the core
executing this thread

Core Core Core Core

on-chip on-chip on-chip on-chip


double* A; cache cache cache cache
A = (double*)
malloc(N * sizeof(double));
interconnect

for (int i = 0; i < N; i++) {


A[i] = 0.0;
memory memory
}
A[0] … A[N]
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Non-Uniform Memory Architecture

E
• First Touch w/ parallel code: all array elements are allocated in the memory of the NUMA node
-
containing the core executing the thread
initializing the respective partition
Core Core Core Core

on-chip on-chip on-chip on-chip


double* A; cache cache cache cache
A = (double*)
malloc(N * sizeof(double));
interconnect
omp_set_num_threads(4);

#pragma omp parallel for


for (int i = 0; i < N; i++) {
A[i] = 0.0;
memory memory
}
A[0] … A[N/2] A[N/2] … A[N]
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Get Info on the System Topology

• Before you design a strategy for thread binding, you should have a basic understanding of the
⼀system topology. Please use one of the following options on a target machine:

− Intel MPI‘s cpuinfo tool


▪ module switch openmpi intelmpi
▪ cpuinfo
▪ Delivers information about the number of sockets (= packages) and the mapping of processor ids used by the operating
system to cpu cores.

− hwlocs‘tools
▪ lstopo (command line: hwloc-ls)
▪ Displays a graphical representation of the system topology, separated into NUMA nodes, along with the mapping of
processor ids used by the operating system to cpu cores and additional info on caches.

7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
ns.
Decide for Binding Strategy

• Selecting the „right“ binding strategy depends not only on the topology, but also on the

n _ n

characteristics of your application.


-_-
orz

-_- ini
− Putting threads far apart, i.e. on different sockets

-.-
▪ May improve the aggregated memory bandwidth available to your application
▪ May improve the combined cache size available to your application
_
▪ May decrease performance of synchronization constructs

1--7
− Putting threads-_-
close together, i.e. on two adjacent cores which possibly shared some caches

.int
▪ May improve performance of synchronization constructs
▪ May decrease the available memory bandwidth and cache size

• If you are unsure, just try a few options and then select the best one.

8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
𨫣
OpenMP 4.0: Places + Binding Policies (1/2)

• Define OpenMP Places


-
− set of OpenMP threads running on one or more processors
− can be defined by the user, i.e.-_-
OMP_PLACES=cores

线程 关联

• Define a set of OpenMP Thread Affinity Policies


−-_-

-_-
SPREAD: spread OpenMP threads evenly among the places
− CLOSE: pack OpenMP threads near master thread
rst
− MASTER: collocate OpenMP thread with master thread
_

tint

蔗排列 。

• Goals
− user has a way to specify where to execute OpenMP threads for
− locality between OpenMP threads / less false sharing / memory bandwidth

9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Places

• Assume the following machine:


p0 p1 p2 p3 p4 p5 p6 p7

− 2 sockets, 4 cores per socket, 4 hyper-threads per core

• Abstract names for OMP_PLACES:

☆ snss
− threads: Each place corresponds to a single hardware thread on the target machine.
− cores: Each place corresponds to a single core (having one or more hardware threads) on the target
machine.


− numa_domains (5.1): Each places corresponds to _

− sockets: Each place corresponds to a single socket (consisting of one or more cores) on the target machine.
− ll_caches (5.1): Each place corresponds to a set of cores that
-_-
share the last level cache.
a set of cores for which their closest memory is: the same
_
memory; and at a similar distance from the cores.

10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP 4.0: Places + Binding Policies (2/2)

• Example‘s Objective:
− separate cores for outer loop and near cores for inner loop
• Outer Parallel Region: proc_bind(spread), Inner: proc_bind(close)
− spread creates partition, compact binds threads within respective partition
OMP_PLACES=(0,1,2,3), (4,5,6,7), ... = (0-3):8:4 = cores
#pragma omp parallel proc_bind(spread) num_threads(4)
#pragma omp parallel proc_bind(close) num_threads(4)
• Example p0 p1 p2 p3 p4 p5 p6 p7
− initial

− spread 4 p0 p1 p2 p3 p4 p5 p6 p7

− close 4
p0 p1 p2 p3 p4 p5 p6 p7

11 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Serial vs. Parallel Initialization

• Performance of OpenMP-parallel STREAM vector assignment measured on 2-socket Intel® Xeon®


X5675 („Westmere“) using Intel® Composer XE 2013 compiler with different thread binding options:

50000
Bandwidth in MB/s

40000
30000
20000
10000
0
1 2 4 8 12 16 20 24
Number of Threads
serial init. / no binding serial init. / close binding
serial init. / spread binding NUMA aware init. / close binding
NUMA aware init. / spread binding
12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?

You might also like