Introduction To Openmp: Openmp in Small Bites: Overview
Introduction To Openmp: Openmp in Small Bites: Overview
呈
applications in C,
C++, and Fortran
• Compiler Directives,
Runtime routines
and Environment
variables
3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP Overview & Parallel Region
OpenMP‘s machine model
Parallelization in OpenMP
employs multiple threads.
Proc Proc Proc Proc
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP Memory Model
==
the same, globally shared memory
private private
memory memory
T T
• Data transfer is through shared private
memory and is 100% transparent memory
to the application
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP Execution Model
-_-
just one thread: The Master.
Parallel
• Worker threads are spawned Slave Region
Slave
Threads
at Parallel Regions, together Worker
Threads
with the Master they form the Threads
Team of threads.
Serial Part
• In between Parallel Regions the
Worker threads are put to sleep.
The OpenMP Runtime takes care
of all thread management work. Parallel
Region
1⼆
• Concept: Fork-Join. '
9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Starting OpenMP Programs on Linux
10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
For Worksharing Construct
For Worksharing
• If only the parallel construct is used, each thread executes the Structured Block.
-_-
• Program Speedup: Worksharing
• OpenMP‘s most common Worksharing construct: for
C/C++ Fortran
int i; INTEGER :: i
#pragma omp for !$omp do
for (i = 0; i < 100; i++) DO i = 0, 99
{ a[i] = b[i] + c[i]
a[i] = b[i] + c[i]; END DO
}
Pseudo-Code Memory
Here: 4 Threads
Thread 1 do i = 0, 24 A(0)
.
a(i) = b(i) + c(i) .
end do .
A(99)
Thread 2 do i = 25, 49
Serial B(0)
a(i) = b(i) + c(i)
do i = 0, 99 .
end do .
a(i) = b(i) + c(i) .
end do do i = 50, 74 B(99)
a(i) = b(i) + c(i)
Thread 3 end do C(0)
.
do i = 75, 99 .
a(i) = b(i) + c(i) .
C(99)
Thread 4 end do
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Demo
Vector Addition
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Influencing the For Loop Scheduling
• for-construct: OpenMP allows to influence how the iterations are scheduled among the threads of
the team, via the schedule clause:
-_-
− schedule(static [, chunk]): Iteration space divided into blocks of chunk size, blocks are
assigned to threads in a round-robin fashion. If chunk is not specified: #threads blocks.
− schedule(dynamic [, chunk]): Iteration space divided into blocks of chunk (not specified: 1) size,
blocks are scheduled to threads in the order in which threads finish previous blocks.
− schedule(guided [, chunk]): Similar to dynamic, but block size starts with implementation-defined
value, then is decreased exponentially down to chunk.
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Synchronization Overview
•-
Data Race: If between two synchronization points at least one thread writes to a memory location
-_-
from which at least one other thread reads, the result is not deterministic (race condition).
_
7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Synchronization: Critical Region
ㄧ
• A Critical Region is executed by all threads, but by only one thread simultaneously (Mutual
Exclusion).
C/C++
#pragma omp critical (name)
{
... structured block ...
}
8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The Barrier Construct
The Barrier Construct
C/C++
#pragma omp barrier
⼀
• All worksharing constructs contain an implicit barrier at the end
10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Single and Master Construct
The Single Construct
C/C++ Fortran
#pragma omp single [clause] !$omp single [clause]
... structured block ... ... structured block ...
!$omp end single
=_=
• The single construct specifies that the enclosed structured block is executed by only on thread of
the team.
三
− It is up to the runtime which thread that is.
• Useful for:
− I/O
− Memory allocation and deallocation, etc. (in general: setup work)
− Implementation of the single-creator parallel-executor pattern as we will see soon…
12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The Master Construct
C/C++ Fortran
#pragma omp master[clause] !$omp master[clause]
⼆
... structured block ... ... structured block ...
!$omp end master
• The master construct specifies that the enclosed structured block is executed only by the master
thread of a team.
• Note: The master construct is no worksharing construct and does not contain an implicit barrier at
the end.
13 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Runtime Library
Runtime Library
• C and C++:
− If OpenMP is enabled during compilation, the preprocessor symbol _OPENMP is defined.
To use the OpenMP runtime library, the header omp.h has to be included.
− omp_set_num_threads(int): The specified number of threads will be used for the
15 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Data Scoping
Scoping Rules
tposs.si
introduced
• Scoping in OpenMP: Dividing variables in shared and private:
竺
later
− private-list and shared-list on Parallel Region
− private-list and =
shared-list on Worksharing constructs
− General default is shared for Parallel Region, firstprivate for Tasks.
− Loop control variables on for-constructs are private
ˋ
− private: A new uninitialized instance is created for the task or each thread executing the construct
firstprivate: Initialization with the value before encountering the construct
.
另
▪
▪ lastprivate: Value of last loop iteration is written back to Master
− Static variables are shared
⼀
3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Privatization of Global/Static Variables
C/C++ Fortran
ii.
static int i;
#pragma omp -7
threadprivate(i)
SAVE INTEGER :: i
!$omp threadprivate(i)
, -_-
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Privatization of Global/Static Variables
C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Back to our bad scaling example
C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}
It‘s your turn: Make It Scale!
do i = 75, 99
} // end parallel s = s + a(i)
end do
7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The Reduction Clause
⾢
• In a reduction-operation the operator is applied to all variables in the list. The variables have to be
shared.
− reduction(operator:list)
− The result is provided in the associated reduction variable
C/C++
int i, s = 0;
#pragma omp parallel for reduction(+:s)
for(i = 0; i < 99; i++)
{
s = s + a[i];
}
8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Reduction Operations
serial part
a=0
int a=0;
.
.
#pragma omp parallel a=0 a=0 a=0 a=0
parallel region
⼝
#pragma omp for reduction(+:a) local copies
for
for (int i=0; i<100; i++)
computation
{
a+=i; 300 925 1550 2175
}
update is written
)
reduction computes
serial part
1
. - 4950
to the shared final result in the
. =
variable shared variable
.
9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
False Sharing
3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Memory Bottleneck
SPECint benchmark
performance
Latency
− Source: John L. Hennessy, Stanford University, and David A. Patterson, University of California, September 25, 2012
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Caches
• CPU is fast
− Order of 3.0 GHz core
⾼
on-chip cache
• Caches:
− Fast, but expensive
− Thus small, order of MB off-chip cache
• Memory is slow
− Order of 0.3 GHz
− Large, order of GB
memory
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Visualization of the Memory Hierarchy
L2 cache
L1 cache
L3 cache
10
8
6
4
2
0
256 B
128 MB
1B
12 MB
32 MB
512 MB
4B
2 GB
16 KB
1 MB
4 MB
16 B
64 B
1 KB
4 KB
64 KB
256 KB
Memory Footprint
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data in Caches
• The hardware always copies chunks into on-chip cache on-chip cache
7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
False Sharing
i_ei.in
• False Sharing occurs when
− different threads use elements of the same cache-line Core Core
− one of the threads writes to the cache-line on-chip cache on-chip cache
8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Summing up vector elements again
9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
False Sharing
1 double s_priv[nthreads];
2 #pragma omp parallel num_threads(nthreads)
3 {
4 int t=omp_get_thread_num();
5 #pragma omp for
6 for (i = 0; i < 99; i++)
7 {
8 s_priv[t] += a[i];
9 }
10 } // end parallel
11 for (i = 0; i < nthreads; i++)
12 {
13 s += s_priv[i];
14 }
10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
False Sharing
4000
• No performance benefit for more threads! 3000
MFLOPS
• Reason: false sharing of s_priv
2000
。
• Solution: padding so that only
nnsti 把 泄
one variable per cache line is used 1000
了
P … 都 分开 0
。
1 2 3 4 5 6 7 8 9 10 11 12
#threads
with false sharing
with false without
sharing false sharing
1 double f(double x)
2 { 1
3 return (4.0 / (1.0 + x*x)); 4
4 } 𝜋=න
5
1 + 𝑥2
0
6 double CalcPi (int n)
7 {
8 const double fH = 1.0 / (double) n;
9 double fSum = 0.0;
10 double fX;
11 int i;
12
13 #pragma omp parallel for
14 for (i = 0; i < n; i++)
15 {
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }
15 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi
16 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi
1 double f(double x)
2 { 1
3 return (4.0 / (1.0 + x*x)); 4
4 } 𝜋=න
5
1 + 𝑥2
0
6 double CalcPi (int n)
7 {
8 const double fH = 1.0 / (double) n;
9 double fSum = 0.0;
10 double fX;
11 int i;
12
13
14
.
#pragma omp parallel for private(fX,i) reduction(+:fSum)
for (i = 0; i < n; i++)
15 {
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }
17 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi
1 double f(double x)
2 { 1
3 return (4.0 / (1.0 + x*x)); 4
4 } 𝜋=න
5
1 + 𝑥2
0
6 double CalcPi (int n)
7 {
8
9
const double fH
double fSum = 0.0;
= 1.0 / (double) n;
What if we
10 double fX; had forgotten
11 int i;
12 this?
13 #pragma omp parallel for private(fX,i) reduction(+:fSum)
14 for (i = 0; i < n; i++)
15 {
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }
18 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Race Condition
-聚
• 1 Data Race: the typical OpenMP ⼀
programming error, when:
事 ? 㘜鑫 :
− two or more threads access the same memory location, and
− at least one of these accesses is a write, and ,吗
− the accesses are not protected by locks or critical regions, and 有 ni 保护
rne-t-n-nsyndutnn-r.no
− the accesses are not synchronized, e.g. by a barrier.
• Non-deterministic occurrence: e.g. the sequence of the execution of parallel loop iterations
is non-deterministic and may change from run to run
• In many cases-
private clauses, barriers or critical regions are missing
19 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Inspector XE – Results
1 detected problems The missing
2 filters reduction is
3 code location detected.
3 2
20 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi
1 double f(double x)
2 {
3 return (4.0 / (1.0 + x*x));
4 }
5
What if we just
6 double CalcPi (int n)
7 { made the fSum
8 const double fH = 1.0 / (double) n; variable private?
9 double fSum = 0.0;
10 double fX;
11 int i;
12 fSum == 0
13 #pragma omp parallel for private(fX,i,fSum)
(no update to
14 for (i = 0; i < n; i++)
15 { global variable)
16 fX = fH * ((double)i + 0.5);
17 fSum += f(fX);
18 }
19 return fH * fSum;
20 }
21 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Pi Scalability
22 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Recursive approach to compute Fibonacci
• Fibonacci numbers
− Form a sequence 𝑭𝒏 such that each number is the sum of the two preceding
− 𝑭𝟎 = 𝟎, 𝑭𝟏 = 𝟏
− 𝑭𝒏 = 𝑭𝒏−𝟏 + 𝑭𝒏−𝟐 (for 𝑛 > 1)
• On the following slides we will discuss three approaches to parallelize this recursive code with
Tasking.
2 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The task construct
• Deferring (or not) a unit of work (executable for any member of the team)
− Always attached to a structured block
◼ Where clause:
→ private(list), → if(scalar-expression)
→ untied
3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Tasks in OpenMP: Data Scoping
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
First version parallelized with Tasking (omp-v1)
三
• Only one Task / Thread enters fib() from main(), it is responsible for
creating the two initial work tasks
• Taskwait is required, as otherwise x and y would be lost
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskwait directive
-_-
• The taskwait directive (shallow task synchronization)
- _ -
− It is a stand-alone directive
#pragma omp taskwait
− wait on the completion of child tasks of the current task; just direct children, not all descendant tasks;
includes an implicit task scheduling point (TSP)
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Fibonacci Illustration
• T1 enters fib(4) .
fib(4)
• T1 and T2 execute tasks from the queue
• T1 and T2 create 4 new tasks
• T1 - T4 execute tasks fib(3) fib(2)
Task Queue
fib(2)
fib(3) fib(2)
fib(1) fib(1) fib(0)
7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Fibonacci Illustration
• T1 enters fib(4)
• T1 creates tasks for fib(3) and fib(2) fib(4)
• T1 and T2 execute tasks from the queue
• T1 and T2 create 4 new tasks
• T1 - T4 execute tasks fib(3) fib(2)
• …
fib(1) fib(0)
8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scalability measurements (1/3)
•⼀
Overhead of task creation prevents scalability!
Speedup of Fibonacci with Tasks
16
15
14
13
12
11
10
Speedup
9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
if Clause
⾔
• The if clause of a task construct
− allows to optimize task creation/execution
− reduces parallelism but also reduces the pressure in the runtime’s task pool
− for “very” fine grain tasks you may need to do your own (manual) if
#pragma omp task if(expression)
{structured-block}
10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Improved parallelization with Tasking (omp-v2)
11 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scalability measurements (2/3)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• Speedup is better, but still not great
Speedup of Fibonacci with Tasks
16
15
14
13
12 Small tasks still have
11
10
to be allocated in
Speedup
9
8 omp-v1
7
6 omp-v2! omp-v2
5 optimal
4
3
2
1
0
1 2 4 8 16
# Threads
12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Improved parallelization with Tasking (omp-v3)
on_no
3 { 16 if (n <= 30)
4 [...] 17 return serfib(n);
5 #pragma omp parallel 18 int x, y;
6 { 19 #pragma omp task shared(x)
7 #pragma omp single 20 {
8 { 21 x = fib(n - 1);
9 fib(input); 22 }
10 } 23 #pragma omp task shared(y)
11 } 24 {
12 [...] 25 y = fib(n - 2);
13 } 26 }
27 #pragma omp taskwait
28 return x+y;
29 }
13 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scalability measurements (3/3)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• Looks promising…
Speedup of Fibonacci with Tasks
16
15
14
13
12
11
10
Speedup
9 omp-v1
8
7 omp-v2
6 omp-v3
5
4 optimal
3
2
1
0
1 2 4 8 16
# Threads
14 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Runtime measurements (1/2)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• First two versions
⼀
were slow because of overhead!
Runtime of Fibonacci with Tasks
450
400
350
300
Runtime [s]
250
omp-v1
200
omp-v2
150 omp-v3
100
50
0
1 2 4 8 16
# Threads
15 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Runtime measurements (2/2)
login-t, E5-2650 v4, 2x 12 cores @ 2.20 GHz
Intel Compiler 16.0.2, fib(45) = 1134903170
• Third version is comparable to serial version w/o OpenMP ☺
Runtime of Fibonacci with Tasks
8
5
Runtime [s]
4
omp-v3
3 serial
2
0
1 2 4 8 16
# Threads
16 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
T-s-nn.rs
Tasking Overheads
譿蠹
"
− Task creation: populate task data structure, add task to task queue
− Task execution: retrieve a task from the queue (may including work stealing)
-_-
− Execution spends a higher relative amount of time in the runtime
-_-
-_-
− Task execution contributing to runtime becomes significantly smaller
17 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Threads vs Tasks
rnst.name/
− Example: multi-threaded plugin in a multi-threaded application
− Composition usually leads to oversubscription and load imbalance
==
− A pool of threads executes all created tasks
− Tasks from different modules can freely mix
18 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Sometimes You’re Better off with Threads…
过
• Some scenarios are more amenable for traditional threads
− Granularity too coarse for tasking
− Isolation of autonomous agents
19 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
-_-
Tasks in OpenMP: Data Scoping
• Tasking:
− Variables are firstprivate unless shared in the enclosing context
▪ Only shared attribute is inherited
▪
-_-
Exception: Orphaned Task variables are firstprivate by default!
- See an example later on
在
赢 task 外⾯的 pnvate 变量
2 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Example: Data Scoping
Data Scoping Example (1/7)
1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a:
13 // Scope of b:
14 // Scope of c:
15 // Scope of d:
16 // Scope of e:
17 } } }
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (2/7)
1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b:
14 // Scope of c:
15 // Scope of d:
16 // Scope of e:
17 } } }
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (3/7)
1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b: firstprivate
14 // Scope of c:
15 // Scope of d:
16 // Scope of e:
17 } } }
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (4/7)
1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b: firstprivate
14 // Scope of c: shared
15 // Scope of d:
16 // Scope of e:
17 } } }
7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (5/7)
1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b: firstprivate
14 // Scope of c: shared
15 // Scope of d: firstprivate
16 // Scope of e:
17 } } }
8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Data Scoping Example (6/7)
1 int a = 1;
Hint: Use default(none) to
2 void foo()
be forced to think about
3 {
every variable if you do not
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
see clear.
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared
13 // Scope of b: firstprivate
14 // Scope of c: shared
15 // Scope of d: firstprivate
16 // Scope of e: private
17 } } }
9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
r
Data Scoping Example (7/7)
1 int a = 1;
2 void foo()
3 {
4 int b = 2, c = 3;
5 #pragma omp parallel private(b)
6 {
7 int d = 4;
8 #pragma omp task
9 {
10 int e = 5;
11
12 // Scope of a: shared, value of a: 1
古
13 // Scope of b: firstprivate, value of b: 0 / undefined
14 // Scope of c: shared, value of c: 3
15 // Scope of d: firstprivate, value of d: 4
16 // Scope of e: private, value of e: 5
17 } } }
10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scoping – Lifetime (1/2)
}<
}
11 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Scoping – Lifetime (2/2)
n_n
// initialized with 5 #pragma omp task
// alive until end of {
parallel region // firstprivate copy of i for task
}< // alive until end of task
}<
➢Alive until end of assigned
}
structured block or construct
12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Orphaned Task Variables
⾼
task_body (x); shared
旁
}
C/C++
#pragma omp barrier
3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskwait directive
ㄧ
#pragma omp taskwait
− wait on the completion of child tasks of the current task; just direct children, not all descendant tasks;
includes an implicit task scheduling point (TSP)
#pragma omp parallel
#pragma omp single A
{
#pragma omp task :A
{
#pragma omp task : B
{ … } wait for… B C
#pragma omp task : C
{ … #C.1; #C.2; …}
#pragma omp taskwait
C.1 C.2
}
} // implicit barrier will wait for C.x
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskgroup directive
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Loops with Tasks
The taskloop Construct
• Task generating construct: decompose a loop into chunks, create a task for each loop chunk
.ir?.
#pragma omp taskloop [clause[[,] clause]…]
{structured-for-loops}
鱻
→ private(list) → final(scalar-expression)
Strategies
→ firstprivate(list) → mergeable
Data
→ lastprivate(list) → untied
Environment Scheduler (R/H)
→ default(sh | pr | fp | none) → priority(priority-value)
→ reduction(r-id: list) → collapse(n)
→ in_reduction(r-id: list) → nogroup Miscellaneous
→ grainsize(grain-size) → allocate([allocator:] list)
Chunks/Grain
→ num_tasks(num-tasks)
3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Worksharing vs. taskloop constructs (1/2)
o
x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Worksharing vs. taskloop constructs (2/2)
x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
!$omp single
!$omp do !$omp taskloop
do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop
!$omp end single ☆
!$omp end parallel !$omp end parallel
write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Taskloop decomposition approaches
tntensn
→ Chunks have at least grain-size iterations → Create num-tasks chunks
-.-
→ Chunks have maximum 2x grain-size iterations → Each chunk must have at least one iteration
-1--7
int TS = 4 * 1024;
#pragma omp taskloop grainsize(TS)
int NT = 4 * omp_get_num_threads();
#pragma omp taskloop num_tasks(NT)
-_-
for ( i = 0; i<SIZE; i+=1) { for ( i = 0; i<SIZE; i+=1) {
A[i]=A[i]*B[i]*S; A[i]=A[i]*B[i]*S;
} }
→ Taskloop creates an implicit taskgroup region; nogroup → no implicit taskgroup region is created
-
。
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
Task Scheduling
Tasks in OpenMP: Scheduling
-_-
• Default: Tasks are tied to the thread that first executes them → not neccessarily the creator.
Scheduling constraints:
− Only the thread a task is tied to can execute it
− A task can only be suspended at task scheduling points
▪ Task creation, task finish, taskwait, barrier, taskyield
− If task is not suspended in a barrier, executing thread can only switch to a direct descendant of all tasks tied
to the thread
3 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Unsafe use of untied Tasks
thredbidchagdatdkswitdypoints.tn
• Problem: Because untied tasks may migrate between threads at any point, thread-centric constructs
-3
can yield unexpected results
• Simple Solution:
为
'
!⾯wymwe
#pragma omp task if(0)
,
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
The taskyield Directive
• The taskyield directive specifies that the current task can be suspended in favor of execution
-of a different task.
− Hint to the runtime for optimization and/or deadlock prevention
C/C++ Fortran
#pragma omp taskyield !$omp taskyield
_
dewedputumnttaskskpchaekqueues.tk
anothentask.ttterwds 把没完成的 完成
5 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
ˊ
taskyield Example (1/2)
#include <omp.h>
void something_useful();
void something_critical();
jteckrns.su
#pragma omp task
{
something_useful();
while( !omp_test_lock(lock) ) {
#pragma omp taskyield
}
something_critical();
omp_unset_lock(lock); api
}
}
6 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
taskyield Example (2/2)
#include <omp.h>
void something_useful();
void something_critical();
_
} other work; may also avoid
something_critical(); deadlock situations.
omp_unset_lock(lock);
}
}
7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Tasks and Dependencies
Tasks and Dependencies
Landscape
Put up Shingl
Roof e Roof
Build
Walls Install
Electrical Drywall
Exterior
Siding
9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Tasks and Dependencies
l-iODpeudenciesfne-grained.rns_e-xbeenread.in
• Task dependencies constrain execution order and times for tasks
onsumedtmdr-nng_inoutoyi.ua
#pragma omp taskwait
Task wait
t1
t2
Task’s creation time
readltnte.in
out
Task’s execution time
Dependencies t1
t2
10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?
NUMA Architectures
NUMA
Non-Uniform Memory Architecture
÷
• OpenMP does not provide explicit support for cc-NUMA on first sight
• Windows, Linux and Solaris all use the “First Touch” placement policy by default
− May be possible to override default (check the docs)
4 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Non-Uniform Memory Architecture
• Serial code: all array elements are allocated in the memory of the NUMA node containing the core
executing this thread
E
• First Touch w/ parallel code: all array elements are allocated in the memory of the NUMA node
-
containing the core executing the thread
initializing the respective partition
Core Core Core Core
• Before you design a strategy for thread binding, you should have a basic understanding of the
⼀system topology. Please use one of the following options on a target machine:
− hwlocs‘tools
▪ lstopo (command line: hwloc-ls)
▪ Displays a graphical representation of the system topology, separated into NUMA nodes, along with the mapping of
processor ids used by the operating system to cpu cores and additional info on caches.
7 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
ns.
Decide for Binding Strategy
• Selecting the „right“ binding strategy depends not only on the topology, but also on the
☆
n _ n
-_- ini
− Putting threads far apart, i.e. on different sockets
-.-
▪ May improve the aggregated memory bandwidth available to your application
▪ May improve the combined cache size available to your application
_
▪ May decrease performance of synchronization constructs
1--7
− Putting threads-_-
close together, i.e. on two adjacent cores which possibly shared some caches
.int
▪ May improve performance of synchronization constructs
▪ May decrease the available memory bandwidth and cache size
• If you are unsure, just try a few options and then select the best one.
8 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
𨫣
OpenMP 4.0: Places + Binding Policies (1/2)
线程 关联
。
tint
蔗排列 。
• Goals
− user has a way to specify where to execute OpenMP threads for
− locality between OpenMP threads / less false sharing / memory bandwidth
9 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Places
☆ snss
− threads: Each place corresponds to a single hardware thread on the target machine.
− cores: Each place corresponds to a single core (having one or more hardware threads) on the target
machine.
⽽
− numa_domains (5.1): Each places corresponds to _
⼜
− sockets: Each place corresponds to a single socket (consisting of one or more cores) on the target machine.
− ll_caches (5.1): Each place corresponds to a set of cores that
-_-
share the last level cache.
a set of cores for which their closest memory is: the same
_
memory; and at a similar distance from the cores.
10 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
OpenMP 4.0: Places + Binding Policies (2/2)
• Example‘s Objective:
− separate cores for outer loop and near cores for inner loop
• Outer Parallel Region: proc_bind(spread), Inner: proc_bind(close)
− spread creates partition, compact binds threads within respective partition
OMP_PLACES=(0,1,2,3), (4,5,6,7), ... = (0-3):8:4 = cores
#pragma omp parallel proc_bind(spread) num_threads(4)
#pragma omp parallel proc_bind(close) num_threads(4)
• Example p0 p1 p2 p3 p4 p5 p6 p7
− initial
− spread 4 p0 p1 p2 p3 p4 p5 p6 p7
− close 4
p0 p1 p2 p3 p4 p5 p6 p7
11 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Serial vs. Parallel Initialization
50000
Bandwidth in MB/s
40000
30000
20000
10000
0
1 2 4 8 12 16 20 24
Number of Threads
serial init. / no binding serial init. / close binding
serial init. / spread binding NUMA aware init. / close binding
NUMA aware init. / spread binding
12 Introduction to High-Performance Computing | Dr. Christian Terboven | Char for High-Performance Computing
Questions?