OMP Common Core-Voss
OMP Common Core-Voss
A hands on exploration
1
                   * The name “OpenMP” is the property of the OpenMP Architecture Review Board.      1
Preliminaries: Part 1
 • Disclosures
   – The views expressed in this tutorial are those of the
     people delivering the tutorial.
     – We are not speaking for our employers.
     – We are not speaking for the OpenMP ARB
 • We take these tutorials VERY seriously:
   – Help us improve … tell us how you would make this
     tutorial better.
                                                             2
Preliminaries: Part 2
                                                                3
Outline
 •   Introduction to OpenMP
 •   Creating Threads
 •   Synchronization
 •   Parallel Loops
 •   Data environment
 •   Memory model
 •   Irregular Parallelism and tasks
 •   Recap
 •   Beyond the common core:
     – Worksharing revisited
     – Synchronization: More than you ever wanted to know
     – Thread private data
     – Thread affinity and data locality
                                                            4
OpenMP* overview:
C$OMP              A set
               SINGLE     of compiler directives
                       PRIVATE(X)                and library routines for
                                       setenv OMP_SCHEDULE “dynamic”
            parallel application programmers
  C$OMP PARALLEL   DO simplifies
            Greatly   ORDERED writing
                                 PRIVATEmulti-threaded
                                         (A, B, C) (MT)C$OMPprograms
                                                                  ORDERED
            in Fortran, C and C++
   C$OMP PARALLEL REDUCTION (+: A, B)
            Standardizes established SMP practice +C$OMP  SECTIONS
                                                      vectorization and
            heterogeneous
  #pragma omp  parallel for device programming
                                private(A,  B)                                        !$OMP   BARRIER
   C$OMP PARALLEL COPYIN(/blk/)                                                C$OMP DO lastprivate(XX)
                            Page counts (not counting front matter, appendices or index) for versions of OpenMP
                            350
                                                                                                                         4.5
                            300               Fortran spec
  Page counts (spec only)
                               0
                                1996   1998      2000    2002   2004      2006   2008   2010         2012         2014         2016
                                                                          year
  The complexity of the full spec is overwhelming, so we focus on the 16 constructs most
  OpenMP programmers restrict themselves to … the so called “OpenMP Common Core”
                                                                                                                                      6
The OpenMP Common Core: Most OpenMP programs only use these 19 items
End User
Application
   Directives,                                     Environment
                         OpenMP library
   Compiler                                        variables
                                                                 8
 OpenMP basic syntax
• Most of the constructs in OpenMP are compiler directives.
                                           Example
#pragma omp parallel private(x)                   !$OMP PARALLEL
{
          #include<stdio.h>
          int main()
          {
    The statements are interleaved based on how the operating schedules the threads
                                                                                      12
Outline
 •   Introduction to OpenMP
 •   Creating Threads
 •   Synchronization
 •   Parallel Loops
 •   Data environment
 •   Memory model
 •   Irregular Parallelism and tasks
 •   Recap
 •   Beyond the common core:
     – Worksharing revisited
     – Synchronization: More than you ever wanted to know
     – Thread private data
     – Thread affinity and data locality
                                                            13
 OpenMP programming model:
 Fork-Join Parallelism:
      Master   thread spawns a team of threads as needed.
      Parallelism    added incrementally until performance goals are met,
         i.e., the sequential program evolves into a parallel program.
                            Parallel Regions
                                                                   A Nested
Master                                                             Parallel
Thread                                                             region
in red
                           Sequential Parts
                                                                              14
Thread creation: Parallel regions
                                                            double A[1000];
• Each thread executes the                                  omp_set_num_threads(4);
  same code redundantly.                                    #pragma omp parallel
                                                            {
                                                               int ID = omp_get_thread_num();
                     double A[1000];                           pooh(ID, A);
                                                            }
               omp_set_num_threads(4)                       printf(“all done\n”);
    A single
  copy of A is
    shared                pooh(0,A)            pooh(1,A)           pooh(2,A)         pooh(3,A)
  between all
   threads.
                    printf(“all done\n”);                            Threads wait here for all threads to finish
                                                                     before proceeding (i.e., a barrier)
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
                                                                                             16
Thread creation: How many threads did
you actually get?
• You create a team threads in OpenMP* with the parallel construct.
• You can request a number of threads with omp_set_num_threads()
• But is the number of threads requested the number you actually get?
    – NO! An implementation can silently decide to give you a team with fewer threads.
    – Once a team of threads is established … the system will not reduce the size of the team.
                                
    4.0
                                       4.0
                                              dx = 
                                     (1+x2)
                                0
                                  F(x )x  
                                         i
                                 i=0
                                                                    18
Serial PI program
See OMP_exercises/pi.c                                          19
Serial PI program
            #include <omp.h>
            static long num_steps = 100000;
            double step;
            int main ()
            {         int i; double x, pi, sum = 0.0, tdata;
                                                         threads       1st
                                                                     SPMD*
                                                            1         1.86
                                                            2         1.03
                                                            3         1.08
                                                            4         0.97
                                                                 24
Why such poor scaling?                                 False sharing
• If independent data elements happen to sit on the same cache line, each
  update will cause the cache lines to “slosh back and forth” between threads
  … This is called “false sharing”.
L1 $ lines L1 $ lines
*Intel compiler (icpc) with default optimization level (O2) on Apple OS X 10.7.3 with a dual core
(four HW thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
                                                                                                27
Changing the Number of Threads
• Inside the OpenMP runtime is an Internal Control Variable (ICV) for the
  default number of threads requested by a parallel construct.
                                                                             28
Outline
 •   Introduction to OpenMP
 •   Creating Threads
 •   Synchronization
 •   Parallel Loops
 •   Data environment
 •   Memory model
 •   Irregular Parallelism and tasks
 •   Recap
 •   Beyond the common core:
     – Worksharing revisited
     – Synchronization: More than you ever wanted to know
     – Threadprivate data
     – Thread affinity and data locality
                                                            29
Synchronization
                                     Synchronization is used to
                                     impose order constraints and
                                     to protect access to shared
                                     data
                                                                    30
Synchronization: critical
• Mutual exclusion: Only one thread at a time can enter a
  critical region.
                    float res;
                    #pragma omp parallel
                    {   float B; int i, id, nthrds;
                        id = omp_get_thread_num();
                        nthrds = omp_get_num_threads();
Threads wait
their turn – only       for(i=id;i<niters;i+=nthrds){
one at a time               B = big_job(i);
calls consume()     #pragma omp critical
                          res += consume (B);
                        }
                    }
                                                            31
Synchronization: barrier
• Barrier: a point in a program all threads much reach before any threads are
  allowed to proceed.
• It is a “stand alone” pragma meaning it is not associated with user code … it
  is an executable statement.
                  double Arr[8], Brr[8];       int numthrds;
                 omp_set_num_threads(8)
                 #pragma omp parallel
                 {   int id, nthrds;
                     id = omp_get_thread_num();
                     nthrds = omp_get_num_threads();
  Threads            if (id==0) numthrds = nthrds;
wait until all
threads hit
                     Arr[id] = big_ugly_calc(id, nthrds);
the barrier.     #pragma omp barrier
 Then they          Brr[id] = really_big_and_ugly(id, nthrds, Arr);
can go on.
                 }
                                                                             32
Exercise
 • In your first Pi program, you probably used an array to create
   space for each thread to store its partial sum.
                                                              threads      1st
                                                                          SPMD
                                                                  1        1.86
                                                                  2        1.03
                                                                  3        1.08
                                                                  4        0.97
*Intel compiler (icpc) with default optimization level (O2) on Apple OS X 10.7.3 with a dual core
(four HW thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.34
Example: Using a            critical section to remove impact of false sharing
#include <omp.h>
static long num_steps = 100000;            double step;
#define NUM_THREADS 2
void main ()
{           int nthreads; double pi=0.0;            step = 1.0/(double) num_steps;
            omp_set_num_threads(NUM_THREADS);
#pragma omp parallel                                          Create a scalar local
{                                                             to each thread to
           int i, id, nthrds; double x, sum;                  accumulate partial
           id = omp_get_thread_num();                         sums.
          nthrds = omp_get_num_threads();
          if (id == 0) nthreads = nthrds;
             for (i=id, sum=0.0;i< num_steps; i=i+nthrds) {             No array, so
                        x = (i+0.5)*step;                               no false
                        sum += 4.0/(1.0+x*x);                           sharing.
             }
         #pragma omp critical             Sum goes “out of scope” beyond the parallel
                   pi += sum * step;      region … so you must sum it in here. Must
}                                         protect summation into pi in a critical region so
}                                         updates don’t conflict
                                                                                              35
Results*: pi program critical section
   • Original Serial pi program with 100000000 steps ran in 1.83 seconds.
*Intel compiler (icpc) with default optimization level (O2) on Apple OS X 10.7.3 with a dual core
(four HW thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
                                                                                                   36
Example: Using a          critical section to remove impact of false sharing
#include <omp.h>
static long num_steps = 100000;            double step;
#define NUM_THREADS 2
void main ()
{           int nthreads; double pi=0.0;           step = 1.0/(double) num_steps;
             omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{                                                              Be careful where
           int i, id,nthrds; double x;                         you put a critical
           id = omp_get_thread_num();                          section
          nthrds = omp_get_num_threads();
          if (id == 0) nthreads = nthrds;
             for (i=id, sum=0.0;i< num_steps; i=i+nthreads){ What would happen if
                        x = (i+0.5)*step;                       you put the critical
                        #pragma omp critical                    section inside the
                            pi += 4.0/(1.0+x*x);                loop?
             }
}
pi *= step;
}
                                                                                       37
Outline
 •   Introduction to OpenMP
 •   Creating Threads
 •   Synchronization
 •   Parallel Loops
 •   Data environment
 •   Memory model
 •   Irregular Parallelism and tasks
 •   Recap
 •   Beyond the common core:
     – Worksharing revisited
     – Synchronization: More than you ever wanted to know
     – Threadprivate data
     – Thread affinity and data locality
                                                            38
The loop worksharing constructs
OpenMP parallel
                  #pragma omp parallel
region and a
                  #pragma omp for
worksharing for
                        for(i=0;i<N;i++) { a[i] = a[i] + b[i];}
construct                                                                 40
 Loop worksharing constructs:
 The schedule clause
• The schedule clause affects how loop iterations are mapped onto threads
   – schedule(static [,chunk])
        – Deal-out blocks of iterations of size “chunk” to each thread.
  – schedule(dynamic[,chunk])
        – Each thread grabs “chunk” iterations off a queue until all iterations have
          been handled.
                                                                    Least work at
 Schedule Clause                   When To Use                      runtime :
                                                                    scheduling done
STATIC                         Pre-determined and                   at compile-time
                               predictable by the
                               programmer
DYNAMIC                        Unpredictable, highly                Most work at
                               variable work per                    runtime :
                               iteration                            complex
                                                                    scheduling logic
                                                                    used at run-time
                                                                                       41
Combined parallel/worksharing construct
                                                            42
Working with loops
• Basic approach
  – Find compute intensive loops
  – Make the loop iterations independent ... So they can safely execute in
    any order without loop-carried dependencies
  – Place the appropriate OpenMP directive and test
 Remember: OpenMP makes the loop control index in a loop workshare construct
             private for you … you don’t need to do this yourself              47
   Example: Pi with a loop and a reduction
#include <omp.h>
static long num_steps = 100000;              double step;
void main ()
{ int i;          double x, pi, sum = 0.0; Create a team of threads …
    step = 1.0/(double) num_steps;              without a parallel construct, you’ll
                                                never have more than one thread
    #pragma omp parallel
    {                               Create a scalar local to each thread to hold
        double x;                   value of x at the center of each interval
*Intel compiler (icpc) with default optimization level (O2) on Apple OS X 10.7.3 with a dual core
(four HW thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
                                                                                                49
 The nowait clause
• Barriers are really expensive. You need to understand when
  they are implied and how to skip them when its safe to do so.
 •   Introduction to OpenMP
 •   Creating Threads
 •   Synchronization
 •   Parallel Loops
 •   Data environment
 •   Memory model
 •   Irregular Parallelism and tasks
 •   Recap
 •   Beyond the common core:
     – Worksharing revisited
     – Synchronization: More than you ever wanted to know
     – Thread private data
     – Thread affinity and data locality
                                                            51
Data environment:
Default storage attributes
                                                                       52
Data sharing: Examples
                                                                                     54
Data sharing: Private clause
                      void wrong() {
                          int tmp = 0;
                      #pragma omp parallel for private(tmp)
                          for (int j = 0; j < 1000; ++j)              tmp was not
                                   tmp += j;                          initialized
 When you need            printf(“%d\n”, tmp);
 to reference the     }
variable tmp that
exists prior to the        tmp is 0 here
construct, we call
  it the original
      variable.
                                                                                    55
   Data sharing: Private clause
   When is the original variable valid?
• The original variable’s value is unspecified if it is referenced
  outside of the construct
     – Implementations may reference the original variable or a copy ….. a
       dangerous programming practice!
     – For example, consider what would happen if the compiler inlined work()?
   int tmp;
   void danger() {                                extern int tmp;
        tmp = 0;                                  void work() {
   #pragma omp parallel private(tmp)                  tmp = 5;
        work();                                   }
       printf(“%d\n”, tmp);
   }
                                                  unspecified which
      tmp has unspecified value                   copy of tmp
                                                                          56
Firstprivate clause
      incr = 0;
      #pragma omp parallel for firstprivate(incr)
      for (i = 0; i <= MAX; i++) {
               if ((i%2)==0) incr++;
               A[i] = incr;
      }
                                                                57
Data sharing:
A data environment test
 • Consider this example of PRIVATE and FIRSTPRIVATE
                variables: A = 1,B = 1, C = 1
          #pragma omp parallel private(B) firstprivate(C)
 • Are A,B,C private to each thread or shared inside the parallel region?
 • What are their initial values inside and values after the parallel region?
                   #include <omp.h>
                   int main()
                   {
                       int i, j=5;    double x=1.0, y=42.0;
    The static
   extent is the       #pragma omp parallel for default(none) reduction(*:x)
   code in the         for (i=0;i<N;i++){
compilation unit          for(j=0; j<3; j++)                  The compiler would
  that contains                 x+= foobar(i, j, y);        complain about j and y,
 the construct.        }                                    which is important since
                                                             you don’t want j to be
                       printf(“ x is %f\n”,(float)x);                shared
                   }
 The full OpenMP specification has other versions of the default clause, but they
           are not used very often so we skip them in the common core                  59
Mandelbrot Set
    Black is inside set. Other colors indicate how quickly it crossed threshold.
                                                                                   60
Exercise: Mandelbrot set area
• Find and fix the errors (hint … the problem is with the data
  environment).
• Once you have a working version, try to optimize the
  program.
  – Try different schedules on the parallel loop.
  – Try different mechanisms to support mutual exclusion … do the
    efficiencies change?
                                                                    61
    The Mandelbrot area program
#include <omp.h>
# define NPOINTS 1000
# define MXITR 1000                                           void testpoint(struct d_complex c){
struct d_complex{                                             struct d_complex z;
   double r; double i;                                             int iter;
};                                                                 double temp;
void testpoint(struct d_complex);
struct d_complex c;                                               z=c;
int numoutside = 0;                                               for (iter=0; iter<MXITR; iter++){
                                                                    temp = (z.r*z.r)-(z.i*z.i)+c.r;
int main(){                                                         z.i = z.r*z.i*2+c.i;
  int i, j;                                                         z.r = temp;
  double area, error, eps = 1.0e-5;                                 if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp parallel for default(shared) private(c, j) \            #pragma omp critical
   firstprivate(eps)                                                  numoutside++;
  for (i=0; i<NPOINTS; i++) {                                         break;
    for (j=0; j<NPOINTS; j++) {                                     }
      c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps;           }
      c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps;          }
      testpoint(c);
    }                                                   •   eps was not initialized
  }
                                                        •   Protect updates of numoutside
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-
numoutside)/(double)(NPOINTS*NPOINTS);                  •   Which value of c does testpoint()
  error=area/(double)NPOINTS;                               see? Global or private?
}                                                                                             62
Outline
 •   Introduction to OpenMP
 •   Creating Threads
 •   Synchronization
 •   Parallel Loops
 •   Data environment
 •   Memory model
 •   Irregular Parallelism and tasks
 •   Recap
 •   Beyond the common core:
     – Worksharing revisited
     – Synchronization: More than you ever wanted to know
     – Thread private data
     – Thread affinity and data locality
                                                            63
OpenMP memory model
   OpenMP supports a shared memory model
   All threads share an address space, but it can get complicated:
                                    Shared memory
          a
                   cache1        cache2       cache3               cacheN
                                                          ...
                   proc1         proc2        proc3               procN
                          a               a
                                                                                       64
OpenMP and relaxed consistency
 • OpenMP supports a relaxed-consistency
   shared memory model
   – Threads can maintain a temporary view of shared memory
     that is not consistent with that of other threads
                                                                             65
Flush operation
                                                                             66
Flush and synchronization
     This means if you are mixing reads and writes of a variable across multiple
     threads, you cannot assume the reading threads see the results of the writes
     unless:
     • the writing threads follow the writes with a construct that implies a flush.
     • the reading threads precede the reads with a construct that implies a flush.
     This is a rare event … or putting this another way, you should avoid writing
     code that depends on ordering reads/writes around flushes.
                                                                                      67
Outline
 •   Introduction to OpenMP
 •   Creating Threads
 •   Synchronization
 •   Parallel Loops
 •   Data environment
 •   Memory model
 •   Irregular Parallelism and tasks
 •   Recap
 •   Beyond the common core:
     – Worksharing revisited
     – Synchronization: More than you ever wanted to know
     – Thread private data
     – Thread affinity and data locality
                                                            68
What are tasks?
                                                                 69
What are tasks?
   A common Pattern is to have one thread create the tasks while the
        other threads wait at a barrier and execute the tasks
                                                                       70
Single worksharing Construct
                                                              71
Task Directive
             #pragma omp task [clauses]
                            structured-block
 • Hint: use tasks to print the indeterminate part of the output (i.e. the “race”
   or “car” parts).
• At taskwait directive
  – i.e. Wait until all tasks defined in the current task have completed.
      #pragma omp taskwait
  – Note: applies only to tasks generated in the current task, not to
    “descendants” .
                                                                 75
Example
 #pragma omp parallel
 {
   #pragma omp single
    {
      #pragma omp task
         fred();
      #pragma omp task   fred() and daisy()
         daisy();        must complete before
      #pragma taskwait   billy() starts
      #pragma omp task
         billy();
    }
 }
                                  76
Linked list traversal
     p = listhead ;
     while (p) {
       process(p);
       p=next(p) ;
     }
                                                  77
Parallel linked list traversal
                                      Only one thread
#pragma omp parallel                  packages tasks
{
  #pragma omp single
   {
    p = listhead ;
    while (p) {
       #pragma omp task firstprivate(p)
             {
               process (p);
             }
       p=next (p) ;
                                makes a copy of p
     }                          when the task is
   }                            packaged
}
                                           78
Data scoping with tasks
• Variables can be shared, private or firstprivate with respect to
  task
     x = fib(n-1);
     y = fib (n-2);
     return (x+y);
 }
 Int main()
 {
   int NW = 5000;
   fib(NW);
 }
                                                          81
Parallel Fibonacci
  int fib (int n)
  { int x,y;
                               • Binary tree of tasks
    if (n < 2) return n;
                               • Traversed using a recursive
  #pragma omp task shared(x)      function
    x = fib(n-1);
  #pragma omp task shared(y)   • A task cannot complete until all
    y = fib (n-2);                tasks below it in the tree are
  #pragma omp taskwait            complete (enforced with taskwait)
    return (x+y);
  }                            • x,y are local, and so by default
                                  they are private to current task
  Int main()                       – must be shared on child tasks so they
  { int NW = 5000;                   don’t create their own firstprivate
    #pragma omp parallel             copies at this level!
    {
       #pragma omp single
          fib(NW);
    }
  }                                                     82
Divide and conquer
• Split the problem into smaller sub-problems; continue until
  the sub-problems can be solve directly
                                              3 Options:
                                                 Do work as you split
                                                  into sub-problems
                                                 Do work only at the
                                                  leaves
                                                 Do work as you
                                                  recombine
                                                                     83
Exercise: Pi with tasks
                                                                  84
Program: OpenMP tasks
#include <omp.h>
static long num_steps = 100000000;                          int main ()
#define MIN_BLK 10000000                                    {
double pi_comp(int Nstart,int Nfinish,double step)            int i;
{ int i,iblk;                                                 double step, pi, sum;
  double x, sum = 0.0,sum1, sum2;                              step = 1.0/(double) num_steps;
  if (Nfinish-Nstart < MIN_BLK){                               #pragma omp parallel
     for (i=Nstart;i< Nfinish; i++){                           {
       x = (i+0.5)*step;                                          #pragma omp single
       sum = sum + 4.0/(1.0+x*x);                                    sum =
     }                                                                  pi_comp(0,num_steps,step);
  }                                                             }
  else{                                                          pi = step * sum;
     iblk = Nfinish-Nstart;                                 }
     #pragma omp task shared(sum1)
         sum1 = pi_comp(Nstart,       Nfinish-iblk/2,step);
     #pragma omp task shared(sum2)
          sum2 = pi_comp(Nfinish-iblk/2, Nfinish,     step);
     #pragma omp taskwait
       sum = sum1 + sum2;
  }return sum;
}                                                                                                    85
Results*: pi with tasks
*Intel compiler (icpc) with default optimization level (O2) on Apple OS X 10.7.3 with a dual core
(four HW thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
                                                                                                86
Using tasks
• Don’t use tasks for things already well supported by
  OpenMP
  – e.g. standard do/for loops
  – the overhead of using tasks is greater
                                                    87
Outline
 •   Introduction to OpenMP
 •   Creating Threads
 •   Synchronization
 •   Parallel Loops
 •   Data environment
 •   Memory model
 •   Irregular Parallelism and tasks
 •   Recap
 •   Beyond the common core:
     – Worksharing revisited
     – Synchronization: More than you ever wanted to know
     – Thread private data
     – Thread affinity and data locality
                                                            88
The OpenMP Common Core: Most OpenMP programs only use these 19 items
Visit the OpenMP booth and enter a drawing for a chance to win a copy of the book.
Drawing Tues and Wed @ 4:30, Thurs @ 2:00. You must be present to win.               93
Background references
                                                            96
Outline
 •   Introduction to OpenMP
 •   Creating Threads
 •   Synchronization
 •   Parallel Loops
 •   Data environment
 •   Memory model
 •   Irregular Parallelism and tasks
 •   Recap
 •   Beyond the common core:
     – Worksharing revisited
     – Synchronization: More than you ever wanted to know
     – Thread private data
     – Thread affinity and data locality
                                                            97
The loop worksharing constructs
                                                                        98
 Loop worksharing constructs:
 The schedule clause
• The schedule clause affects how loop iterations are mapped onto threads
   – schedule(static [,chunk])
        – Deal-out blocks of iterations of size “chunk” to each thread.
  – schedule(dynamic[,chunk])
        – Each thread grabs “chunk” iterations off a queue until all iterations have
          been handled.
  – schedule(guided[,chunk])
        – Threads dynamically grab blocks of iterations. The size of the block starts
          large and shrinks down to size “chunk” as the calculation proceeds.
  – schedule(runtime)
        – Schedule and chunk size taken from the OMP_SCHEDULE environment
          variable (or the runtime library).
  – schedule(auto)
        – Schedule is left up to the runtime to choose (does not have to be any of the
          above).
                                                           100
 Nested loops
   For perfectly nested rectangular loops we can parallelize
    multiple loops in the nest with the collapse clause:
                                                                        103
Outline
•   Introduction to OpenMP
•   Creating Threads
•   Synchronization
•   Parallel Loops
•   Data environment
•   Memory model
•   Irregular Parallelism and tasks
•   Recap
•   Beyond the common core:
    – Worksharing revisited
    – Synchronization: More than you ever wanted to know
    – Thread private data
    – Thread affinity and data locality
                                                           104
Synchronization
                                   Synchronization is used to
                                   impose order constraints and
                                   to protect access to shared
 • High level synchronization:     data
      –critical
                     Covered earlier
      –barrier
      –atomic
      –ordered
 • Low level synchronization
      –flush
      –locks (both simple and nested)
                                                                  105
Synchronization: atomic
      #pragma
      #pragmaomp
              ompparallel
                  parallel
      {{
           double
           doubletmp,
                  B; B;
           B = DOIT();
           B = DOIT();
                                                              106
Synchronization: atomic
                                                                    108
Parallel loop with ordered region
• An ordered clause on a loop worksharing construct
  – indicates that the loop contains an ordered region
      }
  }                                                x[r][c-1]   x[r][c]
                                                                                         111
Parallelizing DOACROSS loops
           2 loops contribute to the pattern
              of dependencies … so the
            dependency relations for each
              depend(sink) is of length 2
                                                                                 112
OpenMP memory model
   OpenMP supports a shared memory model
   All threads share an address space, where variable can be stored or
    retrieved:
                                Shared memory
         a
                 cache1      cache2       cache3           cacheN
                                                    ...
                 proc1       proc2        proc3            procN
                                      a
                                                                          113
Flush operation
• Defines a sequence point at which a thread enforces a
  consistent view of memory.
 IMPORTANT POINT: The flush makes the calling threads temporary view match the
 view in shared memory. Flush by itself does not force synchronization.
                                                                                 114
Memory consistency: flush example
    Flush forces data to be updated in memory so other threads see the most
     recent value
                                    Flush without a list: flush set is all
                                    thread visible variables
      double A;
                                    Flush with a list: flush set is the list of
      A = compute();
                                    variables
      #pragma omp flush(A)
        // flush to memory to make sure other
        // threads can pick up the right value
                                                                                  115
Flush and synchronization
• A flush operation is implied by OpenMP synchronizations, e.g.,
  – at entry/exit of parallel regions
  – at implicit and explicit barriers
  – at entry/exit of critical regions
  – whenever a lock is set or unset
  ….
  (but not at entry to worksharing regions or entry/exit of master regions)
                                                                              116
    Example: prod_cons.c
    • Parallelize a producer/consumer program
       – One thread produces values that another thread consumes.
int main()                                                   – Often used with a
{
  double *A, sum, runtime;       int flag = 0;                 stream of
                                                               produced values
    A = (double *) malloc(N*sizeof(double));                   to implement
                                                               “pipeline
    runtime = omp_get_wtime();                                 parallelism”
    fill_rand(N, A);    // Producer: fill an array of data   – The key is to
                                                               implement
    sum = Sum_array(N, A); // Consumer: sum the array          pairwise
    runtime = omp_get_wtime() - runtime;                       synchronization
                                                               between threads
    printf(" In %lf secs, The sum is %lf \n",runtime,sum);
}
                                                                                 117
Pairwise synchronization in OpenMP
 • OpenMP lacks synchronization constructs that work between
   pairs of threads.
 • When needed, you have to build it yourself.
 • Pairwise synchronization
   – Use a shared flag variable
   – Reader spins waiting for the new flag value
   – Use flushes to force updates to and from memory
                                                           118
    Exercise: Producer/consumer
int main()
{
   double *A, sum, runtime; int numthreads, flag = 0;
   A = (double *)malloc(N*sizeof(double));
   #pragma omp parallel sections
   {
     #pragma omp section
                                            Put the flushes in the right places to
      {
        fill_rand(N, A);                    make this program race-free.
            flag = 1;
                                            Do you need any other
        }
        #pragma omp section                 synchronization constructs to make
        {                                   this work?
                                                                                     119
 Solution (try 1): Producer/consumer
int main()
{
   double *A, sum, runtime; int numthreads, flag = 0;
   A = (double *)malloc(N*sizeof(double));
   #pragma omp parallel sections
   {
      #pragma omp section
                                              Use flag to Signal when the
       {
         fill_rand(N, A);                     “produced” value is ready
         #pragma omp flush
         flag = 1;
         #pragma omp flush (flag)             Flush forces refresh to memory;
       }
       #pragma omp section                    guarantees that the other thread
       {                                      sees the new value of A
         #pragma omp flush (flag)
         while (flag == 0){
              #pragma omp flush (flag) Flush needed on both “reader” and “writer”
         }                              sides of the communication
         #pragma omp flush
         sum = Sum_array(N, A);
       }                                        Notice you must put the flush inside the
     }                                          while loop to make sure the updated flag
}                                               variable is seen
       This program works with the x86 memory model (loads and stores use relaxed
       atomics), but it technically has a race … on the store and later load of flag 120
The OpenMP 3.1 atomics (1 of 2)
                                                                                121
  The OpenMP 3.1 atomics (2 of 2)
• Atomic can protect the assignment of a value (its capture) AND an
  associated update operation:
       # pragma omp atomic capture
               statement or structured block
• Where the statement is one of the following forms:
       v = x++;    v = ++x;       v = x--;    v = –x;        v = x binop expr;
• Where the structured block is one of the following forms:
     Note: a thread always accesses the most recent copy of the lock,
     so you don’t need to use a flush on the lock variable.
     Locks with hints were added in OpenMP 4.5 to suggest a lock strategy based on
     intended use (e.g. contended, unconteded, speculative,, unspeculative)
                                                                                            126
  Synchronization: Simple locks
• Example: conflicts are rare, but to play it safe, we must assure mutual
  exclusion for updates to histogram elements.
  #pragma omp parallel for
                                                One lock per element of hist
  for(i=0;i<NBUCKETS; i++){
        omp_init_lock(&hist_locks[i]); hist[i] = 0;
  }
  #pragma omp parallel for
  for(i=0;i<NVALS;i++){
      ival = (int) sample(arr[i]);
      omp_set_lock(&hist_locks[ival]);           Enforce mutual
         hist[ival]++;                           exclusion on update
      omp_unset_lock(&hist_locks[ival]);         to hist array
    }
  for(i=0;i<NBUCKETS; i++)
   omp_destroy_lock(&hist_locks[i]);
                                             Free-up storage when done.
                                                                               127
Lock Example from Gafort (SpecOMP’2001)
                                                         128
                    !$OMP PARALLEL PRIVATE(rand, iother, itemp, temp, my_cpu_id)
Parallel loop           my_cpu_id = 1
                    !$ my_cpu_id = omp_get_thread_num() + 1
In shuffle.f        !$OMP DO
                        DO j=1,npopsiz-1
of Gafort                 CALL ran3(1,rand,my_cpu_id,0)
                          iother=j+1+DINT(DBLE(npopsiz-j)*rand)
                    !$    IF (j < iother) THEN
                    !$      CALL omp_set_lock(lck(j))
                    !$      CALL omp_set_lock(lck(iother))
                    !$    ELSE
                    !$      CALL omp_set_lock(lck(iother))
                    !$      CALL omp_set_lock(lck(j))
 Exclusive access   !$    END IF
 to array                itemp(1:nchrome)=iparent(1:nchrome,iother)
 elements.               iparent(1:nchrome,iother)=iparent(1:nchrome,j)
                         iparent(1:nchrome,j)=itemp(1:nchrome)
 Ordered locking         temp=fitness(iother)
 prevents                fitness(iother)=fitness(j)
 deadlock.               fitness(j)=temp
                    !$ IF (j < iother) THEN
                    !$      CALL omp_unset_lock(lck(iother))
                    !$      CALL omp_unset_lock(lck(j))
                    !$ ELSE
                    !$      CALL omp_unset_lock(lck(j))
                    !$      CALL omp_unset_lock(lck(iother))
                    !$ END IF
                       END DO
                    !$OMP END DO                              129
                    !$OMP END PARALLEL
Exercise
                                                               130
Outline
 •   Introduction to OpenMP
 •   Creating Threads
 •   Synchronization
 •   Parallel Loops
 •   Data environment
 •   Memory model
 •   Irregular Parallelism and tasks
 •   Recap
 •   Beyond the common core:
     – Worksharing revisited
     – Synchronization: More than you ever wanted to know
     – Thread private data
     – Thread affinity and data locality
                                                            131
Data sharing: Threadprivate
                                                            132
A threadprivate example (C)
       int counter = 0;
       #pragma omp threadprivate(counter)
       int increment_counter()
       {
          counter++;
          return (counter);
       }
                                            133
Data copying: Copyin
      end
                                                                 134
 Data copying: Copyprivate
Used with a single region to broadcast values of privates from one member of a
team to the rest of the team
   #include <omp.h>
   void input_parameters (int, int); // fetch values of input parameters
   void do_work(int, int);
   void main()
   {
     int Nsize, choice;
           do_work(Nsize, choice);
       }
   }
                                                                       135
Exercise: Monte Carlo calculations
Using random numbers to solve tough problems
                                                                           137
Outline
 •   Introduction to OpenMP
 •   Creating Threads
 •   Synchronization
 •   Parallel Loops
 •   Data environment
 •   Memory model
 •   Irregular Parallelism and tasks
 •   Recap
 •   Beyond the common core:
     – Worksharing revisited
     – Synchronization: More than you ever wanted to know
     – Thread private data
     – Thread affinity and data locality
                                                            138
Thread Affinity and Data Locality
 • Affinity
   – Process Affinity: bind processes (MPI tasks, etc.) to CPUs
   – Thread Affinity: further binding threads to CPUs that are
     allocated to their parent process
 • Data Locality
   – Memory Locality: allocate memory as close as possible to the
     core on which the task that requested the memory is running
   – Cache Locality: use data in cache as much as possible
                                                                    139
Memory Locality
• Most systems today are Non-Uniform Memory Access (NUMA)
• Example, the Intel® Xeon Phi™ processor
                                                                                                  Up to 16GB high-bandwidth on-
                 Over 6 TF SP peak                                                                                                                     2x 512b VPU per core
                                                                                                   package memory (MCDRAM)
             Full Xeon ISA compatibility                                                                                                             (Vector Processing Units)
                                                                                                     Exposed as NUMA node
                  through AVX-512
                                                                                                     >400 GB/s sustained BW
                                                                                                                                            Tile
      Up to 72 cores (36 tiles)
                                                                                                                                            HUB
       2D mesh architecture
                                     MCDRAM                            MCDRAM                     MCDRAM   MCDRAM
                                                                                                                                    2 VPU             2
                                                                                                                                                     VPU
                        DDR4                                                                                        DDR4
        6 channels
                                                                                                                                            1MB L2
                                                                        Up to
          DDR4                                                                                                                      Core             Core
                        DDR4                                           72 cores                                     DDR4
           Up to
          384GB         DDR4                                                                                        DDR4
         ~90 GB/s
                                                                                                                      Based on Intel® Atom™ processor with
                                     MCDRAM                            MCDRAM                     MCDRAM   MCDRAM
Diagram is for conceptual purposes only and only illustrates a CPU and memory – it is not to scale and does not include
all functional areas of the CPU, nor does it represent actual component layout.
                                                                                                                                                                                 140
Memory Locality
*KNL: Intel® Xeon Phi™ processor 7250 with 68 cores @ 1.4 Ghz …
the “bootable” version that sits in a socket, not a co-processor
                                                                    141
Example Compute Nodes (Intel Haswell*)
   • An Intel Haswell node has 32 cores (64 CPUs), 128 MB DDR memory.
   • 2 NUMA domains per node, 16 cores per NUMA domain. 2 hardware
     threads (CPUs) per core.
   • Memory bandwidth is non-homogeneous among NUMA domains.
       – CPUs 0-15, 32-47 are closer to memory in NUMA domain 0, farther to memory in NUMA
         domain 1.
       – CPUs 16-31, 48-64 are closer to memory in NUMA domain 1, farther to memory in NUMA
         domain 0.
                                                                                                     143
  Tools to Check Compute Node Information (2)
 • Portable Hardware Locality (hwloc)
     – hwloc-ls: provides a graphical representation of the system
       topology, NUMA nodes, cache info, and the mapping of procs.
     % hwloc-ls
                                            Cori Haswell* node example
                                            32 cores, 2 sockets
 Step 2 Compute
 #pragma omp parallel for
 for (j=0; j<VectorSize; j++) {
 a[j]=b[j]+d*c[j];}
OMP_PROC_BIND=close
                                                                                    145
“Perfect Touch” is Hard
                                                                     146
Runtime Environment Variable:
OMP_PROC_BIND
 • Controls thread affinity within and between OpenMP places
 • OpenMP 3.1 only has OMP_PROC_BIND, either TRUE or
   FALSE.
   – If true, the runtime will not move threads around between processors.
 • OpenMP 4.0 still allows the above. Added options:
   – close: bind threads close to the master thread
   – spread: bind threads as evenly distributed (spreaded) as possible
   – master: bind threads to the same place as the master thread
 • Examples:
   – OMP_PROC_BIND=spread
   – OMP_PROC_BIND=spread,close (for nested levels)
                                                                         147
Runtime Environment Variable:
OMP_PROC_BIND (2)
 • Use 4 cores total, 2 hyperthreads per core, and OMP_NUM_THREADS=4 an
   example
 • none: no affinity setting.
 • close: Bind threads as close to each other as possible
                 Node         Core 0        Core 1        Core 2        Core 3
                          HT1     HT2   HT1     HT2   HT      HT2   HT1     HT
                                                      1                     2
                 Thread   0       1     2       3
                                               - 148 -
Runtime Environment Variable:
OMP_PLACES (1)
 • OpenMP 4.0 added OMP_PLACES environment variable
   – To control thread allocation
   – defines a series of places to which the threads are assigned
 • OMP_PLACES can be
   – threads: each place corresponds to a single hardware thread on the
     target machine.
   – cores: each place corresponds to a single core (having one or more
     hardware threads) on the target machine.
   – sockets: each place corresponds to a single socket (consisting of one
     or more cores) on the target machine.
   – A list with explicit CPU ids (see next slide)
 • Examples:
   – export OMP_PLACES=threads
   – export OMP_PLACES=cores
                                                                         149
Runtime Environment Variable:
OMP_PLACES (2)
 • OMP_PLACES can also be
   – A list with explicit place values of CPU ids, such as:
      – "{0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15}”
      – “{0:4},{4:4},{8:4},{12:4}” (default stride is 1)
      – Format: {lower-bound:length:stride}. Thus, specifying {0:3:2} is the
         same as specifying {0,2,4}
 • Examples:
   – export OMP_PLACES=“ {0:4:2},{1:4:2}” (which is equivalent to
     “{0,2,4,6},{1,3,5,7}”)
   – export OMP_PLACES=“{0:8:1}” (which is equivalent to
     “{0,1,2,3,4,5,6,7}”
                                                                           150
Other Runtime Environment Variables
for Affinity Support
 •   OMP_NUM_THREADS
 •   OMP_THREAD_LIMIT
 •   OMP_NESTED
 •   OMP_MAX_ACTIVE_LEVELS
                                                       151
OMP_PROC_BIND Choices for STREAM
                                Blue: OMP_PROC_BIND=close
OMP_NUM_THREADS=32              Red: OMP_PROC_BIND=spread
OMP_PLACES=threads              Both with First Touch
OMP_PROC_BIND=close
   Threads 0 to 31 bind to
   CPUs
   0,32,1,33,2,34,…15,47. All
   threads are in the first
   socket. The second socket
   is idle. Not optimal.
OMP_PROC_BIND=spread
   Threads 0 to 31 bind to
   CPUs 0,1,2,… to 31. Both
   sockets and memory are
   used to maximize memory
   bandwidth.
                                                            152
Affinity Clauses for OpenMP Parallel
Construct
 • The “num_threads” and “proc_bind” clauses can be used
   – The values set with these clauses take precedence over values set
     by runtime environment variables
 • Helps code portability
 • Examples:
   – C/C++:
      #pragma omp parallel num_threads(2) proc_bind(spread)
   – Fortran:
     !$omp parallel num_threads (2) proc_bind (spread)
    ...
     !$omp end parallel
                                                                         153
Runtime APIs for Thread Affinity Support
                                                                        154
Other Runtime APIs for Thread Affinity
Support
 •   omp_get_nested, omp_set_nested
 •   omp_get_thread_limit
 •   omp_get_level
 •   omp_get_active_level
 •   omp_get_max_active_levels, omp_set_max_active_levels
 •   omp_get_proc_bind, omp_set_proc_bind
 •   omp_get_num_threads, omp_set_num_threads
 •   omp_get_max_threads
                            - 155 -
                                                            155
Exercise: “First Touch” with STREAM
benchmark
• STREAM benchmark codes: stream.c, stream.f
• Check the source codes to see if “first touch” is implemented
• With “first touch” on (stream.c) and off (stream_nft.c), experiment with
  different OMP_NUM_THREADS and OMP_PROC_BIND settings to
  understand how “first touch” and OMP_PROC_BIND choices affect
  STREAM memory bandwidth results (look at the Best Rate for Triad in
  the output).
• Compare your results with the two STREAM plots shown earlier in this
  slide deck.
                                                                             156
Sample Nested OpenMP Program
 #include <omp.h>                                    % a.out
 #include <stdio.h>                                  Level 1: number of threads in the team: 2
 void report_num_threads(int level)                  Level 2: number of threads in the team: 1
 {                                                   Level 3: number of threads in the team: 1
    #pragma omp single {                             Level 2: number of threads in the team: 1
        printf("Level %d: number of threads in the   Level 3: number of threads in the team: 1
 team: %d\n", level, omp_get_num_threads());
       }
                                                     % export OMP_NESTED=true
 }
 int main()                                          % export OMP_MAX_ACTIVE_LEVELS=3
 {                                                   % a.out
    omp_set_dynamic(0);                              Level 1: number of threads in the team: 2
    #pragma omp parallel num_threads(2) {            Level 2: number of threads in the team: 2
       report_num_threads(1);                        Level 2: number of threads in the team: 2
       #pragma omp parallel num_threads(2) {         Level 3: number of threads in the team: 2
          report_num_threads(2);
                                                     Level 3: number of threads in the team: 2
          #pragma omp parallel num_threads(2) {
             report_num_threads(3);                  Level 3: number of threads in the team: 2
          }                                          Level 3: number of threads in the team: 2
       }
    }                                                Level 0: P0
    return(0);                                       Level 1: P0 P1
 }                                                   Level 2: P0 P2; P1 P3
                                                     Level 3: P0 P4; P2 P5; P1 P6; P3 P7
                                                                                                 157
 Process and Thread Affinity in Nested OpenMP
• A combination of OpenMP environment variables and run time flags are needed
  for different compilers and different batch schedulers on different systems.
close
• Use num_threads clause in source codes to set threads for nested regions.
• For most other non-nested regions, use OMP_NUM_THREADS environment
  variable for simplicity and flexibility.
                                                                                 158
When to Use Nested OpenMP
                                                                      159
Use Multiple Threads in MKL
      export OMP_NESTED=true
      export OMP_PLACES=cores
      export OMP_PROC_BIND=sprad,close
      export OMP_NUM_THREADS=6,4
      export MKL_DYNAMICS=false
      export OMP_MAX_ACTIVE_LEVELS=2
                                                                           FFT3D on KNC, Ng=643 example
                                                                           Courtesy of Jeongnim Kim, Intel
*KNC: Intel® Xeon Phi™ processor (Knights Corner) … the first generation co-processor version of the chip.
                                                                                                             160
 Example Compute Nodes (Cori KNL*)
       •   A Cori KNL node has 68 cores/272 CPUs, 96 GB DDR memory, 16 GB
           high bandwidth on package memory (MCDRAM).
       •   Three cluster modes, all-to-all, quadrant, sub-NUMA clustering, are
           available at boot time to configure the KNL mesh interconnect.
       •   A quad,cache node has only 1 NUMA node with all CPUs on the NUMA node 0
           (DDR memory). The MCDRAM is hidden from the “numactl –H” result since it is a
           cache.
       •   A quad,flat node has only 2 NUMA nodes with all CPUs on the NUMA node 0 (DDR
           memory). And NUMA node 1 has MCDRAM only.
       •   A snc2,flat node has 4 NUMA domains with DDR memory and all CPUs on NUMA
           nodes 0 and 1. (NUMA node 0 has physical cores 0 to 33 and all corresponding
           hyperthreads, and NUMA node 1 has physical cores 34 to 67 and all corresponding
           hyperthreads). NUMA nodes 2 and 3 have MCDRAM only.
*KNL: Intel® Xeon Phi™ processor 7250 with 68 cores @ 1.4 GHz                                161
Intel KNL Quad,Flat Node Example
                                           Cori KNL quad,flat node example
                                           68 cores (272 CPUs)
% numactl –H
  available: 2 nodes (0-1)
  node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
  37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
  74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
  108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
  135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
  162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
  189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
  216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242
  243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
  270 271
  node 0 size: 96723 MB
  node 0 free: 93924 MB
  node 1 cpus:
  node 1 size: 16157 MB
  node 1 free: 16088 MB
  node distances:
  node 0 1
   0: 10 31              • The quad,flat mode has only 2 NUMA nodes with all CPUs
   1: 31 10                   on the NUMA node 0 (DDR memory).
                         • And NUMA node 1 has MCDRAM (high bandwidth memory).
                                                                                                                      162
Exercise: Affinity Choices and Results
 • Pure OpenMP code: xthi-omp.c
 • Nested OpenMP code: xthi-nested-omp.c
 • Sample output:
    % ./xthi-omp |sort –k4n
    Hello from thread 0, on nid00011. (core affinity = 0)
    Hello from thread 1, on nid00011. (core affinity = 4)
    Hello from thread 2, on nid00011. (core affinity = 8) ...
 • Experiment with different OMP_NUM_THREADS,
   OMP_PROC_BIND, and OMP_PLACES settings to check thread
   affinity on different compute node architectures (for example, Cori
   Haswell and KNL).
                                                                         163
Essential runtime settings for KNL MCDRAM
Memory Affinity
 • In quad, cache mode, no special setting is needed to use
   MCDRAM
 • In quad,flat mode, using quad,flat as an example
   – NUMA node 1 is MCDRAM
 • Enforced memory mapping to MCDRAM
   – If using >16 GB, malloc will fail
   – Use “numactl -m 1 ./myapp” as the executable
    (instead of “./myapp”)
 • Preferred memory mapping to MCDRAM:
   – If using >16 GB, malloc will spill to DDR
   – Use “numactl -p 1 ./myapp” as the executable
    (instead of “./myapp”)
                                                              164
 Summary for Thread Affinity and Data Locality
   • Achieving best data locality, and optimal process and thread affinity is
     crucial in getting good performance with OpenMP, yet it is not
     straightforward to do so.
        – Understand the node architecture with tools such as “numactl -H” first.
        – Always use simple examples with the same settings for your real application
          to verify first.
   • Exploit first touch data policy, optimize code for cache locality.
   • Pay special attention to avoid false sharing.
   • Put threads far apart (spread) may improve aggregated memory
     bandwidth and available cache size for your application, but may also
     increase synchronization overhead. And putting threads “close” have
     the reverse impact as “spread”.
   • For nested OpenMP, set OMP_PROC_BIND=spread,close is generally
     recommended.
   • Use numactl -m or -p option to explicitly request memory allocation in
     specific NUMA domain (for example: high bandwidth memory in KNL)
*KNL: Intel® Xeon Phi™ processor 7250 with 68 cores @ 1.4 GHz                           165
Appendices
• Challenge Problems
• Challenge Problems: solutions
   – Monte Carlo PI and random number generators
   – Molecular dynamics
   – Matrix multiplication
   – Linked lists
   – Recursive matrix multiplication
• Fortran and OpenMP
                                                   166
Challenge problems
                                                                             167
Challenge 1: Molecular dynamics
                                                                 168
Challenge 1: MD (cont.)
 • Once you have a working version, move the parallel region
   out to encompass the iteration loop in main.c
   – Code other than the forces loop must be executed by a single thread
     (or workshared).
   – How does the data sharing change?
 • The atomics are a bottleneck on most systems.
   – This can be avoided by introducing a temporary array for the force
     accumulation, with an extra dimension indexed by thread number
   – Which thread(s) should do the final accumulation into f?
                                                                           169
Challenge 1 MD: (cont.)
 • Another option is to use locks
   – Declare an array of locks
   – Associate each lock with some subset of the particles
   – Any thread that updates the force on a particle must hold the
     corresponding lock
   – Try to avoid unnecessary acquires/releases
   – What is the best number of particles per lock?
                                                                     170
Challenge 2: Matrix multiplication
                                                            171
Challenge 3: Traversing linked lists
 • Consider the program linked.c
   – Traverses a linked list, computing a sequence of Fibonacci numbers
     at each node
 • Parallelize this program two different ways
    1. Use OpenMP tasks
    2. Use anything you choose in OpenMP other than tasks.
 • The second approach (no tasks) can be difficult and may
   take considerable creativity in how you approach the
   problem (why its such a pedagogically valuable problem)
                                                                          172
Challenge 4: Recursive matrix multiplication
                                                                173
Challenge 4: Recursive matrix multiplication
        A                          B                      C
    C1,1 = A1,1·B1,1 + A1,2·B2,1                 C1,2 = A1,1·B1,2 + A1,2·B2,2
    C2,1 = A2,1·B1,1 + A2,2·B2,1                C2,2 = A2,1·B1,2 + A2,2·B2,2
                                                                                174
Challenge 4: Recursive matrix multiplication
     How to multiply submatrices?
 • Use the same routine that is computing the full matrix
   multiplication
    – Quarter each input submatrix and output submatrix
    – Treat each sub-submatrix as a single element and multiply
         A1,1
        A111,1       A1,2
                    A111,2       B11
                                  B1,1
                                    1,1       B1,21,2
                                             B11              C11
                                                               C1,1
                                                                 1,1      C11
                                                                           C1,21,2
        A11
         A2,1
            2,1     A11
                     A2,2
                        2,2      B11
                                  B2,12,1    B11
                                              B2,22,2         C11
                                                               C2,12,1    C11
                                                                           C2,22,2
// C11 += A11*B11
 matmultrec(mf, mf+(ml-mf)/2, nf, nf+(nl-nf)/2, pf, pf+(pl-pf)/2, A,B,C);
// C11 += A12*B21
 matmultrec(mf, mf+(ml-mf)/2, nf, nf+(nl-nf)/2, pf+(pl-pf)/2, pl, A,B,C);
   . . .
}
                                                   177
 Computers and random numbers
• We use “dice” to make random numbers:
  – Given previous values, you cannot predict the next value.
  – There are no patterns in the series … and it goes on forever.
• Computers are deterministic machines … set an initial state,
  run a sequence of predefined instructions, and you get a
  deterministic answer
  – By design, computers are not random and cannot produce random
    numbers.
• However, with some very clever programming, we can make
  “pseudo random” numbers that are as random as you need
  them to be … but only if you are very careful.
• Why do I care? Random numbers drive statistical methods
  used in countless applications:
  – Sample a large space of alternatives to find statistically good answers
    (Monte Carlo methods).
                                                                              178
Monte Carlo Calculations
Using Random numbers to solve tough problems
     pi = 4.0 * ((double)Ncirc/(double)num_trials);
     printf("\n %d trials, pi is %f \n",num_trials, pi);
 }
                                                                                     180
Linear Congruential Generator (LCG)
• LCG: Easy to write, cheap to compute, portable, OK quality
   If you pick the multiplier and addend correctly, LCG has a period of
    PMOD.
   Picking good LCG parameters is complicated, so look it up
    (Numerical Recipes is a good source). I used the following:
      MULTIPLIER = 1366
      ADDEND = 150889
      PMOD = 714025
                                                                           181
LCG code
 static long MULTIPLIER = 1366;
 static long ADDEND     = 150889;
 static long PMOD      = 714025;
 long random_last = 0;                       Seed the pseudo random
 double random ()                            sequence by setting
 {
                                             random_last
    long random_next;
     return ((double)random_next/(double)PMOD);
 }
                                                                      182
Running the PI_MC program with LCG generator
                              1
 Log 10 Relative error
                                     1     2        3        4        5        6
                                                                                                                                    Run the same
                             0.1
                                                                                                LCG - one thread                    program the
                                                                                                                                    same way and
                            0.01                                                                LCG, 4 threads,                     get different
                                                                                                trail 1                             answers!
                                                                                                LCG 4 threads,
                           0.001                                                                trial 2                             That is not
                                                                                                LCG, 4 threads,                     acceptable!
                                                                                                trial 3
                          0.0001                                                                                                    Issue: my LCG
                                                                                                                                    generator is not
                                                                                                                                    threadsafe
                         0.00001
       Program written using the Intel C/C++ compiler (10.0.659.2005) in Microsoft Visual studio 2005 (8.0.50727.42) and running on a dual-core laptop (Intel
       T2400 @ 1.83 Ghz with 2 GB RAM) running Microsoft Windows XP.
                                                                                                                                                 183
LCG code: threadsafe version
static long MULTIPLIER = 1366;                   random_last carries state
static long ADDEND     = 150889;                 between random number
static long PMOD      = 714025;                  computations,
long random_last = 0;
#pragma omp threadprivate(random_last)           To make the generator
double random ()                                 threadsafe, make
{                                                random_last threadprivate
   long random_next;
                                                 so each thread has its
                                                 own copy.
    random_next = (MULTIPLIER * random_last + ADDEND)%  PMOD;
    random_last = random_next;
    return ((double)random_next/(double)PMOD);
}
                                                                             184
Thread safe random number generators
                                                           thread           program.
                           0.1
                                                           LCG 4 threads,
                                                                            But for large
                                                           trial 1
                          0.01                                              number of
                                                           LCT 4 threads,
                                                                            samples, its
                                                           trial 2
                         0.001                                              quality is lower
                                                           LCG 4 threads,
                                                                            than the one
                                                           trial 3
                                                                            thread result!
                        0.0001                             LCG 4 threads,
                                                           thread safe      Why?
                       0.00001
                                                                            185
    Pseudo Random Sequences
• Random number Generators (RNGs) define a sequence of pseudo-random
  numbers of length equal to the period of the RNG
              Thread 1
                                              Thread 2
                                                                                   Thread 3
                                                                                 187
MKL Random number generators (RNG)
     MKL includes several families of RNGs in its vector statistics library.
     Specialized to efficiently generate vectors of random numbers
                                                                                        188
Wichmann-Hill generators (WH)
VSLStreamStatePtr stream;
#pragma omp threadprivate(stream)
                        …
vslNewStream(&ran_stream, VSL_BRNG_WH+Thrd_ID, (int)seed);
                                                                       189
                       Independent Generator for each thread
                             Log10 number of samples
                                                                     Notice that once
                            1                                        you get beyond
                                 1   2   3   4   5     6             the high error,
Log10 Relative error
                                                                     small sample
                           0.1                                       count range,
                                                           WH one    adding threads
                                                           thread    doesn’t
                                                           WH, 2     decrease quality
                          0.01
                                                           threads   of random
                                                           WH, 4
                                                                     sampling.
0.001 threads
0.0001
                                                                                  190
Leap Frog method
• Interleave samples in the sequence of pseudo random numbers:
   – Thread i starts at the ith number in the sequence
   – Stride through sequence, stride length = number of threads.
• Result … the same sequence of values regardless of the number of
  threads.
Used the MKL library with two generator streams per computation: one for the x values (WH) and one for the
y values (WH+1). Also used the leapfrog method to deal out iterations among threads.
                                                                                                         192
Appendices
• Challenge Problems
• Challenge Problems: solutions
   – Monte Carlo PI and random number generators
   – Molecular dynamics
   – Matrix multiplication
   – Linked lists
   – Recursive matrix multiplication
• Fortran and OpenMP
                                                   193
Molecular dynamics: Solution
                                               Compiler will warn you
                                               if you have missed
                                               some variables
                                                                        194
Molecular dynamics : Solution (cont.)
........
#pragma omp atomic
          f[j] -= forcex;
#pragma omp atomic
          f[j+1] -= forcey;
#pragma omp atomic            All updates to f must be
          f[j+2] -= forcez;   atomic
        }
      }
#pragma omp atomic
      f[i] += fxi;
#pragma omp atomic
      f[i+1] += fyi;
#pragma omp atomic
      f[i+2] += fzi;
    }
  }
                                                         195
Molecular dynamics : With orphaning
 }
 #pragma omp for reduction(+:epot,vir) schedule (static,32)
    for (int i=0; i<npart*3; i+=3) {
 ………                                                        All variables which used to
                                                              be shared here are now
                                                              implicitly determined
                                                                                       196
Molecular dynamics : With array reduction
           ftemp[myid][j] -= forcex;
           ftemp[myid][j+1] -= forcey;
           ftemp[myid][j+2] -= forcez;
       }
      }
       ftemp[myid][i]       += fxi;
       ftemp[myid][i+1]      += fyi;     Replace atomics with
       ftemp[myid][i+2]      += fzi;     accumulation into array
  }                                      with extra dimension
                                                                   197
Molecular dynamics : With array reduction
 ….
                                   Reduction can be done
 #pragma omp for                   in parallel
   for(int i=0;i<(npart*3);i++){
        for(int id=0;id<nthreads;id++){
             f[i] += ftemp[id][i];
           ftemp[id][i] = 0.0;
      }
   }                                    Zero ftemp for next time
                                         round
                                                                   198
Appendices
• Challenge Problems
• Challenge Problems: solutions
   – Monte Carlo PI and random number generators
   – Molecular dynamics
   – Matrix multiplication
   – Linked lists
   – Recursive matrix multiplication
• Fortran and OpenMP
                                                   199
Challenge: Matrix Multiplication
                                                            200
                                                          There is much more that can be
Matrix multiplication                                     done. This is really just the first
                                                          and most simple step
  Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2                 201
Appendices
• Challenge Problems
• Challenge Problems: solutions
   – Monte Carlo PI and random number generators
   – Molecular dynamics
   – Matrix multiplication
   – Linked lists
   – Recursive matrix multiplication
• Fortran and OpenMP
                                                   202
Exercise: traversing linked lists
 • Consider the program linked.c
   – Traverses a linked list computing a sequence of Fibonacci numbers at
     each node.
 • Parallelize this program two different ways
    1. Use OpenMP tasks
    2. Use anything you choose in OpenMP other than tasks.
 • The second approach (no tasks) can be difficult and may
   take considerable creativity in how you approach the
   problem (hence why its such a pedagogically valuable
   problem).
                                                                        203
 Linked lists with tasks
• See the file Linked_omp3_tasks.c
                                                                           204
Exercise: traversing linked lists
 • Consider the program linked.c
   – Traverses a linked list computing a sequence of Fibonacci numbers at
     each node.
 • Parallelize this program two different ways
    1. Use OpenMP tasks
    2. Use anything you choose in OpenMP other than tasks.
 • The second approach (no tasks) can be difficult and may
   take considerable creativity in how you approach the
   problem (hence why its such a pedagogically valuable
   problem).
                                                                        205
 Linked lists without tasks
• See the file Linked_omp25.c
while (p != NULL) {
    p = p->next;                                 Count number of items in the linked list
     count++;
}
p = head;
for(i=0; i<count; i++) {
     parr[i] = p;
                                                 Copy pointer to each node into an array
     p = p->next;
  }
#pragma omp parallel
{
    #pragma omp for schedule(static,1)
    for(i=0; i<count; i++)                       Process nodes in parallel with a for loop
      processwork(parr[i]);
}
                                                            Default schedule Static,1
                                     One Thread             48 seconds              45 seconds
                                     Two Threads            39 seconds              28 seconds
   Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2             206
  Linked lists without tasks: C++ STL
• See the file Linked_cpp.cpp
int j = (int)nodelist.size();
#pragma omp parallel for schedule(static,1) Count number of items in the linked list
   for (int i = 0; i < j; ++i)
         processwork(nodelist[i]);
                                                   208
Recursive matrix multiplication
• Could be executed in parallel as 4 tasks
   – Each task executes the two calls for the same output submatrix of C
• However, the same number of multiplication operations needed
  #define THRESHOLD 32768   // product size below which simple matmult code is called
  void matmultrec(int mf, int ml, int nf, int nl, int pf, int pl,
                  double **A, double **B, double **C)
  {
     if ((ml-mf)*(nl-nf)*(pl-pf) < THRESHOLD)
        matmult (mf, ml, nf, nl, pf, pl, A, B, C);
     else
     {
  #pragma omp task firstprivate(mf,ml,nf,nl,pf,pl)
  {
        matmultrec(mf, mf+(ml-mf)/2, nf, nf+(nl-nf)/2, pf, pf+(pl-pf)/2, A, B, C);   // C11 += A11*B11
        matmultrec(mf, mf+(ml-mf)/2, nf, nf+(nl-nf)/2, pf+(pl-pf)/2, pl, A, B, C);   // C11 += A12*B21
  }
  #pragma omp task firstprivate(mf,ml,nf,nl,pf,pl)
  {
        matmultrec(mf, mf+(ml-mf)/2, nf+(nl-nf)/2, nl, pf, pf+(pl-pf)/2, A, B, C);   // C12 += A11*B12
        matmultrec(mf, mf+(ml-mf)/2, nf+(nl-nf)/2, nl, pf+(pl-pf)/2, pl, A, B, C);   // C12 += A12*B22
  }
  #pragma omp task firstprivate(mf,ml,nf,nl,pf,pl)
  {
       matmultrec(mf+(ml-mf)/2, ml, nf, nf+(nl-nf)/2, pf, pf+(pl-pf)/2, A, B, C);    // C21 += A21*B11
       matmultrec(mf+(ml-mf)/2, ml, nf, nf+(nl-nf)/2, pf+(pl-pf)/2, pl, A, B, C);    // C21 += A22*B21
  }
  #pragma omp task firstprivate(mf,ml,nf,nl,pf,pl)
  {
       matmultrec(mf+(ml-mf)/2, ml, nf+(nl-nf)/2, nl, pf, pf+(pl-pf)/2, A, B, C);    // C22 += A21*B12
       matmultrec(mf+(ml-mf)/2, ml, nf+(nl-nf)/2, nl, pf+(pl-pf)/2, pl, A, B, C);    // C22 += A22*B22
  }
  #pragma omp taskwait
      }
  }                                                                                                      209
Appendices
• Challenge Problems
• Challenge Problems: solutions
   – Monte Carlo PI and random number generators
   – Molecular dynamics
   – Matrix multiplication
   – Linked lists
   – Recursive matrix multiplication
• Fortran and OpenMP
                                                   210
Fortran and OpenMP
                                                                 211
OpenMP:
Some syntax details for Fortran programmers
 • Most of the constructs in OpenMP are compiler directives.
   – For Fortran, the directives take one of the forms:
      C$OMP construct [clause [clause]…]
      !$OMP construct [clause [clause]…]
      *$OMP construct [clause [clause]…]
 • The OpenMP include file and lib module
      use omp_lib
      Include omp_lib.h
                                                               212
   OpenMP:
   Structured blocks (Fortran)
     – Most OpenMP constructs apply to structured blocks.
         –Structured block: a block of code with one point of
          entry at the top and one point of exit at the bottom.
         –The only “branches” allowed are STOP statements
          in Fortran and exit() in C/C++.
                                                C$OMP PARALLEL DO
    C$OMP PARALLEL
                                                   do I=1,N
    10 wrk(id) = garbage(id)
                                                       res(I)=bigComp(I)
       res(id) = wrk(id)**2
                                                    end do
       if(conv(res(id)) goto 10
                                                C$OMP END PARALLEL DO
    C$OMP END PARALLEL
                                                                                 214
Runtime library routines
• The include file or module defines parameters
   – Integer parameter omp_locl_kind
   – Integer parameter omp_nest_lock_kind
   – Integer parameter omp_sched_kind
   – Integer parameter openmp_version
      – With value that matches C’s _OPEMMP macro
• Fortran interfaces are similar to those used with C
   – Subroutine omp_set_num_threads (num_threads)
   – Integer function omp_get_num_threads()
   – Integer function omp_get_thread_num()\
   – Subroutine omp_init_lock(svar)
      – Integer(kind=omp_lock_kind) svar
   – Subroutine omp_destroy_lock(svar)
   – Subroutine omp_set_lock(svar)
   – Subroutine omp_unset_lock(svar)
                                                        215
OpenMP compilers on Apple laptops: MacPorts
 • To use OpenMP on your Apple laptop:
 • Download Xcode. Be sure to setup the command line tools.
 • Download and use MacPorts to install the latest gnu compilers.
       sudo port select –set gcc mp-gcc6 Select the mp enabled version of
                                                    the most recent gcc release
                                                                                         216
OpenMP compilers on Apple laptops: Homebrew
 • An alternate way to use OpenMP on your Apple laptop:
 • Install Homebrew. If Hombrew is already installed, skip to the install gcc section.