Optimising Serial Code
Advancing Science with
         Pawsey
         Learning Objectives
• Choose an algorithm for good performance
• Choose a language for good performance
• Understand the importance of standard
  conformance
• Write code that can be optimised for a modern
  CPU
• Locate bottlenecks in a serial code and address
  them
                          Course Roadmap
                                 Programming                                     Optimization
•   correctness/novelty                             •   cache performance
•   complexity                                      •   FLOPS
•   memory scaling        •   language selection    •   IPC                 • data locality
•   etc                   •   code efficiency       •   etc                 • loop
                          •   standardization                                 transformation
                          •   etc                                           • vectorization
                                                                            • etc
          Algorithm                                           Profiling
                                           Parallelization
                                            Performance
              Motivation
• Science will be limited by the code and
  computer.
• Development effort may be secondary to
  code performance.
• Many gains come with small effort.
• Code optimisation can force good
  programming styles.
Is it worth the time?
    From XKCD webcomic: https://xkcd.com/1205/
         Development Effort
1 Magnus day = 49 years on a dual-core
desktop.
   It is worth it to spend months of development
    time to save years/decades of runtime!
   This might go against common opinions.
You still need to consider unit testing, ease
of collaboration and code extension.
Know When to Stop Developing
Focus your effort!
1. Start with a good algorithm.
2. Write code with performance in mind.
3. Profile the code to determine where to spend
   your optimising effort.
4. Optimise the code, then profile again, and repeat
   until satisfactory performance achieved.
Know when to stop trying! Set an effort limit or a
runtime goal.
MEASURING PERFORMANCE
          Busy vs Effective
• Low processor utilisation does mean sub-
  optimal performance.
• High processor utilisation does not mean
  optimal performance.
• Why?
  • processor could be doing something
    inefficient
  • processor could be polling
                  Time
• What matters is real (your) time. Use the
  clock, we call this walltime.
• You can use routines like gettimeofday,
  MPI_Wtime to measure program runtime.
• Don’t use many timing calls to profile the
  code. This is slow and alters the profile.
  Instead use a profiler, which we will do
  later!
• CPU clock speed can vary between runs!
                        Timing routines
•   C / C++
    clock_t clock(void): Returns CPU time in “clock ticks” since initial call. (Use
    CLOCKS_PER_SEC to convert.). Resolution is microseconds.
    time_t time( time_t *second): Returns real time in seconds since unix epoch 00:00:00 Jan
    1, 1970. (Output argument “second” is also assigned with that value if it is not a NULL
    pointer.). Resolution is in seconds.
•   Fortran
    CPU_TIME(TIME): Returns the processor time in seconds since the start of the program
    through “TIME”, which is a REAL with INTENT(OUT). Resolution in miliseconds.
    DATE_AND_TIME([date,time,zone,values]): Returns character and binary data on the
    real-time clock and date. Resolution is miliseconds.
•   glibc
    int gettimeofday(struct timeval *tp, struct timezone *tzp): Returns the current calendar
    time as the elapsed time since the epoch in the structure “tp”. Resolution is microseconds.
•   MPI
    double MPI_Wtime( void ): Returns time in seconds since an arbitrary time in the past.
    Resolution is indicated by: double MPI_Wtick( void ).
CHOOSING AN ALGORITHM
                         Algorithms
Many common algorithms have:
•   a period of time
•   increments within that period
•   elements of interest
                                    Algorithm For t = t1 .. tn {over time increments}
•   ways to visit these elements        A         For i = x1 .. xn
•   a calculation <do something>                       For j = y1 .. yn
                                      Grid                 For k = z1 .. zn
                                                                <Do something>
                                    Algorithm For t = t1 .. tn {over time increments}
                                        B         For each element
                                                       For each other element
                                     N-body                <Do something>
                  Complexity
At this stage, we can make a few observations:
  Algorithm A will do #t.#x.#y.#z calculations = O(t*n3)
  Algorithm B will do #t.n.(n-1) calculations = O(t*n2)
… this is the order or algorithmic complexity.
                    Scaling
• Consider memory scaling as well as
  compute scaling. Memory is finite.
                    Compute Time   Memory
3D-grid with time   O(t*n3)        O(n3)
N-body              O(t*n2)        O(n)
• Other examples: QR tridiagonal
  eigensolver is O(n2) in memory, while MR3
  tridiagonal eigensolver is O(n) in memory.
 Example Alternative Algorithm
• Instead of calculating every pairwise interaction
  (brute force) in the N-body problem, represent
  hierarchies of bodies with multipoles in a
  spacial decomposition (influence of very distant
  bodies is negligible).
• O(N2) interactions becomes O(N log N), even
  O(N) for adaptive expansion.
  N             Brute force O(N2)   Multipole model O(N log N)
  1,000         1,000,000           7,000
      Choosing an Algorithm
There may be multiple algorithms to solve a
problem. Consider:
• E.g. O(n3) vs O(n log n) or Brute Force vs
  Clever approach with adequate results.
• Parallelisability constraints imposed by
  the algorithm.
• Domain decomposition often leads to
  good scaling and parallelisability. Load
  balancing may become an issue.
               Prefactor
• If two algorithms have the same prefactor,
  does one have fewer expensive
  operations? E.g. square roots,
  exponentials, complex numbers, sin, cos
  etc.
• Algorithms that scale well might have a
  more expensive prefactor and be slower
  for small n. Perhaps use two algorithms
  and set a transition point based on n.
Rewrite your Maths
Simple Tricks
CHOOSING A LANGUAGE
         Choosing a language
Choose the right language for the job.
•   Performance.
•   Potential to use OpenMP, OpenACC, CUDA, and/or
    MPI.
•   Flexible to new technologies.
•   Portable.
•   Available performance and debugging tools.
•   Ease of programming.
Perhaps a hybrid approach, e.g. python/C or
python/Fortran
Other Language Considerations
Our Goals:
• Want to convert algorithms into code (without too
  much effort).
• Want to help the compiler produce the fastest
  code (without too much effort).
Some Considerations:
  •   Array / matrix support
  •   Complex number support
  •   Aliasing
  •   Static typing
                        Aliasing
Aliasing is when some variables *might* refer to the
same memory location. This introduces constraints on
compiler optimisations. (e.g. prevents code reordering).
• Fortran
    Assumes no aliasing. Avoid unnecessary use of pointers
    (stick to allocatable).
•   C
    Use the restrict keyword, since C99 standard.
•   C++
    __restrict__ etc are non-standard but in some compilers.
    Not portable.
Avoid unnecessary use of pointers.
                Static Typing
• With static typing, the type is known at
  compile-time.
• Compiler can check that variables passed to
  functions are compatible with the functions.
•   Permits optimisations at compile-time.
•   Improves reproducibility, reduces errors.
•   API is well documented for others.
•   Fortran, C, C++ all support static typing.
                     Language performance
                    Fortran   Julia   Python   R       Matlab   Octave   Mathematica   JavaScript   Go      SciLua   Java
                    gcc                                                                                     1.0.0-   1.8.0_1
                              1.0.0   3.6.3    3.5.0   R2018a   4.2.2    11.3          V8 6.2.414   go1.9
                    7.3.1                                                                                   b12      7
fibonacci           0.98      1.32    94.2     263     17.6     10045    132.1         3.51         1.80    1.18     3.63
parse_integer       6.87      2.19    19.76    50.37 178.2      574.4    22.66         5.03         0.96    0.97     3.17
quicksort           1.18      0.99    37.57    57.93 2.36       2221.3   44.48         4.28         1.24    1.56     2.98
mandelbrot          0.70      0.68    65.66    195.5 9.84       5812     18.29         1.134        0.77    1.00     1.42
pi_sum              1.00      1.01    14.77    11.69 1.01       317.5    1.45          1.08         1.00    0.99     1.08
matrix_statistics   1.54      1.63    17.73    20.97 8.09       46.24    7.49          13.97        6.08    1.70     5.02
matrix_multiply     1.15      0.97    1.18     8.26    1.16     1.21     1.18          31.77        1.43    1.08     8.07
 Comparison from https://julialang.org/benchmarks/.
 Times relative to C (gcc 7.3.1). Lower is better.
 Ran on a single core (serial execution) on an Intel® Core™ i7-3960X 3.30GHz CPU with
 64GB of 1600MHz DDR3 RAM, running openSUSE LEAP 15.0 Linux.
 These are not optimal results, just how a typical researcher might program in those
 languages.
   Fortran Compiler Comparison
                     GCC                      Intel                 Cray
fibonacci                                 1                  4.41                   3.16
parse_integer                             1                  0.77                   0.77
mandelbrot                                1                  0.50                       -
quicksort                                 1                  0.91                   0.99
pi_sum                                    1                  0.55                   0.55
matrix_statistics                         1                  0.80                   0.47
matrix_multiply                           1                  0.75                   0.15
Comparison code perf.f90 from https://julialang.org/benchmarks/.
Runtimes normalised to the GNU runtimes. Lower is better.
Results on Magnus, all using “ftn –O3 perf.f90” under relevant compiler environments.
STANDARDS CONFORMANCE
      Standard Conformance
• Your code will run on different systems and
  different compilers over the years.
• Standards conformance significantly improves
  the chance of reproducibility. (good for
  science!)
• Reproducibility means “apples vs apples”
  performance comparisons between systems
  and compilers, and between profiling runs.
• Do not rely on compiler extensions!
 How to write portable code (1)
• Use your compiler to check.
  •   gcc -std=c99 -pedantic myfile.c
  •   gcc -std=c++98 -pedantic myfile.cxx
  •   gfortran -std=f95 -pedantic myfile.f90
• Develop with multiple compilers. Cray
  compilers are very pedantic.
• Download a copy of the standard (or a
  draft).
 How to write portable code (2)
• Try initialising variables to other values,
  do you get the same answer?.
  • gfortran -finit-real=inf -finit-logical=false -finit-
    character=t …
• Try runtime checking. This is slower than
  compile-time checking.
• gfortran -fcheck=all
            Portable Makefiles
• Don’t hard code compiler-specific flags.
  •   There is no standard for compiler flags, just for the
      code itself.
  •   -Wall is not portable among compilers
• Use Makefile variables, make them easy to find
  and modify:
CFLAGS=-g -Wall
#CFLAGS=-O3
http://www.gnu.org/software/make/manual/
                          Exercise 1
Exercises are publicly available via Github:
git clone https://github.com/PawseySC/Optimising-Serial-Code.git
It’s case sensitive. Download them!
Have a look to the various “Makefile”s.
       Exercise 2: matmul (1)
In the exercise directory:
  cd matrix && make matrix
This produces the executables matrix.O0,
matrix.O1, matrix.O2, matrix.O3
(Have a quick look at the Makefile).
Run them through the queue:
  sbatch run_matrices.slurm
Output is in the SLURM output file slurm-
JOBID.out
       Exercise 2: matmul (2)
Compare the timings:
• What effect does optimisation level have on
  calls to the external math routine dgemm, to
  the intrinsic matmul, to manual looping of
  matrix multiplication?
• Is there a single best method for any matrix
  size at high optimisation?
• What is the main difference between
  matmul3 and matmul4.
• If you have time, try with a different compiler.
MODERN COMPUTERS AND
OPTIMISATION
                         CPUs
•   Most CPUs have multi-level caches, to reduce the time
    taken to access data.
•   A CPU can do a lot of work in the time it takes to access
    main memory.
                                 Data Locality
Location                              Access time                             Access time (cycles)
Register                              <1ns                                    -
L1 cache                              1ns                                     4
L2 cache                              4ns                                     10
L3 cache                              15-30ns                                 40-75
Memory                                60ns                                    150
Solid state disk                      50us                                    130,000
Hard disk                             10ms                                    26,000,000
Tape                                  10sec                                   26,000,000,000
   Source: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
         Cache: Access Patterns
•   Sequential access results in a higher rate of cache hits
•   Striding access has a low rate of hits, however modern processors
    can detect striding access patterns and pre-fetch cache lines
•   Random access is typically the worst, with a low rate of hits and no
    ability to predict subsequent access locations
    Translation Lookaside Buffer
•   The Translation Page Table maps the memory seen
    by the program (Virtual Memory) into physical
    memory. It sits in main memory and is slow.
•   The Translation Lookaside Buffer is a cache of
    recently used mappings (not actual content). Like
    the main caches, aim to work within VM pages.
    These are typically 4kB.
•   On a Cray you can use larger pages.
•   Before compiling: module load craype-
    hugepages2M (size does not matter)
•   Runtime: module load craype-hugepagesXXXM
    (size does matter)
       Being cache-friendly
• Read and write to contiguous chunks of
  memory. Data is transferred in a cache
  line.
• Avoid cache misses.
          Writing to Memory
• Don’t store data in RAM if you don’t need
  to. Use local temporary variables instead.
• These could be optimised into registers.
• In particular, don’t use global variables as
  local temporary variables.
• Similarly, avoid using array sections for
  temporary storage.
       Instruction Pipelining
• Modern CPUs can complete multiple
  Instructions Per Cycle (IPC), also known
  as Instructions Per Clock.
• An average for Xeon is 4. You need to
  keep the pipeline full to achieve this.
              Loop Unrolling
Keep data in registers/cache via loop unrolling /
blocking with inlining. This can also improve IPC.
e.g. a(n)=somefunc(n) + a(n-1)
Unroll with stride 2:
do n=1,nmax,2
  a(n)=somefunc(n) + a(n-1)
  a(n+1)=somefunc(n+1) + a(n)
end do
                Loop Counts
• Inner loops should have high iteration counts,
  since loops themselves have a non-negligible
  cost.
• Counter to this, very small inner loops may get
  unrolled away. In this case do it manually.
Good                     Bad
do j=1,10                do n=1,10000
 do n=1,10000             do j=1,10
  work                     work
 end do                   end do
end do                   end do
       Knowing Loop Counts
• There is more potential for compiler
  optimisation when the loop count is
  known before the loop is started.
• Use “do”, “for” loops. Avoid “while” and
  “do … while” loops.
          Contiguous Memory
• Aim to work through contiguous chunks of
  memory. Avoid unnecessary striding.
• In Fortran, the consecutive elements of a column
  reside next to each other (column-major order).
• In C/C++, the consecutive elements of a row
  reside next to each other (row-major order)
Fortran                  C /C++
do j=1,10                for (i=0;i<10;i++)
  do i=1,10                for (j=0;j<10;j++)
    A(i,j)=something         a[i][j]=something
                Loop Blocking (1)
Loop blocking/tiling/strip-mining …
This code potentially contains many cache misses.
do J=1,Jmax
  do I=1,Jmax
     A(I,J) = A(I,J) + B(J,I)
  end do
end do
A has stride 1, B has stride Jmax.
•   Make the stride of B smaller so that small blocks of A and B sit in
    cache.
•   If Imax and Jmax are very large, neither A or B can sit in cache in
    whole.
                   Loop Blocking (2)
                                                      B(I,J) access pattern
         B(I,J) access pattern: non contiguous
                                                      after blocking
                  A(I,J) access pattern: contiguous                A(I,J) access pattern
                                                                   after blocking
• The Cray Fortran compiler does this for you.
  Most other compilers currently do not.
                 Loop Blocking (3)
After applying loop blocking technique
do J=1,Jblksize,Jmax
    do I=1,Iblksize,Imax
         do JJ=J,J+Jblksize
             do II=I,I+Iblksize
                  A(II,JJ) = A(II,JJ) + B(JJ,II)
             end do
         end do
    end do
end do
A has stride 1, B has stride Jblksize.
•   Make the stride of B smaller so that small blocks of A and B sit in
    cache.
•   Only Iblksize x Jblksize of A or B sit in cache. Not exhausting cache.
                   Branching
•   Remove branches from loops and change the loop
    bounds. Branches are bad for IPC and pre-emptive
    cache fetching.
•   Avoid GOTO statements (in Fortran, C, C++). They can
    affect cache/register use.
Before:                      After:
do n=1,nmax                  a(1)=0
  if (n==1) a(n)=0           a(nmax)=1
  a(n)=somefunc(n)           do n=2,nmax-1
  if (n==nmax) a(n)=1          a(n)=somefunc(n)
end do                       end do
                  Loop Fission
•   Too much work in loops means that registers and/or
    instruction cache may get exhausted.
•   Perhaps only part of a loop is vectorisable (execute a
    number of loop-iterations at the same time).
•   Break these loops up.
Before:                       After:
do n=1,nmax                   do n=1,nmax
  lots of work                  lots of work
  some I/O                    end do
end do                        do n=1,nmax
                                some I/O
                              end do
                   Loop Fusion
Before:                   After:
do n=1,nmax               do n=1,nmax
  a(n)=somefunc(n)          a(n)=somefunc(n)
end do                      b(n)=someotherfunc(n)
do n=1,nmax               end do
  b(n)=someotherfunc(n)
end do
• This is usually used in conjunction with other
  techniques.
• Whether this is beneficial depends on the
  work inside the loops. Too much work may
  exhaust registers or cache.
                Pointers
Avoid unnecessary use of pointers.
• Pointers might prevent some compiler
  optimisations. They should be fine if they
  have local scope.
• Pointers make it difficult to copy data to
  GPU memory.
• Hard to optimise cache use, e.g. with
  linked lists vs contiguous arrays.
    MMX / 3DNOW / SSE / AVX
•   These are Single-Instruction-Multiple-Data (SIMD)
    operations.
•   These are not part of standard Fortran / C / C++. They may
    get deprecated over time.
•   Compiler should insert these automatically. Write SIMD-
    friendly code.
                    Inlining
• Function calls take time. You can remove this
  time by placing the code into the calling routine.
• Compilers can inline code for you when using
  high optimization levels. This is not guaranteed,
  but you can make it more likely:
• In C / C++, put the code in the same file as the
  calling routine. Use static functions.
• In C99 / C++, use the inline keyword.
• In Fortran, put the code in the same file / module
  as the calling routine. Use the compiler directive
  “forceinline”.
COMPILER OUTPUT
             Cray Compiler Output
Cray compiler will output annotated version of source file
    ftn -rm mycode.f90
    Outputs mycode.lst
Examine annotated file to figure out what’s going on
       Intel Compiler Output
• Optimisation reports.
• Compiler flag: -qopt-report=3
• Have a look at the man page for other
  values to opt-report.
 Exercise 3: Cray compiler output
• Run:
  cd compiler_reports
  make matrix.cray
• Check out the manpage if needed
  •   man crayftn
• Examine the output in matrix.lst.O*
• What has the compiler done with routine calls and
  loops?
• Can you identify the reason of the timing results
  from the previous exercise?
Exercise 4: Intel compiler output
• Run:
  module swap PrgEnv-cray PrgEnv-intel
  cd compiler_reports
  make matrix.intel
• Seek help: ifort -help reports
• Examine the output in matrix.optrpt.O*
• Might be a bit too much information. Scale
  back the reporting options in Makefile
PROFILING
           Profiling phases
• Instrumentation: compile the source code
  with extra compiler flags that enable the
  recording of performance-relevant events.
• Measurement & analysis: run the
  instrumented application on a
  representative test case. Usually the
  instrumented application is much slower
  than the original one.
• Performance examination: collect and
  analyse the measurement results.
                            Profilers
•   On Cray supercomputers: Cray Tools
https://pubs.cray.com/content/S-2376/7.0.0/cray-performance-measurement-
and-analysis-tools-user-guide/craypat
https://support.pawsey.org.au/documentation/display/US/Profiling+with+Cray+T
ools
•   Intel VTune: https://software.intel.com/en-us/intel-vtune-amplifier-xe
https://support.pawsey.org.au/documentation/display/US/Profiling+with+Intel+V
Tune
•   Others: gprof, Arm MAP & Performance reports, etc.
https://support.pawsey.org.au/documentation/display/US/User+Training
https://support.pawsey.org.au/documentation/display/US/Training+Material
                             Profilers
•   Profiling guide at Pawsey:
https://support.pawsey.org.au/documentation/display/US/Profiling
        Full Profiling with CrayPAT
Sampling experiment
Instrumentation
• module load perftools
• Compile code, using Cray compiler wrappers (ftn, cc, CC) &
     preserving object (.o) files
• pat_build myapp
    •   Generates executable named myapp+pat
Measurement & analysis
• Run ./myapp+pat as normal, this will generate an
• Output dir: myapp+pat+XX+YYs/ (or .xf file for small runs)
• pat_report myapp+pat+XX+YYs/ > myapp.sampling.report
    (this also generates .ap2 file that can be viewed with Apprentice2,
    and a build-options.apa file to be used in a tracing experiment)
Performance examination
• Read myapp.sampling.report file
• or use Apprentice2 (with X11 forwarding activated: “ssh –X”):
    app2 myapp+pat+XX+YYs/ &
       Full Profiling with CrayPat
Tracing experiment
Instrumentation
• First, perform a sampling experiment to generate the file:
     myapp+pat+XX+YYs/build-options.apa
• pat_build -O myapp+pat+XX+YYs/build-options.apa
   •   Essentially pat_build -w -T funcs -g grps -u myapp (can be
       edited to change –T funcs and –g grps)
   •   Generates executable named myapp+apa
Measurement & analysis
• Run ./myyapp+apa as normal, this will generate an
• Output dir: myapp+apa+XX+YYt/ (or .xf file for small runs)
• pat_report myapp+apa+XX+YYt/ > myapp.tracing.report
    (also generates .ap2 file that can be viewed with Apprentice2)
Performance examination
• Read myapp.tracing.report file
• or use Apprentice2 (with X11 forwarding activated: “ssh –X”):
    app2 myapp+apa+XX+YYt/ &
  Exercise 5: profiling game of life (1)
Profile (sampling) the game_of_life code.
  •   module load perftools
  •   cd game_of_life
  •   make game_of_life.cray
  •   pat_build game_of_life.O3.cray
  •   sbatch run_game_profile.slurm
  •   pat_report game_of_life.O3.cray+pat+XX-YYs/ >
      game_of_life.O3.cray.sampling.report
Exercise 5: profiling game of life (2)
Examine
game_of_life.O3.cray.sampling.report
• Where is all the time spent? What occurs on
  these lines of the code?
• How good is our cache utilisation?
• Check optimisations in
  game_of_life.O3.lst
• Try again with other levels of optimisation
  (edit the Makefile)
          Exercise 6: profiling matrix
         multiplication (1) - sampling
Sampling experiment
•   module load perftools
•   cd profiling
•   ftn -c –O2 matrix.f90
•   ftn -o matrix.O2 matrix.o
•   pat_build matrix.O2
•   sbatch run_matrix_profile.slurm
•   Generate the report:
pat_report matrix.O2+pat+XX+YYs/ > matrix.O2.sampling.report
•   Examine matrix.O2.sampling.report
•   You can also use aprentice2: app2 matrix.O2+pat+XX+YYs/ &
Exercise 6: profiling matrix (2) sampling
Exercise 6: profiling matrix (3) sampling
Exercise 6: profiling matrix (4) sampling
          Exercise 6: profiling matrix
          multiplication (5) - tracing
Tracing experiment
•   Edit the build_options file to define tracing options:
     vim matrix.O2+pat+XX-YYs/build_options.apa
Have the following:
   -g mpi,blas,io,heap
   -T test_matmul1$timeit_
   -T test_matmul2$timeit_
   -T test_matmul3$timeit_
   -T test_matmul4$timeit_
   -T test_matmul5$timeit_
•   Build the executable with the defined tracing options:
    pat_build -O matrix.O2+pat+XX-YYs/build_options.apa
Exercise 6: profiling matrix (6) tracing
•   Edit the job script: vim run_matrix_profile.slurm
    Have: expType=apa
•   sbatch run_matrix_profile.slurm
• Generate the report:
pat_report matrix.O2+apa+XX-YYt/ > matrix.O2.tracing.report
•   Examine: matrix.O2.tracing.report
•   You can also use aprentice2: app2 matrix.O2+apa+XX+YYt/ &
•   Use the previous exercises of the matrix multiplication (timing and
    optimisation listing) to understand the results.
•   Repeat the exercise with lower levels of optimisation.
Seek help
• man pat_build
• man pat_report
• pat_help all . > all_pat_help
Exercise 6: profiling matrix (7) tracing
Exercise 6: profiling matrix (8) tracing
Exercise 6: profiling matrix (9) tracing
Exercise 6: profiling matrix (10) tracing
Exercise 6: profiling matrix (11) tracing
 Exercise 6: profiling matrix (12) tracing
From apprentice2: app2 matrix.O2+apa+XX+YYt/ &
Exercise 6: profiling matrix (13) tracing
Exercise 6: profiling matrix
multiplication (14) – tracing
Exercise 6: profiling matrix
multiplication (15) – tracing
Exercise 6: profiling matrix
multiplication (16) – tracing
CODING HABITS
             Global variables
• Avoid global variables unless necessary.
  They may make it difficult to convert to
  multithreaded code in further development.
• Pass variables through routine calls. (There is a
  slight performance overhead).
  In Fortran arguments can be given intent(in),
  intent(out) attributes. This assists the compiler.
  Scoping in OpenMP becomes much easier.
  May assist in auto-threading by compilers.
       Parentheses in Fortran
• Try to avoid parentheses in Fortran; they force an
  evaluation and prevent code arithmetic
  rearrangements. Use temporary variables
  instead.
• Compiler not permitted to rearrange this:
a = 2 * (c + d) – 2 * e
• Compiler allowed to rearrange this:
tmp=c+d
a=2*tmp – 2 * e
            Fortran Array Notation
Fortran array notation is convenient and easy to read, but current compilers are
likely to not optimise them well.
Some compilers are unlikely to fuse these operations:
A(:,:)=1.0
C(:,:)=A(:,:)+B(:,:)
In the meantime:
do j,1,n
   do i=1,m
     A(i,j)=1.0
     C(i,j)=A(i,j)+B(i,j)
   end do
end do
The Cray compiler does fuse array notation!
     Low Level Object Oriented
Low level Object Oriented programming has the potential for
poor performance. E.g. the below strided (not contiguous)
memory accesses.
type atom_type
  integer :: atomic_number
  double precision :: mass
  double precision, dimension(3) :: position
end type atom_type
type(atom_type), dimension(n_atoms) :: atom_list
total_mass=0
do i=1,n_atoms
  total_mass=total_mass+atom_list(i)%mass
end do
  Low Level Object Oriented (2)
Less organised but faster code:
integer, dimension(n_atoms) :: atomic_numbers
double precision, dimension(n_atoms) :: masses
double precision, dimension(3,n_atoms) :: positions
total_mass=sum(masses)
Set a level for the trade-off between
maintainability/extensibility and performance.
         Special Case Code
Assume we have a code that handles arrays of
varying length, and;
• the code creates temporary arrays;
• in practice an array of length 1 is the most
  common situation.
Optimisation: write a separate routine for arrays
of length 1, and use temporary variables rather
than temporary arrays.
                          I/O
• Often the processor is doing little while
  waiting for I/O.
• Ways to reduce I/O overhead:
  •   Use buffering (or don’t turn it off or flush).
  •   Output in binary, not formatted text.
  •   Use I/O libraries.
• Hierarchical Data Format (HDF5) is the
  name of a set of file formats and libraries
  designed to store and organize large
  amounts of numerical data.
            Exercise 7: I/O
Observe the effects of I/O techniques on
performance.
• module load cray-hdf5
• cd iobench && make iobench_hdf5
• sbatch run_iobench_hdf5.slurm
Look at the SLURM output file.
         Use version control
• Some of your attempts at optimisation will
  need to be undone. E.g. due to:
  • Incorrect results.
  • Slower performance.
• Use version control software. E.g. git,
  subversion. You should be using this
  anyway.
• Use informative comments in check-in.
MATH LIBRARIES
                  Popular libraries
•   BLAS: basic linear algebra such as matrix-vector or matrix-matrix
    operations.
•   LAPACK:
    •   Simple matrix/vector operations
    •   Linear equations solvers
    •   Linear least squares
    •   Eigensolvers
    •   Singular value decomposition
    •   Real + Complex
•   FFTW: fast fourier transforms, real/complex
Optimised vendor versions available. e.g. Intel MKL, Cray Libsci, SGI
SCSL, IBM ESSL. Some are multi-threaded.
             Other libraries
• PLAPACK - better scaling eigensolver
  (MRRR algorithm)
• PARPACK – sparse eigensolver
• MUMPS – parallel sparse direct solver
• Hypre – parallel linear solver
• Scotch – graph partitioning
• SuperLU – parallel sparse linear solver
• available at cray-tpsl
             Intel Math Libraries
Intel MKL includes BLAS, LAPACK and FFT
libraries. It consists of multiple libraries – use the
Intel advisor to work out the compiler link options:
http://software.intel.com/sites/products/mkl/MKL_Link_Line_Advisor.html
Example output:
   -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 -
   lmkl_sequential -lmkl_core –lpthread -lm
$(MKLROOT) is set by “module load intel”
               PetSc / Slepc
• It's an attempt at a black box solver suite.
  (makes it easy to swap between various
  solvers in various libraries). It links to
  common libraries.
•   C/C++ and Fortran interfaces
•   Linear equation solvers
•   Eigensolvers
•   Dense and Sparse
•   various finite element solver add-ons
                       Finish
• What’s next?:
   • Come to the parallel course.
   • Read optimisation guides from vendors.
     Intel and Cray in particular.
• Slides are available at
https://support.pawsey.org.au/documentation/display/US/Tr
aining+Material