OpenACC

Sample codes for OpenACC implementation

How to compile using pgi compiler

pgf90 -mp -fast -Minfo=accel -acc loop_omp_acc.f90
pgcc -mp -fast -Minfo=accel -acc loop_omp_acc.c

for v100 card

pgcc -acc -mp -Minfo=acc -fast -ta=tesla:cc70 loop_omp_acc.c

OpenMP

Multiple threading parallelism in CPU
- Only on a single node of SMP
- Using the shared memory among multiple processors
- Can be coupled with MPI/OpenACC
- Can be coupled with vectorization
Inject "sentinel" arround the loops
- Tells the compiler the ROI for parallelism
- Private/Shared variables for each thread
- Single thread or synchronization using atomic or critical section
- Reduce operation for scalar variables
  - No array or vectors
- Nested loops or parallelism supported
  - Multiple steps of parallelism
Mostly better than auto-parallelization
- When the loop is complex

Using OpenMP

Inject sentinels in the top and bottom of the loop
- Decide private/shared variables for the ones used inside the loop
Compile the source code using
- -fopenmp for GCC
- -fopenmp or -qopenmp for Intel compiler
Configure the number of threads to use
- export OMP_NUM_THREADS=8
- a.out
Rule of thumb
- Mostly MPI parallelism is better than OpenMP or multiple-threading
- But hybrid parallelism might be better than bare MPI when CPU density per NIC is high
- Sweet-spot of OpenMP is 8-12 threads as of 2013-2018
- Dont' guess, measure
In Intel compiler: ifort -Ofast -qopenmp omp.f90
In PGI compiler: pgf90 -fast -mp omp.f90

OpenACC

For GPGPU computing
- -ta=multicore for CPU
  - OMP_NUM_THREADS may conflict with CGROUP
  - export MP_BLIST=0,1,2,3,4,5
More abstract than CUDA
- CUDA is extremely hard to use
- Fortran/C/C++ supported
PGI and GCC compiler
- Can be coupled with OpenMP/MPI
Similar feeling and look of OpenMP
- OpenMP-style parallelism (or vectorization) on GPGPU
- More explicit than OpenMP
  - Can configure which GPU will be used
- Don't guess, measure

Using OpenACC

Compile the source code using
- -fopenacc for GCC
- -acc for PGI compiler
  - -ta=tesla for GPU, -ta=multicore for CPU
Monitor GPU status using command: nvidia-smi -l
Ref:
As of 2018, derived data type might not be supported
Use structure of array (SoA), not the array of the struture (AoS)
Details
- gang => thread block
- worker => warp
- vector => thread
Limited multi-GPU support
- One MPI rank per GPU is recommended
No multiple-threading supported random number generation
Reduction in acc routine is not supported
2D array can not be privatized in the loop

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
comp_gang_split.f90		comp_gang_split.f90
comparison_call_ftn.f90		comparison_call_ftn.f90
eff_memcopy_freq.f90		eff_memcopy_freq.f90
loop_omp_acc.c		loop_omp_acc.c
loop_omp_acc.f90		loop_omp_acc.f90

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenACC

How to compile using pgi compiler

for v100 card

OpenMP

Using OpenMP

OpenACC

Using OpenACC

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenACC

How to compile using pgi compiler

for v100 card

OpenMP

Using OpenMP

OpenACC

Using OpenACC

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages