0% found this document useful (0 votes)
27 views49 pages

OpenMP Performance Consideration

The document discusses performance considerations for OpenMP programming, emphasizing the importance of writing efficient sequential code before introducing parallel constructs. It highlights various optimization techniques such as loop interchange, loop unrolling, and memory access patterns to improve performance, while also addressing the impact of cache usage and parallel overheads. Additionally, it covers Amdahl's Law to predict speedup in parallel computing and provides best practices for optimizing OpenMP performance.

Uploaded by

wwesmackrawxtew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views49 pages

OpenMP Performance Consideration

The document discusses performance considerations for OpenMP programming, emphasizing the importance of writing efficient sequential code before introducing parallel constructs. It highlights various optimization techniques such as loop interchange, loop unrolling, and memory access patterns to improve performance, while also addressing the impact of cache usage and parallel overheads. Additionally, it covers Amdahl's Law to predict speedup in parallel computing and provides best practices for optimizing OpenMP performance.

Uploaded by

wwesmackrawxtew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

OpenMP Performance

considerations
Introduction
• It may be possible to quickly write a correctly functioning OpenMP
program
• But not so easy to create a program that provides the desired level of
performance
• It is often because some basic programming rules have not been adhered to
• Programmers have developed some rule of thumb for writing efficient
sequential code
• Guarantees certain base level performance
• This can also be extended to OpenMP programs
• Best practice: Write an efficient sequential program. Then introduce
OpenMP constructs
Performance Considerations for Sequential Programs
• Poor single-processor performance is often caused by the suboptimal
usage of cache memory
• Cache-miss on highest level of memory hierarchy is expensive
• 5-10 times more expensive than fetching the data from the cache
• Higher frequency- poor program performance
• In a shared memory systems:
• The adverse effect is more
• More number of threads are involved
• A cache-miss: results in additional traffic on the system interconnect
• No systems in the market has interconnect with sufficient bandwidth
Memory Access Patterns
• Memory Hierarchy:
• The largest, and also slowest, part of memory is known as main memory
• Main memory is organized into pages, a subset of which will be available to a given
application
• The memory levels closer to the processor are successively smaller and faster
and are collectively known as cache

• When the program is compiled, the compiler will arrange its data objects to be
stored in the main memory
• They will be transferred to cache when needed
Memory Access Patterns
• If the data requested is not present in cache, its known as Cache-miss
• It must be retrieved from higher levels of the memory hierarchy
• Program data is brought into cache in chunks called blocks
• Data that is already in cache may need to be removed, or “evicted”,
to make space for a new block of data

• Memory hierarchy cannot be programmed by the programmer or the compiler


• We can only control the data fetched into the cache and evicted from the cache
• Reduce the frequency with which this situation occurs
Memory Access Patterns
• A major goal is to organize data accesses so that values are used as
often as possible while they are still in cache
• Example: Let’s consider Arrays
• C typically specify that the elements of arrays be stored contiguously in
memory
• Thus, if an array element is fetched into cache, “nearby” elements of the
array will be in the same cache block and will be fetched as part of the same
transaction
• If a computation that uses any of these values can be performed while they
are still in cache, it will be beneficial for performance
Loop Optimizations
• If we were to encounter the first implementation of the loop in a
piece of code, we could simply exchange the order of the loop
headers and most likely experience a significant performance benefit
• This strategy is called loop interchange (or loop exchange)
Loop Optimizations
• Since many programs spend much of their time executing loops
• Array access
• A suitable reorganization of the computation in loop nests to exploit
cache can significantly improve a program’s performance
• A programmer should consider transforming a loop
• If accesses to arrays in the loop nest do not occur in the order in which they
are stored in memory
• If a loop has a large body and the references to an array element or its
neighbors are far apart
Loop Optimizations
• They can be applied if the changes to the code do not affect correct
execution of the program

If any memory location is referenced more than once in the loop nest
and if at least one of those references modifies its value, then their
relative ordering must not be changed by the transformation
Loop Optimizations
• Loop transformations have other purposes:
• They may help the compiler to better utilize the instruction pipeline or may increase
the amount of exploitable parallelism
• They can also be applied to increase the size of parallel regions
• Loop unrolling
• Loop unrolling, also known as loop unwinding, is a loop transformation technique
that attempts to optimize a program's execution speed at the expense of its binary
size, which is an approach known as space–time tradeoff.
• Powerful technique to effectively reduce the overheads of loop execution
• Loop unrolling can help to improve cache line utilization by improving data reuse
• It can also help to increase the instruction-level parallelism
Loop Optimizations

• In this example, the loop body executes 2 iterations in one pass


• This number is called the “unroll factor”
• A higher value tends to give better performance but also increases the
number of registers needed
• If the unroll factor does not divide the iteration count, the remaining
iterations must be performed outside this loop nest
• This is implemented through a second loop, the “cleanup” loop
Loop Optimizations
• Unroll and jam is an extension of loop unrolling that is appropriate
for some loop nests with multiple loops
Loop Optimizations
• Can this loops be optimized using loop Interchange?
Loop Optimizations
• Loop fusion merges two or more loops to create a bigger loop
• This might enable data in cache to be reused more frequently
• May increase the amount of computation per iteration in order to
improve the instruction-level parallelism
Loop Optimizations
• Loop fission is a transformation that breaks up a loop into several
loops
• Sometimes, we may be able to improve use of cache this way
• Isolate a part that inhibits full optimization of the loop
• This technique is likely to be most useful if a loop nest is large and its
data does not fit into cache or if we can optimize parts of the loop in
different ways
Loop Optimizations
Loop Optimizations
• Loop tiling or blocking
• It is a powerful transformation designed to tailor the number of memory
references inside a loop iteration so that they fit into cache
• If data sizes are large and memory access is bad
• If there is data reuse in the loop
• Loop tiling replaces the original loop by a pair of loops
Loop Optimizations
Use of Pointers and Contiguous Memory in C
• The memory model in C is such that, without additional information,
one must assume that all pointers may reference any memory
address
• This is generally referred to as the pointer aliasing problem
• It prevents a compiler from performing many program optimizations
• If pointers are guaranteed to point to portions of nonoverlapping
memory, optimizations can be applied
Using Compilers
• Modern compilers implement most, if not all, of the loop
optimizations
• They perform a variety of analyses to determine whether they may be
applied
• The main one is known as data dependence analysis
• They also apply a variety of techniques to reduce the number of
operations performed and reorder code to better exploit the
hardware
• It is worthwhile to experiment with compiler options to squeeze the
maximum performance out of the application
Amdahl’s law
• Amdahl's Law is a principle used in parallel computing to predict the maximum
potential speedup when using multiple processors. It's named after Gene
Amdahl, a computer architect, who formulated it in 1967.

• If we denote by T1 the execution time of an application on 1 processor, then in an


ideal situation, the execution time on P processors should be T1/P

• If TP denotes the execution time on P processors, then the ratio


S = T1/TP

• Parallel speedup is a measure of the success of the parallelization


Amdahl’s law
• Virtually all programs contain some regions that are suitable for
parallelization and other regions that are not
• By using an increasing number of processors, the time spent in the
parallelized parts of the program is reduced, but the sequential
section remains the same
• Eventually the execution time is completely dominated by the time
taken to compute the sequential portion, which puts an upper limit
on the expected speedup
S = 1/ (fpar/P + (1 − fpar))
• fpar is the parallel fraction of the code and P is the number of
processors
Amdahl’s law
• Suppose that 70% of a program execution can be speeded up if the
program is parallelized and run on 16 processing units instead of one.
What is the maximum speedup that can be achieved by the whole
program?
• What is the maximum speedup if we increase the number of
processing units to 32, then to 64, and then to 128
Amdahl’s law
• Obstacles along the way to perfect linear speedup are the overheads
introduced by forking and joining threads, thread synchronization,
and memory accesses
• A measure of a program’s ability to decrease the execution time of
the code with an increasing number of processors is referred to as
parallel scalability
Measuring OpenMP Performance
• How to measure and identify what factors determine overall program
performance
• On Unix systems if we use :/bin/time ./a.out
Measuring OpenMP Performance
• Real: Program took 5.4 seconds from beginning to end
• User: The time the program spent executing outside any operating
system services
• Sys: The time spent on operating system services, such as
input/output routines

• CPU time: The sum of user and system time


• The real time is also referred to as wall-clock time or elapsed time
Measuring OpenMP Performance
• There is a difference between real time and CPU time
• The application did not get a full processor to itself, because of a high load on the
system

• OpenMP program has additional overheads


• These overheads are collectively called the parallel overhead
• It includes the time to
• Create, start, and stop threads
• The extra work needed to figure out what each task is to perform
• The time spent waiting in barriers and at critical sections and locks
• The time spent computing some operations redundantly
Measuring OpenMP Performance
𝑇𝐶𝑃𝑈 𝑃 = (1 + 𝑂𝑃 ∙ 𝑃)𝑇𝑆𝑒𝑟𝑖𝑎𝑙

𝑓
𝑇𝐸𝑙𝑎𝑝𝑠𝑒𝑑 𝑃 = (( ) + 1 − 𝑓 + 𝑂𝑃 ∙ 𝑃)𝑇𝑆𝑒𝑟𝑖𝑎𝑙
𝑝

• 𝑇𝑆𝑒𝑟𝑖𝑎𝑙 is the CPU time of the original serial version of the application
• 𝑃 is the number of processors
• 𝑂𝑃 is the parallel overhead
• 𝑃 with 𝑂𝑃 assumed to be a constant percentage
• 𝑓 ∈ [0, 1] is the fraction of execution time that has been parallelized
Measuring OpenMP Performance
• If the original program takes 𝑇𝑆𝑒𝑟𝑖𝑎𝑙 = 10.20 seconds to run and code
corresponding to 95% of the execution time has been parallelized.
Assume that each additional processor adds a 2% overhead to the total
CPU time. Compute Speedup and Efficiency of the parallel program with
4 processors. Also estimate the 𝑇𝐶𝑃𝑈 and 𝑇𝐸𝑙𝑎𝑝𝑠𝑒𝑑 of the given
program.
Solution
• To compute the speedup and efficiency of the parallel program with 4
processors, need to calculate the total execution time with and without
parallelization, and then use these values to find the speedup and
efficiency.
Given that 95% of the execution time is parallelized, calculate the
total execution time with parallelization:
Measuring OpenMP Performance
• If the original program takes 𝑇𝑆𝑒𝑟𝑖𝑎𝑙 = 18.35 seconds to run and code
corresponding to 72% of the execution time has been parallelized.
Assume that each additional processor adds a 6% overhead to the
total CPU time. Compute Speedup and Efficiency of the parallel
program with 8 processors. Also estimate estimate the 𝑇𝐶𝑃𝑈 and
𝑇𝐸𝑙𝑎𝑝𝑠𝑒𝑑 of the given program.
Measuring OpenMP Performance
• The observable performance of OpenMP programs is influenced by at least
the following factors
• The manner in which memory is accessed by the individual threads
• The fraction of the work that is sequential, or replicated (Sequential overheads)
• The amount of time spent handling OpenMP constructs (Parallelization overheads)
• When a work-sharing directive is implemented, the work to be performed by each thread is
usually determined at run time
• The load imbalance between synchronization points (Load imbalance overheads)
• Threads perform different amounts of work in a work-shared region
• Threads might have to wait for a member of their team to carry out the work of a single
construct
• Other synchronization costs (Synchronization overheads)
Measuring OpenMP Performance
• Suppose that 65% of program execution can be sped up if the program is
parallelized and run on 8 processing units instead of one. What is the
maximum speedup that can be achieved by the whole program? What is
the maximum speedup if we increase the number of processing units to
16.
• If the original program takes 𝑇𝑆𝑒𝑟𝑖𝑎𝑙 = 9.4 seconds to run and code
corresponding to 68% of the execution time has been parallelized.
Assume that 6 processors incur a total overhead of 0.24 units to the total
CPU time. Compute Speedup and Efficiency of the parallel program with 6
processors. Also estimate estimate the 𝑇𝐶𝑃𝑈 and 𝑇𝐸𝑙𝑎𝑝𝑠𝑒𝑑 of the given
program.
Best Practices
• Optimize Barrier Use
• Barriers are expensive operations
• Can use Nowait clause to ignore the barriers
Best Practices
• Optimize Barrier Use
Best Practices
• Avoid the Ordered Construct
• The ordered construct ensures that the corresponding block of code within a
parallel loop is executed in the order of the loop iterations
• The runtime system has to keep track which iterations have finished and
possibly keep threads in a wait state until their results are needed
• Avoid Large Critical Regions
• The more code contained in the critical region, however, the greater the
likelihood that threads have to wait to enter it, and the longer the potential
wait times
Best Practices
• Maximize Parallel Regions
• Overheads are associated with starting and
terminating a parallel region

• Large parallel regions offer more opportunities


for using data in cache and provide a bigger
context for other compiler optimizations

• Minimize the number of parallel regions


Best Practices
Best Practices
• Avoid Parallel Regions in Inner Loops
• We repeatedly experience the overheads of the parallel construct
• the overheads of the #pragma omp parallel for construct are incurred n2
times
Best Practices
• Address Poor Load Balance
• In some parallel algorithms, threads have different amounts of work to do
• The threads wait at the next synchronization point until the slowest one
completes
• Solution is to use schedule clause
• The dynamic and guided workload distribution schedules have higher
overheads than does the static scheme
Best Practices
Additional Performance Considerations
• The Single Construct Versus the Master Construct
• A single region can be executed by any thread, typically the first to encounter
it
• A single construct has an implicit barrier
• Whereas this is not the case for the master region
• A master construct can be more efficient: Single construct requires more
work in the OpenMP library
• The single construct might be more efficient if the master thread is not likely
to be the first one to reach it and the threads need to synchronize at the end
of the block
Additional Performance Considerations
• Avoid False Sharing
• One of the factors limiting scalable performance is false sharing
• State bits are used by cache coherence protocols to track the state of the
cache line
• If a single byte is updated in the cache, the entire cache line needs to be
fetched from the memory
• Threads update different data elements in the same cache line, they interfere
with each other
• This effect is known as false sharing
Additional Performance Considerations
• Avoid False Sharing
• Array padding can be used to eliminate the problem
• Changing the indexing from a[i] to a[i][0] eliminates the false sharing
• False sharing is likely to significantly impact performance under the
following conditions:
• Shared data is modified by multiple threads
• The access pattern is such that multiple threads modify the same cache line(s)
• These modifications occur in rapid succession
Additional Performance Considerations
• Private Versus Shared Data
• The programmer may often choose whether data should be shared or private
• For example, if threads need unique read/write access to a one dimensional
array, one could declare a two-dimensional shared array with one row
• Alternatively, each thread might allocate a one dimensional private array
within the parallel region
• In general, the latter approach is to be preferred over the former
• If there are frequent modification to the data, it may result in false sharing
• Degrades performance
• If data is read but not written in a parallel region, it could be shared, ensuring
that each thread has (read) access to it.
END

You might also like