0% found this document useful (0 votes)

27 views49 pages

OpenMP Performance Consideration

The document discusses performance considerations for OpenMP programming, emphasizing the importance of writing efficient sequential code before introducing parallel constructs. It highlights various optimization techniques such as loop interchange, loop unrolling, and memory access patterns to improve performance, while also addressing the impact of cache usage and parallel overheads. Additionally, it covers Amdahl's Law to predict speedup in parallel computing and provides best practices for optimizing OpenMP performance.

Uploaded by

wwesmackrawxtew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views49 pages

OpenMP Performance Consideration

Uploaded by

wwesmackrawxtew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

OpenMP Performance

considerations
Introduction
• It may be possible to quickly write a correctly functioning OpenMP
program
• But not so easy to create a program that provides the desired level of
performance
• It is often because some basic programming rules have not been adhered to
• Programmers have developed some rule of thumb for writing efficient
sequential code
• Guarantees certain base level performance
• This can also be extended to OpenMP programs
• Best practice: Write an efficient sequential program. Then introduce
OpenMP constructs
Performance Considerations for Sequential Programs
• Poor single-processor performance is often caused by the suboptimal
usage of cache memory
• Cache-miss on highest level of memory hierarchy is expensive
• 5-10 times more expensive than fetching the data from the cache
• Higher frequency- poor program performance
• In a shared memory systems:
• The adverse effect is more
• More number of threads are involved
• A cache-miss: results in additional traffic on the system interconnect
• No systems in the market has interconnect with sufficient bandwidth
Memory Access Patterns
• Memory Hierarchy:
• The largest, and also slowest, part of memory is known as main memory
• Main memory is organized into pages, a subset of which will be available to a given
application
• The memory levels closer to the processor are successively smaller and faster
and are collectively known as cache

• When the program is compiled, the compiler will arrange its data objects to be
stored in the main memory
• They will be transferred to cache when needed
Memory Access Patterns
• If the data requested is not present in cache, its known as Cache-miss
• It must be retrieved from higher levels of the memory hierarchy
• Program data is brought into cache in chunks called blocks
• Data that is already in cache may need to be removed, or “evicted”,
to make space for a new block of data

• Memory hierarchy cannot be programmed by the programmer or the compiler

• We can only control the data fetched into the cache and evicted from the cache
• Reduce the frequency with which this situation occurs
Memory Access Patterns
• A major goal is to organize data accesses so that values are used as
often as possible while they are still in cache
• Example: Let’s consider Arrays
• C typically specify that the elements of arrays be stored contiguously in
memory
• Thus, if an array element is fetched into cache, “nearby” elements of the
array will be in the same cache block and will be fetched as part of the same
transaction
• If a computation that uses any of these values can be performed while they
are still in cache, it will be beneficial for performance
Loop Optimizations
• If we were to encounter the first implementation of the loop in a
piece of code, we could simply exchange the order of the loop
headers and most likely experience a significant performance benefit
• This strategy is called loop interchange (or loop exchange)
Loop Optimizations
• Since many programs spend much of their time executing loops
• Array access
• A suitable reorganization of the computation in loop nests to exploit
cache can significantly improve a program’s performance
• A programmer should consider transforming a loop
• If accesses to arrays in the loop nest do not occur in the order in which they
are stored in memory
• If a loop has a large body and the references to an array element or its
neighbors are far apart
Loop Optimizations
• They can be applied if the changes to the code do not affect correct
execution of the program

If any memory location is referenced more than once in the loop nest
and if at least one of those references modifies its value, then their
relative ordering must not be changed by the transformation
Loop Optimizations
• Loop transformations have other purposes:
• They may help the compiler to better utilize the instruction pipeline or may increase
the amount of exploitable parallelism
• They can also be applied to increase the size of parallel regions
• Loop unrolling
• Loop unrolling, also known as loop unwinding, is a loop transformation technique
that attempts to optimize a program's execution speed at the expense of its binary
size, which is an approach known as space–time tradeoff.
• Powerful technique to effectively reduce the overheads of loop execution
• Loop unrolling can help to improve cache line utilization by improving data reuse
• It can also help to increase the instruction-level parallelism
Loop Optimizations

• In this example, the loop body executes 2 iterations in one pass

• This number is called the “unroll factor”
• A higher value tends to give better performance but also increases the
number of registers needed
• If the unroll factor does not divide the iteration count, the remaining
iterations must be performed outside this loop nest
• This is implemented through a second loop, the “cleanup” loop
Loop Optimizations
• Unroll and jam is an extension of loop unrolling that is appropriate
for some loop nests with multiple loops
Loop Optimizations
• Can this loops be optimized using loop Interchange?
Loop Optimizations
• Loop fusion merges two or more loops to create a bigger loop
• This might enable data in cache to be reused more frequently
• May increase the amount of computation per iteration in order to
improve the instruction-level parallelism
Loop Optimizations
• Loop fission is a transformation that breaks up a loop into several
loops
• Sometimes, we may be able to improve use of cache this way
• Isolate a part that inhibits full optimization of the loop
• This technique is likely to be most useful if a loop nest is large and its
data does not fit into cache or if we can optimize parts of the loop in
different ways
Loop Optimizations
Loop Optimizations
• Loop tiling or blocking
• It is a powerful transformation designed to tailor the number of memory
references inside a loop iteration so that they fit into cache
• If data sizes are large and memory access is bad
• If there is data reuse in the loop
• Loop tiling replaces the original loop by a pair of loops
Loop Optimizations
Use of Pointers and Contiguous Memory in C
• The memory model in C is such that, without additional information,
one must assume that all pointers may reference any memory
address
• This is generally referred to as the pointer aliasing problem
• It prevents a compiler from performing many program optimizations
• If pointers are guaranteed to point to portions of nonoverlapping
memory, optimizations can be applied
Using Compilers
• Modern compilers implement most, if not all, of the loop
optimizations
• They perform a variety of analyses to determine whether they may be
applied
• The main one is known as data dependence analysis
• They also apply a variety of techniques to reduce the number of
operations performed and reorder code to better exploit the
hardware
• It is worthwhile to experiment with compiler options to squeeze the
maximum performance out of the application
Amdahl’s law
• Amdahl's Law is a principle used in parallel computing to predict the maximum
potential speedup when using multiple processors. It's named after Gene
Amdahl, a computer architect, who formulated it in 1967.

• If we denote by T1 the execution time of an application on 1 processor, then in an

ideal situation, the execution time on P processors should be T1/P

• If TP denotes the execution time on P processors, then the ratio

S = T1/TP

• Parallel speedup is a measure of the success of the parallelization

Amdahl’s law
• Virtually all programs contain some regions that are suitable for
parallelization and other regions that are not
• By using an increasing number of processors, the time spent in the
parallelized parts of the program is reduced, but the sequential
section remains the same
• Eventually the execution time is completely dominated by the time
taken to compute the sequential portion, which puts an upper limit
on the expected speedup
S = 1/ (fpar/P + (1 − fpar))
• fpar is the parallel fraction of the code and P is the number of
processors
Amdahl’s law
• Suppose that 70% of a program execution can be speeded up if the
program is parallelized and run on 16 processing units instead of one.
What is the maximum speedup that can be achieved by the whole
program?
• What is the maximum speedup if we increase the number of
processing units to 32, then to 64, and then to 128
Amdahl’s law
• Obstacles along the way to perfect linear speedup are the overheads
introduced by forking and joining threads, thread synchronization,
and memory accesses
• A measure of a program’s ability to decrease the execution time of
the code with an increasing number of processors is referred to as
parallel scalability
Measuring OpenMP Performance
• How to measure and identify what factors determine overall program
performance
• On Unix systems if we use :/bin/time ./a.out
Measuring OpenMP Performance
• Real: Program took 5.4 seconds from beginning to end
• User: The time the program spent executing outside any operating
system services
• Sys: The time spent on operating system services, such as
input/output routines

• CPU time: The sum of user and system time

• The real time is also referred to as wall-clock time or elapsed time
Measuring OpenMP Performance
• There is a difference between real time and CPU time
• The application did not get a full processor to itself, because of a high load on the
system

• OpenMP program has additional overheads

• These overheads are collectively called the parallel overhead
• It includes the time to
• Create, start, and stop threads
• The extra work needed to figure out what each task is to perform
• The time spent waiting in barriers and at critical sections and locks
• The time spent computing some operations redundantly
Measuring OpenMP Performance
𝑇𝐶𝑃𝑈 𝑃 = (1 + 𝑂𝑃 ∙ 𝑃)𝑇𝑆𝑒𝑟𝑖𝑎𝑙

𝑓
𝑇𝐸𝑙𝑎𝑝𝑠𝑒𝑑 𝑃 = (( ) + 1 − 𝑓 + 𝑂𝑃 ∙ 𝑃)𝑇𝑆𝑒𝑟𝑖𝑎𝑙
𝑝

• 𝑇𝑆𝑒𝑟𝑖𝑎𝑙 is the CPU time of the original serial version of the application
• 𝑃 is the number of processors
• 𝑂𝑃 is the parallel overhead
• 𝑃 with 𝑂𝑃 assumed to be a constant percentage
• 𝑓 ∈ [0, 1] is the fraction of execution time that has been parallelized
Measuring OpenMP Performance
• If the original program takes 𝑇𝑆𝑒𝑟𝑖𝑎𝑙 = 10.20 seconds to run and code
corresponding to 95% of the execution time has been parallelized.
Assume that each additional processor adds a 2% overhead to the total
CPU time. Compute Speedup and Efficiency of the parallel program with
4 processors. Also estimate the 𝑇𝐶𝑃𝑈 and 𝑇𝐸𝑙𝑎𝑝𝑠𝑒𝑑 of the given
program.
Solution
• To compute the speedup and efficiency of the parallel program with 4
processors, need to calculate the total execution time with and without
parallelization, and then use these values to find the speedup and
efficiency.
Given that 95% of the execution time is parallelized, calculate the
total execution time with parallelization:
Measuring OpenMP Performance
• If the original program takes 𝑇𝑆𝑒𝑟𝑖𝑎𝑙 = 18.35 seconds to run and code
corresponding to 72% of the execution time has been parallelized.
Assume that each additional processor adds a 6% overhead to the
total CPU time. Compute Speedup and Efficiency of the parallel
program with 8 processors. Also estimate estimate the 𝑇𝐶𝑃𝑈 and
𝑇𝐸𝑙𝑎𝑝𝑠𝑒𝑑 of the given program.
Measuring OpenMP Performance
• The observable performance of OpenMP programs is influenced by at least
the following factors
• The manner in which memory is accessed by the individual threads
• The fraction of the work that is sequential, or replicated (Sequential overheads)
• The amount of time spent handling OpenMP constructs (Parallelization overheads)
• When a work-sharing directive is implemented, the work to be performed by each thread is
usually determined at run time
• The load imbalance between synchronization points (Load imbalance overheads)
• Threads perform different amounts of work in a work-shared region
• Threads might have to wait for a member of their team to carry out the work of a single
construct
• Other synchronization costs (Synchronization overheads)
Measuring OpenMP Performance
• Suppose that 65% of program execution can be sped up if the program is
parallelized and run on 8 processing units instead of one. What is the
maximum speedup that can be achieved by the whole program? What is
the maximum speedup if we increase the number of processing units to
16.
• If the original program takes 𝑇𝑆𝑒𝑟𝑖𝑎𝑙 = 9.4 seconds to run and code
corresponding to 68% of the execution time has been parallelized.
Assume that 6 processors incur a total overhead of 0.24 units to the total
CPU time. Compute Speedup and Efficiency of the parallel program with 6
processors. Also estimate estimate the 𝑇𝐶𝑃𝑈 and 𝑇𝐸𝑙𝑎𝑝𝑠𝑒𝑑 of the given
program.
Best Practices
• Optimize Barrier Use
• Barriers are expensive operations
• Can use Nowait clause to ignore the barriers
Best Practices
• Optimize Barrier Use
Best Practices
• Avoid the Ordered Construct
• The ordered construct ensures that the corresponding block of code within a
parallel loop is executed in the order of the loop iterations
• The runtime system has to keep track which iterations have finished and
possibly keep threads in a wait state until their results are needed
• Avoid Large Critical Regions
• The more code contained in the critical region, however, the greater the
likelihood that threads have to wait to enter it, and the longer the potential
wait times
Best Practices
• Maximize Parallel Regions
• Overheads are associated with starting and
terminating a parallel region

• Large parallel regions offer more opportunities

for using data in cache and provide a bigger
context for other compiler optimizations

• Minimize the number of parallel regions

Best Practices
Best Practices
• Avoid Parallel Regions in Inner Loops
• We repeatedly experience the overheads of the parallel construct
• the overheads of the #pragma omp parallel for construct are incurred n2
times
Best Practices
• Address Poor Load Balance
• In some parallel algorithms, threads have different amounts of work to do
• The threads wait at the next synchronization point until the slowest one
completes
• Solution is to use schedule clause
• The dynamic and guided workload distribution schedules have higher
overheads than does the static scheme
Best Practices
Additional Performance Considerations
• The Single Construct Versus the Master Construct
• A single region can be executed by any thread, typically the first to encounter
it
• A single construct has an implicit barrier
• Whereas this is not the case for the master region
• A master construct can be more efficient: Single construct requires more
work in the OpenMP library
• The single construct might be more efficient if the master thread is not likely
to be the first one to reach it and the threads need to synchronize at the end
of the block
Additional Performance Considerations
• Avoid False Sharing
• One of the factors limiting scalable performance is false sharing
• State bits are used by cache coherence protocols to track the state of the
cache line
• If a single byte is updated in the cache, the entire cache line needs to be
fetched from the memory
• Threads update different data elements in the same cache line, they interfere
with each other
• This effect is known as false sharing
Additional Performance Considerations
• Avoid False Sharing
• Array padding can be used to eliminate the problem
• Changing the indexing from a[i] to a[i][0] eliminates the false sharing
• False sharing is likely to significantly impact performance under the
following conditions:
• Shared data is modified by multiple threads
• The access pattern is such that multiple threads modify the same cache line(s)
• These modifications occur in rapid succession
Additional Performance Considerations
• Private Versus Shared Data
• The programmer may often choose whether data should be shared or private
• For example, if threads need unique read/write access to a one dimensional
array, one could declare a two-dimensional shared array with one row
• Alternatively, each thread might allocate a one dimensional private array
within the parallel region
• In general, the latter approach is to be preferred over the former
• If there are frequent modification to the data, it may result in false sharing
• Degrades performance
• If data is read but not written in a parallel region, it could be shared, ensuring
that each thread has (read) access to it.
END

Performance and Tuning of Openmp Programs
No ratings yet
Performance and Tuning of Openmp Programs
76 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
No ratings yet
Final Exam Topics: CSE 564 Computer Architecture Summer 2017
78 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Lect 02
No ratings yet
Lect 02
51 pages
Parallel Programming Essentials
No ratings yet
Parallel Programming Essentials
40 pages
Amdahl's Law & Parallel Computing
No ratings yet
Amdahl's Law & Parallel Computing
19 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
Day01 HPC WRKSHP Compiler Opt
No ratings yet
Day01 HPC WRKSHP Compiler Opt
61 pages
Program Optimization Techniques
No ratings yet
Program Optimization Techniques
35 pages
410A Week 4
No ratings yet
410A Week 4
12 pages
Lecture 4 Amdahl Law 1
No ratings yet
Lecture 4 Amdahl Law 1
22 pages
CSC 313 Module 3 Pipelining
No ratings yet
CSC 313 Module 3 Pipelining
59 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
32 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Cs8083 Unit II Notes
No ratings yet
Cs8083 Unit II Notes
23 pages
2 TypesofParallelism
No ratings yet
2 TypesofParallelism
69 pages
Amdahl's Law: S (N) T (1) /T (N)
No ratings yet
Amdahl's Law: S (N) T (1) /T (N)
46 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Module 3
No ratings yet
Module 3
104 pages
UNIT 5 Notes CD
No ratings yet
UNIT 5 Notes CD
6 pages
APT06 2024S2 New
No ratings yet
APT06 2024S2 New
21 pages
Updated - CS8083 MCP UNIT II Notes
No ratings yet
Updated - CS8083 MCP UNIT II Notes
23 pages
Amdahl's Law Example #2: - Protein String Matching Code
No ratings yet
Amdahl's Law Example #2: - Protein String Matching Code
23 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
Lec02 1 Measuring Profiling
No ratings yet
Lec02 1 Measuring Profiling
25 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
04 OpenMPPerformances
No ratings yet
04 OpenMPPerformances
15 pages
William Stallings Computer Organization and Architecture 8 Edition
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition
38 pages
Lecture 8
No ratings yet
Lecture 8
37 pages
Lecture # 01
No ratings yet
Lecture # 01
30 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
CS-3006 4 PerformanceAnalysis
No ratings yet
CS-3006 4 PerformanceAnalysis
62 pages
Parallel Computing for Students
No ratings yet
Parallel Computing for Students
113 pages
CS-3006 10 PerformanceAnalysis
No ratings yet
CS-3006 10 PerformanceAnalysis
52 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
Optimising Serial Code
No ratings yet
Optimising Serial Code
101 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
Module 1 - Parallel Computing
No ratings yet
Module 1 - Parallel Computing
29 pages
COMP Unit 1
No ratings yet
COMP Unit 1
52 pages
Lec02 2 Compiler Optimizations
No ratings yet
Lec02 2 Compiler Optimizations
32 pages
Parallel Programming Platforms: Alexandre David 1.2.05
No ratings yet
Parallel Programming Platforms: Alexandre David 1.2.05
30 pages
Parallel Programming Course Overview
No ratings yet
Parallel Programming Course Overview
36 pages
Zindagi Zama Da
No ratings yet
Zindagi Zama Da
21 pages
Lec01 1 Introduction
No ratings yet
Lec01 1 Introduction
36 pages
PC 1
No ratings yet
PC 1
53 pages
Lec 3
No ratings yet
Lec 3
20 pages
18 Code Optimization 07-02-2025
No ratings yet
18 Code Optimization 07-02-2025
9 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
Ece586 Lec5 1
No ratings yet
Ece586 Lec5 1
6 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Analytical Modeling in Parallel Computing
No ratings yet
Analytical Modeling in Parallel Computing
19 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Computer Architecture Unit V
No ratings yet
Computer Architecture Unit V
23 pages
Module 7
No ratings yet
Module 7
28 pages
Samarth Gupta Golang Resume Target
No ratings yet
Samarth Gupta Golang Resume Target
1 page
Monitoring & Tuning Azure SQL Database: Presenting Sponsors
100% (1)
Monitoring & Tuning Azure SQL Database: Presenting Sponsors
38 pages
Dell Unity Configuring Pools
No ratings yet
Dell Unity Configuring Pools
78 pages
3 Chapter 4 - Recovery Techniques
No ratings yet
3 Chapter 4 - Recovery Techniques
22 pages
Design For Performance
100% (1)
Design For Performance
34 pages
451/1 Computer Studies Marking Scheme
No ratings yet
451/1 Computer Studies Marking Scheme
8 pages
Datasheet AMD Alchemy Au1100
No ratings yet
Datasheet AMD Alchemy Au1100
414 pages
Cache Configurations Explained
No ratings yet
Cache Configurations Explained
2 pages
UNIX Implementation: K. Thompson
No ratings yet
UNIX Implementation: K. Thompson
9 pages
Intelligent Storage Systems Guide
No ratings yet
Intelligent Storage Systems Guide
174 pages
Mi Configuration
No ratings yet
Mi Configuration
25 pages
FreePBX CLI Command Asterisk
No ratings yet
FreePBX CLI Command Asterisk
5 pages
Cache Memory Mapping Techniques
No ratings yet
Cache Memory Mapping Techniques
16 pages
Memory Notes
No ratings yet
Memory Notes
70 pages
SQLServer DMV StarterPack
No ratings yet
SQLServer DMV StarterPack
84 pages
Guidelines For Report Development in Cognos
No ratings yet
Guidelines For Report Development in Cognos
5 pages
Oracle AWR Report
No ratings yet
Oracle AWR Report
64 pages
Web Agents 5.6 - User Guide
No ratings yet
Web Agents 5.6 - User Guide
140 pages
.NET Core: A Developer's Guide
100% (2)
.NET Core: A Developer's Guide
23 pages
Design and Implementation of The UVM Virtual Memory System in NetBSD
No ratings yet
Design and Implementation of The UVM Virtual Memory System in NetBSD
270 pages
PDC Notes Complete - Updated
No ratings yet
PDC Notes Complete - Updated
52 pages
Chapter 3
No ratings yet
Chapter 3
59 pages
Core Java Question
No ratings yet
Core Java Question
4 pages
Hollywood Pop Brass User Manual
No ratings yet
Hollywood Pop Brass User Manual
72 pages
Edu en Vsicm8 Lec Se3
No ratings yet
Edu en Vsicm8 Lec Se3
151 pages
Vsan 703 Planning Deployment Guide
100% (1)
Vsan 703 Planning Deployment Guide
87 pages
1st SUMMATIVE IN CREATIVE TECHNOLOGY
100% (1)
1st SUMMATIVE IN CREATIVE TECHNOLOGY
3 pages
IT1020 - Worksheet 01
No ratings yet
IT1020 - Worksheet 01
11 pages
Eetop - CN IHI0050B Amba 5 Chi Architecture Specification 20170804
No ratings yet
Eetop - CN IHI0050B Amba 5 Chi Architecture Specification 20170804
368 pages
01 Operating System
No ratings yet
01 Operating System
19 pages

OpenMP Performance Consideration

Uploaded by

OpenMP Performance Consideration

Uploaded by

OpenMP Performance

• Memory hierarchy cannot be programmed by the programmer or the compiler

• In this example, the loop body executes 2 iterations in one pass

• If we denote by T1 the execution time of an application on 1 processor, then in an

• If TP denotes the execution time on P processors, then the ratio

• Parallel speedup is a measure of the success of the parallelization

• CPU time: The sum of user and system time

• OpenMP program has additional overheads

• Large parallel regions offer more opportunities

• Minimize the number of parallel regions

You might also like