0% found this document useful (0 votes)
7 views10 pages

Unit 3 HPC

Uploaded by

subalakshmir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

Unit 3 HPC

Uploaded by

subalakshmir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

open MP in HPC

OpenMP (Open Multi-Processing) is a widely adopted API for parallel programming in shared-memory systems, playing a crucial
role in High-Performance Computing (HPC). It enables developers to parallelize applications efficiently, leveraging multi-core processors to
enhance performance.

🧠 OpenMP in High-Performance Computing (HPC)

In HPC, OpenMP is primarily utilized for parallelizing tasks within a single node, harnessing the power of multiple CPU cores. This is
particularly beneficial for applications that require intensive computation and can be divided into smaller, independent tasks.

Key Features:

 Shared Memory Model: Threads within a process share the same memory space, facilitating efficient data sharing and
communication.
 Compiler Directives: OpenMP uses compiler directives (e.g., #pragma omp parallel) to specify parallel regions in the code,
allowing for straightforward parallelization.
 Thread Management: It provides constructs for thread creation, synchronization, and management, simplifying the development of
parallel applications.HPC NMSU
 Scalability: OpenMP allows applications to scale across multiple cores within a node, improving performance for parallel workloads.

⚙️Hybrid Parallelism: Combining OpenMP with MPI

While OpenMP excels in shared-memory environments, many HPC applications require distributed-memory systems. To address this, a hybrid
parallelism model combining OpenMP with MPI (Message Passing Interface) is commonly employed.

Hybrid Model Overview:

 MPI: Manages parallelism across multiple nodes in a cluster, handling inter-process communication.
 OpenMP: Handles parallelism within each node, utilizing multiple cores to perform computations concurrently.

This hybrid approach allows applications to efficiently utilize both distributed and shared memory architectures, enhancing scalability and
performance. For instance, MPI manages communication between nodes, while OpenMP manages parallel computation within each node.

Best Practices for OpenMP in HPC

To maximize the effectiveness of OpenMP in HPC applications, consider the following best practices:

 Efficient Thread Management: Use appropriate thread counts to match the hardware capabilities, avoiding oversubscription of
cores.
 Data Locality: Organize data to enhance cache performance and minimize memory latency.
 Synchronization Minimization: Reduce the use of synchronization constructs to avoid bottlenecks.
 Load Balancing: Ensure that work is evenly distributed among threads to prevent idle times.
 Profiling and Optimization: Regularly profile the application to identify performance bottlenecks and optimize accordingly.
By adhering to these practices, developers can enhance the performance and scalability of their HPC applications using OpenMP. LinkedIn

basics of OpenMP

OpenMP (Open Multi-Processing) is a widely used API for parallel programming in shared-memory systems, enabling developers to
write parallel code in C, C++, and Fortran. It simplifies the process of parallelizing applications by providing compiler directives, runtime
routines, and environment variables.

🧠 Core Concepts of OpenMP

1. Shared Memory Model

OpenMP is designed for shared-memory architectures, where multiple processors or cores can access a common memory space. This model
allows threads to communicate by reading and writing to shared variables, facilitating efficient data sharing. NERSC Documentation

2. Thread-Based Parallelism

OpenMP utilizes threads as the basic unit of execution. The master thread forks additional threads to execute parallel regions of code, and these
threads synchronize upon completion.

3. Fork-Join Execution Model

OpenMP programs begin with a single master thread. When a parallel region is encountered, the master thread forks a team of threads to execute
the enclosed code in parallel. After completing the parallel region, the threads synchronize and terminate, returning control to the master thread.
LLNL HPC Tutorials+1NERSC Documentation+1

4. Compiler Directives

OpenMP uses compiler directives (e.g., #pragma omp parallel) to specify parallel regions in the code. These directives are interpreted by
the compiler to generate parallel code.

Basic OpenMP Example

Here's a simple example demonstrating OpenMP in C:

c
CopyEdit
#include <stdio.h>
#include <omp.h>

int main() {
#pragma omp parallel
{
printf("Hello, World! from thread %d\n", omp_get_thread_num());
}
return 0;
}

To compile and run this program:


bash
CopyEdit
gcc -fopenmp -o hello hello.c
./hello

This program will print "Hello, World!" from each thread, with each thread identifying itself by its thread number. GeeksforGeeks

🔧 Key Features of OpenMP

 Parallel Regions: Sections of code that can be executed in parallel are enclosed within #pragma omp parallel directives.
 Work Sharing: Distributes loop iterations or blocks of code among threads using constructs like #pragma omp for.
 Synchronization: Ensures correct execution order using constructs like #pragma omp barrier and #pragma omp
critical.
 Data Environment: Manages data sharing attributes (e.g., shared, private, firstprivate) to control variable visibility
among threads.
 Runtime Control: Environment variables like OMP_NUM_THREADS control the number of threads used during execution.

parallel regions and work sharing constructs

OpenMP (Open Multi-Processing) is a widely used API for parallel programming in shared-memory systems. It enables developers to parallelize
applications efficiently by providing compiler directives, runtime routines, and environment variables.

🧠 Parallel Regions in OpenMP

A parallel region in OpenMP is a block of code that is executed by multiple threads in parallel. It is defined using the #pragma omp
parallel directive.

Syntax:

c
CopyEdit
#pragma omp parallel
{
// Code to be executed in parallel
}

When the program encounters a parallel region, it creates a team of threads. Each thread executes the code within the parallel region concurrently.
Variables declared inside the parallel region are private to each thread, while variables declared outside are shared among all threads by
default.Fiveable+1cvw.cac.cornell.edu+1
🔄 Work-Sharing Constructs in OpenMP

Work-sharing constructs divide the execution of a block of code among the threads in a team. These constructs do not launch new threads; they
distribute the work among existing threads. There is no implied barrier upon entry to a work-sharing construct; however, there is an implied
barrier at the end unless a nowait clause is specified.Fiveable+4www3.risc.jku.at+4wstein.org+4www3.risc.jku.at+2wstein.org+2LLNL HPC
Tutorials+2cvw.cac.cornell.edu+7OpenMP+7wstein.org+7

Types of Work-Sharing Constructs:

1. #pragma omp for / #pragma omp do:


o Distributes iterations of a loop across the threads in a team.
o Each thread executes a subset of the loop iterations.
o Suitable for loops with independent iterations.
o Example:wstein.org+1LLNL HPC Tutorials+1LLNL HPC Tutorials+2Fiveable+2wstein.org+2

c
CopyEdit
#pragma omp parallel for
for (int i = 0; i < N; i++) {
// Loop body
}

2. #pragma omp sections:


o Divides a block of code into separate sections.
o Each section is executed by one thread.
o Useful for tasks that can be performed independently.
o Example:www3.risc.jku.at+3Fiveable+3LLNL HPC Tutorials+3

c
CopyEdit
#pragma omp parallel sections
{
#pragma omp section
{
// Code for section 1
}
#pragma omp section
{
// Code for section 2
}
}

3. #pragma omp single:


o Specifies that the enclosed code block is executed by only one thread.
o Other threads skip the block.
o Useful for initialization or tasks that should be done only once.
o Example:www3.risc.jku.atPassLab

c
CopyEdit
#pragma omp parallel
{
#pragma omp single
{
// Code to be executed by only one thread
}
}

4. #pragma omp workshare:


o Divides the execution of the enclosed code region among the threads in a team.
o Each unit of work is executed only once by one thread.
o Requires the nowait clause to omit the implicit barrier at the end.
o Example:Fiveable+6LLNL HPC
Tutorials+6wstein.org+6www3.risc.jku.at+7OpenMP+7OpenMP+7www3.risc.jku.at+7OpenMP+7OpenMP+7

fortran
CopyEdit
!$omp parallel
! Code before workshare
!$omp workshare
! Code to be distributed among threads
!$omp end workshare
! Code after workshare
!$omp end parallel

synchronization in OpenMP: critical sections , barriers

Critical Sections

The critical construct ensures that a specific section of code is executed by only one thread at a time, preventing race conditions when
multiple threads access shared resources.

Syntax:

 C/C++:

c
CopyEdit
#pragma omp critical [name]
{
// Code to be executed by one thread at a time
}

 Fortran:

fortran
CopyEdit
!$OMP CRITICAL [name]
! Code to be executed by one thread at a time
!$OMP END CRITICAL

If the optional name is omitted, OpenMP uses a global mutex to protect the critical section. Specifying a name allows multiple critical sections
to exist with different mutexes, reducing contention. IntelLLNL HPC Tutorials+1ResearchGate+1

Example:

c
CopyEdit
#include <omp.h>
#include <stdio.h>

int main() {
int x = 0;
#pragma omp parallel
{
#pragma omp critical
{
x = x + 1;
}
}
printf("Final value of x: %d\n", x);
return 0;
}

In this example, the increment of x is protected by the critical directive, ensuring that only one thread modifies x at a time.

⛔ Barriers

The barrier construct synchronizes all threads in a team. When a thread encounters a barrier, it waits until all other threads have reached the
same point before proceeding.OpenMP+5wstein.org+5Oracle Documentation+5Oracle Documentation+3techpubs.jurassic.nl+3wstein.org+3

Syntax:

 C/C++:

c
CopyEdit
#pragma omp barrier

 Fortran:

fortran
CopyEdit
!$OMP BARRIER

Example:

c
CopyEdit
#include <omp.h>
#include <stdio.h>

int main() {
#pragma omp parallel
{
printf("Thread %d before barrier\n", omp_get_thread_num());
#pragma omp barrier
printf("Thread %d after barrier\n", omp_get_thread_num());
}
return 0;
}

In this example, all threads print a message before and after the barrier. The barrier ensures that all threads reach the same point before any thread
proceeds beyond it.

threading , synchronization, and critical sections in OpenMP

Threading in OpenMP

OpenMP employs a fork-join model for parallel execution:

1. Fork: The master thread spawns a team of threads to execute a parallel region.
2. Join: Upon completion, threads synchronize and terminate, returning control to the master thread.

Threads are identified by unique IDs, accessible via omp_get_thread_num(). The number of threads can be controlled using the
OMP_NUM_THREADS environment variable or the num_threads clause in the #pragma omp parallel directive.Wikipedia

🔐 Synchronization Constructs

Synchronization ensures correct execution order and data consistency among threads. Key constructs include:
1. Critical Sections

The critical construct restricts access to a block of code, allowing only one thread to execute it at a time. This prevents race conditions when
multiple threads access shared resources.

Syntax:

 C/C++:

c
CopyEdit
#pragma omp critical
{
// Code to be executed by one thread at a time
}

 Fortran:

fortran
CopyEdit
!$OMP CRITICAL
! Code to be executed by one thread at a time
!$OMP END CRITICAL

Example:

c
CopyEdit
#include <omp.h>
#include <stdio.h>

int main() {
int x = 0;
#pragma omp parallel
{
#pragma omp critical
{
x = x + 1;
}
}
printf("Final value of x: %d\n", x);
return 0;
}

In this example, the increment of x is protected by the critical directive, ensuring that only one thread modifies x at a time.

2. Barriers

The barrier construct synchronizes all threads in a team. When a thread encounters a barrier, it waits until all other threads have reached the
same point before proceeding.

Syntax:

 C/C++:

c
CopyEdit
#pragma omp barrier

 Fortran:

fortran
CopyEdit
!$OMP BARRIER

Example:
c
CopyEdit
#include <omp.h>
#include <stdio.h>

int main() {
#pragma omp parallel
{
printf("Thread %d before barrier\n", omp_get_thread_num());
#pragma omp barrier
printf("Thread %d after barrier\n", omp_get_thread_num());
}
return 0;
}

In this example, all threads print a message before and after the barrier. The barrier ensures that all threads reach the same point before any thread
proceeds beyond it.

parallel loops and work sharing in OpenMP

Parallel Loops in OpenMP

OpenMP provides the #pragma omp parallel for directive to parallelize loops, enabling concurrent execution of loop iterations by
multiple threads.

Syntax:

c
CopyEdit
#pragma omp parallel for [clause[ [,] clause] ...]
for (initialization; condition; increment) {
// Loop body
}

This directive combines the parallel and for constructs, creating a team of threads that divide the loop iterations among themselves. Each
thread executes a subset of iterations, enhancing performance through parallelism.

Example:

c
CopyEdit
#include <omp.h>
#include <stdio.h>

int main() {
int sum = 0;
int a[100];
for (int i = 0; i < 100; i++) {
a[i] = i + 1;
}

#pragma omp parallel for reduction(+:sum)


for (int i = 0; i < 100; i++) {
sum += a[i];
}

printf("Total sum: %d\n", sum);


return 0;
}

In this example, the reduction clause ensures that each thread maintains a private copy of sum, which are then combined at the end to
produce the final result.PassLab

🔄 Work-Sharing Constructs

Work-sharing constructs in OpenMP allow for the division of work among threads without creating new threads. These constructs are used within
a parallel region to distribute tasks among the existing threads.

1. #pragma omp for / #pragma omp do


Distributes loop iterations across the threads in the team. Each thread executes a subset of iterations.

2. #pragma omp sections / #pragma omp section

Divides code into separate sections, each of which is executed by a different thread.

3. #pragma omp single

Specifies that a block of code should be executed by only one thread.

4. #pragma omp workshare

Distributes the execution of a block of code among the threads in a team.

5. #pragma omp parallel for

A combination of parallel and for constructs, creating a team of threads and distributing loop iterations among them.

LOOP - LEVEL parallelism in HPC

What Is Loop-Level Parallelism?

Loop-level parallelism entails executing multiple iterations of a loop simultaneously, leveraging multiple threads or processing units. This is
feasible when iterations are independent, meaning they do not share data that could lead to race conditions. By parallelizing loops, computational
tasks can be completed more quickly, making this method ideal for performance-critical applications.

🧵 Implementing Loop-Level Parallelism in HPC

In HPC, loop-level parallelism is typically achieved using parallel programming models like OpenMP, MPI, or hybrid approaches. These models
allow for the distribution of loop iterations across multiple processing units, enhancing computational efficiency.

Example with OpenMP:

c
CopyEdit
#include <omp.h>
#include <stdio.h>

int main() {
int sum = 0;
int a[100];
for (int i = 0; i < 100; i++) {
a[i] = i + 1;
}

#pragma omp parallel for reduction(+:sum)


for (int i = 0; i < 100; i++) {
sum += a[i];
}

printf("Total sum: %d\n", sum);


return 0;
}

In this example, the reduction clause ensures that each thread maintains a private copy of the sum variable, which are then combined at the
end to produce the final result.

⚙️Types of Loop-Level Parallelism

Loop-level parallelism can be categorized based on how iterations are distributed and dependencies are managed:

1. DO-ALL Parallelism (Independent Multithreading): Each loop iteration is independent, allowing all iterations to be executed in
parallel without inter-thread communication. This is the simplest form of parallelism and is highly efficient when applicable.
2. DO-ACROSS Parallelism (Cyclic Multithreading): Iterations are assigned to threads in a round-robin manner. Dependencies
between iterations are managed by delaying the start of each iteration until all dependencies from previous iterations are satisfied. This
approach increases parallelism by overlapping the sequential portion of iterations with parallel execution.
3. DO-PIPE Parallelism (Pipelined Multithreading): The loop body is divided into stages, each assigned to a different thread. Each
iteration of the loop is distributed across all threads, with each thread executing its assigned stage. This method is effective for loops
with cross-iteration dependencies, allowing for parallel execution while maintaining data flow integrity.

⚠️Challenges and Considerations

 Data Dependencies: Loops with dependencies between iterations (e.g., one iteration's output is another's input) cannot be trivially
parallelized. Identifying and managing these dependencies is crucial to avoid race conditions and ensure correct execution.
 Synchronization Overhead: Introducing parallelism often requires synchronization mechanisms to manage shared resources, which
can introduce overhead and reduce performance gains.
 Load Balancing: Uneven distribution of iterations among threads can lead to some threads being idle while others are overloaded,
affecting overall performance. Proper scheduling strategies are needed to balance the workload effectively.

You might also like