This project demonstrates how to run MPI (Message Passing Interface) programs across multiple nodes using a GNS3 network simulation. Four Docker containers act as compute nodes connected through an Ethernet switch, providing a controlled environment to observe real MPI communication over a physical network layer.
- MPI Fundamentals -- What MPI is, the communication gap in HPC clusters, major implementations (OpenMPI, MPICH), core API functions, and network transport.
- Collective Operations -- Distribution (Broadcast, Scatter), Collection (Gather, Allgather), Computation (Reduce, Allreduce, Reduce-Scatter, Scan), Shuffling (All-to-All), and Synchronization (Barrier) explained with analogies and diagrams.
- Launching and Orchestrating MPI Jobs -- Launcher vs Worker roles, the launch sequence, the PRRTE/hwloc/PMIx stack, and Slurm integration.
- Communication Topologies -- How MPI internally implements collectives using tree, ring, recursive-halving, and other algorithms.
We use four Docker containers running Ubuntu, all connected to a single Ethernet switch.
| Node | IP Address |
|---|---|
| Node-A | 10.10.10.100 |
| Node-B | 10.10.10.101 |
| Node-C | 10.10.10.102 |
| Node-D | 10.10.10.103 |
The file "/etc/network/interfaces" inside each container defines how interfaces are brought up:
# eth0 gets IP via DHCP
auto eth0
iface eth0 inet dhcp
The management host is acting as a DHCP server responsible for IPv4 assignment to all nodes as well as external connectivity to download packages. Refer to this guide to configure the management host.
Ping all other nodes from Node-A to make sure you have full connectivity.
Node-A (the launcher) needs to reach into every other node to start processes without a password prompt.
Run this on Node-A:
# Generate an SSH key (press Enter for all prompts to leave no passphrase)
ssh-keygen -t rsa
# Copy the key to each node (password is gns3)
ssh-copy-id root@10.10.10.101
ssh-copy-id root@10.10.10.102
ssh-copy-id root@10.10.10.103
Verify by running ssh root@10.10.10.101, ssh root@10.10.10.102, and ssh root@10.10.10.103. If you log in without a password each time, it worked.
You must install the software suite on all four machines so they all have the required libraries.
Run this on Node-A, Node-B, Node-C, and Node-D:
apt update
apt install openmpi-bin libopenmpi-dev -y
Node-A needs to know who is on the network and how many tasks each node can handle.
On Node-A, create a text file:
nano my_hosts.txt
Add the IP addresses:
localhost slots=1
10.10.10.101 slots=1
10.10.10.102 slots=1
10.10.10.103 slots=1
For every example, the workflow is the same:
# Compile on Node-A
mpicc codes/XX_example.c -o XX_example
# Copy the binary to the same path on all other nodes
scp XX_example root@10.10.10.101:/root/
scp XX_example root@10.10.10.102:/root/
scp XX_example root@10.10.10.103:/root/
# Run with 4 processes across the cluster
mpirun --allow-run-as-root --hostfile my_hosts.txt -np 4 ./XX_example
All source files live in the codes/ folder. They are numbered in the order you should work through them.
| # | Example | Level | Description |
|---|---|---|---|
| 01 | Hello World | 1 | Init MPI, print rank and hostname |
| 02 | Send / Receive | 2 | Root sends a value to each rank (point-to-point) |
| 03 | Send / Receive with Tags | 2 | Use tags to distinguish multiple messages |
| 04 | Ping-Pong | 2 | Two ranks bounce a value back and forth |
| 05 | Ring | 2 | Pass a token around a ring of ranks |
| 06 | Broadcast | 3 | Root sends one value to all ranks |
| 07 | Scatter | 3 | Root deals array elements to ranks |
| 08 | Gather | 3 | Ranks send values; root collects into array |
| 09 | Reduce | 3 | Combine values with SUM; result at root only |
| 10 | Allreduce | 4 | Combine values with SUM; result at every rank |
| 11 | Allgather | 4 | Each rank contributes a value; everyone gets all |
| 12 | Barrier | 4 | Synchronize all ranks (no data moved) |
| 13 | Scatter + Gather | 5 | Distribute work, compute locally, collect results |
| 14 | Scatterv / Gatherv | 5 | Variable-length chunks (uneven data split) |
| 15 | Reduce Operations | 5 | SUM, PROD, MAX, MIN on the same input |
| 16 | Alltoall | 5 | Every rank exchanges unique data with every other |
| 17 | Non-blocking | 6 | Isend/Irecv to overlap communication and computation |
| 18 | Parallel Sum | 6 | Full mini-app: scatter, compute, reduce, verify |
- Level 1 -- MPI basics (init, rank, size)
- Level 2 -- Point-to-point (Send / Recv)
- Level 3 -- Collectives: one-to-all / all-to-one (Bcast, Scatter, Gather, Reduce)
- Level 4 -- Collectives: all-to-all and synchronization (Allreduce, Allgather, Barrier)
- Level 5 -- Practical and advanced collectives (Scatter+Gather, Scatterv/Gatherv, Reduce ops, Alltoall)
- Level 6 -- Advanced (non-blocking, real app)
Quick reference -- MPI operations:
| Category | Operation | Who Sends | Who Receives | Data Movement |
|---|---|---|---|---|
| Point-to-point | Send/Recv | One rank | One rank | Direct message between two ranks |
| Distribution | Bcast | Root | All ranks | Identical copy → everyone |
| Scatter | Root | All ranks | Split array → one chunk each | |
| Collection | Gather | All ranks | Root | One chunk each → combined array |
| Allgather | All ranks | All ranks | One chunk each → everyone gets all | |
| Computation | Reduce | All ranks | Root | Combine with operation → root only |
| Allreduce | All ranks | All ranks | Combine with operation → everyone | |
| Shuffling | Alltoall | All ranks | All ranks | Personalized exchange (N x N) |
| Synchronization | Barrier | -- | -- | Block until all ranks arrive |
01 - Hello World (01_hello_mpi.c)
The absolute minimum MPI program. The same binary runs on all four nodes. Each process initializes MPI, discovers its own rank (unique ID) and the total number of processes, then prints a greeting. No data is exchanged between processes. This just proves MPI is working across all four nodes.
Sample output:
Hello from Rank 0 of 4 on Node-A
Hello from Rank 1 of 4 on Node-B
Hello from Rank 2 of 4 on Node-C
Hello from Rank 3 of 4 on Node-D
Ranks are assigned based on the order in the hostfile (top-to-bottom). The first entry gets rank 0, which typically acts as the "root" -- the coordinator that distributes work and collects results. The code never refers to a physical node by name; it only knows its rank. This separation of logical role (rank) from physical location (node) is a core MPI design principle.
02 - Send / Receive (02_send_recv.c)
Rank 0 sends a unique integer to every other rank using MPI_Send. Each receiver calls MPI_Recv to collect its value. This is the most fundamental communication primitive in MPI.
MPI_Send(&value, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
│ │ │ │ │ │
│ │ │ │ │ └─ communicator
│ │ │ │ └────── tag (message label, receiver must match it)
│ │ │ └──────────── destination rank (who to send to)
│ │ └───────────────────── data type
│ └──────────────────────── count (number of elements)
└──────────────────────────────── buffer (data to send)
MPI_Recv(&value, 1, MPI_INT, source, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
│ │ │ │ │ │ │
│ │ │ │ │ │ └─ status (info about received msg)
│ │ │ │ │ └────────────────── communicator
│ │ │ │ └─────────────────────── tag (must match sender's tag)
│ │ │ └─────────────────────────────── source rank (who to receive from)
│ │ └──────────────────────────────────────── data type
│ └─────────────────────────────────────────── count (max elements to receive)
└──────────────────────────────────────────────────── buffer (where to store received data)
Sample output:
[Rank 0] Sent value 100 to Rank 1
[Rank 0] Sent value 200 to Rank 2
[Rank 0] Sent value 300 to Rank 3
[Rank 1] Received value 100 from Rank 0
[Rank 2] Received value 200 from Rank 0
[Rank 3] Received value 300 from Rank 0
03 - Send / Receive with Tags (03_send_recv_tags.c)
Tags let a receiver pick which message to accept when multiple messages arrive from the same sender. Think of tags as subject lines on emails -- you can choose to open the "urgent" one before the "newsletter".
Here Rank 0 sends two messages to each rank: a "data" message (tag 1) and a "control" message (tag 2). Each receiver deliberately receives tag 2 first, proving that tags control which message is picked up -- not arrival order.
Sample output:
[Rank 0] Sent data=42 (tag 1) and control=99 (tag 2) to Rank 1
[Rank 0] Sent data=42 (tag 1) and control=99 (tag 2) to Rank 2
[Rank 0] Sent data=42 (tag 1) and control=99 (tag 2) to Rank 3
[Rank 1] Received control=99 (tag 2) FIRST, then data=42 (tag 1)
[Rank 2] Received control=99 (tag 2) FIRST, then data=42 (tag 1)
[Rank 3] Received control=99 (tag 2) FIRST, then data=42 (tag 1)
04 - Ping-Pong (04_ping_pong.c)
Rank 0 sends a value to Rank 1, which increments it and sends it back. This repeats for several rounds, demonstrating two-way communication and giving a feel for message latency.
Sample output:
[Round 0] Rank 0 sent 0, Rank 1 returned 1
[Round 1] Rank 0 sent 1, Rank 1 returned 2
[Round 2] Rank 0 sent 2, Rank 1 returned 3
...
05 - Ring Communication (05_ring.c)
Each rank passes a token to the next rank in a ring (0 → 1 → 2 → 3 → 0). Every rank adds its rank number to the token before forwarding. When the token returns to Rank 0 it contains the sum of all ranks.
Sample output:
[Rank 0] Starting token = 0
[Rank 1] Received 0, added 1, forwarding 1
[Rank 2] Received 1, added 2, forwarding 3
[Rank 3] Received 3, added 3, forwarding 6
[Rank 0] Token completed the ring with value 6
06 - Broadcast (06_broadcast.c)
Root sends ONE value to ALL ranks. After MPI_Bcast, every rank has the same copy.
All ranks call the same function: MPI_Bcast(). MPI uses the root parameter (0) to decide who sends and who receives -- the root sends, everyone else receives. This is how all collective operations work: one function, called by every rank, with MPI handling the roles internally.
MPI_Bcast(&value, 1, MPI_INT, 0, MPI_COMM_WORLD);
│ │ │ │ │
│ │ │ │ └─ communicator (all ranks in the group)
│ │ │ └──── root = rank 0 (THIS is who sends)
│ │ └───────────── data type
│ └──────────────── count (1 element)
└──────────────────────── buffer (read on root, written on others)
Sample output:
[Rank 0] Before broadcast: value = 42 ← only root has the value
[Rank 1] Before broadcast: value = -1
[Rank 2] Before broadcast: value = -1
[Rank 3] Before broadcast: value = -1
[Rank 0] After broadcast: value = 42 ← now everyone has it
[Rank 1] After broadcast: value = 42
[Rank 2] After broadcast: value = 42
[Rank 3] After broadcast: value = 42
07 - Scatter (07_scatter.c)
Root splits an array and sends one element to each rank.
MPI_Scatter(sendbuf, 1, MPI_INT, &recvval, 1, MPI_INT, 0, MPI_COMM_WORLD);
│ │ │ │ │ │ │ │
│ │ │ │ │ │ │ └─ communicator
│ │ │ │ │ │ └──── root = rank 0 (who has the data)
│ │ │ │ │ └───────────── receive type
│ │ │ │ └──────────────── receive count (1 element per rank)
│ │ │ └────────────────────────── receive buffer (where each rank stores its piece)
│ │ └──────────────────────────────────── send type
│ └─────────────────────────────────────── send count (1 element PER RANK, not total)
└──────────────────────────────────────────────── send buffer (only matters on root)
Sample output:
[Rank 0] Data to scatter: 10 20 30 40
[Rank 0] Received 10
[Rank 1] Received 20
[Rank 2] Received 30
[Rank 3] Received 40
08 - Gather (08_gather.c)
The reverse of Scatter. Each rank sends one value; root collects them all into a single array.
MPI_Gather(&local_val, 1, MPI_INT, recvbuf, 1, MPI_INT, 0, MPI_COMM_WORLD);
│ │ │ │ │ │ │ │
│ │ │ │ │ │ │ └─ communicator
│ │ │ │ │ │ └──── root = rank 0 (who collects)
│ │ │ │ │ └───────────── receive type
│ │ │ │ └──────────────── receive count (1 element per rank)
│ │ │ └───────────────────────── receive buffer (only matters on root)
│ │ └────────────────────────────────── send type
│ └───────────────────────────────────── send count (1 element from each rank)
└───────────────────────────────────────────────── send buffer (each rank's contribution)
Sample output:
[Rank 0] Sending value 100 to root
[Rank 1] Sending value 101 to root
[Rank 2] Sending value 102 to root
[Rank 3] Sending value 103 to root
[Rank 0] Gathered: 100 101 102 103
09 - Reduce (09_reduce.c)
Each rank contributes a value. MPI combines them with an operation (SUM) and delivers the result to root ONLY. Other ranks do not receive the answer.
MPI_Reduce(&local_val, &global_sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
│ │ │ │ │ │ │
│ │ │ │ │ │ └─ communicator
│ │ │ │ │ └──── root = rank 0 (who gets the result)
│ │ │ │ └────────────── operation (SUM, PROD, MAX, MIN, ...)
│ │ │ └─────────────────────── data type
│ │ └────────────────────────── count (number of elements)
│ └─────────────────────────────────────── result buffer (only meaningful on root)
└─────────────────────────────────────────────────── input buffer (each rank's contribution)
Sample output:
[Rank 0] Local value = 10
[Rank 1] Local value = 20
[Rank 2] Local value = 30
[Rank 3] Local value = 40
[Rank 0] Global sum = 100
[Rank 1] Does NOT know the global sum
No single node does all the computation. MPI uses a tree internally so multiple nodes add in parallel:
Step 1 (parallel): Rank 0 (10) + Rank 1 (20) = 30
Rank 2 (30) + Rank 3 (40) = 70
Step 2: 30 + 70 = 100 → delivered to root
With 1024 ranks this takes only 10 steps (log2) instead of 1023 sequential additions. In real applications, the heavy computation happens before Reduce. Each rank processes its local data in parallel, then Reduce just combines the small results at the end.
10 - Allreduce (10_allreduce.c)
Like Reduce, but the result is delivered to ALL ranks. Equivalent to Reduce + Broadcast but faster internally.
MPI_Allreduce(&local_val, &global_sum, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
│ │ │ │ │ │
│ │ │ │ │ └─ communicator
│ │ │ │ └─────────── operation
│ │ │ └──────────────────── data type
│ │ └─────────────────────── count
│ └──────────────────────────────────── result buffer (EVERY rank gets the result)
└──────────────────────────────────────────────── input buffer (each rank's contribution)
No root parameter -- because there is no root. Everyone contributes and everyone receives.
Sample output:
[Rank 0] Global sum = 100 (everyone knows!)
[Rank 1] Global sum = 100 (everyone knows!)
[Rank 2] Global sum = 100 (everyone knows!)
[Rank 3] Global sum = 100 (everyone knows!)
11 - Allgather (11_allgather.c)
Each rank contributes one value. After Allgather every rank holds the complete array.
Gather gives the array to root only; Allgather gives it to everyone.
MPI_Allgather(&local_val, 1, MPI_INT, all_vals, 1, MPI_INT, MPI_COMM_WORLD);
│ │ │ │ │ │ │
│ │ │ │ │ │ └─ communicator
│ │ │ │ │ └─────────── receive type
│ │ │ │ └──────────────── receive count (per rank)
│ │ │ └─────────────────────────── receive buffer (EVERY rank gets the full array)
│ │ └──────────────────────────────────── send type
│ └─────────────────────────────────────── send count
└─────────────────────────────────────────────────── send buffer (each rank's contribution)
No root parameter -- every rank both sends and receives.
Sample output:
[Rank 0] Allgather result: 10 20 30 40
[Rank 1] Allgather result: 10 20 30 40
[Rank 2] Allgather result: 10 20 30 40
[Rank 3] Allgather result: 10 20 30 40
12 - Barrier (12_barrier.c)
A pure synchronization fence. No data is moved. Every rank blocks until ALL ranks have reached the barrier. Useful when you need to ensure all ranks have finished a phase before the next one starts. This is the simplest collective -- no buffers, no counts, no root. Just "wait for everyone."
Sample output:
[Rank 0] Doing work (sleeping 1 seconds)...
[Rank 0] Reached barrier
...
[Rank 3] Reached barrier
[Rank 0] Passed barrier -- all ranks are synchronized
[Rank 1] Passed barrier -- all ranks are synchronized
...
13 - Scatter + Compute + Gather (13_scatter_gather.c)
The classic parallel-computing pattern:
- Root scatters chunks of work to all ranks.
- Each rank processes its chunk locally (squares the value).
- Root gathers the results back.
Sample output:
[Root] Original data: 2 4 6 8
[Rank 0] Received 2, computing square = 4
[Rank 1] Received 4, computing square = 16
[Rank 2] Received 6, computing square = 36
[Rank 3] Received 8, computing square = 64
[Root] Gathered results: 4 16 36 64
14 - Scatterv / Gatherv -- Variable-length Chunks (14_scatterv_gatherv.c)
Real data rarely divides evenly. Scatterv/Gatherv let root send a different number of elements to each rank. Here 10 elements are split 3-3-2-2 across 4 ranks.
Sample output:
[Root] Full array: 1 2 3 4 5 6 7 8 9 10
[Rank 0] Received 3 elements: 1 2 3
[Rank 1] Received 3 elements: 4 5 6
[Rank 2] Received 2 elements: 7 8
[Rank 3] Received 2 elements: 9 10
[Root] Gathered (doubled): 2 4 6 8 10 12 14 16 18 20
15 - Reduce with Different Operations (15_reduce_ops.c)
Runs four reductions on the same input to demonstrate built-in operations: SUM, PROD, MAX, and MIN.
Sample output:
[Root] SUM = 16
[Root] PROD = 105
[Root] MAX = 7
[Root] MIN = 1
16 - Alltoall -- Complete Exchange (16_alltoall.c)
Every rank sends a different piece of data to every other rank.
MPI_Alltoall(sendbuf, 1, MPI_INT, recvbuf, 1, MPI_INT, MPI_COMM_WORLD);
│ │ │ │ │ │ │
│ │ │ │ │ │ └─ communicator
│ │ │ │ │ └─────────── receive type
│ │ │ │ └──────────────── receive count (elements FROM each rank)
│ │ │ └───────────────────────── receive buffer (collects one piece from everyone)
│ │ └────────────────────────────────── send type
│ └───────────────────────────────────── send count (elements TO each rank)
└────────────────────────────────────────────── send buffer (element i goes to rank i)
No root -- every rank sends to every rank. Fully symmetric.
Sample output:
[Rank 0] Sending: 0 1 2 3 → Received: 0 4 8 12
[Rank 1] Sending: 4 5 6 7 → Received: 1 5 9 13
[Rank 2] Sending: 8 9 10 11 → Received: 2 6 10 14
[Rank 3] Sending: 12 13 14 15 → Received: 3 7 11 15
17 - Non-blocking Communication (17_nonblocking.c)
Blocking calls pause the caller until the message transfer completes. Non-blocking calls (MPI_Isend/MPI_Irecv) return immediately and give you a request handle. You call MPI_Wait when you actually need the data. This lets you overlap computation with communication.
MPI_Isend(&data, 1, MPI_INT, dest, tag, MPI_COMM_WORLD, &request);
│ │ │ │ │ │ │
│ │ │ │ │ │ └─ request handle (check/wait on later)
│ │ │ │ │ └────────────────── communicator
│ │ │ │ └─────────────────────── tag
│ │ │ └───────────────────────────── destination rank
│ │ └────────────────────────────────────── data type
│ └───────────────────────────────────────── count
└──────────────────────────────────────────────── buffer (do NOT modify until Wait completes)
MPI_Irecv(&data, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &request);
(same layout -- returns immediately, fills buffer later)
MPI_Wait(&request, MPI_STATUS_IGNORE);
│ │
│ └─ status (info about the completed message)
└─────────── request handle (blocks until this operation finishes)
Sample output:
[Rank 0] All sends posted (non-blocking), doing other work...
[Rank 1] Recv posted (non-blocking), doing other work...
[Rank 2] Recv posted (non-blocking), doing other work...
[Rank 3] Recv posted (non-blocking), doing other work...
[Rank 0] All sends completed
[Rank 1] Received 111
[Rank 2] Received 222
[Rank 3] Received 333
18 - Parallel Array Summation (18_parallel_sum.c)
A mini-application that ties everything together. Root creates an array of 1,000,000 integers, scatters it, each rank computes a partial sum, Reduce combines the results, and root verifies and prints timing.
Sample output:
Sum of 1..1000000 = 500000500000 (expected 500000500000) PASS
Elapsed: 0.003247 seconds with 4 processes