MPI Project

This project demonstrates how to run MPI (Message Passing Interface) programs across multiple nodes using a GNS3 network simulation. Four Docker containers act as compute nodes connected through an Ethernet switch, providing a controlled environment to observe real MPI communication over a physical network layer.

Learning Resources

MPI Fundamentals -- What MPI is, the communication gap in HPC clusters, major implementations (OpenMPI, MPICH), core API functions, and network transport.
Collective Operations -- Distribution (Broadcast, Scatter), Collection (Gather, Allgather), Computation (Reduce, Allreduce, Reduce-Scatter, Scan), Shuffling (All-to-All), and Synchronization (Barrier) explained with analogies and diagrams.
Launching and Orchestrating MPI Jobs -- Launcher vs Worker roles, the launch sequence, the PRRTE/hwloc/PMIx stack, and Slurm integration.
Communication Topologies -- How MPI internally implements collectives using tree, ring, recursive-halving, and other algorithms.

GNS3 Topology

We use four Docker containers running Ubuntu, all connected to a single Ethernet switch.

Node	IP Address
Node-A	10.10.10.100
Node-B	10.10.10.101
Node-C	10.10.10.102
Node-D	10.10.10.103

The file "/etc/network/interfaces" inside each container defines how interfaces are brought up:

# eth0 gets IP via DHCP
auto eth0
iface eth0 inet dhcp

The management host is acting as a DHCP server responsible for IPv4 assignment to all nodes as well as external connectivity to download packages. Refer to this guide to configure the management host.

Ping all other nodes from Node-A to make sure you have full connectivity.

Set Up Password-less SSH

Node-A (the launcher) needs to reach into every other node to start processes without a password prompt.

Run this on Node-A:

# Generate an SSH key (press Enter for all prompts to leave no passphrase)
ssh-keygen -t rsa

# Copy the key to each node (password is gns3)
ssh-copy-id root@10.10.10.101
ssh-copy-id root@10.10.10.102
ssh-copy-id root@10.10.10.103

Verify by running ssh root@10.10.10.101, ssh root@10.10.10.102, and ssh root@10.10.10.103. If you log in without a password each time, it worked.

Install OpenMPI on ALL Nodes

You must install the software suite on all four machines so they all have the required libraries.

Run this on Node-A, Node-B, Node-C, and Node-D:

apt update
apt install openmpi-bin libopenmpi-dev -y

Create the Hostfile

Node-A needs to know who is on the network and how many tasks each node can handle.

On Node-A, create a text file:

nano my_hosts.txt

Add the IP addresses:

localhost     slots=1
10.10.10.101  slots=1
10.10.10.102  slots=1
10.10.10.103  slots=1

Build and Distribute

For every example, the workflow is the same:

# Compile on Node-A
mpicc codes/XX_example.c -o XX_example

# Copy the binary to the same path on all other nodes
scp XX_example root@10.10.10.101:/root/
scp XX_example root@10.10.10.102:/root/
scp XX_example root@10.10.10.103:/root/

# Run with 4 processes across the cluster
mpirun --allow-run-as-root --hostfile my_hosts.txt -np 4 ./XX_example

Code Examples (Easy to Complex)

All source files live in the codes/ folder. They are numbered in the order you should work through them.

#	Example	Level	Description
01	Hello World	1	Init MPI, print rank and hostname
02	Send / Receive	2	Root sends a value to each rank (point-to-point)
03	Send / Receive with Tags	2	Use tags to distinguish multiple messages
04	Ping-Pong	2	Two ranks bounce a value back and forth
05	Ring	2	Pass a token around a ring of ranks
06	Broadcast	3	Root sends one value to all ranks
07	Scatter	3	Root deals array elements to ranks
08	Gather	3	Ranks send values; root collects into array
09	Reduce	3	Combine values with SUM; result at root only
10	Allreduce	4	Combine values with SUM; result at every rank
11	Allgather	4	Each rank contributes a value; everyone gets all
12	Barrier	4	Synchronize all ranks (no data moved)
13	Scatter + Gather	5	Distribute work, compute locally, collect results
14	Scatterv / Gatherv	5	Variable-length chunks (uneven data split)
15	Reduce Operations	5	SUM, PROD, MAX, MIN on the same input
16	Alltoall	5	Every rank exchanges unique data with every other
17	Non-blocking	6	Isend/Irecv to overlap communication and computation
18	Parallel Sum	6	Full mini-app: scatter, compute, reduce, verify

Level 1 -- MPI basics (init, rank, size)
Level 2 -- Point-to-point (Send / Recv)
Level 3 -- Collectives: one-to-all / all-to-one (Bcast, Scatter, Gather, Reduce)
Level 4 -- Collectives: all-to-all and synchronization (Allreduce, Allgather, Barrier)
Level 5 -- Practical and advanced collectives (Scatter+Gather, Scatterv/Gatherv, Reduce ops, Alltoall)
Level 6 -- Advanced (non-blocking, real app)

Quick reference -- MPI operations:

Category	Operation	Who Sends	Who Receives	Data Movement
Point-to-point	Send/Recv	One rank	One rank	Direct message between two ranks
Distribution	Bcast	Root	All ranks	Identical copy → everyone
	Scatter	Root	All ranks	Split array → one chunk each
Collection	Gather	All ranks	Root	One chunk each → combined array
	Allgather	All ranks	All ranks	One chunk each → everyone gets all
Computation	Reduce	All ranks	Root	Combine with operation → root only
	Allreduce	All ranks	All ranks	Combine with operation → everyone
Shuffling	Alltoall	All ranks	All ranks	Personalized exchange (N x N)
Synchronization	Barrier	--	--	Block until all ranks arrive

Level 1 -- MPI Basics

01 - Hello World (`01_hello_mpi.c`)

The absolute minimum MPI program. The same binary runs on all four nodes. Each process initializes MPI, discovers its own rank (unique ID) and the total number of processes, then prints a greeting. No data is exchanged between processes. This just proves MPI is working across all four nodes.

Sample output:

Hello from Rank 0 of 4 on Node-A
Hello from Rank 1 of 4 on Node-B
Hello from Rank 2 of 4 on Node-C
Hello from Rank 3 of 4 on Node-D

Ranks are assigned based on the order in the hostfile (top-to-bottom). The first entry gets rank 0, which typically acts as the "root" -- the coordinator that distributes work and collects results. The code never refers to a physical node by name; it only knows its rank. This separation of logical role (rank) from physical location (node) is a core MPI design principle.

Level 2 -- Point-to-Point Communication

02 - Send / Receive (`02_send_recv.c`)

Rank 0 sends a unique integer to every other rank using MPI_Send. Each receiver calls MPI_Recv to collect its value. This is the most fundamental communication primitive in MPI.

MPI_Send(&value, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
         │       │  │        │     │    │
         │       │  │        │     │    └─ communicator
         │       │  │        │     └────── tag (message label, receiver must match it)
         │       │  │        └──────────── destination rank (who to send to)
         │       │  └───────────────────── data type
         │       └──────────────────────── count (number of elements)
         └──────────────────────────────── buffer (data to send)

MPI_Recv(&value, 1, MPI_INT, source, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
         │       │  │        │       │    │                │
         │       │  │        │       │    │                └─ status (info about received msg)
         │       │  │        │       │    └────────────────── communicator
         │       │  │        │       └─────────────────────── tag (must match sender's tag)
         │       │  │        └─────────────────────────────── source rank (who to receive from)
         │       │  └──────────────────────────────────────── data type
         │       └─────────────────────────────────────────── count (max elements to receive)
         └──────────────────────────────────────────────────── buffer (where to store received data)

Sample output:

[Rank 0] Sent value 100 to Rank 1
[Rank 0] Sent value 200 to Rank 2
[Rank 0] Sent value 300 to Rank 3
[Rank 1] Received value 100 from Rank 0
[Rank 2] Received value 200 from Rank 0
[Rank 3] Received value 300 from Rank 0

03 - Send / Receive with Tags (`03_send_recv_tags.c`)

Tags let a receiver pick which message to accept when multiple messages arrive from the same sender. Think of tags as subject lines on emails -- you can choose to open the "urgent" one before the "newsletter".

Here Rank 0 sends two messages to each rank: a "data" message (tag 1) and a "control" message (tag 2). Each receiver deliberately receives tag 2 first, proving that tags control which message is picked up -- not arrival order.

Sample output:

[Rank 0] Sent data=42 (tag 1) and control=99 (tag 2) to Rank 1
[Rank 0] Sent data=42 (tag 1) and control=99 (tag 2) to Rank 2
[Rank 0] Sent data=42 (tag 1) and control=99 (tag 2) to Rank 3
[Rank 1] Received control=99 (tag 2) FIRST, then data=42 (tag 1)
[Rank 2] Received control=99 (tag 2) FIRST, then data=42 (tag 1)
[Rank 3] Received control=99 (tag 2) FIRST, then data=42 (tag 1)

04 - Ping-Pong (`04_ping_pong.c`)

Rank 0 sends a value to Rank 1, which increments it and sends it back. This repeats for several rounds, demonstrating two-way communication and giving a feel for message latency.

Sample output:

[Round 0] Rank 0 sent 0, Rank 1 returned 1
[Round 1] Rank 0 sent 1, Rank 1 returned 2
[Round 2] Rank 0 sent 2, Rank 1 returned 3
...

05 - Ring Communication (`05_ring.c`)

Each rank passes a token to the next rank in a ring (0 → 1 → 2 → 3 → 0). Every rank adds its rank number to the token before forwarding. When the token returns to Rank 0 it contains the sum of all ranks.

Sample output:

[Rank 0] Starting token = 0
[Rank 1] Received 0, added 1, forwarding 1
[Rank 2] Received 1, added 2, forwarding 3
[Rank 3] Received 3, added 3, forwarding 6
[Rank 0] Token completed the ring with value 6

Level 3 -- Collective Operations (one-to-all / all-to-one)

06 - Broadcast (`06_broadcast.c`)

Root sends ONE value to ALL ranks. After MPI_Bcast, every rank has the same copy.

All ranks call the same function: MPI_Bcast(). MPI uses the root parameter (0) to decide who sends and who receives -- the root sends, everyone else receives. This is how all collective operations work: one function, called by every rank, with MPI handling the roles internally.

MPI_Bcast(&value, 1, MPI_INT, 0, MPI_COMM_WORLD);
           │      │  │        │  │
           │      │  │        │  └─ communicator (all ranks in the group)
           │      │  │        └──── root = rank 0 (THIS is who sends)
           │      │  └───────────── data type
           │      └──────────────── count (1 element)
           └──────────────────────── buffer (read on root, written on others)

Sample output:

[Rank 0] Before broadcast: value = 42    ← only root has the value
[Rank 1] Before broadcast: value = -1
[Rank 2] Before broadcast: value = -1
[Rank 3] Before broadcast: value = -1
[Rank 0] After  broadcast: value = 42    ← now everyone has it
[Rank 1] After  broadcast: value = 42
[Rank 2] After  broadcast: value = 42
[Rank 3] After  broadcast: value = 42

07 - Scatter (`07_scatter.c`)

Root splits an array and sends one element to each rank.

MPI_Scatter(sendbuf, 1, MPI_INT, &recvval, 1, MPI_INT, 0, MPI_COMM_WORLD);
            │        │  │         │         │  │        │  │
            │        │  │         │         │  │        │  └─ communicator
            │        │  │         │         │  │        └──── root = rank 0 (who has the data)
            │        │  │         │         │  └───────────── receive type
            │        │  │         │         └──────────────── receive count (1 element per rank)
            │        │  │         └────────────────────────── receive buffer (where each rank stores its piece)
            │        │  └──────────────────────────────────── send type
            │        └─────────────────────────────────────── send count (1 element PER RANK, not total)
            └──────────────────────────────────────────────── send buffer (only matters on root)

Sample output:

[Rank 0] Data to scatter: 10 20 30 40
[Rank 0] Received 10
[Rank 1] Received 20
[Rank 2] Received 30
[Rank 3] Received 40

08 - Gather (`08_gather.c`)

The reverse of Scatter. Each rank sends one value; root collects them all into a single array.

MPI_Gather(&local_val, 1, MPI_INT, recvbuf, 1, MPI_INT, 0, MPI_COMM_WORLD);
           │           │  │        │        │  │        │  │
           │           │  │        │        │  │        │  └─ communicator
           │           │  │        │        │  │        └──── root = rank 0 (who collects)
           │           │  │        │        │  └───────────── receive type
           │           │  │        │        └──────────────── receive count (1 element per rank)
           │           │  │        └───────────────────────── receive buffer (only matters on root)
           │           │  └────────────────────────────────── send type
           │           └───────────────────────────────────── send count (1 element from each rank)
           └───────────────────────────────────────────────── send buffer (each rank's contribution)

Sample output:

[Rank 0] Sending value 100 to root
[Rank 1] Sending value 101 to root
[Rank 2] Sending value 102 to root
[Rank 3] Sending value 103 to root
[Rank 0] Gathered: 100 101 102 103

09 - Reduce (`09_reduce.c`)

Each rank contributes a value. MPI combines them with an operation (SUM) and delivers the result to root ONLY. Other ranks do not receive the answer.

MPI_Reduce(&local_val, &global_sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
           │           │            │  │        │         │  │
           │           │            │  │        │         │  └─ communicator
           │           │            │  │        │         └──── root = rank 0 (who gets the result)
           │           │            │  │        └────────────── operation (SUM, PROD, MAX, MIN, ...)
           │           │            │  └─────────────────────── data type
           │           │            └────────────────────────── count (number of elements)
           │           └─────────────────────────────────────── result buffer (only meaningful on root)
           └─────────────────────────────────────────────────── input buffer (each rank's contribution)

Sample output:

[Rank 0] Local value = 10
[Rank 1] Local value = 20
[Rank 2] Local value = 30
[Rank 3] Local value = 40
[Rank 0] Global sum = 100
[Rank 1] Does NOT know the global sum

No single node does all the computation. MPI uses a tree internally so multiple nodes add in parallel:

Step 1 (parallel):  Rank 0 (10) + Rank 1 (20) = 30
                    Rank 2 (30) + Rank 3 (40) = 70
Step 2:             30 + 70 = 100  →  delivered to root

With 1024 ranks this takes only 10 steps (log2) instead of 1023 sequential additions. In real applications, the heavy computation happens before Reduce. Each rank processes its local data in parallel, then Reduce just combines the small results at the end.

Level 4 -- Collective Operations (all-to-all and synchronization)

10 - Allreduce (`10_allreduce.c`)

Like Reduce, but the result is delivered to ALL ranks. Equivalent to Reduce + Broadcast but faster internally.

MPI_Allreduce(&local_val, &global_sum, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
              │           │            │  │        │         │
              │           │            │  │        │         └─ communicator
              │           │            │  │        └─────────── operation
              │           │            │  └──────────────────── data type
              │           │            └─────────────────────── count
              │           └──────────────────────────────────── result buffer (EVERY rank gets the result)
              └──────────────────────────────────────────────── input buffer (each rank's contribution)

No root parameter -- because there is no root. Everyone contributes and everyone receives.

Sample output:

[Rank 0] Global sum = 100  (everyone knows!)
[Rank 1] Global sum = 100  (everyone knows!)
[Rank 2] Global sum = 100  (everyone knows!)
[Rank 3] Global sum = 100  (everyone knows!)

11 - Allgather (`11_allgather.c`)

Each rank contributes one value. After Allgather every rank holds the complete array.

Gather gives the array to root only; Allgather gives it to everyone.

MPI_Allgather(&local_val, 1, MPI_INT, all_vals, 1, MPI_INT, MPI_COMM_WORLD);
              │           │  │        │          │  │        │
              │           │  │        │          │  │        └─ communicator
              │           │  │        │          │  └─────────── receive type
              │           │  │        │          └──────────────── receive count (per rank)
              │           │  │        └─────────────────────────── receive buffer (EVERY rank gets the full array)
              │           │  └──────────────────────────────────── send type
              │           └─────────────────────────────────────── send count
              └─────────────────────────────────────────────────── send buffer (each rank's contribution)

No root parameter -- every rank both sends and receives.

Sample output:

[Rank 0] Allgather result: 10 20 30 40
[Rank 1] Allgather result: 10 20 30 40
[Rank 2] Allgather result: 10 20 30 40
[Rank 3] Allgather result: 10 20 30 40

12 - Barrier (`12_barrier.c`)

A pure synchronization fence. No data is moved. Every rank blocks until ALL ranks have reached the barrier. Useful when you need to ensure all ranks have finished a phase before the next one starts. This is the simplest collective -- no buffers, no counts, no root. Just "wait for everyone."

Sample output:

[Rank 0] Doing work (sleeping 1 seconds)...
[Rank 0] Reached barrier
...
[Rank 3] Reached barrier
[Rank 0] Passed barrier -- all ranks are synchronized
[Rank 1] Passed barrier -- all ranks are synchronized
...

Level 5 -- Practical and Advanced Collectives

13 - Scatter + Compute + Gather (`13_scatter_gather.c`)

The classic parallel-computing pattern:

Root scatters chunks of work to all ranks.
Each rank processes its chunk locally (squares the value).
Root gathers the results back.

Sample output:

[Root] Original data: 2 4 6 8
[Rank 0] Received 2, computing square = 4
[Rank 1] Received 4, computing square = 16
[Rank 2] Received 6, computing square = 36
[Rank 3] Received 8, computing square = 64
[Root] Gathered results: 4 16 36 64

14 - Scatterv / Gatherv -- Variable-length Chunks (`14_scatterv_gatherv.c`)

Real data rarely divides evenly. Scatterv/Gatherv let root send a different number of elements to each rank. Here 10 elements are split 3-3-2-2 across 4 ranks.

Sample output:

[Root] Full array: 1 2 3 4 5 6 7 8 9 10
[Rank 0] Received 3 elements: 1 2 3
[Rank 1] Received 3 elements: 4 5 6
[Rank 2] Received 2 elements: 7 8
[Rank 3] Received 2 elements: 9 10
[Root] Gathered (doubled): 2 4 6 8 10 12 14 16 18 20

15 - Reduce with Different Operations (`15_reduce_ops.c`)

Runs four reductions on the same input to demonstrate built-in operations: SUM, PROD, MAX, and MIN.

Sample output:

[Root] SUM  = 16
[Root] PROD = 105
[Root] MAX  = 7
[Root] MIN  = 1

16 - Alltoall -- Complete Exchange (`16_alltoall.c`)

Every rank sends a different piece of data to every other rank.

MPI_Alltoall(sendbuf, 1, MPI_INT, recvbuf, 1, MPI_INT, MPI_COMM_WORLD);
             │        │  │        │        │  │        │
             │        │  │        │        │  │        └─ communicator
             │        │  │        │        │  └─────────── receive type
             │        │  │        │        └──────────────── receive count (elements FROM each rank)
             │        │  │        └───────────────────────── receive buffer (collects one piece from everyone)
             │        │  └────────────────────────────────── send type
             │        └───────────────────────────────────── send count (elements TO each rank)
             └────────────────────────────────────────────── send buffer (element i goes to rank i)

No root -- every rank sends to every rank. Fully symmetric.

Sample output:

[Rank 0] Sending:  0  1  2  3    →  Received:  0  4  8 12
[Rank 1] Sending:  4  5  6  7    →  Received:  1  5  9 13
[Rank 2] Sending:  8  9 10 11    →  Received:  2  6 10 14
[Rank 3] Sending: 12 13 14 15    →  Received:  3  7 11 15

Level 6 -- Advanced Patterns

17 - Non-blocking Communication (`17_nonblocking.c`)

Blocking calls pause the caller until the message transfer completes. Non-blocking calls (MPI_Isend/MPI_Irecv) return immediately and give you a request handle. You call MPI_Wait when you actually need the data. This lets you overlap computation with communication.

MPI_Isend(&data, 1, MPI_INT, dest, tag, MPI_COMM_WORLD, &request);
          │      │  │        │     │    │                │
          │      │  │        │     │    │                └─ request handle (check/wait on later)
          │      │  │        │     │    └────────────────── communicator
          │      │  │        │     └─────────────────────── tag
          │      │  │        └───────────────────────────── destination rank
          │      │  └────────────────────────────────────── data type
          │      └───────────────────────────────────────── count
          └──────────────────────────────────────────────── buffer (do NOT modify until Wait completes)

MPI_Irecv(&data, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &request);
          (same layout -- returns immediately, fills buffer later)

MPI_Wait(&request, MPI_STATUS_IGNORE);
         │         │
         │         └─ status (info about the completed message)
         └─────────── request handle (blocks until this operation finishes)

Sample output:

[Rank 0] All sends posted (non-blocking), doing other work...
[Rank 1] Recv posted (non-blocking), doing other work...
[Rank 2] Recv posted (non-blocking), doing other work...
[Rank 3] Recv posted (non-blocking), doing other work...
[Rank 0] All sends completed
[Rank 1] Received 111
[Rank 2] Received 222
[Rank 3] Received 333

18 - Parallel Array Summation (`18_parallel_sum.c`)

A mini-application that ties everything together. Root creates an array of 1,000,000 integers, scatters it, each rank computes a partial sum, Reduce combines the results, and root verifies and prints timing.

Sample output:

Sum of 1..1000000 = 500000500000  (expected 500000500000)  PASS
Elapsed: 0.003247 seconds with 4 processes

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
codes		codes
docs		docs
pics		pics
README.md		README.md
deploy.sh		deploy.sh

Folders and files

Latest commit

History

Repository files navigation

MPI Project

Learning Resources

GNS3 Topology

Set Up Password-less SSH

Install OpenMPI on ALL Nodes

Create the Hostfile

Build and Distribute

Code Examples (Easy to Complex)

Level 1 -- MPI Basics

01 - Hello World (01_hello_mpi.c)

Level 2 -- Point-to-Point Communication

02 - Send / Receive (02_send_recv.c)

03 - Send / Receive with Tags (03_send_recv_tags.c)

04 - Ping-Pong (04_ping_pong.c)

05 - Ring Communication (05_ring.c)

Level 3 -- Collective Operations (one-to-all / all-to-one)

06 - Broadcast (06_broadcast.c)

07 - Scatter (07_scatter.c)

08 - Gather (08_gather.c)

09 - Reduce (09_reduce.c)

Level 4 -- Collective Operations (all-to-all and synchronization)

10 - Allreduce (10_allreduce.c)

11 - Allgather (11_allgather.c)

12 - Barrier (12_barrier.c)

Level 5 -- Practical and Advanced Collectives

13 - Scatter + Compute + Gather (13_scatter_gather.c)

14 - Scatterv / Gatherv -- Variable-length Chunks (14_scatterv_gatherv.c)

15 - Reduce with Different Operations (15_reduce_ops.c)

16 - Alltoall -- Complete Exchange (16_alltoall.c)

Level 6 -- Advanced Patterns

17 - Non-blocking Communication (17_nonblocking.c)

18 - Parallel Array Summation (18_parallel_sum.c)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

01 - Hello World (`01_hello_mpi.c`)

02 - Send / Receive (`02_send_recv.c`)

03 - Send / Receive with Tags (`03_send_recv_tags.c`)

04 - Ping-Pong (`04_ping_pong.c`)

05 - Ring Communication (`05_ring.c`)

06 - Broadcast (`06_broadcast.c`)

07 - Scatter (`07_scatter.c`)

08 - Gather (`08_gather.c`)

09 - Reduce (`09_reduce.c`)

10 - Allreduce (`10_allreduce.c`)

11 - Allgather (`11_allgather.c`)

12 - Barrier (`12_barrier.c`)

13 - Scatter + Compute + Gather (`13_scatter_gather.c`)

14 - Scatterv / Gatherv -- Variable-length Chunks (`14_scatterv_gatherv.c`)

15 - Reduce with Different Operations (`15_reduce_ops.c`)

16 - Alltoall -- Complete Exchange (`16_alltoall.c`)

17 - Non-blocking Communication (`17_nonblocking.c`)

18 - Parallel Array Summation (`18_parallel_sum.c`)