mprdc

a simple mapreduce implementation in go, built for learning.

based on the google mapreduce paper by jeffrey dean and sanjay ghemawat (2004).

what is mapreduce?

mapreduce is a programming model for processing large datasets in parallel across a cluster. the idea is simple:

map: transform input data into key-value pairs
shuffle: group all values by key
reduce: combine values for each key into a final result

flowchart LR
    subgraph map phase
        I1[input 1] --> M1[map]
        I2[input 2] --> M2[map]
        I3[input 3] --> M3[map]
    end

    subgraph shuffle
        M1 --> S{partition<br/>by key}
        M2 --> S
        M3 --> S
    end

    subgraph reduce phase
        S --> R1[reduce 0]
        S --> R2[reduce 1]
        S --> R3[reduce 2]
    end

    R1 --> O1[output 0]
    R2 --> O2[output 1]
    R3 --> O3[output 2]

architecture

the system uses a master-worker pattern:

flowchart TB
    subgraph master
        TQ[task queue]
        TS[task state tracking]
        TC[timeout checker]
    end

    subgraph workers
        W1[worker 1]
        W2[worker 2]
        W3[worker 3]
        W4[worker 4]
    end

    W1 <-->|assign task<br/>report done| master
    W2 <-->|assign task<br/>report done| master
    W3 <-->|assign task<br/>report done| master
    W4 <-->|assign task<br/>report done| master

master: holds all tasks, assigns them to workers, tracks state
workers: request tasks, execute map/reduce, report completion

task lifecycle

each task goes through these states:

stateDiagram-v2
    [*] --> idle
    idle --> in_progress: assigned to worker
    in_progress --> completed: worker reports done
    in_progress --> idle: timeout or failure
    completed --> [*]

data flow (word count example)

flowchart TB
    subgraph input files
        F1["doc1.txt<br/>'the quick brown fox'"]
        F2["doc2.txt<br/>'the fox is quick'"]
    end

    subgraph map phase
        F1 --> M1["map task 0"]
        F2 --> M2["map task 1"]
        M1 --> KV1["(the,1) (quick,1)<br/>(brown,1) (fox,1)"]
        M2 --> KV2["(the,1) (fox,1)<br/>(is,1) (quick,1)"]
    end

    subgraph intermediate files
        KV1 --> |"hash(key) % R"| IF
        KV2 --> |"hash(key) % R"| IF
        IF["mr-0-0, mr-0-1, mr-0-2<br/>mr-1-0, mr-1-1, mr-1-2"]
    end

    subgraph reduce phase
        IF --> R0["reduce 0<br/>reads mr-*-0"]
        IF --> R1["reduce 1<br/>reads mr-*-1"]
        IF --> R2["reduce 2<br/>reads mr-*-2"]
    end

    subgraph output
        R0 --> O0["mr-out-0<br/>the 2"]
        R1 --> O1["mr-out-1<br/>fox 2, quick 2"]
        R2 --> O2["mr-out-2<br/>brown 1, is 1"]
    end

intermediate file naming

the key insight is how intermediate files are organized:

         reduce partitions
              0   1   2
           ┌───┬───┬───┐
map task 0 │0-0│0-1│0-2│  ← map writes ROWS
map task 1 │1-0│1-1│1-2│
map task 2 │2-0│2-1│2-2│
           └───┴───┴───┘
             ↓   ↓   ↓
           reduce reads COLUMNS

map task m writes to mr-m-0, mr-m-1, mr-m-2 (one per partition)
reduce task r reads mr-0-r, mr-1-r, mr-2-r (from all map tasks)

features implemented

from the paper

feature	status	description
map/reduce	✅	core paradigm
partitioning	✅	`hash(key) % nReduce`
master-worker	✅	coordinator pattern
task timeouts	✅	detect crashed workers
task retry	✅	reschedule failed tasks
file splitting	✅	large files → multiple map tasks

additional

feature	description
exponential backoff	failed tasks wait before retry (1s → 2s → 4s...)
jitter	randomized delays to prevent thundering herd
line alignment	chunks don't split words/lines

fault tolerance

sequenceDiagram
    participant M as master
    participant W1 as worker 1
    participant W2 as worker 2

    M->>W1: assign task 0
    Note over W1: starts working...
    Note over W1: crashes!

    loop timeout checker
        M->>M: task 0 exceeded timeout
        M->>M: reset to idle
    end

    W2->>M: request task
    M->>W2: assign task 0
    W2->>M: report done
    M->>M: mark completed

usage

// define your map function
mapF := func(content string) []KeyValue[string, string] {
    words := strings.Fields(content)
    kvs := make([]KeyValue[string, string], 0, len(words))
    for _, word := range words {
        kvs = append(kvs, KeyValue[string, string]{
            Key:   strings.ToLower(word),
            Value: "1",
        })
    }
    return kvs
}

// define your reduce function
reduceF := func(key string, values []string) string {
    return strconv.Itoa(len(values))
}

// run mapreduce
files := []string{"data/doc1.txt", "data/doc2.txt"}
results := RunMapReduce(files, 3, 4, mapF, reduceF)
//                      ↑      ↑  ↑
//                      │      │  └── 4 workers
//                      │      └───── 3 reduce partitions
//                      └──────────── input files

run

go run .

output:

Word counts:
  the: 5
  fox: 2
  quick: 2
  ...

project structure

mprdc/
├── main.go      # orchestration, word count example
├── master.go    # task assignment, timeout detection, backoff
├── worker.go    # map/reduce execution, file i/o
├── types.go     # Task, KeyValue, MapFunc, ReduceFunc
└── README.md

references

mapreduce: simplified data processing on large clusters - the original paper
mit 6.824 - distributed systems course with mapreduce lab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mprdc

what is mapreduce?

architecture

task lifecycle

data flow (word count example)

intermediate file naming

features implemented

from the paper

additional

fault tolerance

usage

run

project structure

references

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
go.mod		go.mod
main.go		main.go
master.go		master.go
types.go		types.go
worker.go		worker.go

Folders and files

Latest commit

History

Repository files navigation

mprdc

what is mapreduce?

architecture

task lifecycle

data flow (word count example)

intermediate file naming

features implemented

from the paper

additional

fault tolerance

usage

run

project structure

references

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages