Skip to content

k-arindam/microgpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 MicroGPT β€” The Most Atomic GPT Implementation

🚧 This project is a work in progress. Features, optimizations, and additional language ports may be added over time.

A multi-language port of Andrej Karpathy's beautifully minimal GPT trainer. One file per language. Zero (or near-zero) external dependencies. The complete algorithm β€” everything else is just efficiency.


πŸ“– Overview

MicroGPT implements a complete GPT-2 style language model from scratch β€” including autograd, training, and inference β€” in a single file per language. It trains on a character-level dataset (names by default) and learns to generate novel, hallucinated names.

The project is a faithful port of the original Python implementation across eight languages:

Language File Dependencies Compile / Run
🐍 Python microgpt.py None (stdlib only) python microgpt.py
βš™οΈ C++ microgpt.cpp STL only g++ -std=c++17 -O2 -o microgpt microgpt.cpp && ./microgpt
πŸ¦€ Rust microgpt.rs std only rustc -O microgpt.rs -o microgpt && ./microgpt
🍎 Swift microgpt.swift Foundation only swiftc -O microgpt.swift -o microgpt && ./microgpt
🎯 Dart microgpt.dart dart:io + dart:math dart compile exe microgpt.dart -o microgpt && ./microgpt
🟣 Kotlin microgpt.kt kotlin.math + java.io kotlinc microgpt.kt -include-runtime -d microgpt.jar && java -jar microgpt.jar
🟑 JavaScript microgpt.js fs (Node.js built-in) node microgpt.js
πŸ”΅ Go microgpt.go stdlib only go run microgpt.go

Key principle: Every port preserves the exact same structure, variable naming conventions, and algorithmic flow as the original Python. Reading any version is reading the same algorithm.


πŸ—οΈ Architecture

The algorithm is structured into six cleanly separated sections, identical across all languages:

1. Data Loading & Tokenization

input.txt β†’ list of documents β†’ character-level tokenizer
  • Reads input.txt (a newline-separated list of names)
  • Extracts unique characters as tokens (IDs 0..n-1)
  • Adds a special BOS (Beginning of Sequence) token as the final ID
  • Shuffles the dataset with a fixed seed (42) for reproducibility

2. Autograd Engine (Value)

A scalar-valued autograd system that tracks computation graphs and computes gradients via reverse-mode differentiation (backpropagation).

Supported operations:

Operation Python C++ Rust Swift Dart Kotlin JavaScript Go
Addition a + b a + b a.add(&b) a + b a + b a + b a.add(b) Add(a,b)
Multiplication a * b a * b a.mul(&b) a * b a * b a * b a.mul(b) Mul(a,b)
Power a ** n vpow(a, n) a.vpow(n) a.vpow(n) a.vpow(n) a.vpow(n) a.pow(n) Pow(a,n)
Log a.log() vlog(a) a.vlog() a.vlog() a.vlog() a.vlog() a.log() Log(a)
Exp a.exp() vexp(a) a.vexp() a.vexp() a.vexp() a.vexp() a.exp() Exp(a)
ReLU a.relu() vrelu(a) a.relu() a.relu() a.relu() a.relu() a.relu() Relu(a)
Negation -a -a a.neg() -a -a -a a.neg() Neg(a)
Subtraction a - b a - b a.sub(&b) a - b a - b a - b a.sub(b) Sub(a,b)
Division a / b a / b a.div(&b) a / b a / b a / b a.div(b) Div(a,b)

Backward pass uses topological sorting (DFS) to propagate gradients through the computation graph β€” the chain rule, applied recursively.

3. Parameter Initialization

Weights are initialized as 2D matrices of Value nodes, drawn from a Gaussian distribution (mean=0, std=0.08):

state_dict:
  wte        β†’ [vocab_size Γ— n_embd]      Token embeddings
  wpe        β†’ [block_size Γ— n_embd]      Position embeddings
  lm_head    β†’ [vocab_size Γ— n_embd]      Output projection

  Per layer:
    attn_wq  β†’ [n_embd Γ— n_embd]          Query projection
    attn_wk  β†’ [n_embd Γ— n_embd]          Key projection
    attn_wv  β†’ [n_embd Γ— n_embd]          Value projection
    attn_wo  β†’ [n_embd Γ— n_embd]          Output projection
    mlp_fc1  β†’ [4Β·n_embd Γ— n_embd]        MLP up-projection
    mlp_fc2  β†’ [n_embd Γ— 4Β·n_embd]        MLP down-projection

Default hyperparameters:

Parameter Value Description
n_embd 16 Embedding dimension
n_head 4 Number of attention heads
n_layer 1 Number of transformer layers
block_size 16 Maximum sequence length
head_dim 4 Dimension per head (n_embd / n_head)

4. Model Architecture

Follows GPT-2 with minor simplifications:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚         Token + Position         β”‚
                    β”‚           Embedding              β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                          β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
                          β”‚ RMSNorm β”‚
                          β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                               β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚     Multi-Head Self-Attention   β”‚
              β”‚  (Q, K, V projections + output) β”‚
              β”‚  with causal KV-cache           β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚ + residual
                          β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
                          β”‚ RMSNorm β”‚
                          β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                               β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚          MLP Block              β”‚
              β”‚  FC1 β†’ ReLU β†’ FC2               β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚ + residual
                          β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
                          β”‚ lm_head β”‚
                          β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
                               β”‚
                          [ logits ]

Differences from standard GPT-2:

  • LayerNorm β†’ RMSNorm (no learned scale/bias)
  • GeLU β†’ ReLU (simpler activation)
  • No biases anywhere in the model

5. Training Loop

For each step (1..1000):
  1. Pick a document, tokenize it with BOS delimiters
  2. Forward each token through the model (autoregressive)
  3. Compute cross-entropy loss against next-token targets
  4. Backward pass (compute all gradients)
  5. Adam optimizer update with:
     - Linear learning rate decay
     - Bias-corrected first moment (β₁ = 0.85)
     - Bias-corrected second moment (Ξ²β‚‚ = 0.99)

Optimizer hyperparameters:

Parameter Value
Learning rate 0.01 (linearly decayed to 0)
β₁ 0.85
Ξ²β‚‚ 0.99
Ξ΅ 1e-8
Steps 1000

6. Inference

After training, the model generates 20 new samples:

For each sample:
  1. Start with BOS token
  2. Feed through model to get next-token probabilities
  3. Apply temperature scaling (0.5 β€” slightly conservative)
  4. Sample from the probability distribution
  5. Repeat until BOS is generated or block_size is reached

⚑ Language-Specific Design Decisions

C++ (microgpt.cpp)

  • Value ownership: std::shared_ptr<Value> enables shared graph node references
  • Operator overloads: Global operator+, operator*, etc., for natural mathematical syntax
  • RNG: std::mt19937 (Mersenne Twister) + std::normal_distribution + std::discrete_distribution
  • Requires: C++17 (-std=c++17) for structured bindings

Rust (microgpt.rs)

  • Value ownership: Rc<RefCell<ValueInner>> β€” Rust's interior mutability pattern for shared mutable nodes
  • No external crates: Everything is std-only, including a hand-rolled xoshiro256** PRNG
  • Gaussian sampling: Box-Muller transform (no external dependency needed)
  • Weighted sampling: Hand-rolled cumulative distribution function

Swift (microgpt.swift)

  • Value as class: Reference semantics (like Python) via Swift's class keyword
  • Operator overloads: Global +, *, -, / operators matching Python's syntax
  • Identity tracking: ObjectIdentifier for the visited set in topological sort
  • RNG: Hand-rolled xoshiro256** + Box-Muller for Gaussian

Dart (microgpt.dart)

  • Value class: Dart classes are reference types by default (perfect fit)
  • Operator overloads: operator +, operator *, operator -, operator /
  • Named methods: vpow(), vlog(), vexp(), relu() for operations Dart can't overload
  • RNG: Hand-rolled xoshiro256** + Box-Muller for Gaussian
  • String handling: Character-by-character via index access (string[i])

Kotlin (microgpt.kt)

  • Value class: Kotlin classes are reference types; direct operator overloads (plus, times, minus, div, unaryMinus)
  • Data class for weights: data class LayerWeights for clean per-layer storage
  • RNG: Hand-rolled xoshiro256** with Long (64-bit signed) + Box-Muller
  • Signed 64-bit workaround: SplitMix64 seed constants expressed as signed Long literals

JavaScript (microgpt.js)

  • Value class: ES6 class with method-based API (.add(), .mul(), .sub(), .div())
  • No operator overloads: JS doesn't support them, so all operations are method calls
  • RNG: Hand-rolled xoshiro256** using BigInt for 64-bit precision + Box-Muller
  • Typed arrays: Float64Array for Adam optimizer buffers (memory-efficient)
  • IIFE pattern: Entire program wrapped in an immediately-invoked function

Go (microgpt.go)

  • Value struct: Pointer-based (*Value) for shared graph references
  • Free functions: Go has no operator overloading β€” all ops are top-level functions (Add, Mul, Pow, Log, Exp, Relu, etc.)
  • RNG: Hand-rolled xoshiro256** with uint64 + Box-Muller for Gaussian
  • Shuffle: Uses Rng.Shuffle with a swap callback (matches sort.Slice idiom)
  • Idiomatic Go: Exported fields (Data, Grad), unexported internals (children, localGrads)

πŸ“‚ Project Structure

microgpt/
β”œβ”€β”€ input.txt           # Training dataset (32K+ names)
β”œβ”€β”€ microgpt.py         # Original Python implementation
β”œβ”€β”€ microgpt.cpp        # C++17 port
β”œβ”€β”€ microgpt.rs         # Rust port
β”œβ”€β”€ microgpt.swift      # Swift port
β”œβ”€β”€ microgpt.dart       # Dart port
β”œβ”€β”€ microgpt.kt         # Kotlin port
β”œβ”€β”€ microgpt.js         # JavaScript (Node.js) port
β”œβ”€β”€ microgpt.go         # Go port
β”œβ”€β”€ benchmark.sh        # All-languages benchmark script
└── README.md           # This file

πŸš€ Quick Start

Prerequisites

  • An input.txt file in the working directory (the Python version auto-downloads it; other versions expect it to exist)

To auto-download input.txt:

curl -o input.txt https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt

Run any version

# Python (original)
python microgpt.py

# C++
g++ -std=c++17 -O2 -o microgpt_cpp microgpt.cpp && ./microgpt_cpp

# Rust
rustc -O microgpt.rs -o microgpt_rs && ./microgpt_rs

# Swift
swiftc -O microgpt.swift -o microgpt_swift && ./microgpt_swift

# Dart
dart compile exe microgpt.dart -o microgpt && ./microgpt

# Kotlin
kotlinc microgpt.kt -include-runtime -d microgpt.jar && java -jar microgpt.jar

# JavaScript (Node.js)
node microgpt.js

# Go
go run microgpt.go

Expected Output

num docs: 32033
vocab size: 27
num params: 7185
step    1 / 1000 | loss 3.2958
step    2 / 1000 | loss 3.2730
...
step 1000 / 1000 | loss 2.1042

--- inference (new, hallucinated names) ---
sample  1: Maliya
sample  2: Kori
sample  3: Aden
...

Note: Exact output values may vary slightly between languages due to differences in floating-point arithmetic and RNG implementations, but the overall behavior and convergence should be consistent.


⏱️ Benchmarking

A benchmark script is included that compiles and runs all implementations with maximum optimizations, then prints a ranked comparison table.

bash benchmark.sh

This compiles each language with its best optimization flags (-O3 for C++, -O for Rust/Swift, AOT for Dart, etc.), runs all 1000 training steps, and produces a results table like:

╔═══════════════════════════════════════════════════════════════════════╗
β•‘                        BENCHMARK RESULTS                              β•‘
╠═══════════════════════════════════════════════════════════════════════╣
β•‘  Rank β”‚ Language              β”‚ Compile (s) β”‚  Run (s)  β”‚ vs Fastest  β•‘
β•‘  ─────┼───────────────────────┼─────────────┼───────────┼─────────────║
β•‘    #1 β”‚ Rust (rustc -O)       β”‚          3s β”‚      6s   β”‚      1.00x  β•‘
β•‘    #2 β”‚ C++ (g++ -O3)         β”‚          1s β”‚      7s   β”‚      1.17x  β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Note: The above is a sample β€” actual timings depend on your hardware. Missing toolchains are gracefully skipped.


πŸ“Š Dependency Comparison

Language External Deps Total Imports
Python 0 os, math, random
C++ 0 <iostream>, <fstream>, <vector>, <cmath>, <random>, <algorithm>, <set>, <map>, <string>, <sstream>, <memory>, <functional>, <numeric>
Rust 0 std::cell, std::collections, std::fs, std::rc
Swift 0 Foundation
Dart 0 dart:io, dart:math
Kotlin 0 java.io.File, kotlin.math.*
JavaScript 0 fs (Node.js built-in)
Go 0 fmt, math, os, bufio, sort, strings

Every implementation achieves zero external dependencies β€” only standard library / language built-ins are used.


πŸ€” Why This Exists

This project demonstrates that:

  1. A GPT can be implemented in ~200-300 lines in any modern language
  2. Autograd doesn't require a framework β€” it's just the chain rule + a topological sort
  3. The algorithm is language-agnostic β€” the same structure maps naturally to C++, Rust, Swift, Dart, Kotlin, JavaScript, and Go
  4. Dependencies are optional β€” even RNG and Gaussian sampling can be done from scratch

It's a powerful educational tool for understanding transformers at the most fundamental level.


πŸ™ Credits & Acknowledgments

Massive shoutout to Andrej Karpathy for the original Python implementation. His work on making deep learning accessible β€” from micrograd to nanoGPT to this beautifully minimal microgpt.py β€” has been an incredible gift to the ML community. This multi-language port is a tribute to the elegance of his original algorithm. πŸŽ‰


"This file is the complete algorithm. Everything else is just efficiency."
β€” Andrej Karpathy