A production-grade LLM research engine — built from first principles.
Not a tutorial wrapper. Not a HuggingFace clone. A ground-up LLM engine engineered for researchers who want to understand and control every FLOP.
Throughput ████████████████████████████████ 2.4× baseline
MFU ████████████████████████████░░░░ 61% A100 peak
VRAM ████████████░░░░░░░░░░░░░░░░░░░░ 38% reduction
One table. No marketing. Just hardware truth.
| Metric | PyTorch Baseline | fast-gpt-lab | Delta |
|---|---|---|---|
| Throughput (tok/s, A100 BF16) | ~42,000 |
~101,000 |
+140% |
| Peak VRAM (1B model, seq=2048) | 38.4 GB |
23.7 GB |
−38% |
| MFU (Model FLOP Utilization) | 26% |
61% |
+2.4× |
| Attention Memory Complexity | O(n²) |
O(n) |
FlashAttn-v3 |
| Kernel Fusion Points | 0 |
4 |
SwiGLU, LN+Res, RoPE, Attn |
| Batch Serving Latency (p99) | ~740ms |
~190ms |
Continuous batching |
Benchmarked on a single NVIDIA A100-80GB SXM4. Numbers will vary with hardware, sequence length, and batch size. Reproduce with
make bench.
Input Tokens
│
▼
┌─────────────────────────────────────────────────────┐
│ Byte-Level BPE Tokenizer (GPT-2 regex compliant) │
└───────────────────────┬─────────────────────────────┘
│
▼
┌──────────────────┐
│ Token Embedding │ ←── weight-tied with LM Head
└────────┬─────────┘
│
┌───────────▼────────────┐
│ Transformer Block ×N │
│ │
│ ┌─────────────────┐ │
│ │ Pre-RMS Norm │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ RoPE Attention │ │ ← Triton FlashAttn-v3
│ │ (GQA-ready) │ │ O(n) memory, online softmax
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼──────────┐ │
│ │ Fused LN + Res Add│ │ ← Triton kernel (1 pass)
│ └────────┬──────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ Fused SwiGLU FFN│ │ ← Triton kernel (no RTT)
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼──────────┐ │
│ │ Fused LN + Res Add│ │
│ └────────┬──────────┘ │
└───────────┼─────────────┘
│ ×N layers
▼
┌──────────────────┐
│ LM Head │ ← tied weights → no param overhead
└──────────────────┘
│
▼
Logits / KVCache
Clients (HTTP/SSE)
┌────┐ ┌────┐ ┌────┐ ┌────┐
│ C1 │ │ C2 │ │ C3 │ │ C4 │ ...concurrent requests
└──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘
│ │ │ │
└───────┴───────┴───────┘
│
▼
┌─────────────────────────┐
│ Async Request Queue │ ← asyncio + priority scheduling
│ (continuous batching) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Dynamic Batch Packer │ ← groups sequences by length
│ (max hardware occ.) │ avoids padding waste
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ KVCache Pool │ ← O(1) decode per step
│ (stateful per seq) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ INT8/FP8 GPT Model │ ← quantized weights, BF16 compute
└────────────┬────────────┘
│
SSE token streams
┌────┐ ┌────┐ ┌────┐ ┌────┐
│ C1 │ │ C2 │ │ C3 │ │ C4 │
└────┘ └────┘ └────┘ └────┘
fast-gpt-lab/
├── src/
│ ├── vanilla/
│ │ ├── model.py # GPT: SwiGLU, Pre-Norm, weight-tied LM Head
│ │ ├── config.py # Hardware-aware presets (Micro/Small/Medium)
│ │ ├── cache.py # Stateful KVCache — O(1) AR decoding
│ │ ├── streaming_data.py # Infinite HuggingFace streaming loader
│ │ ├── data_sharder.py # Deterministic rank-based data sharding
│ │ └── tensor_cores.py # Modulo-8 dim alignment for TC saturation
│ └── kernels/
│ ├── flash_attention.py # Triton FlashAttn-v3 + online softmax
│ ├── swiglu.py # Fused SwiGLU — eliminates mem-bus RTT
│ ├── fused_layernorm.py # LN + Residual Add in a single Triton pass
│ ├── rotary.py # Fused RoPE kernel
│ └── quantization.py # INT8/FP8 weight packing hooks
├── training/
│ ├── telemetry.py # W&B experiment tracking + artifact logging
│ ├── checkpoint.py # Atomic rename + sliding-window pruning
│ ├── amp_scaler.py # AMP: FP16/BF16 loss scaling
│ ├── optim_utils.py # Decoupled weight decay param grouping
│ └── cluster_orchestrator.py # SLURM/MPI env detection (no torchrun lock-in)
├── profiling/
│ ├── flop_calculator.py # MFU = (6N × tokens/s) / peak_TFLOPS
│ ├── chrome_trace_analyzer.py# PyTorch JSON trace → SM idle cycle finder
│ ├── memory_auditor.py # VRAM fragmentation + peak tracker
│ └── run_sweeps.py # Grid-search: throughput vs context length
├── deploy/
│ ├── api.py # FastAPI + SSE token streaming gateway
│ ├── tui.py # rich-based Terminal UI
│ ├── quant_fp8.py # Real-time INT8 weight packing
│ └── continuous_batching.py # Async request pool for concurrent serving
├── docs/
│ ├── adr/ # ADR-001 → ADR-015 (decision records)
│ ├── bpe_math.md # O-notation BPE compression derivation
│ └── math_rigor.md # Attention memory complexity proof
├── tests/
├── benchmarks/
├── Makefile
└── pyproject.toml
15 ADRs govern every non-obvious choice. Key decisions summarized below.
Decision: Replace standard GELU activation with SwiGLU (FFN(x) = (xW₁ ⊙ σ(xV)) W₂).
Rationale:
- PaLM, LLaMA-2/3, Mistral all converge on SwiGLU — empirically outperforms GELU by ~0.3–0.6 ppl at equivalent compute.
- Gating mechanism creates implicit feature selection; no extra parameters needed vs standard FFN.
- Fused Triton kernel eliminates two separate memory round-trips (gate + activation no longer split across ops).
Trade-off: FFN hidden dim must be adjusted (4d × 2/3) to match GELU parameter count. Handled in config.py.
Decision: Custom Triton implementation of FlashAttention-v3 with online softmax, rather than PyTorch's built-in SDPA.
Rationale:
- PyTorch SDPA is a black box — cannot instrument, cannot fuse with RoPE, cannot control tiling strategy.
- FA-v3 with online softmax reduces attention memory from
O(n²)toO(n)— critical at seq_len > 4096. - Custom kernel allows direct fusion with RoPE application inside the attention loop, saving one full materialization of Q/K.
- Achieves >300 TFLOPS on A100 vs ~180 TFLOPS with naive attention.
Trade-off: Triton kernel maintenance burden. Mitigated by pinned triton==2.3.x in pyproject.toml.
Decision: cluster_orchestrator.py reads SLURM_* and OMPI_* env vars directly for rank/world_size bootstrap.
Rationale:
torchrunintroduces a rendezvous daemon — a single point of failure at 100+ node scale.- Direct SLURM integration allows scheduler-native fault recovery (job step restart vs full job restart).
- Eliminates ~15s initialization overhead per training run from
torchrunTCP store negotiation.
Trade-off: Requires explicit srun invocation in SLURM scripts. Documented in infra/slurm/train.sbatch.
Decision: Single Triton kernel combining y = LayerNorm(x) + residual rather than two sequential ops.
Rationale:
- Standard PyTorch: 2 memory reads (x, residual), 2 memory writes, 2 kernel launches, 2 SM occupancy windows.
- Fused kernel: 1 read of input, 1 write of output, 1 kernel launch — halves memory bandwidth for this op.
- In a 24-layer model, this op fires 48× per forward pass. Savings compound significantly at high throughput.
Trade-off: Harder to profile individually. Chrome trace instrumentation (chrome_trace_analyzer.py) compensates.
Decision: LM head shares weight matrix with token embedding layer.
Rationale:
- Saves
vocab_size × d_modelparameters (≈25M params for vocab=50k, d=512) — pure memory reduction. - Press & Wolf (2016) show no performance degradation; widely adopted in GPT-2, LLaMA, etc.
- Reduces checkpoint size proportionally, benefiting atomic checkpoint writes.
Trade-off: Gradient flows through embedding from two sources. Handled correctly by PyTorch autograd — no action needed.
See /docs/adr/ for ADR-006 through ADR-015 covering: GQA configuration, BF16 AMP strategy, atomic checkpointing design, streaming data sharding, MFU normalization choice, KVCache eviction policy, INT8 calibration strategy, continuous batching queue design, and TUI rendering architecture.
Requirements: Python 3.11+, CUDA 12.x, uv
# 1. Clone and enter
git clone https://github.com/your-org/fast-gpt-lab && cd fast-gpt-lab
# 2. Bootstrap environment (uv — ~10× faster than pip)
make install
# Equivalent: uv sync --all-extras
# 3. Verify CUDA kernels compile
make kernel-check
# 4. Train a micro model (single GPU, ~2min)
make train-micro
# Runs: uv run python -m training.train --config micro --steps 1000
# 5. Run full benchmark suite
make bench
# Outputs MFU, throughput, VRAM report to benchmarks/results/
# 6. Launch inference API
make serve
# FastAPI + SSE at http://localhost:8000
# 7. Interactive TUI
make tui
# rich-based terminal chat interfaceMakefile targets:
| Target | Description |
|---|---|
make install |
uv sync — full env setup |
make train-micro |
Single-GPU smoke test (Micro config) |
make train-small |
Multi-GPU DDP (Small config, 8×A100) |
make train-cluster |
SLURM multi-node submission |
make bench |
Full throughput/MFU/VRAM benchmark sweep |
make kernel-check |
Compile and unit-test all Triton kernels |
make profile |
Chrome trace + memory audit |
make serve |
FastAPI inference gateway |
make tui |
Terminal UI chat |
make test |
Full pytest suite with type checking |
make lint |
ruff + mypy — zero-error CI gate |
Model size presets (src/vanilla/config.py):
GPTConfig.MICRO # 6M params — smoke test, fits any GPU
GPTConfig.SMALL # 85M params — research scale
GPTConfig.MEDIUM # 350M params — publication scale827 commits · 41 branches · 14 release tags · 5 phases
TIMELINE ──────────────────────────────────────────────────────────────────►
PHASE 1 ░░░░░░░░
"The Core" feat/modern-gpt ──────────────────────┐
v0.1.0 → v0.3.0 kernel/flash-attention-v3 ────────────┤
~180 commits ▼
✓ RoPE + SwiGLU + KVCache architecture
✓ BPE tokenizer (GPT-2 regex compliant)
✓ Triton FlashAttn-v3 kernel
✓ Triton Fused SwiGLU kernel
PHASE 2 ░░░░░░░░░░░░░░░
"The Engine" feat/wandb-telemetry ─────────────┐
v0.4.0 → v0.5.0 data/streaming-datasets ──────────┤
~160 commits infra/slurm-orchestration ─────────┤
▼
✓ SLURM/MPI native bootstrap
✓ Infinite streaming data + sharding
✓ AMP scaler + atomic checkpointing
✓ W&B telemetry + artifact logging
PHASE 3 ░░░░░░░░░░░░░░░
"The Lab" perf/mfu-auditing ────────────────┐
v0.6.0 → v0.7.0 bench/hardware-sweeps ────────────┤
~130 commits telemetry/visual-traces ───────────┤
▼
✓ MFU calculation (6N baseline)
✓ Chrome JSON trace instrumentation
✓ Tensor Core dim alignment (mod-8)
✓ VRAM fragmentation auditor
PHASE 4 ░░░░░░░░░░░░░░░
"The Gateway" feat/fastapi-gateway ─────────────┐
v0.8.0 → v0.9.0 feat/elite-tui-interface ─────────┤
~84 commits ▼
✓ FastAPI + SSE token streaming
✓ Continuous batching engine
✓ rich Terminal UI (TUI)
PHASE 5 ░░░░░░░░░░░░░░░░░░░░░░░░░
"The Mastery" perf/fused-layernorm-residual ─────┐
v1.0.0 deploy/fp8-quantization ────────────┤
~273 commits feat/continuous-batching ────────────┤
+ 36 additional branches ────────────┤
▼
✓ Fused LayerNorm + Residual kernel
✓ INT8/FP8 weight packing
✓ 100% type-safety (mypy strict)
✓ Zero-error CI/CD gate
✓ Full docstring standardization
──────────────────────────────────────────────────────────────────────────────
v0.1.0 v0.2.0 v0.3.0 v0.4.0 v0.5.0 v0.6.0 v0.7.0 v0.8.0 v0.9.0 v1.0.0
│ │ │ │ │ │ │ │ │ │
└───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
14 release tags
Key mathematical results underpinning the implementation:
Attention Memory Complexity (docs/math_rigor.md):
Standard attention materializes the full n × n attention matrix — O(n²) memory. FlashAttention-v3 with online softmax computes attention in tiles, never materializing the full matrix — proven O(n) HBM footprint with identical numerical output.
MFU Normalization (profiling/flop_calculator.py):
Following Hoffmann et al. (Chinchilla), MFU is computed as:
MFU = (6 × N × tokens_per_second) / peak_hardware_TFLOPS
where 6N accounts for both forward and backward pass FLOPs. This provides a hardware-agnostic efficiency metric comparable across GPU generations.
BPE Compression (docs/bpe_math.md):
Formal derivation of tokenizer compression ratio as a function of merge operations, with O-notation bounds on vocabulary construction time.
All experiments are fully reproducible:
# Seed control
make train-micro SEED=42
# Deterministic data sharding (rank-based, hash-stable)
# Configured in src/vanilla/data_sharder.py
# Checkpoint + resume
make train-small RESUME=checkpoints/step_10000.pt
# W&B run tracking
make train-small WANDB_PROJECT=fast-gpt-labCI enforces: unit tests, kernel correctness tests, MFU regression gate (fails if MFU drops >5% from baseline), and full type checking via mypy --strict.
@software{fast-gpt-lab,
title = {fast-gpt-lab: A Production-Grade LLM Research Engine},
url = {https://github.com/ammmanism/fast-gpt-lab},
note = {827 commits, 5 engineering phases, Apache-2.0}
}Apache 2.0 — see LICENSE. Use it, fork it, ship it.
Built with engineering discipline. Documented with mathematical rigor. Optimized for hardware truth.
If this project helped your research or saved you GPU hours, leave a ⭐