Skip to content
youngharold edited this page Apr 24, 2026 · 12 revisions

Tightwad-Inference Wiki

Mixed-vendor GPU inference cluster manager with speculative decoding proxy.

Getting Started

Fastest path: Docker with env vars (no config files):

docker run --rm --network host \
  -e TIGHTWAD_DRAFT_URL=http://192.168.1.10:11434 \
  -e TIGHTWAD_TARGET_URL=http://192.168.1.20:11434 \
  ghcr.io/youngharold/tightwad

Or install and auto-discover servers:

pip install -e .
tightwad init    # scans LAN, finds Ollama/llama-server, generates config
tightwad proxy start

Six Modes

  1. Combined Mode — Speculation Over a Pool (the killer feature) — When a model doesn't fit on one machine, pool GPUs via RPC and speculate on top. Batch verification amortizes RPC latency — 32 tokens per round-trip instead of 1. 1.86x measured speedup on Llama 3.3 70B across 4 GPUs over WiFi.
  2. Speculative Decoding Proxy — A fast draft model proposes tokens, a large target model verifies them in batch. Output quality equivalent to the target model alone, up to 2x faster. Ships token IDs (bytes), not tensor data. Includes a live web dashboard at /dashboard with server health, SVG charts, and per-request timing.
  3. Multi-Drafter Consensus — Race multiple cheap drafters in parallel. When they agree, the target GPU is skipped entirely. Three consensus modes: strict, majority, any_disagree. Tree-based speculation for branching paths, Prometheus metrics for accept/fallback rates.
  4. RPC Cluster — Pool CUDA + ROCm + Metal GPUs across machines into a single OpenAI-compatible endpoint using llama.cpp RPC. Hot-swap models without restarting workers.
  5. Quality Gate — CPU Fleet Drafts, GPU Reviews — Full-response level (not token-level). A fleet of cheap machines generates complete responses; a powerful GPU reviews each one, approving, correcting, or rejecting. 60-80% pass unchanged. tightwad gate start.
  6. Swarm Transfer — BitTorrent-style P2P model distribution. Split GGUF files into 64 MB pieces with SHA256 hashes, pull from any peer that has pieces. Rarest-first selection, resume on interrupt, delta updates.

Pages

  • Architecture — RPC cluster design, data flow, Docker deployment, and tensor split calculation
  • Speculative Decoding — How the proxy works, verification algorithm, use cases
  • MoE Support — Expert-aware placement, defusion, profile-guided placement, MoE benchmarks
  • Hardware Setup — Building llama.cpp for CUDA/ROCm workers and coordinator
  • Configuration — cluster.yaml reference, env var config, Docker deployment
  • CLI Reference — All commands and options (init, proxy, cluster, swarm)
  • Swarm Transfer — P2P model distribution (manifest, seeder, puller)
  • Benchmarking — Benchmark scripts, methodology, and published results
  • Troubleshooting — Common issues and fixes (proxy, RPC, Docker, init wizard)
  • Network Optimization — Bandwidth tuning and layer placement
  • OpenClaw Integration — Registering Tightwad as an OpenClaw provider

The Research Behind It

Speculative decoding is what Google, DeepMind, and most production inference stacks already use to accelerate frontier models. Tightwad packages it for home hardware.

  • Leviathan, Kalman, Matias (2022/2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023. arXiv:2211.17192 — the foundational paper.
  • Chen et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. DeepMind. arXiv:2302.01318 — independent parallel formulation with stochastic-sampling extension.
  • Google Research (2024). Looking Back at Speculative Decoding. Blog retrospective from the original authors covering production deployment.

Quick Links

Clone this wiki locally