-
Notifications
You must be signed in to change notification settings - Fork 2
Home
youngharold edited this page Apr 24, 2026
·
12 revisions
Mixed-vendor GPU inference cluster manager with speculative decoding proxy.
Fastest path: Docker with env vars (no config files):
docker run --rm --network host \
-e TIGHTWAD_DRAFT_URL=http://192.168.1.10:11434 \
-e TIGHTWAD_TARGET_URL=http://192.168.1.20:11434 \
ghcr.io/youngharold/tightwadOr install and auto-discover servers:
pip install -e .
tightwad init # scans LAN, finds Ollama/llama-server, generates config
tightwad proxy start- Combined Mode — Speculation Over a Pool (the killer feature) — When a model doesn't fit on one machine, pool GPUs via RPC and speculate on top. Batch verification amortizes RPC latency — 32 tokens per round-trip instead of 1. 1.86x measured speedup on Llama 3.3 70B across 4 GPUs over WiFi.
-
Speculative Decoding Proxy — A fast draft model proposes tokens, a large target model verifies them in batch. Output quality equivalent to the target model alone, up to 2x faster. Ships token IDs (bytes), not tensor data. Includes a live web dashboard at
/dashboardwith server health, SVG charts, and per-request timing. -
Multi-Drafter Consensus — Race multiple cheap drafters in parallel. When they agree, the target GPU is skipped entirely. Three consensus modes:
strict,majority,any_disagree. Tree-based speculation for branching paths, Prometheus metrics for accept/fallback rates. - RPC Cluster — Pool CUDA + ROCm + Metal GPUs across machines into a single OpenAI-compatible endpoint using llama.cpp RPC. Hot-swap models without restarting workers.
-
Quality Gate — CPU Fleet Drafts, GPU Reviews — Full-response level (not token-level). A fleet of cheap machines generates complete responses; a powerful GPU reviews each one, approving, correcting, or rejecting. 60-80% pass unchanged.
tightwad gate start. - Swarm Transfer — BitTorrent-style P2P model distribution. Split GGUF files into 64 MB pieces with SHA256 hashes, pull from any peer that has pieces. Rarest-first selection, resume on interrupt, delta updates.
- Architecture — RPC cluster design, data flow, Docker deployment, and tensor split calculation
- Speculative Decoding — How the proxy works, verification algorithm, use cases
- MoE Support — Expert-aware placement, defusion, profile-guided placement, MoE benchmarks
- Hardware Setup — Building llama.cpp for CUDA/ROCm workers and coordinator
- Configuration — cluster.yaml reference, env var config, Docker deployment
- CLI Reference — All commands and options (init, proxy, cluster, swarm)
- Swarm Transfer — P2P model distribution (manifest, seeder, puller)
- Benchmarking — Benchmark scripts, methodology, and published results
- Troubleshooting — Common issues and fixes (proxy, RPC, Docker, init wizard)
- Network Optimization — Bandwidth tuning and layer placement
- OpenClaw Integration — Registering Tightwad as an OpenClaw provider
Speculative decoding is what Google, DeepMind, and most production inference stacks already use to accelerate frontier models. Tightwad packages it for home hardware.
- Leviathan, Kalman, Matias (2022/2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023. arXiv:2211.17192 — the foundational paper.
- Chen et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. DeepMind. arXiv:2302.01318 — independent parallel formulation with stochastic-sampling extension.
- Google Research (2024). Looking Back at Speculative Decoding. Blog retrospective from the original authors covering production deployment.