Home

Tightwad-Inference Wiki

Mixed-vendor GPU inference cluster manager with speculative decoding proxy.

Getting Started

Fastest path: Docker with env vars (no config files):

docker run --rm --network host \
  -e TIGHTWAD_DRAFT_URL=http://192.168.1.10:11434 \
  -e TIGHTWAD_TARGET_URL=http://192.168.1.20:11434 \
  ghcr.io/youngharold/tightwad

Or install and auto-discover servers:

pip install -e .
tightwad init    # scans LAN, finds Ollama/llama-server, generates config
tightwad proxy start

Six Modes

Combined Mode — Speculation Over a Pool (the killer feature) — When a model doesn't fit on one machine, pool GPUs via RPC and speculate on top. Batch verification amortizes RPC latency — 32 tokens per round-trip instead of 1. 1.86x measured speedup on Llama 3.3 70B across 4 GPUs over WiFi.
Speculative Decoding Proxy — A fast draft model proposes tokens, a large target model verifies them in batch. Output quality equivalent to the target model alone, up to 2x faster. Ships token IDs (bytes), not tensor data. Includes a live web dashboard at /dashboard with server health, SVG charts, and per-request timing.
Multi-Drafter Consensus — Race multiple cheap drafters in parallel. When they agree, the target GPU is skipped entirely. Three consensus modes: strict, majority, any_disagree. Tree-based speculation for branching paths, Prometheus metrics for accept/fallback rates.
RPC Cluster — Pool CUDA + ROCm + Metal GPUs across machines into a single OpenAI-compatible endpoint using llama.cpp RPC. Hot-swap models without restarting workers.
Quality Gate — CPU Fleet Drafts, GPU Reviews — Full-response level (not token-level). A fleet of cheap machines generates complete responses; a powerful GPU reviews each one, approving, correcting, or rejecting. 60-80% pass unchanged. tightwad gate start.
Swarm Transfer — BitTorrent-style P2P model distribution. Split GGUF files into 64 MB pieces with SHA256 hashes, pull from any peer that has pieces. Rarest-first selection, resume on interrupt, delta updates.

Pages

Architecture — RPC cluster design, data flow, Docker deployment, and tensor split calculation
Speculative Decoding — How the proxy works, verification algorithm, use cases
MoE Support — Expert-aware placement, defusion, profile-guided placement, MoE benchmarks
Hardware Setup — Building llama.cpp for CUDA/ROCm workers and coordinator
Configuration — cluster.yaml reference, env var config, Docker deployment
CLI Reference — All commands and options (init, proxy, cluster, swarm)
Swarm Transfer — P2P model distribution (manifest, seeder, puller)
Benchmarking — Benchmark scripts, methodology, and published results
Troubleshooting — Common issues and fixes (proxy, RPC, Docker, init wizard)
Network Optimization — Bandwidth tuning and layer placement
OpenClaw Integration — Registering Tightwad as an OpenClaw provider

The Research Behind It

Speculative decoding is what Google, DeepMind, and most production inference stacks already use to accelerate frontier models. Tightwad packages it for home hardware.

Leviathan, Kalman, Matias (2022/2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023. arXiv:2211.17192 — the foundational paper.
Chen et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. DeepMind. arXiv:2302.01318 — independent parallel formulation with stochastic-sampling extension.
Google Research (2024). Looking Back at Speculative Decoding. Blog retrospective from the original authors covering production deployment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Tightwad-Inference Wiki

Getting Started

Six Modes

Pages

The Research Behind It

Quick Links

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally