docs(llama-cpp): reclassify GPU exclusivity — vLLM is the hard-exclusive one by a1exus · Pull Request #6 · a1exus/sparky

a1exus · 2026-05-23T18:06:46Z

You were right: the "llama-cpp ⊕ Ollama" exclusivity story was inherited from the classic-mode README and is wrong for router mode. vLLM is the one that actually requires an exclusive GPU lock.

Why

Engine	VRAM posture
llama-cpp router mode	lazy — no model resident at start; loads on first request, LRU-evicts at `MODELS_MAX`
Ollama	lazy — loads on chat request, unloads after `keep_alive` (5 min default)
llama-cpp classic single-model mode	eager — `-ngl 999` pins every layer at boot (~65 GiB for the default 120b)
vLLM	eager — `--gpu-memory-utilization 0.9` reserves ~90% of VRAM at start whether or not it's serving

Router mode and Ollama can both be up at once on the GB10 — exactly what we observed when ollama.spark-1822.local came back with Ollama is running while llama-cpp was also healthy.

Changes

llama-cpp/README.md: replace the ### GPU exclusivity section with ### GPU sharing — a 4-row table covering the four postures, the coexistence rule of thumb (MODELS_MAX × worst_case_resident_VRAM + other engine's set ≤ 124 GiB), and an explicit note that Ollama's restart: unless-stopped brings it back on Docker daemon restarts (so a one-shot stop is not durable).
docs/superpowers/specs/2026-05-22-llama-cpp-router-mode-design.md: add a "GPU exclusivity reclassified" paragraph to the post-deployment-findings section.
vllm/README.md: untouched — it already correctly states vLLM can't coexist with either of the others.

🤖 Generated with Claude Code

…he hard-exclusive one The pre-router README claimed llama-cpp and Ollama can't coexist on the GB10 — true for classic single-model mode (-ngl 999 pins all layers at boot), false for router mode. Router mode is lazy: no VRAM is claimed until a model is loaded. Ollama is also lazy (loads on chat request, unloads after keep_alive). So router mode + Ollama coexist as long as the simultaneous resident set fits the 124 GiB GB10. vLLM is the actually-exclusive engine in this stack: --gpu-memory- utilization 0.9 reserves ~90% of VRAM at startup whether or not it's serving requests. classic single-model mode also remains exclusive. Replaces the "GPU exclusivity" subsection in llama-cpp/README.md with "GPU sharing" — a four-row table comparing posture (lazy vs eager) and a coexistence rule of thumb. Adds an explicit note that Ollama has `restart: unless-stopped`, so a Docker daemon restart will re-up it even after a stop; the explicit stop-and-start workflow is still needed if you want hard exclusivity (e.g. loading the 120b on both engines simultaneously). Spec doc gets a corresponding "GPU exclusivity reclassified" note in the post-deployment-findings section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

a1exus merged commit 052077a into main May 23, 2026
8 of 12 checks passed

a1exus mentioned this pull request May 23, 2026

docs(llama-cpp): clarify API and web UI share one port #7

Merged

a1exus deleted the llama-cpp-gpu-coexistence branch May 23, 2026 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(llama-cpp): reclassify GPU exclusivity — vLLM is the hard-exclusive one#6

docs(llama-cpp): reclassify GPU exclusivity — vLLM is the hard-exclusive one#6
a1exus merged 1 commit into
mainfrom
llama-cpp-gpu-coexistence

a1exus commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

a1exus commented May 23, 2026

Why

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant