Skip to content

docs(llama-cpp): reclassify GPU exclusivity — vLLM is the hard-exclusive one#6

Merged
a1exus merged 1 commit into
mainfrom
llama-cpp-gpu-coexistence
May 23, 2026
Merged

docs(llama-cpp): reclassify GPU exclusivity — vLLM is the hard-exclusive one#6
a1exus merged 1 commit into
mainfrom
llama-cpp-gpu-coexistence

Conversation

@a1exus

@a1exus a1exus commented May 23, 2026

Copy link
Copy Markdown
Owner

You were right: the "llama-cpp ⊕ Ollama" exclusivity story was inherited from the classic-mode README and is wrong for router mode. vLLM is the one that actually requires an exclusive GPU lock.

Why

Engine VRAM posture
llama-cpp router mode lazy — no model resident at start; loads on first request, LRU-evicts at MODELS_MAX
Ollama lazy — loads on chat request, unloads after keep_alive (5 min default)
llama-cpp classic single-model mode eager — -ngl 999 pins every layer at boot (~65 GiB for the default 120b)
vLLM eager — --gpu-memory-utilization 0.9 reserves ~90% of VRAM at start whether or not it's serving

Router mode and Ollama can both be up at once on the GB10 — exactly what we observed when ollama.spark-1822.local came back with Ollama is running while llama-cpp was also healthy.

Changes

  • llama-cpp/README.md: replace the ### GPU exclusivity section with ### GPU sharing — a 4-row table covering the four postures, the coexistence rule of thumb (MODELS_MAX × worst_case_resident_VRAM + other engine's set ≤ 124 GiB), and an explicit note that Ollama's restart: unless-stopped brings it back on Docker daemon restarts (so a one-shot stop is not durable).
  • docs/superpowers/specs/2026-05-22-llama-cpp-router-mode-design.md: add a "GPU exclusivity reclassified" paragraph to the post-deployment-findings section.
  • vllm/README.md: untouched — it already correctly states vLLM can't coexist with either of the others.

🤖 Generated with Claude Code

…he hard-exclusive one

The pre-router README claimed llama-cpp and Ollama can't coexist on the
GB10 — true for classic single-model mode (-ngl 999 pins all layers at
boot), false for router mode. Router mode is lazy: no VRAM is claimed
until a model is loaded. Ollama is also lazy (loads on chat request,
unloads after keep_alive). So router mode + Ollama coexist as long as
the simultaneous resident set fits the 124 GiB GB10.

vLLM is the actually-exclusive engine in this stack: --gpu-memory-
utilization 0.9 reserves ~90% of VRAM at startup whether or not it's
serving requests. classic single-model mode also remains exclusive.

Replaces the "GPU exclusivity" subsection in llama-cpp/README.md with
"GPU sharing" — a four-row table comparing posture (lazy vs eager) and
a coexistence rule of thumb. Adds an explicit note that Ollama has
`restart: unless-stopped`, so a Docker daemon restart will re-up it
even after a stop; the explicit stop-and-start workflow is still
needed if you want hard exclusivity (e.g. loading the 120b on both
engines simultaneously).

Spec doc gets a corresponding "GPU exclusivity reclassified" note in
the post-deployment-findings section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@a1exus a1exus merged commit 052077a into main May 23, 2026
8 of 12 checks passed
@a1exus a1exus deleted the llama-cpp-gpu-coexistence branch May 23, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant