docs(llama-cpp): reclassify GPU exclusivity — vLLM is the hard-exclusive one#6
Merged
Conversation
…he hard-exclusive one The pre-router README claimed llama-cpp and Ollama can't coexist on the GB10 — true for classic single-model mode (-ngl 999 pins all layers at boot), false for router mode. Router mode is lazy: no VRAM is claimed until a model is loaded. Ollama is also lazy (loads on chat request, unloads after keep_alive). So router mode + Ollama coexist as long as the simultaneous resident set fits the 124 GiB GB10. vLLM is the actually-exclusive engine in this stack: --gpu-memory- utilization 0.9 reserves ~90% of VRAM at startup whether or not it's serving requests. classic single-model mode also remains exclusive. Replaces the "GPU exclusivity" subsection in llama-cpp/README.md with "GPU sharing" — a four-row table comparing posture (lazy vs eager) and a coexistence rule of thumb. Adds an explicit note that Ollama has `restart: unless-stopped`, so a Docker daemon restart will re-up it even after a stop; the explicit stop-and-start workflow is still needed if you want hard exclusivity (e.g. loading the 120b on both engines simultaneously). Spec doc gets a corresponding "GPU exclusivity reclassified" note in the post-deployment-findings section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
You were right: the "llama-cpp ⊕ Ollama" exclusivity story was inherited from the classic-mode README and is wrong for router mode. vLLM is the one that actually requires an exclusive GPU lock.
Why
MODELS_MAXkeep_alive(5 min default)-ngl 999pins every layer at boot (~65 GiB for the default 120b)--gpu-memory-utilization 0.9reserves ~90% of VRAM at start whether or not it's servingRouter mode and Ollama can both be up at once on the GB10 — exactly what we observed when
ollama.spark-1822.localcame back withOllama is runningwhilellama-cppwas also healthy.Changes
llama-cpp/README.md: replace the### GPU exclusivitysection with### GPU sharing— a 4-row table covering the four postures, the coexistence rule of thumb (MODELS_MAX × worst_case_resident_VRAM+ other engine's set ≤ 124 GiB), and an explicit note that Ollama'srestart: unless-stoppedbrings it back on Docker daemon restarts (so a one-shotstopis not durable).docs/superpowers/specs/2026-05-22-llama-cpp-router-mode-design.md: add a "GPU exclusivity reclassified" paragraph to the post-deployment-findings section.vllm/README.md: untouched — it already correctly states vLLM can't coexist with either of the others.🤖 Generated with Claude Code