A backend/model switcher for Claude Code. One command picks your inference backend, sets the right environment, and launches Claude Code — no manual export juggling, no stale env vars between sessions.
claude-controller sits between you and Claude Code. It resolves dynamic model facts from the Ollama API, merges them with curated compatibility knowledge, pre-flight checks the target endpoint, then hands off to Claude Code with a clean environment. After your session, cc-health and cc-diagnose analyse what happened.
claude-controller/
├── claude-controller.sh # main entry point — interactive picker, env manager, launcher
├── ollama_resolver.py # queries Ollama API live for installed models, num_ctx, architecture
├── model_profiles.json # curated compatibility knowledge: tool_compat, known_issues
├── README.md
├── docs/
│ └── architecture.svg # system architecture diagram
└── tools/
├── cc-health # post-session JSONL analyser — flags anomalies and failure patterns
└── cc-diagnose # Ollama performance diagnostic — VRAM, tok/s, GPU offload
# Clone or copy the repo, then make scripts executable
chmod +x claude-controller.sh tools/cc-health tools/cc-diagnose
# Optional: symlink to PATH
ln -sf "$(pwd)/claude-controller.sh" ~/.local/bin/claude-controller
ln -sf "$(pwd)/tools/cc-health" ~/.local/bin/cc-health
ln -sf "$(pwd)/tools/cc-diagnose" ~/.local/bin/cc-diagnosePrerequisites: Claude Code CLI, Python 3, nc/ss/lsof, curl
For Ollama backend: install Ollama and pull at least one model.
For OpenAI or Gemini backends, install LiteLLM:
pip install 'litellm[proxy]' --break-system-packagesclaude-controllerPick a backend, pick a model — Claude Code launches. That's it.
claude-controller # interactive picker → launches Claude Code
claude-controller --status # show active backend, model, compat profile
claude-controller --resolve <model> # full resolved profile (dynamic + curated)
claude-controller --profiles # list all curated model profiles
claude-controller --diagnose <model> # Ollama performance diagnostic
claude-controller --start-litellm # start LiteLLM proxy (kills stale process if needed)
claude-controller --stop-litellm # stop LiteLLM proxy
claude-controller --setup-litellm # regenerate LiteLLM config
claude-controller --export # print env vars for sourcing into current shellmkdir -p ~/.config/switchmodel
cat >> ~/.config/switchmodel/keys.env <<EOF
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
EOF
chmod 600 ~/.config/switchmodel/keys.envThe Anthropic backend uses your existing claude login session — no API key needed.
cc-health # analyse most recent session
cc-health --last 5 # analyse last 5 sessions
cc-health <session.jsonl> # analyse a specific file
cc-health --watch # watch for new sessions and auto-analyseReads Claude Code's JSONL session logs and flags:
| Flag | What it means |
|---|---|
SYNTHETIC_ERROR |
API call died before reaching any model — LiteLLM config or connectivity issue |
TOOL_LOOP |
Same tool called with identical params 2+ times — model stuck in retry loop |
WRONG_PARAMS |
Known bad patterns: pages:"" on Read (gpt-4o), ReadFile instead of Read (qwen3) |
AGENT_FAILURE |
Agent tool called with Claude model names LiteLLM cannot resolve |
RAW_XML |
<function=...> leaked as plain text — model partially trained on tool syntax |
CONTEXT_CAP |
All turns show identical input_tokens — conversation history being truncated |
ROLE_SWITCH |
Model spontaneously responded as a different assistant persona |
ZERO_OUTPUT |
Session produced no substantive output — task not completed |
CACHE_EFFICIENCY |
Low cache hit rate on a Claude model |
cc-diagnose # diagnose last-used model
cc-diagnose devstral-small-2:24b # diagnose a specific model
claude-controller --diagnose <model> # same, via the controllerMeasures and reports:
- GPU vs CPU layer split and VRAM usage
- KV cache size at various
num_ctxvalues vs available VRAM headroom - Token speed: prefill (tok/s) and decode (tok/s)
- Reload cost between requests (cold-start detection)
- Context cap detection from recent JSONL logs
Dynamic facts — read live from the Ollama API via ollama_resolver.py:
- Which models are installed
- Real
num_ctxfrom the modelfile and native context length from model weights - Architecture, parameter size, quantization, capabilities
Curated knowledge — stored in model_profiles.json:
tool_compatrating:native/partial/brokenknown_issuesfrom observed Claude Code session behavior- Human notes and recommendations
These are never mixed. model_profiles.json contains no num_ctx values or parameter counts — nothing Ollama already knows. This prevents the two sources from contradicting each other or going stale independently.
For any Ollama model, ollama_resolver.py runs:
- Exact match — full model name (e.g.
qwen3-coder:latestis ratedbrokenspecifically) - Family prefix — progressively shorter prefix matches (
qwen3-coder,devstral,llama3...) - Architecture — the
general.architecturestring from Ollama'smodel_info(e.g.qwen3,llama) - Unknown fallback — warns the user, proceeds with conservative defaults
A brand new model you just pulled gets a sensible answer based on its architecture even if it has no explicit profile entry.
num_ctx is resolved in priority order:
- If the modelfile explicitly sets
num_ctx→ use that value, warn if below 16K - Use the native
context_lengthfrom model weights (frommodel_info), capped at 131072 - Absolute fallback: 32768
If the resolved value is missing or below the 16K agentic minimum, the controller automatically creates a pinned Modelfile variant (e.g. devstral-small-2-cc:49152ctx) with num_ctx baked in. This is the only reliable way to set context window size for Claude Code's HTTP requests to Ollama — environment variables like OLLAMA_NUM_CTX are ignored at the API layer.
The VRAM budget for the KV cache is estimated from nvidia-smi free memory minus model weight allocation, and num_ctx is scaled down automatically if it would cause spillage to RAM.
Every apply_* function calls sanitize_env() before setting new variables, clearing all backend-related env vars (ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, ANTHROPIC_API_KEY, CLAUDE_MODEL, OLLAMA_KEEP_ALIVE). This prevents stale vars from a previous backend session leaking into the new one.
Before launching Claude Code, the controller verifies the target endpoint with a two-stage check: TCP connect (port open?) followed by an HTTP health request (process actually responding?). This catches the common failure where a crashed process leaves a socket in TIME_WAIT — the port appears open but no requests are served. A failed pre-flight exits immediately with a clear error, rather than letting Claude Code retry silently for 3 minutes.
LiteLLM is always configured with drop_params: true, which silently strips Claude-specific parameters (like context_management) that non-Anthropic models reject with HTTP 400. A system prompt (CLAUDE_CODE_SYSTEM_PROMPT) is also injected for non-Claude backends to prevent known failure modes: gpt-4o spawning Claude sub-agents via the Agent tool, shallow Glob-only fallback after Agent failures, and the pages:"" empty-string parameter bug on Read tool calls.
Ratings are derived from real Claude Code JSONL session data:
| Rating | Meaning |
|---|---|
✅ native |
Full Claude Code tool protocol support. No known issues. |
partial |
Works for most tasks. Known issues shown as warnings before launch. |
❌ broken |
Fundamental tool-calling incompatibilities. Must type yes to proceed. |
gpt-4o (session a7385301, 3ae0395b): Emits pages:"" on Read tool calls — Claude Code rejects this every time and the model never self-corrects. Also attempts to spawn Claude sub-agents (claude-sonnet-4-6, claude-opus-4-7) via the Agent tool, which LiteLLM cannot resolve without Anthropic credentials. Both are addressed by the injected system prompt.
gemini-2.0-flash (session 2e393533): Rejected Claude Code's context_management parameter with HTTP 400 — fixed by enforcing drop_params: true.
qwen3-coder:latest (session 84b1c5cc): Hallucinated tool names (ReadFile instead of Read), wrong parameter names, read wrong files, spontaneously became a Google Drive assistant, emitted raw XML as plain text, fixed context cap at 32768 tokens per turn.
devstral-small-2-cc:49152ctx (validated): 100% GPU offload on RTX 3090, 52 tok/s decode, 48K context, no reload cost between turns. Optimal configuration for 24GB VRAM.
claude-sonnet-4-6 (session 97299b69): 100% success rate, 92.2% cache efficiency, 340 output tokens/turn, correct tool usage throughout. The reference baseline.
Edit model_profiles.json. Add to exact for a specific model tag, or families for a whole family:
"families": {
"my-new-model": {
"tool_compat": "partial",
"known_issues": [
"issue_name: Description of the problem and when it manifests."
],
"notes": "Human-readable recommendation.",
"session_evidence": ["session-uuid-here"]
}
}Run cc-health after a test session to gather evidence before writing a profile entry.
Edit the generated config at ~/.cache/agentrun/litellm_config.yaml, or regenerate it:
claude-controller --setup-litellmThe config always includes drop_params: true. Do not remove it.
| Path | Purpose |
|---|---|
~/.config/switchmodel/keys.env |
API keys for OpenAI, Gemini |
~/.cache/agentrun/backend_state.json |
Last-used backend and model |
~/.cache/agentrun/litellm_config.yaml |
Generated LiteLLM config |
~/.cache/agentrun/litellm.log |
LiteLLM proxy log |
~/.claude/projects/**/*.jsonl |
Claude Code session logs (read by cc-health) |