claude-controller

A backend/model switcher for Claude Code. One command picks your inference backend, sets the right environment, and launches Claude Code — no manual export juggling, no stale env vars between sessions.

Architecture

claude-controller sits between you and Claude Code. It resolves dynamic model facts from the Ollama API, merges them with curated compatibility knowledge, pre-flight checks the target endpoint, then hands off to Claude Code with a clean environment. After your session, cc-health and cc-diagnose analyse what happened.

Repository structure

claude-controller/
├── claude-controller.sh      # main entry point — interactive picker, env manager, launcher
├── ollama_resolver.py        # queries Ollama API live for installed models, num_ctx, architecture
├── model_profiles.json       # curated compatibility knowledge: tool_compat, known_issues
├── README.md
├── docs/
│   └── architecture.svg      # system architecture diagram
└── tools/
    ├── cc-health             # post-session JSONL analyser — flags anomalies and failure patterns
    └── cc-diagnose           # Ollama performance diagnostic — VRAM, tok/s, GPU offload

Installation

# Clone or copy the repo, then make scripts executable
chmod +x claude-controller.sh tools/cc-health tools/cc-diagnose

# Optional: symlink to PATH
ln -sf "$(pwd)/claude-controller.sh"  ~/.local/bin/claude-controller
ln -sf "$(pwd)/tools/cc-health"       ~/.local/bin/cc-health
ln -sf "$(pwd)/tools/cc-diagnose"     ~/.local/bin/cc-diagnose

Prerequisites: Claude Code CLI, Python 3, nc/ss/lsof, curl

For Ollama backend: install Ollama and pull at least one model.

For OpenAI or Gemini backends, install LiteLLM:

pip install 'litellm[proxy]' --break-system-packages

Quick start

claude-controller

Pick a backend, pick a model — Claude Code launches. That's it.

All commands

claude-controller                    # interactive picker → launches Claude Code
claude-controller --status           # show active backend, model, compat profile
claude-controller --resolve <model>  # full resolved profile (dynamic + curated)
claude-controller --profiles         # list all curated model profiles
claude-controller --diagnose <model> # Ollama performance diagnostic
claude-controller --start-litellm    # start LiteLLM proxy (kills stale process if needed)
claude-controller --stop-litellm     # stop LiteLLM proxy
claude-controller --setup-litellm    # regenerate LiteLLM config
claude-controller --export           # print env vars for sourcing into current shell

API keys

mkdir -p ~/.config/switchmodel
cat >> ~/.config/switchmodel/keys.env <<EOF
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
EOF
chmod 600 ~/.config/switchmodel/keys.env

The Anthropic backend uses your existing claude login session — no API key needed.

Post-session tools

cc-health — session quality analyser

cc-health                  # analyse most recent session
cc-health --last 5         # analyse last 5 sessions
cc-health <session.jsonl>  # analyse a specific file
cc-health --watch          # watch for new sessions and auto-analyse

Reads Claude Code's JSONL session logs and flags:

Flag	What it means
`SYNTHETIC_ERROR`	API call died before reaching any model — LiteLLM config or connectivity issue
`TOOL_LOOP`	Same tool called with identical params 2+ times — model stuck in retry loop
`WRONG_PARAMS`	Known bad patterns: `pages:""` on Read (gpt-4o), `ReadFile` instead of `Read` (qwen3)
`AGENT_FAILURE`	Agent tool called with Claude model names LiteLLM cannot resolve
`RAW_XML`	`<function=...>` leaked as plain text — model partially trained on tool syntax
`CONTEXT_CAP`	All turns show identical `input_tokens` — conversation history being truncated
`ROLE_SWITCH`	Model spontaneously responded as a different assistant persona
`ZERO_OUTPUT`	Session produced no substantive output — task not completed
`CACHE_EFFICIENCY`	Low cache hit rate on a Claude model

cc-diagnose — Ollama performance diagnostic

cc-diagnose                        # diagnose last-used model
cc-diagnose devstral-small-2:24b   # diagnose a specific model
claude-controller --diagnose <model>  # same, via the controller

Measures and reports:

GPU vs CPU layer split and VRAM usage
KV cache size at various num_ctx values vs available VRAM headroom
Token speed: prefill (tok/s) and decode (tok/s)
Reload cost between requests (cold-start detection)
Context cap detection from recent JSONL logs

Design

Two sources of model knowledge, kept strictly separate

Dynamic facts — read live from the Ollama API via ollama_resolver.py:

Which models are installed
Real num_ctx from the modelfile and native context length from model weights
Architecture, parameter size, quantization, capabilities

Curated knowledge — stored in model_profiles.json:

tool_compat rating: native / partial / broken
known_issues from observed Claude Code session behavior
Human notes and recommendations

These are never mixed. model_profiles.json contains no num_ctx values or parameter counts — nothing Ollama already knows. This prevents the two sources from contradicting each other or going stale independently.

Profile lookup chain

For any Ollama model, ollama_resolver.py runs:

Exact match — full model name (e.g. qwen3-coder:latest is rated broken specifically)
Family prefix — progressively shorter prefix matches (qwen3-coder, devstral, llama3...)
Architecture — the general.architecture string from Ollama's model_info (e.g. qwen3, llama)
Unknown fallback — warns the user, proceeds with conservative defaults

A brand new model you just pulled gets a sensible answer based on its architecture even if it has no explicit profile entry.

num_ctx resolution and pinned variants

num_ctx is resolved in priority order:

If the modelfile explicitly sets num_ctx → use that value, warn if below 16K
Use the native context_length from model weights (from model_info), capped at 131072
Absolute fallback: 32768

If the resolved value is missing or below the 16K agentic minimum, the controller automatically creates a pinned Modelfile variant (e.g. devstral-small-2-cc:49152ctx) with num_ctx baked in. This is the only reliable way to set context window size for Claude Code's HTTP requests to Ollama — environment variables like OLLAMA_NUM_CTX are ignored at the API layer.

The VRAM budget for the KV cache is estimated from nvidia-smi free memory minus model weight allocation, and num_ctx is scaled down automatically if it would cause spillage to RAM.

Environment sanitization

Every apply_* function calls sanitize_env() before setting new variables, clearing all backend-related env vars (ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, ANTHROPIC_API_KEY, CLAUDE_MODEL, OLLAMA_KEEP_ALIVE). This prevents stale vars from a previous backend session leaking into the new one.

Pre-flight health checks

Before launching Claude Code, the controller verifies the target endpoint with a two-stage check: TCP connect (port open?) followed by an HTTP health request (process actually responding?). This catches the common failure where a crashed process leaves a socket in TIME_WAIT — the port appears open but no requests are served. A failed pre-flight exits immediately with a clear error, rather than letting Claude Code retry silently for 3 minutes.

LiteLLM configuration

LiteLLM is always configured with drop_params: true, which silently strips Claude-specific parameters (like context_management) that non-Anthropic models reject with HTTP 400. A system prompt (CLAUDE_CODE_SYSTEM_PROMPT) is also injected for non-Claude backends to prevent known failure modes: gpt-4o spawning Claude sub-agents via the Agent tool, shallow Glob-only fallback after Agent failures, and the pages:"" empty-string parameter bug on Read tool calls.

Compatibility ratings

Ratings are derived from real Claude Code JSONL session data:

Rating	Meaning
✅ `native`	Full Claude Code tool protocol support. No known issues.
⚠️ `partial`	Works for most tasks. Known issues shown as warnings before launch.
❌ `broken`	Fundamental tool-calling incompatibilities. Must type `yes` to proceed.

Known issues from session evidence

gpt-4o (session a7385301, 3ae0395b): Emits pages:"" on Read tool calls — Claude Code rejects this every time and the model never self-corrects. Also attempts to spawn Claude sub-agents (claude-sonnet-4-6, claude-opus-4-7) via the Agent tool, which LiteLLM cannot resolve without Anthropic credentials. Both are addressed by the injected system prompt.

gemini-2.0-flash (session 2e393533): Rejected Claude Code's context_management parameter with HTTP 400 — fixed by enforcing drop_params: true.

qwen3-coder:latest (session 84b1c5cc): Hallucinated tool names (ReadFile instead of Read), wrong parameter names, read wrong files, spontaneously became a Google Drive assistant, emitted raw XML as plain text, fixed context cap at 32768 tokens per turn.

devstral-small-2-cc:49152ctx (validated): 100% GPU offload on RTX 3090, 52 tok/s decode, 48K context, no reload cost between turns. Optimal configuration for 24GB VRAM.

claude-sonnet-4-6 (session 97299b69): 100% success rate, 92.2% cache efficiency, 340 output tokens/turn, correct tool usage throughout. The reference baseline.

Adding a new profile

Edit model_profiles.json. Add to exact for a specific model tag, or families for a whole family:

"families": {
  "my-new-model": {
    "tool_compat": "partial",
    "known_issues": [
      "issue_name: Description of the problem and when it manifests."
    ],
    "notes": "Human-readable recommendation.",
    "session_evidence": ["session-uuid-here"]
  }
}

Run cc-health after a test session to gather evidence before writing a profile entry.

Extending LiteLLM model support

Edit the generated config at ~/.cache/agentrun/litellm_config.yaml, or regenerate it:

claude-controller --setup-litellm

The config always includes drop_params: true. Do not remove it.

Runtime files

Path	Purpose
`~/.config/switchmodel/keys.env`	API keys for OpenAI, Gemini
`~/.cache/agentrun/backend_state.json`	Last-used backend and model
`~/.cache/agentrun/litellm_config.yaml`	Generated LiteLLM config
`~/.cache/agentrun/litellm.log`	LiteLLM proxy log
`~/.claude/projects/*/.jsonl`	Claude Code session logs (read by cc-health)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

claude-controller

Architecture

Repository structure

Installation

Quick start

All commands

API keys

Post-session tools

cc-health — session quality analyser

cc-diagnose — Ollama performance diagnostic

Design

Two sources of model knowledge, kept strictly separate

Profile lookup chain

num_ctx resolution and pinned variants

Environment sanitization

Pre-flight health checks

LiteLLM configuration

Compatibility ratings

Known issues from session evidence

Adding a new profile

Extending LiteLLM model support

Runtime files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
tools		tools
README.md		README.md
claude-controller.sh		claude-controller.sh
model_profiles.json		model_profiles.json
ollama_resolver.py		ollama_resolver.py

Folders and files

Latest commit

History

Repository files navigation

claude-controller

Architecture

Repository structure

Installation

Quick start

All commands

API keys

Post-session tools

cc-health — session quality analyser

cc-diagnose — Ollama performance diagnostic

Design

Two sources of model knowledge, kept strictly separate

Profile lookup chain

num_ctx resolution and pinned variants

Environment sanitization

Pre-flight health checks

LiteLLM configuration

Compatibility ratings

Known issues from session evidence

Adding a new profile

Extending LiteLLM model support

Runtime files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages