llama-dash

llama-dash turns a self-hosted local inference box into an observable, policy-controlled AI gateway: one UI for model state, request history, API keys, routing rules, proxy metrics, and client setup. The implemented inference backend is currently llama-swap over llama.cpp.

It is the single public entrypoint for OpenAI-compatible and Anthropic-compatible clients. llama-dash owns proxy policy, logging, auth, routing, and backend normalization, your selected inference backend owns local model processes and inference when traffic is routed to local models.

OpenAI SDK / Claude Code / Continue / Open WebUI
                    │
                    ▼
              llama-dash :3000
      dashboard · auth · logs · routing · metrics
             │                     │
             ▼                     ▼
      llama-swap :8080         direct /v1 upstreams
  llama.cpp models · peers      OpenAI · Anthropic

✨ What it does

Dashboard — live stats, sparklines, model timeline, upstream health, GPU monitoring, SSE-backed live refresh with ETag polling fallback, and update status.
Model management — load/unload models, per-model stats, load history, config snippet.
Request logging — every completed /v1/* call is queued for SQLite logging with searchable UI, histogram, and detail view.
Transparent proxy — streaming SSE preserved, bounded body capture for logs, token counts scraped in-flight. OpenAI (/v1/chat/completions) and Anthropic (/v1/messages) shapes both supported — for example, point Claude Code at llama-dash via ANTHROPIC_BASE_URL to proxy and track your Claude code usage as well.
API keys — per-key rate limits (RPM/TPM), model allow-lists editable from detail page, hashed at rest, per-key stats and model usage breakdown.
Dashboard auth — Better Auth username/password and passkey session gate for the UI and /api/* with first-visit signup; /v1/* proxy auth stays API-key based.
Policies — custom routing rules with real proxy enforcement for continue, model rewrite, and policy reject actions, plus explicit auth passthrough, direct HTTPS /v1 upstream targets, encrypted upstream credential injection, per-key system prompt injection, and global request size limits.
Attribution — configurable header mapping for client, end-user, and session metadata with setup examples for common clients.
Request auditing — per-key usage tracking across all proxied calls.
Prometheus metrics — /metrics exposes proxy request, token, latency-window, queue, upstream, running-model, and GPU gauges.
GPU monitoring — NVIDIA, AMD, and Apple Silicon. VRAM, utilization, temp, power.
Config editor — edit llama-swap config.yaml in-browser with on-demand validation, enforced pre-save schema checks, and auto-reload.
Inference backend facade — backend health, model list/running state, lifecycle actions, logs, and config are capability-driven so future runtimes can be added without weakening the llama-swap experience.
Endpoints — copyable base URL, API key selector, code examples for curl, Python, TypeScript, Home Assistant, Claude Code, opencode, Open WebUI, and more.
Playground — supports chat, image, speech and transcribe.

🎯 Use cases

Give teammates API keys without exposing llama-swap directly on your network.
See which models are running, which clients are using them, and where latency is coming from.
Debug slow or failed requests with status, token usage, timing, routing, attribution, and upstream metadata in one place.
Enforce model allow-lists, request size limits, model aliases, and routing rules before traffic reaches llama-swap/llama-cpp.
Route Claude Code or other Anthropic clients through one observable gateway while preserving subscription/OAuth bearer flows.
Keep Prometheus metrics and searchable SQLite request history for a single-box self-hosted AI stack.

_{Dashboard Live traffic, tokens, model residency, upstream and GPU health}	_{Playground Chat against local endpoints with request/response inspection}	_{Request detail Routing, attribution, latency, tokens, and payload metadata}	_{Logs Raw llama-swap, proxy, and upstream streams in one viewer}

_{Model detail Load history, stats, recent requests, and config context}	_{Speech playground Audio and image-capable endpoint testing}	_{Policies Aliases, routing rules, passthrough auth, and request limits}	_{Requests Searchable history with filters, sorting, and histogram}

⚡ Quick start (Docker Compose)

Choose the compose file that matches your GPU vendor. Both setups use ./config/config.yaml for llama-swap config, ./models/ for model files, and expose llama-dash on http://localhost:3000.

AMD / ROCm

cp config/config.example.yaml config/config.yaml  # edit models
docker compose -f docker-compose.amd.yaml up -d

docker-compose.amd.yaml runs ghcr.io/mostlygeek/llama-swap:rocm, passes through /dev/kfd and /dev/dri, and also mounts /dev/dri into llama-dash so AMD GPU stats work in the dashboard.

NVIDIA / CUDA

cp config/config.example.yaml config/config.yaml  # edit models
docker compose -f docker-compose.nvidia.yaml up -d

docker-compose.nvidia.yaml runs ghcr.io/mostlygeek/llama-swap:cuda and requests gpus: all for the llama-swap service. This requires the NVIDIA Container Toolkit on the host.

🏗️ Manual setup

Requirements

Node 24+
pnpm
A reachable llama-swap instance

Install

cp .env.example .env   # edit INFERENCE_BASE_URL to point at your instance
pnpm install
pnpm db:migrate        # creates data/dash.db
pnpm dev               # http://localhost:5173

🏔️ Environment

Copy .env.example to .env and fill in the values.

Variable	Default	Notes
`INFERENCE_BACKEND`	`llama-swap`	Active inference backend. Only `llama-swap` is currently implemented.
`INFERENCE_BASE_URL`	`http://localhost:8080`	Inference backend base URL. No trailing slash.
`INFERENCE_INSECURE`	`false`	Skip TLS verification for inference backend with self-signed certs.
`INFERENCE_CONFIG_FILE`	(empty)	Absolute path to the backend config file. Required for the llama-swap config editor.
`DATABASE_PATH`	`data/dash.db`	SQLite file, relative to CWD. SQLite `:memory:` and `file:` URI paths are preserved for tests/special deployments.
`BETTER_AUTH_SECRET`		Secret for signing Better Auth session data; `openssl rand -base64 33`
`BETTER_AUTH_URL`	inferred	Optional external base URL for Better Auth redirects/cookies. Set this to the public HTTPS origin when using passkeys outside localhost.
`CREDENTIAL_ENCRYPTION_KEY`		32+ character secret used to encrypt stored upstream provider credentials. Required before creating or injecting upstream credentials.

See docs/2026_05_03_inference_backends.md for the backend abstraction, capability model, and future Ollama notes.

⚙️ How it's wired

src/server/proxy/* — the /v1/* pass-through: streaming SSE preserved, proxy context/body snapshots kept isolated, bounded request/response capture for logs, token counts scraped from responses as they fly by, and one queued SQLite row per completed request.
src/server/admin/* — the /api/* admin surface consumed by the UI, with grouped route modules under src/server/admin/routes/* for models, requests, config, keys, aliases, routing, upstream credentials, settings, and system health. JSON GET responses support conditional ETag polling, /api/events streams lightweight dashboard events, and /api/log-events streams llama-swap logs only while the Logs page is mounted.
src/server/auth.ts — Better Auth setup for dashboard username/password and passkey sessions; protects UI and /api/*, not /v1/*. Signup is only allowed while no dashboard user exists.
src/server/gpu-poller.ts — polls nvidia-smi / rocm-smi / system_profiler every 10s, caches result in memory, and publishes GPU-change events for live dashboard refresh. AMD APUs use GTT (not VRAM) for actual usable memory; Apple shows unified memory and core count when available.
src/server/model-watcher.ts — polls the inference backend running-model capability every 15s, diffs state, writes load/unload events to model_events table, and publishes model-change events.
src/server/inference/* — selected inference backend facade plus backend-specific adapters and hints.
src/server/llama-swap/client.ts — typed client over llama-swap's HTTP API.
src/server/db/* — Drizzle schema, SQLite initialization, and request/model-event indexes for common dashboard query paths. Apply migrations explicitly with pnpm db:migrate.
src/server/metrics.ts — Prometheus text metrics for proxy requests, tokens, latency window gauges, queue depth/drops, upstream reachability, running models, and GPU gauges at /metrics.
Dockerfile, prod-server.mjs, docker-compose.amd.yaml, docker-compose.nvidia.yaml — production container packaging for llama-dash by itself or bundled with llama-swap.
src/routes/* — thin TanStack Start route entrypoints for /, /login, /models, /models/:id, /requests, /logs, /system, /playground, /config, /settings, /keys, /keys/:id, /attribution, /policies, /endpoints.
src/features/* — feature-local page components and helpers grouped by route area (dashboard, requests, keys, models, playground, etc.).
src/lib/queries.ts — TanStack Query hooks with SSE-driven cache invalidation for request, model, GPU, and system changes plus slow ETag polling fallback.

✴️ Claude Code / Anthropic passthrough

Route any Anthropic SDK (including claude-code) through llama-dash for logging, filtering, and per-request inspection. Supports Anthropic subscriptions. Traffic flows:

Claude Code ──► llama-dash :5173 (log + filter) ──► api.anthropic.com

Client config (~/.claude/settings.json):

{ 
  "env": { 
    "ANTHROPIC_BASE_URL": "http://<llama-dash-host>:3000" 
  } 
}

Leave ANTHROPIC_AUTH_TOKEN unset when using subscription OAuth — Claude Code manages the bearer itself and llama-dash passes it through unchanged.

In llama-dash, configure an explicit routing rule in Policies for /v1/* target path or Claude source model names, using continue, passthrough auth, preserved client Authorization, and direct target https://api.anthropic.com/v1. This will result in all Anthropic requests being transparently proxied through while logging all traffic in llama-dash and applying filters.

🤖 Acknowledgements

This project was developed with significant assistance from LLMs. Architecture decisions, implementation, and documentation were all shaped through human-AI collaboration.

📝 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 500 Commits
.github		.github
config		config
docs		docs
drizzle		drizzle
public		public
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
docker-compose.amd.yaml		docker-compose.amd.yaml
docker-compose.nvidia.yaml		docker-compose.nvidia.yaml
drizzle.config.ts		drizzle.config.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
prod-server.mjs		prod-server.mjs
tsconfig.json		tsconfig.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-dash

✨ What it does

🎯 Use cases

⚡ Quick start (Docker Compose)

AMD / ROCm

NVIDIA / CUDA

🏗️ Manual setup

Requirements

Install

🏔️ Environment

⚙️ How it's wired

✴️ Claude Code / Anthropic passthrough

🤖 Acknowledgements

📝 License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama-dash

✨ What it does

🎯 Use cases

⚡ Quick start (Docker Compose)

AMD / ROCm

NVIDIA / CUDA

🏗️ Manual setup

Requirements

Install

🏔️ Environment

⚙️ How it's wired

✴️ Claude Code / Anthropic passthrough

🤖 Acknowledgements

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages