Skip to content

tumf/kani

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

224 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🦀 kani

kani — LLM smart router

CI License: MIT

LLM smart router. Classifies prompts by complexity and routes to the optimal model.

OpenAI API-compatible proxy — drop in as a base URL and let kani pick the right model automatically.

How it works

Request → Distilled Feature Classifier (15 dimensions) → Tier + Agentic Score → Capability Filter → Model Selection (round-robin) → Upstream Provider
                                                     │
                                                     └─ model unavailable → conservative default

Classification pipeline:

  1. Distilled feature classifier — deterministic tokenCount + learned 14 semantic dimensions
  2. Axis-based scoring — separate complexity and reasoning scores drive tier selection
  3. Independent agentic scoreagenticTask dimension is exposed as agentic_score without affecting tier
  4. Conservative default — fall back to MEDIUM when the feature model is unavailable
  5. Capability filter — auto-detects vision/tools/json_mode from the request and escalates to a capable model

Every request is logged to $XDG_STATE_HOME/kani/log/ (default: ~/.local/state/kani/log/) as training data for future model improvement.

Scoring approach

kani no longer relies on hand-maintained keyword lists or runtime LLM fallback for routing. The scorer is now distilled-feature-first:

  • compute tokenCount deterministically
  • infer 14 semantic dimensions (low / medium / high) using a learned multi-output classifier
  • compute separate complexity and reasoning axis scores from the dimensions
  • determine tier from these axis scores (SIMPLE / MEDIUM / COMPLEX / REASONING)
  • expose agentic_score from the agenticTask dimension independently (does not affect tier)
  • return a conservative default tier when the feature model, embedding configuration, embedding request, or prediction path is unavailable

Axis-based tier thresholds:

  • REASONING: reasoning_score >= 0.75 (4 reasoning dimensions averaged)
  • COMPLEX: complexity_score >= 0.8 (6 complexity dimensions averaged)
  • MEDIUM: complexity_score >= 0.5
  • SIMPLE: below all thresholds

This makes routing behavior easier to improve with data, because changes come from retraining and calibration rather than runtime prompt engineering.

Quick start

Try without installing (uvx)

# Classify a prompt
uvx --from git+https://github.com/tumf/kani kani route "hello world"

# Start the proxy server
uvx --from git+https://github.com/tumf/kani kani serve

Local install

git clone https://github.com/tumf/kani.git && cd kani
uv sync

uv run kani route "hello world"
uv run kani serve

Usage — drop-in replacement for OpenAI / OpenRouter

kani speaks the OpenAI API. Change base_url and model, everything else stays the same.

Before (direct OpenAI)

from openai import OpenAI

client = OpenAI(
    api_key="sk-...",                          # OpenAI key
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "explain quicksort"}],
)

Before (OpenRouter)

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",  # OpenRouter
    api_key="sk-or-...",                       # OpenRouter key
)

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4",
    messages=[{"role": "user", "content": "explain quicksort"}],
)

After (kani) — auto-routed

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:18420/v1",      # ← kani
    api_key="anything",                        # kani handles upstream auth
)

# kani picks the best model based on prompt complexity
response = client.chat.completions.create(
    model="kani/auto",
    messages=[{"role": "user", "content": "explain quicksort"}],
)

# Or pin a routing profile
response = client.chat.completions.create(
    model="kani/premium",  # always use best-quality models
    messages=[{"role": "user", "content": "prove P != NP"}],
)

curl

curl http://localhost:18420/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "kani/auto", "messages": [{"role": "user", "content": "hello"}]}'

That's it. Any tool or library that supports the OpenAI API works with kani — LangChain, LlamaIndex, Cursor, Continue, etc. Just point base_url at kani.

Routing profiles

Note: The routing profiles below are sample/reference defaults. Treat them as examples — you should tune the actual profile names, strategies, and model mappings to match your own workload and cost/quality goals.

Profile Strategy Best for
kani/auto Balanced cost/quality (default) General use
kani/eco Cheapest viable models High volume, low stakes
kani/premium Best quality models Critical tasks
kani/agentic Tool-use optimized Agent workflows

Capability-aware routing

kani automatically detects required capabilities from the request and routes to a model that supports them. If no model in the scored tier has the required capabilities, kani escalates to higher tiers.

Detected capabilities:

Capability Trigger
vision image_url content block in messages
tools tools or functions field in request
json_mode response_format.type is json_object or json_schema

Configuration: declare model metadata via prefix matching in config.yaml using model_rules:

model_rules:
  - prefix: "anthropic/claude-"
    capabilities: [vision, tools, json_mode]
  - prefix: "google/gemini-"
    capabilities: [vision, tools, json_mode]
  - prefix: "openai/gpt-4"
    capabilities: [vision, tools, json_mode]

model_rules is the primary model metadata surface. The older model_capabilities key remains a legacy alias and is normalized into model_rules only when model_rules is unset. When a request requires capabilities and no configured candidate declares the full required set, routing fails closed instead of selecting an incapable model.

API endpoints

Endpoint Method Description
/v1/chat/completions POST Main proxy (OpenAI-compatible)
/v1/models GET List available models
/v1/route POST Debug — returns routing decision without proxying
/admin/reload-config POST Admin-only safe config hot reload
/health GET Health + active config version metadata

Routed responses include extra headers: X-Kani-Tier, X-Kani-Model, X-Kani-Score, X-Kani-Signals.

Configuration

config.yaml:

host: "0.0.0.0"
port: 18420
default_provider: openrouter
default_profile: auto

providers:
  openrouter:
    name: openrouter
    base_url: "https://openrouter.ai/api/v1"
    api_key: "${OPENROUTER_API_KEY}"
    # reasoning_style: openai | anthropic | dashscope | gemini | none (default: openai)
  cliproxy:
    name: cliproxy
    base_url: "http://127.0.0.1:8317/v1"
    api_key: "local-test-key"

profiles:
  auto:
    tiers:
      SIMPLE:
        # primary can be a single model or an ordered list for round-robin
        primary: ["google/gemini-2.5-flash", "google/gemini-2.5-flash-lite"]
        fallback: ["nvidia/gpt-oss-120b"]
      MEDIUM:
        primary:
          model: "moonshotai/kimi-k2.5"
          max_input_tokens: 128000
        fallback: null  # allowed; normalized to []
      COMPLEX:
        primary: "google/gemini-3.1-pro"
        fallback:
          - model: "anthropic/claude-sonnet-4.6"
            max_input_tokens: 200000
      REASONING:
        primary: "x-ai/grok-4-1-fast-reasoning"
        fallback: ["anthropic/claude-sonnet-4.6"]
      # provider: per-tier override (optional)

smart_proxy:
  fallback_backoff:
    enabled: true
    initial_delay_seconds: 5
    multiplier: 2
    max_delay_seconds: 300
  • ${VAR} syntax resolves environment variables
  • Provider resolution order is: model-entry provider > tier-level provider > default_provider
  • Configured model IDs are sent literally to the selected provider; anthropic/claude-sonnet-4.6 is not parsed by kani as a provider selector unless you also set a provider field
  • primary accepts a string, {model, provider, max_input_tokens} object, or a list of those; list entries are selected round-robin per profile+tier combination
  • fallback accepts the same string/object entries as primary; object entries can set max_input_tokens so candidates with a known input limit lower than the estimated prompt tokens are skipped
  • Candidates without max_input_tokens remain eligible because their input limit is unknown
  • fallback: null is accepted only at profiles.*.tiers.*.fallback and normalized to []
  • When primary fails, fallback attempts skip the failed primary candidate and deduplicate repeated model+provider entries
  • smart_proxy.fallback_backoff enables process-local exponential cooldowns for retryable non-streaming 429 / 5xx failures, keyed by model+provider
  • Cooled-down model+provider pairs are skipped during both primary selection and fallback execution; the same model on a different provider remains eligible
  • Successful recovery resets the failure streak for that exact model+provider pair, and restarting kani clears the in-memory cooldown registry
  • Config path: --config flag > $KANI_CONFIG env var > ./config.yaml > $XDG_CONFIG_HOME/kani/config.yaml > /etc/kani/config.yaml
  • Set KANI_ADMIN_TOKEN to enable POST /admin/reload-config (admin-only, separate from regular API keys)
  • Hot reload validates with strict=True and rejects non-reloadable field changes (host, port) with 409

Smart-proxy context compaction

kani can optionally reduce context pressure for long-running conversations by compacting oversized message histories before proxying upstream (Phase A) and by pre-computing summaries in the background for reuse on later requests (Phase B).

All compaction behavior is opt-in and disabled by default. When disabled or when compaction fails, kani routes and proxies requests unchanged.

Configuration

Add a smart_proxy section to your config.yaml:

smart_proxy:
  context_compaction:
    enabled: true                        # master switch

    sync_compaction:
      enabled: true                      # Phase A: compact inline before proxying
      threshold_percent: 80.0            # compact when prompt ≥ 80% of context window
      protect_first_n: 1                 # turns to keep at head of conversation
      protect_last_n: 2                  # turns to keep at tail
      summary_profile: ""                # empty = resolve through default_profile

    background_precompaction:
      enabled: true                      # Phase B: pre-compute summaries async
      trigger_percent: 70.0              # start background job at 70% usage
      max_concurrency: 2                 # max parallel background jobs
      summary_ttl_seconds: 3600

    session:
      header_name: X-Kani-Session-Id    # client header for explicit session binding

    context_window_tokens: 128000        # assumed context window for threshold math

Summary generation is selected by routing profile. Set summary_profile to a profile such as compress to route summaries through that profile; leave it empty to fall back to default_profile via the router's normal model resolution.

Session identity

kani uses a stable session key only when the client sends the configured explicit session header:

  1. Explicit header — value of session.header_name (required for Phase B cache hits, persistence, incremental summarization, and background precompaction)
  2. No header — no session ID is derived; inline compaction may still run for an oversized request, but cache reuse and background precompaction are unavailable

The resolution mode is surfaced in the X-Kani-Compaction-Session response header as explicit or none.

Operator telemetry

Each routed response includes compaction headers:

Header Values Meaning
X-Kani-Compaction off | skipped | inline | cached | failed What compaction did
X-Kani-Compaction-Session explicit | none How session was resolved
X-Kani-Compaction-Saved-Tokens integer Estimated tokens saved

Structured log fields are emitted at INFO level on every compaction decision. Failures are logged at WARNING level and never propagate to the client.

Safe config hot reload (admin)

Use admin-only config hot reload without restarting the proxy:

# 1) set an admin token (separate from normal API keys)
export KANI_ADMIN_TOKEN="your-admin-token"

# 2) trigger reload after editing config.yaml
curl -X POST http://localhost:18420/admin/reload-config \
  -H "Authorization: Bearer ${KANI_ADMIN_TOKEN}"

Behavior:

  • Reload is applied only when strict config validation succeeds.
  • In-flight requests keep the state snapshot captured at request start.
  • Changes to host / port are rejected as non-reloadable with 409 and require process restart.

Docker Compose / local deployment

No additional services are required. Compaction state is persisted in SQLite under $XDG_DATA_HOME/kani/compaction.db (default: ~/.local/share/kani/compaction.db). Override with KANI_DATA_DIR.

# Verify compaction is active after startup:
curl -s http://localhost:18420/health | jq .
# Inspect a routed request's compaction outcome:
curl -v -X POST http://localhost:18420/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Kani-Session-Id: my-session-1" \
  -d '{"model":"kani/auto","messages":[{"role":"user","content":"hello"}]}' \
  2>&1 | grep -i "x-kani-compaction"

Offline feature annotation

Runtime routing does not call an LLM. LLM usage is limited to offline dataset generation when logs are missing semantic labels.

Optional annotator configuration (for scripts/build_agentic_dataset.py --annotate-missing) can be set in config.yaml under feature_annotator, or overridden with env vars:

feature_annotator:
  model: "gemini-2.5-flash-lite"
  provider: "cliproxy"  # optional; defaults to default_provider

feature_annotator and llm_classifier connection details are provider-resolved. In config.yaml, set model + optional provider; do not set base_url or api_key directly in these sections.

Env var Default Description
KANI_LLM_ANNOTATOR_MODEL google/gemini-2.5-flash-lite Annotation model
KANI_LLM_ANNOTATOR_BASE_URL https://openrouter.ai/api/v1 API endpoint
KANI_LLM_ANNOTATOR_API_KEY $OPENROUTER_API_KEY API key

Priority is: CLI flags > env vars > config.yaml feature_annotator > built-in defaults.

Routing logs

All decisions are logged to $XDG_STATE_HOME/kani/log/routing-YYYY-MM-DD.jsonl (default: ~/.local/state/kani/log/):

{"timestamp":"2025-03-21T19:50:00","prompt_preview":"prove the Riemann...","tier":"REASONING","score":0.82,"confidence":0.87,"method":"distilled-features","agentic_score":1.0,"signals":{"tokenCount":38,"semanticLabels":{"reasoningMarkers":"high","agenticTask":"high"},"featureVersion":"v1"}}

Use these logs to build distilled feature training data:

uv run python scripts/build_agentic_dataset.py \
  --output data/distilled_feature_dataset.json

When existing logs do not yet include semantic labels, you can annotate missing examples offline:

uv run python scripts/build_agentic_dataset.py \
  --annotate-missing \
  --output data/distilled_feature_dataset.json

Then train the multi-output feature classifier bundle:

uv run python scripts/train_classifier.py \
  --data data/distilled_feature_dataset.json \
  --output models

This writes models/feature_classifier.pkl with the sklearn multi-output classifier, per-dimension label encoders, weights, thresholds, and embedding metadata.

API key authentication

kani supports API key authentication to restrict proxy access. Keys are managed via the CLI and stored in $XDG_DATA_HOME/kani/api_keys.json.

When no keys are configured, all requests pass through without authentication (backward-compatible). As soon as one key is added, every API request must include a valid Authorization: Bearer <key> header.

# Create a key (auto-generated, shown once)
kani keys add hermes
#   kani-aBcDeFgH...  ← save this

# List keys (prefix only, secrets are not stored in plaintext)
kani keys list

# Remove a key by name or prefix
kani keys remove hermes

Using the key:

curl http://localhost:18420/v1/chat/completions \
  -H "Authorization: Bearer kani-aBcDeFgH..." \
  -H "Content-Type: application/json" \
  -d '{"model": "kani/auto", "messages": [{"role": "user", "content": "hello"}]}'
client = OpenAI(
    base_url="http://localhost:18420/v1",
    api_key="kani-aBcDeFgH...",  # kani API key
)

/health and /docs are exempt from authentication. No server restart required — keys take effect immediately.

CLI

kani serve [--config path] [--host 0.0.0.0] [--port 18420]
kani route "your prompt here" [--config path]
kani config [--config path]
kani keys add <name>
kani keys list
kani keys remove <name|prefix>

Architecture

src/kani/
├── scorer.py    # distilled feature scoring (15-dimensional classifier)
├── router.py    # Tier → model+provider mapping
├── proxy.py     # FastAPI OpenAI-compatible server
├── config.py    # YAML config loading, env var resolution
├── dirs.py      # XDG-compliant directory paths (config, data, logs)
├── logger.py    # JSONL routing log
└── cli.py       # Click CLI

Development

uv sync
uv run pytest tests/ -q    # 176 tests
uv run ruff check src/
uv run pyright src/

Credits

Scoring logic ported from ClawRouter (MIT license).

License

MIT

About

LLM smart router — classifies prompts by complexity and routes to the optimal model. OpenAI API-compatible proxy.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors