Skip to content

Latest commit

 

History

History
332 lines (251 loc) · 7.97 KB

File metadata and controls

332 lines (251 loc) · 7.97 KB

llmfit REST API Guide

This document is for agent/client builders integrating with llmfit serve.

Purpose

llmfit serve exposes node-local model fit analysis (same core data used by TUI/CLI) over HTTP and serves a local web dashboard.

Primary use case:

  • Query each node in a cluster for top runnable models.
  • Aggregate externally (scheduler/controller/UI) for placement decisions.

Start the server

llmfit serve --port 8787

Global flags still apply:

llmfit --memory 24G --ram 64G --cpu-cores 16 --max-context 8192 serve --port 8787

Hardware overrides (--memory, --ram, --cpu-cores) are reflected in API responses, making the server report the overridden values instead of the detected hardware.

Base URL

Default local base URL:

http://127.0.0.1:8787

To expose outside localhost, pass --host 0.0.0.0.

If you are building from source and want the dashboard embedded in llmfit, build web assets first:

cd llmfit-web && npm ci && npm run build

Endpoints

GET /

Web dashboard entrypoint (same-origin UI for fit exploration).

GET /health

Liveness probe.

Example response:

{
  "status": "ok",
  "node": {
    "name": "worker-1",
    "os": "linux"
  }
}

GET /api/v1/system

Returns node identity + detected hardware.

Example response shape:

{
  "node": {
    "name": "worker-1",
    "os": "linux"
  },
  "system": {
    "total_ram_gb": 62.23,
    "available_ram_gb": 41.08,
    "cpu_cores": 14,
    "cpu_name": "Intel(R) Core(TM) Ultra 7 165U",
    "has_gpu": false,
    "gpu_vram_gb": null,
    "gpu_name": null,
    "gpu_count": 0,
    "unified_memory": false,
    "backend": "CPU (x86)",
    "gpus": []
  }
}

GET /api/v1/models

Returns filtered/sorted model-fit rows for this node.

Envelope shape:

{
  "node": { "name": "worker-1", "os": "linux" },
  "system": { "...": "..." },
  "total_models": 23,
  "returned_models": 10,
  "filters": { "...": "echo of query state" },
  "models": [
    {
      "name": "Qwen/Qwen2.5-Coder-7B-Instruct",
      "provider": "Qwen",
      "parameter_count": "7B",
      "params_b": 7.0,
      "context_length": 32768,
      "use_case": "Coding",
      "category": "Coding",
      "release_date": "2025-03-14",
      "is_moe": false,
      "fit_level": "good",
      "fit_label": "Good",
      "run_mode": "gpu",
      "run_mode_label": "GPU",
      "score": 86.5,
      "score_components": {
        "quality": 87.0,
        "speed": 81.2,
        "fit": 90.1,
        "context": 88.0
      },
      "estimated_tps": 42.5,
      "runtime": "llamacpp",
      "runtime_label": "llama.cpp",
      "best_quant": "Q5_K_M",
      "memory_required_gb": 5.8,
      "memory_available_gb": 12.0,
      "utilization_pct": 48.3,
      "notes": [],
      "gguf_sources": []
    }
  ]
}

GET /api/v1/models/top

Key scheduling endpoint. Same schema as /api/v1/models, but defaults to top 5 runnable entries.

Important behavior:

  • Defaults limit=5.
  • Excludes too_tight rows unless explicitly overridden (and top endpoint still keeps runnable semantics).

GET /api/v1/models/{name}

Path-constrained search. Equivalent to a text search scoped by {name}.

Useful for:

  • Client-side drilldown after selecting a model family.

Query parameters

Supported on /api/v1/models and /api/v1/models/top (also /api/v1/models/{name}):

  • limit (or alias n): max rows returned.
  • perfect: true|false (when true, only perfect fits).
  • min_fit: perfect|good|marginal|too_tight.
  • runtime: any|mlx|llamacpp.
  • use_case: general|coding|reasoning|chat|multimodal|embedding.
  • provider: provider substring filter.
  • search: free-text filter (name/provider/params/use-case/category).
  • sort: score|tps|params|mem|ctx|date|use_case.
  • include_too_tight: include unrunnable rows (defaults true for /models, false for /models/top).
  • max_context: per-request context cap used by memory estimation.
  • force_runtime: mlx|llamacpp|vllm — override automatic runtime selection during analysis (e.g. get llama.cpp recommendations on Apple Silicon instead of MLX).

Error handling

Invalid filter values return HTTP 400:

{
  "error": "invalid min_fit value: use perfect|good|marginal|too_tight"
}

Server errors return HTTP 500 with {"error": "..."}.

Client integration recommendations

1) Polling pattern for schedulers

For each node agent:

  1. Call /health.
  2. Call /api/v1/system.
  3. Call /api/v1/models/top?limit=K&min_fit=good.
  4. Attach node metadata and forward to your central scheduler.

2) Conservative placement defaults

For production placement, prefer:

min_fit=good
include_too_tight=false
sort=score
limit=5..20

3) Per-workload targeting

Examples:

  • Coding workloads: use_case=coding
  • Embedding workloads: use_case=embedding
  • Runtime constrained to llama.cpp fleet: runtime=llamacpp

4) Stable parsing

Treat unknown fields as forward-compatible additions:

  • Parse required fields you depend on.
  • Ignore unknown fields.

Curl examples

curl http://127.0.0.1:8787/health
curl http://127.0.0.1:8787/api/v1/system
curl "http://127.0.0.1:8787/api/v1/models?limit=20&min_fit=marginal&sort=score"
curl "http://127.0.0.1:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding"
curl "http://127.0.0.1:8787/api/v1/models/Mistral?runtime=any"

MCP Server Mode

llmfit can run as an MCP (Model Context Protocol) server over stdio, making it discoverable by AI agents (Claude, Cursor, etc.).

Start the MCP server

llmfit serve --mcp

Global hardware overrides still apply:

llmfit --memory 24G --ram 64G serve --mcp

MCP client configuration

Add to your MCP client config (e.g. claude_desktop_config.json):

{
  "mcpServers": {
    "llmfit": {
      "command": "llmfit",
      "args": ["serve", "--mcp"]
    }
  }
}

Available tools

Tool Description Parameters
get_system_specs Node hardware info (RAM, GPU, CPU) None
recommend_models Top models for this hardware limit?, use_case?, min_fit?, runtime?, license?, sort?
search_models Free-text model search query, limit?
plan_hardware Hardware requirements for a model model, context?, quant?, target_tps?
get_runtimes Installed inference runtimes None
get_installed_models Models in local runtimes None

NATS Event Publishing

When built with the nats feature, llmfit can publish hardware and model events to NATS for integration with coordination systems (e.g. Sympozium membrane).

Build with NATS support

cargo build --features nats

Enable event publishing

llmfit serve --send-events --nats-url nats://localhost:4222
llmfit serve --mcp --send-events  # also works with MCP mode

The NATS_URL environment variable is also supported.

Event subjects

Events are published to llmfit.{event_type}.{hostname}:

Subject Trigger Payload
llmfit.system.{hostname} Startup + every 60s System hardware specs
llmfit.fit.{hostname} After fit analysis Model fit summary
llmfit.plan.{hostname} After plan estimate Plan estimate
llmfit.runtimes.{hostname} Startup + on query Runtime availability
llmfit.installed.{hostname} Startup + on query Installed models

Event envelope

All events are wrapped in a common envelope:

{
  "timestamp": "1747058400",
  "hostname": "worker-1",
  "event_type": "system",
  "version": "1",
  "data": { ... }
}

Subscribe to events

nats sub 'llmfit.>'                    # all events from all nodes
nats sub 'llmfit.system.>'             # system specs from all nodes
nats sub 'llmfit.system.worker-1'      # system specs from specific node

Versioning notes

Current API prefix is v1.

If you build long-lived clients, pin to /api/v1/... and validate behavior with the local test script in scripts/test_api.py.