A no-nonsense coding agent that doesn't use 50K tokens of context to say "hello."
Minion is a purpose built coding agent aimed at removing and keeping out context bloat. Many agent frameworks use 20K-50K+ tokens when you've just said "hey." This is caused by having a lot of features and tools that need to be loaded into the context of the LLM.
This presents real challenges to running coding agents on local models, where you're often hosting the best AI you can, and you don't have much room for context.
Why do we care about this context?
- That first 50K of context is your fastest context for your model. We want that speed.
- That first 50K of context is where your model's attention mechanism is likely the best. We want that intelligence.
- That first 50K of context riding along every singe message you send adds to cost over time, even if you're on some API.
On a bare hey, the entire prompt minion sends is about 625 tokens:
system prompt (SYSTEM) ~98
tool schemas (5 functions, TOOLS) ~475
the word "hey" 1
chat-template framing ~50
─────
~625 tokens
The variance is in the last line: every server's chat template wraps the tool section differently (Qwen/Hermes add per-tool tags, llama.cpp adds a functions header, OpenAI injects its own), so the real total lands somewhere in the low 600s. The point is the floor — not the exact figure. That's two orders of magnitude less than the harnesses that spend the first 20K–50K of your context before you've said anything, and it's paid on every single turn.
You don't have to take our word for it. Point any harness at a local
server and say hey. Most print a token footer; minion does. The
number you see is the number that rides along every message you send,
and it's the cheapest thing to compare.
Nothing against the more feature-rich agents, we need those too. But we also need a very lightweight coding agent, and here it is.
Point it at any OpenAI-compatible endpoint — a local llama.cpp / vLLM / SGLang server, or a remote API like Z.ai or OpenAI itself — and start chatting with an agent that can read, write, edit, and run shell commands in your project.
The whole thing is one file (minion.py, ~3200 lines). No TUI framework, no
plugin system, no config file format. It reads from environment variables (and
~/.env), talks directly to the OpenAI SDK, and uses raw terminal escapes for
its interface. If you want to understand or modify how it works, you read one
file. That's the whole pitch.
It's built to survive the rough edges of self-hosted and open models: if the
server doesn't support native tool-calling, it falls back to parsing
<tool_call>…</tool_call> tags out of the model's text. If the server streams a
separate reasoning_content field (MiniMax-M3, DeepSeek-R1, etc.), it renders
that as a dim "thinking" block above the answer. It degrades gracefully rather
than demanding a perfect server.
pip install openai
export MINION_BASE_URL=http://localhost:8080/v1
export MINION_MODEL=your-model-name
export MINION_API_KEY=sk-noop # any string; local servers ignore it
python minion.py
If MINION_MODEL is unset, minion asks the server what it's serving.
If you'd rather have a minion command on your $PATH, install from this
repo:
pip install -e .
That registers a minion console script pointing at this checkout — edits you
make here are picked up immediately. Use pip install . (no -e) for a
non-editable install instead.
minion reads configuration from environment variables, and automatically loads
~/.env at startup (so you don't have to export things in every terminal).
MINION_BASE_URL=http://localhost:8080/v1
MINION_MODEL=your-model-name
MINION_API_KEY=sk-noop
Define named endpoints and switch between them at runtime:
MINION_SOURCES=local,zai
MINION_SOURCE_LOCAL_BASE_URL=http://localhost:8080/v1
MINION_SOURCE_LOCAL_API_KEY=sk-noop
MINION_SOURCE_ZAI_BASE_URL=https://api.z.ai/api/paas/v4
MINION_SOURCE_ZAI_API_KEY=$zai_test # $name = look up a key from env / ~/.env
MINION_SOURCE_ZAI_MODEL=glm-x-preview
See sources.example.env for a full annotated example.
Switch at runtime with /source [name]. The conversation context is preserved
across switches (use /reset if you want a clean slate).
| flag | what it does |
|---|---|
--yolo |
start in never-prompt mode (auto-approve everything) |
--approval <all|low|medium|high|yolo> |
start with a non-default approval mode |
--source <name> |
start on a specific source |
--resume [target] |
resume a saved session; bare = most recent |
--session <id> |
start a fresh run attached to a specific session id |
minion auto-loads ~/.env at startup (override with MINION_ENV_FILE),
so per-user settings live in one place instead of being exported every shell.
| env var | what it does |
|---|---|
MINION_APPROVAL |
persistent default approval mode: all/low/medium/high/yolo (see below). CLI flags --approval / --yolo override it for a single run. |
MINION_BASE_URL / MINION_MODEL / MINION_API_KEY |
legacy single-source config (or the local fallback) |
MINION_SOURCES / MINION_SOURCE_* |
named multi-source endpoints |
MINION_HOME / MINION_SESSIONS_DIR |
where session JSON files are stored |
MINION_MALFORMED_STREAM_RETRIES |
max clean retries for malformed/truncated tool-call args or SSE streams before waiting for user input (default 2) |
MINION_REASONING_ONLY_CHARS |
reasoning-only stall cutoff before forcing a visible answer (default 36000; 0 disables) |
MINION_REASONING_ONLY_RETRIES |
forced-final-answer rescue attempts after a reasoning-only stall (default 1) |
MINION_TOOL_RESULT_CHARS |
per-tool-result char cap before it enters message history, to starve context-copying repetition (default 20000; 0 disables the cap, dedup still runs) |
MINION_RECOVERY_TEMPERATURE / MINION_RECOVERY_TOP_P |
standard sampler params used only for recovery retries (defaults 1.0 / 0.95; negative values omit them) |
MINION_RECOVERY_MIN_P |
min-p floor for recovery retries (llama.cpp extension via extra_body; default 0.02; negative omits it) |
MINION_RECOVERY_REPEAT_PENALTY / MINION_RECOVERY_REPEAT_LAST_N |
repeat penalty applied during recovery retries to lower a looping token's logit (defaults 1.2 / 512; negative omits them) |
MINION_RECOVERY_DRY_MULTIPLIER / MINION_RECOVERY_DRY_BASE / MINION_RECOVERY_DRY_ALLOWED_LENGTH |
DRY (Don't Repeat Yourself) anti-repetition params for recovery retries (defaults 0.8 / 1.75 / 2; set MINION_RECOVERY_DRY_MULTIPLIER to 0 to disable DRY) |
MINION_FORCED_FINAL_MAX_TOKENS |
token cap for the forced-final-answer rescue request (default 2048) |
MINION_MAX_TOKENS |
token cap for normal streaming requests (default 16000; 0 omits the cap) |
MINION_RISK_RETRIES |
connection retries for the command-risk classifier before prompting as high-risk (default 3) |
MINION_RISK_RETRY_SECONDS |
seconds to wait between command-risk classifier connection retries (default 1) |
MINION_SESSION_DESC_REFRESH |
refresh the model-generated session description every N turns (default 6; 0 disables) |
MINION_METRICS_URL |
optional endpoint to receive cumulative token-usage totals after each turn (default unset; see Metrics) |
| subcommand | what it does |
|---|---|
minion |
start the REPL |
minion sessions [query] |
list saved sessions, 10 per page (prints + exits); optional substring filter |
minion --sessions [query] |
alias for minion sessions [query] |
| command | what it does |
|---|---|
/source [name] |
list sources or switch to one (context preserved) |
/yolo |
toggle auto-approve for writes and bash |
/approval [level] |
show or set risk threshold (all/low/medium/high/yolo) |
/sessions [n] |
list recent sessions, or show one in full |
/resume [target] |
resume a past session (n/id/prefix/title) |
/save [title] |
save the current session (optional custom title) |
/delete [target] |
delete a saved session |
/compress |
summarize older turns into one, keep last 2 verbatim |
/compact |
alias for /compress |
/recover [note] |
force a low-temp visible checkpoint after a bad stream |
/reset |
clear conversation, start a fresh session |
/clear |
alias for /reset |
/new |
alias for /reset |
/quit |
exit |
The prompt is a multi-line editor with a framed box:
- Enter submits; Alt+Enter or Ctrl+J inserts a newline
- Paste (bracketed-paste) inserts text verbatim, including newlines
- Up/Down navigate history; Left/Right move within the line
- Home/End jump to line start/end; Ctrl+U clears; Ctrl+C cancels
- Long lines word-wrap inside the box
Falls back to plain input() when stdin/stdout isn't a TTY.
Press Esc at any point during generation to stop the model and drop back to
the prompt. The stream is closed, partial output is discarded, and a synthetic
"you were interrupted" note is appended to context so the model knows what
happened. In-flight tool calls (e.g. a running run_bash) are not
cancelled — they run to completion. Ctrl+C kills the whole process if you need
a hard stop.
Every write / edit / bash call is risk-classified by a single cheap model call
before it runs. Levels: low (read-only or trivially reversible), medium
(modifies state but contained/reversible), high (destructive, hard to
reverse, or broad scope). The approval mode controls the maximum risk level
Minion may auto-allow:
| setting | prompts at | auto-allows |
|---|---|---|
(default) / all |
low + medium + high | — |
--approval low |
medium + high | low |
--approval medium |
high | low + medium |
--approval high |
— | low + medium + high |
--yolo / yolo |
— | everything; skips classifier |
The risk assessment is shown in brackets next to the prompt, so you have context for the decision:
allow rm -rf /tmp/foo? [risk: HIGH — recursive force delete in /tmp] [Y/n/esc]
At the prompt, press:
- Y (or Enter) to approve
- n to deny — the model is told the action was refused and can adapt
- Esc to stop the turn and drop back to the chat input so you can add more guidance. The escaped action is recorded as cancelled; if the model emitted multiple tool calls, any remaining ones are marked skipped so the context stays valid. A note is left so the model knows you pulled it back.
Auto-allowed calls print a one-liner:
↳ auto-allow [low] ls -la (read-only listing)
YOLO mode skips the classifier entirely. If the classifier call fails or returns
garbage, the action defaults to high (always prompts) so it errs on the side
of asking.
Every chat is automatically saved to ~/.minion/sessions/ (override with
MINION_HOME or MINION_SESSIONS_DIR) — one JSON file per session holding
the exact message array the model sees plus a little metadata (id, title,
description, source, cwd, timestamps). Files are plain JSON and
human-readable/greppable.
- Auto-save happens after every model turn, so a crash or accidental close
never loses your work. On Ctrl-D / Ctrl-C exit a grey
resume with: minion --resume <id>hint is printed so you can pick right back up. - The title is auto-derived from your first message; set a custom one with
/save <title>. - A short id (the 6-hex suffix) is shown in listings and accepted by
--resume//resume, sominion --resume deadbeworks without typing the full timestamp. - A model-generated description refreshes every
MINION_SESSION_DESC_REFRESHturns (default 6;0disables) and appears as a dim subtitle under each session inminion sessions//sessions— it tracks the current task rather than freezing on the first message. - Resume a session at startup with
minion --resume <target>or mid-chat with/resume <target>. Atargetis a number from/sessions, a short id, a full session id, a unique id prefix, or an exact title. Bareminion --resumeresumes your most recent session. - On resume, the full conversation history is printed as a one-line-per-
message recap (color-coded by role, tool calls shown as
→ name(...)) so you immediately re-orient on what the chat was about. - Discover saved sessions from the shell with
minion sessions(prints and exits — no REPL). Add a substring query to filter:minion sessions refactormatches titles, descriptions, and ids. Listings show 10 sessions per page by default. Use--page/-pand--limit/-nto move through older sessions without loading every transcript. - A resumed session reselects the source (endpoint + model) it was started on, so it lands on the same backend it was talking to.
/sessions <n>shows the full transcript of a past session inline.- Use
/sessions --page 2for the next in-chat page./sessions 2still opens session 2, so numeric selection keeps working. /resetstarts a fresh session (it does not overwrite the old one).
$ minion sessions # browse the 10 most recent sessions, then exit
$ minion sessions --page 2 # browse the next page
$ minion sessions -n 5 -p 3 # page 3, 5 sessions per page
$ minion sessions refactor # filter sessions mentioning "refactor"
$ minion --resume 1 # resume the most recent session
$ minion --resume deadbe # resume by short id
$ minion --resume implement # resume the session titled "implement…"
This is a deliberately lightweight take on session persistence — inspired by
how Hermes (hermes_state.py) stores sessions, but flat JSON files instead of
SQLite, since minion is a single local agent rather than a multi-platform
gateway.
Reasoning models sometimes stream a long reasoning_content block and then stop
without ever emitting visible content or a tool call — a silent, empty turn that
burns tokens for nothing. minion counts reasoning-only chars, and once the stream
reaches MINION_REASONING_ONLY_CHARS (default 36000) with no content or tool
call in sight, it cuts the stream and nudges the model to produce a visible answer
via the final_answer tool (FORCE_FINAL_NUDGE). After
MINION_REASONING_ONLY_RETRIES (default 1) forced-final attempts also stall,
minion gives up and returns to the chat input for guidance. Set the char limit to
0 to disable the cutoff. This is a plain char-count timeout — it makes no guess
about the content of the reasoning, only how much of it there is.
The same forced-final rescue path handles the no-signal case too: if a turn ends with reasoning but zero content and zero tool calls (and wasn't already cut), the char-count guard trips as if the limit had been hit, so a model that thinks in silence still gets one visible-answer nudge instead of a dead turn.
A separate guard catches malformed or truncated tool-call args and SSE streams:
after MINION_MALFORMED_STREAM_RETRIES (default 2) clean retries, minion
stops retrying and waits for input. Each retry re-opens the stream with recovery
sampler params — more entropy and anti-repetition than a normal turn — to escape
the low-entropy attractor that produced the corruption.
Recovery retries (malformed-stream, reasoning-only-stall rescue, and /recover)
swap in a higher-entropy, anti-repetition sampler so the model doesn't collapse
back into the same broken output:
| knob | default | notes |
|---|---|---|
MINION_RECOVERY_TEMPERATURE |
1.0 |
raised (not lowered) so a repetition collapse gets more entropy, not a sharper greedy pass |
MINION_RECOVERY_TOP_P |
0.95 |
|
MINION_RECOVERY_MIN_P |
0.02 |
llama.cpp extension, rides in extra_body; negative omits it |
MINION_RECOVERY_REPEAT_PENALTY |
1.2 |
lowers a looping token's logit |
MINION_RECOVERY_REPEAT_LAST_N |
512 |
window for the repeat penalty |
MINION_RECOVERY_DRY_MULTIPLIER |
0.8 |
DRY anti-repetition; set to 0 to disable |
MINION_RECOVERY_DRY_BASE |
1.75 |
|
MINION_RECOVERY_DRY_ALLOWED_LENGTH |
2 |
Path/code punctuation (\n, :, ", *, /, \, `, ') are DRY
sequence breakers, so a long file path the model must emit verbatim is never
penalized as repetition. Normal turns pass no sampler params, so the server keeps
its own defaults unless a recovery path is in progress. Non-llama.cpp backends
ignore the unknown extra_body keys.
You can trigger the same checkpoint path manually with /recover [optional note]
after interrupting a bad stream or once the prompt returns. The command appends a
manual recovery note to the conversation and immediately forces a final_answer
checkpoint with recovery sampling, instead of letting the model continue free-form.
| tool | args | notes |
|---|---|---|
read_file |
path |
|
write_file |
path, content |
overwrites; requires confirmation |
edit_file |
path, old, new |
old must match exactly once |
list_dir |
path |
|
run_bash |
command |
requires confirmation |
At startup (and after a /source / /yolo / /approval switch) minion
prints a one-line banner showing the model name, active source, approval
mode, and endpoint. The banner is printed into the normal scrollback —
there's no pinned/scroll-region status bar, so terminal scrollback works
normally and every line of output stays visible.
(An earlier version pinned a status bar at row 1 using a DECSTBM scroll region, like tmux/vim. It was removed because it broke terminal scrollback — lines scrolling off the top of the region never entered the scrollback buffer, so the chat became unscrollable in a plain terminal.)
Every request and streamed SSE chunk is appended to llamacpp.log next to the
script (JSONL). Useful for debugging what the model actually saw and returned.
minion records token usage for every model call (input, output, cache-read,
and reasoning tokens) and writes it into the session JSON under
~/.minion/sessions/. This is always on — the numbers are already in hand
from the stats footer, so persisting them costs nothing and a local usage log
is useful whether or not you point it anywhere. A saved session carries:
"input_tokens": 900,
"output_tokens": 230,
"cache_read_tokens": 600,
"reasoning_tokens": 120,
"api_calls": 2,
"started_at": 1700000000.0The accounting matches the OpenAI usage convention (and what most dashboards
expect): input_tokens excludes cached prompt tokens (those are broken out
into cache_read_tokens), and output_tokens excludes reasoning tokens
(those are broken out into reasoning_tokens). Local llama.cpp servers report
via a timings object instead of streaming usage; both are normalized to the
same four fields. Totals accumulate across a session and are reloaded on
--resume / /resume, so they keep climbing across restarts rather than
resetting.
Optionally, set MINION_METRICS_URL to a POST endpoint and minion will also
push the same cumulative totals after each turn:
MINION_METRICS_URL=http://localhost:9121/api/tokens/push
The body is a small, dashboard-agnostic JSON blob:
{
"session_id": "20240101-120000-abcdef",
"model": "glm-4.6",
"source": "zai",
"input_tokens": 900,
"output_tokens": 230,
"cache_read_tokens": 600,
"reasoning_tokens": 120,
"api_calls": 2,
"started_at": 1700000000.0,
"ended_at": 1700000123.0
}Totals are cumulative per session, so the endpoint can compute its own deltas between pushes. The push is fire-and-forget with a 1.5 s timeout and disables itself after the first failure — if the endpoint is down or absent, minion stops trying rather than adding latency to every turn. Unset the variable (the default) and nothing leaves the machine; only the local session-JSON totals remain.
minion was developed using the following models:
- minion (eating its own dog food)
- GLM 5.2 (Z.ai, open weights)
- MiniMax-M3 (MiniMax)
MIT License. See LICENSE.