llm-fuse exposes an LLM conversation as a FUSE filesystem. Each session shows up as a directory containing four files. You write a prompt to input, poll status until it shows idle, then read the model's response from output. A fourth file, usage.yaml, is updated reflecting input tokens, cached input tokens, and output tokens along with cost information.
I built this because I wanted to script LLM interactions from the command-line and I couldn't find an easy way to do it.
A session directory has four files plus a writable config/ subdirectory:
$ llm-mount ~/sessions/scratch
Name: 01HXY7K9P5Q3MN8DQ2W4F6T0RZ
Path: /home/kev/sessions/scratch
Provider: fireworks
Model: qwen3p6-plus
$ ls ~/sessions/scratch
config input output status usage.yaml
Each file does one thing. config/ is covered later.
Accepts prompts. Anything you write to it counts as a complete prompt; the daemon submits the moment you close the handle. So this works:
$ echo "what's the capital of france" > ~/sessions/scratch/input
Reading input gives you the bare-form log of every prompt submitted through this session, one line per turn. Useful for tail -f input if you want a side process watching what got asked.
If you're writing a watcher of your own, listen for IN_CLOSE_WRITE, not IN_MODIFY. IN_MODIFY fires on partial buffered writes before the prompt has actually been submitted. IN_CLOSE_WRITE fires after the close round-trip completes, which is the point you can be sure the daemon has the full prompt.
The streaming transcript. Tokens land in output as the model emits them. cat output returns the full conversation so far. tail -f output works during a generation, and it wakes up on every chunk because the daemon issues a real write through the FUSE layer per chunk to fire IN_MODIFY.
This wasn't the obvious thing to do. Kernel-side cache invalidation will get you fresh reads, but it bypasses fsnotify so external watchers stay asleep. Writing a sentinel byte through our own mount path pulls the inotify trigger that user-space tools depend on. The bytes themselves get discarded; the authoritative content is regenerated from session state on every read.
The bare word idle or busy, reflecting whether a turn is in flight. Same inotify-wake trick as output — fires on every flip. This is what makes a chat REPL practical: write a prompt, then loop on status until it drops back to idle, draining output as you go. The chat script in examples/ does exactly this.
Regenerated on every read. Cumulative tokens and cost for the session, with inline comments explaining each field:
# Cumulative token usage and cost across every completed turn in this
# session. Re-rendered each time you read the file.
sent_tokens: 47
context_tokens: 312
tokens_in: 359
tokens_in_cached: 0
tokens_out: 488
# Pricing for fireworks/qwen3p6-plus (microcents per token): input=50, ...
cost: "0.0017"sent_tokens is the new content you've sent across all turns. context_tokens is the chat template plus accumulated history that gets replayed on every request because the model is stateless. That one grows quadratically with session length even if you stop typing. tokens_in is the sum of the two. tokens_in_cached is the subset of input tokens that hit the provider's prompt cache.
Costs come from the per-model pricing block in your config. Self-hosted backends like Ollama get all-zero pricing, which is the canonical "no cost" form.
llm-fused is the long-running daemon. It listens on a Unix domain socket and owns every active mount. llm-mount and llm-umount are user-facing CLIs that talk to the daemon over that socket.
The split exists because FUSE mounts have to outlive the command that creates them. If llm-mount did the mount itself, the kernel mount would die when the command returned. So the daemon holds the mount, and the CLIs are thin clients that ask it to start or stop one. Same shape as dockerd and docker.
The daemon process is meant to run under launchd (macOS) or systemd (Linux). User-level scope (systemctl --user) is usually the right one. The daemon needs write access to wherever you'll be mounting sessions. For my uses, that's included my $HOME directory which is why sandboxing features of systemd and launchd aren't used.
You need Go 1.26.2 and a system that can mount FUSE filesystems (Linux with FUSE built in, or macOS with macFUSE). Build is driven through Taskfile:
$ task build
That produces three binaries in build/:
build/
├── llm-fused
├── llm-mount
└── llm-umount
The build target also runs gofmt -l cmd internal (fails on non-empty output) and go vet ./... before producing the binaries, so you don't ship something the linter would catch.
Tests live behind task test:
$ task test # unit tests + race tests
$ task coverage # coverage profile, reports the total: line
Real-FUSE integration tests are gated on LLM_FUSE_TESTS=1. The default task test runs without /dev/fuse so you can build on CI boxes that don't expose it.
Start the daemon:
$ ./build/llm-fused serve
In another shell, mount a session:
$ ./build/llm-mount ~/sessions/scratch
Name: 01HXY7K9P5Q3MN8DQ2W4F6T0RZ
Path: /home/kev/sessions/scratch
Provider: fireworks
Model: qwen3p6-plus
Send a prompt and read the reply:
$ echo "say hi in three words" > ~/sessions/scratch/input
$ cat ~/sessions/scratch/status
busy
$ # ...wait a moment...
$ cat ~/sessions/scratch/status
idle
$ cat ~/sessions/scratch/output
Hello there friend
$ cat ~/sessions/scratch/input
say hi in three words
output holds the model's responses. input is the bare-form log of every prompt you've submitted. Both grow as the session progresses. Check what it cost:
$ cat ~/sessions/scratch/usage.yaml | grep -v '^#'
sent_tokens: 5
context_tokens: 13
tokens_in: 18
tokens_in_cached: 0
tokens_out: 4
cost: "0.0000"
When you're done, unmount it:
$ ./build/llm-umount ~/sessions/scratch
Unmounted: /home/kev/sessions/scratch
That's the whole interface.
examples/llm-chat.sh is a minimal interactive chat client built entirely on top of the FUSE mount. It uses inotifywait to wake up on output and status changes, and dd to drain the streaming bytes incrementally. It does not own the mount lifecycle — bring one up first, point the script at it, and it gives you a REPL on top.
$ llm-mount ~/sessions/repl
$ examples/llm-chat.sh ~/sessions/repl
> what's the airspeed velocity of an unladen swallow?
African or European?
> european
About 11 metres per second...
>
The script is about a hundred lines including comments, and most of those comments are explaining race conditions I hit and fixed. It's a good read if you're considering writing your own client.
Config is loaded once at startup from these paths, in this order (later wins):
/etc/llm-fuse/config.yaml
/usr/local/etc/llm-fuse/config.yaml
$XDG_CONFIG_HOME/llm-fuse/config.yaml
./llm-fuse.yaml # cwd, developer override
Then LLM_FUSE_* environment variables override anything from the files. The cwd file is named llm-fuse.yaml rather than config.yaml so it's self-identifying when it shows up in someone's project directory.
A working config looks like this:
daemon:
socket_path: /run/user/1000/llm-fuse.sock
llm:
default_provider: fireworks
providers:
fireworks:
default_model: qwen3p6-plus
base_url: https://api.fireworks.ai/inference/v1/chat/completions
api_key_env: FIREWORKS_AI_TOKEN
pricing_unit: mtoken
models:
qwen3p6-plus:
id: accounts/fireworks/models/qwen3p6-plus
pricing:
input_cost: "0.50"
cached_input_cost: "0.10"
output_cost: "3.00"
ollama:
default_model: llama3
base_url: http://localhost:11434/v1/chat/completions
models:
llama3:
id: llama3:latest
pricing:
input_cost: "0"
cached_input_cost: "0"
output_cost: "0"A few things worth knowing:
api_key_env names an environment variable, not the value. The daemon reads the variable's contents at session-create time and uses that as the bearer token. Self-hosted providers (Ollama, vLLM, anything without auth) leave api_key_env empty and the daemon skips the Authorization header.
Pricing values are quoted strings. YAML would otherwise float-coerce them into something that won't round-trip cleanly through validation, and you'd discover this six weeks later when your cost field was still 0.0.
A provider can carry as many models as you want under its models: block. The default_model key picks which one a session binds to when the caller doesn't say. To pick something else, pass --provider and/or --model to llm-mount:
$ llm-mount --provider fireworks --model qwen3p6-plus ~/sessions/qwen
$ llm-mount --model llama3 ~/sessions/local # provider stays at the default
Provider and model both fall back to defaults: --provider to llm.default_provider, --model to that provider's default_model.
Mount paths are not in the config. They're passed at session-create time, on the llm-mount command line. The config knows about providers and models; the user knows where they want their sessions to live.
If daemon.socket_path is unset, the daemon falls back to $XDG_RUNTIME_DIR/llm-fuse.sock, then $TMPDIR/llm-fuse-<uid>.sock, then /tmp/llm-fuse-<uid>.sock. You usually don't have to think about this. The default works.
Each mount has a writable config/ subdirectory. Drop files into it to control the next turn. Two file shapes are recognized; everything else is ignored.
Files matching the regex ^system_prompt_[0-9][0-9]\.md$ are concatenated lex-sorted by name, with a blank line between fragments, and prepended as a system message on every turn.
The two-digit zero-padded sequence is mandatory. Lex sort is the only sort applied — system_prompt_1.md and system_prompt_10.md would land in the wrong order without padding, so the regex rejects unpadded names outright. Real names look like system_prompt_01.md, system_prompt_02.md, etc. up to system_prompt_99.md.
$ echo "you are terse" > ~/sessions/scratch/config/system_prompt_01.md
$ echo "never apologize" > ~/sessions/scratch/config/system_prompt_02.md
$ echo "what's 2+2" > ~/sessions/scratch/input
The model receives a system message with body you are terse\n\nnever apologize, then the user message. If you change the fragments mid-session, the next turn picks them up — at the cost of busting the provider's prompt cache for that session, since the system message is part of what gets cached.
If config/schema.json exists and contains a valid JSON Schema, the daemon attaches it to the request as response_format: {type: "json_schema", ...}. The provider returns a structured response that conforms to the schema.
$ cat > ~/sessions/scratch/config/schema.json <<'EOF'
{
"type": "object",
"properties": {
"answer": {"type": "number"},
"reasoning": {"type": "string"}
},
"required": ["answer"]
}
EOF
$ echo "what's 2+2" > ~/sessions/scratch/input
A malformed schema.json fails the turn fast — the parse error lands in the transcript tail ([error: parse schema.json: ...]) and the model is not called. Fix the file and try again.
response_format is a server-side capability that varies by provider. Set supports_structured_output: true on the provider in your config to opt in. Sessions bound to providers without the flag drop schema.json and log a one-shot warning the first time it would have been applied. Fireworks and OpenAI honor the field; Ollama supports a subset (format: "json") but not full schema, so leave its flag off.
config/ is a flat namespace. mkdir, rmdir, and rename return EPERM. Plain cp, rm, cat, and > redirection all work. Files are stored in memory on the session and don't survive unmount.
Templates live in dist/:
dist/
├── llm-fused.service # systemd unit (Linux)
├── llm-fused.plist # launchd plist (macOS)
└── llm-fuse.yaml.example # annotated config
The fastest path is install.sh, which detects the host OS via uname -s, copies the binaries to /usr/local/bin, drops the matching supervisor unit into the right place, and starts it:
$ task build
$ ./install.sh
Re-running is safe — files get overwritten and the service is restarted, which is what you want after rebuilding.
Provider tokens are not handled by the script. Set them in the supervisor's environment after install: a drop-in like ~/.config/systemd/user/llm-fused.service.d/secrets.conf for systemd, or the EnvironmentVariables block in the installed plist for launchd. Tokens never go in the YAML config.
If you'd rather do it by hand, the steps are below.
Install the binaries somewhere on $PATH. The shipped service templates expect /usr/local/bin:
$ sudo install -m 0755 build/llm-fused build/llm-mount build/llm-umount /usr/local/bin/
If you put them somewhere else, edit ExecStart= in dist/llm-fused.service (or the ProgramArguments path in dist/llm-fused.plist) to match.
For Linux, install at user scope:
$ mkdir -p ~/.config/systemd/user
$ cp dist/llm-fused.service ~/.config/systemd/user/
$ systemctl --user daemon-reload
$ systemctl --user enable --now llm-fused
For macOS:
$ cp dist/llm-fused.plist ~/Library/LaunchAgents/com.poiesic.llm-fused.plist
$ launchctl load ~/Library/LaunchAgents/com.poiesic.llm-fused.plist
The daemon also exposes a small HTTP API on the same Unix socket, useful if you want to drive sessions from something other than the CLIs:
GET /status # build version, uptime
GET /sessions # list active sessions
POST /sessions # create + mount a session
DELETE /sessions/{name} # unmount + remove a session
The OpenAPI document at api/openapi.yaml is generated from the Huma route definitions in internal/api. A typed Go client lives in internal/cli/client/gen, wrapped by a hand-written surface in internal/cli/client. If you want to regenerate either, task gen does it.
Each direct subpackage of internal/ is its own subsystem with its own tests. Coverage gate is 85% line coverage on hand-written packages. The codegen output under internal/cli/client/gen is exempt; its correctness is verified by integration tests through the hand-written wrapper.
Pre-1.0. There are likely bugs. You've been warned.
Filing issues is welcome. PRs are welcome too, though I'd appreciate a discussion before anything large lands — I have opinions about scope and want to keep this thing small.
Apache License 2.0. See LICENSE for the full text. Copyright 2026 Poiesic Systems.