llama-swap on macOS

Native llama-swap setup for local Qwen3.6 MTP models using Homebrew and launchd.

Files

config.yaml — llama-swap model config
local.llama-swap.plist.template — launchd template with __PROJECT_DIR__ placeholder
install-launchd.sh — generates a machine-local plist in ~/Library/LaunchAgents/ and starts the service
uninstall-launchd.sh — stops the service and removes the generated plist from ~/Library/LaunchAgents/
.gitignore — ignores local runtime files

Installed binaries

Installed via Homebrew:

brew tap mostlygeek/llama-swap
brew install llama-swap llama.cpp

The launchd installer auto-detects llama-swap from your PATH, so no machine-specific binary path needs to be committed.

Models exposed

Backed by Unsloth MTP GGUFs via llama-server -hf, so models are fetched automatically on first load.

qwen3.6-27b
qwen3.6-27b:nothink
qwen3.6-35b-a3b
qwen3.6-35b-a3b:nothink

Tested hardware

Tested on an Apple Silicon Mac with 64 GB unified memory.

qwen3.6-27b works well on this setup
qwen3.6-35b-a3b also works, but with less memory headroom
MTP uses slightly more memory than standard GGUFs
closing other heavy local AI apps is recommended

The :nothink variants use enable_thinking: false without forcing a model reload.

MTP best-practice defaults applied from the Unsloth guide:

--spec-type draft-mtp
--spec-draft-n-max 2
UD-Q4_K_XL quant via Hugging Face auto-download

Coding-optimized sampling defaults applied:

thinking mode: temperature 0.6, top_p 0.95, presence_penalty 0.0
non-thinking mode: temperature 1.0, top_p 0.95, presence_penalty 1.5

If you want to tune throughput further, Unsloth recommends testing --spec-draft-n-max values from 1 to 6, though 2 is their default recommendation and they do not recommend going above 2 in general.

Start manually

cd /path/to/llamaswap
llama-swap --config ./config.yaml --listen 127.0.0.1:8080 --watch-config

Endpoint:

http://127.0.0.1:8080

Install autostart

The repo contains no absolute paths. Instead, install-launchd.sh detects its own directory and renders local.llama-swap.plist.template into a machine-local plist under ~/Library/LaunchAgents/.

From the project directory:

cd /path/to/llamaswap
./install-launchd.sh

What it does:

stops any existing local.llama-swap user agent
renders a local plist from ./local.llama-swap.plist.template
auto-detects the llama-swap binary from your PATH
writes the result to ~/Library/LaunchAgents/local.llama-swap.plist
bootstraps the agent with launchctl
starts it immediately

Stop / uninstall autostart

cd /path/to/llamaswap
./uninstall-launchd.sh

Restart

Fast restart of the loaded service:

launchctl kickstart -k gui/$(id -u)/local.llama-swap

Clean reinstall after changing the plist:

cd /path/to/llamaswap
./install-launchd.sh

Logs

Service logs (from the project directory):

tail -f ./llama-swap.out.log
tail -f ./llama-swap.err.log

Health check

curl http://127.0.0.1:8080/health
curl http://127.0.0.1:8080/v1/models

Pi coding agent setup

Pi can use llama-swap as an OpenAI-compatible local provider. Add the models to:

~/.pi/agent/models.json

Recommended configuration:

{
  "providers": {
    "llama-swap": {
      "baseUrl": "http://127.0.0.1:8080/v1",
      "api": "openai-completions",
      "apiKey": "local",
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false,
        "maxTokensField": "max_tokens",
        "thinkingFormat": "qwen-chat-template"
      },
      "models": [
        {
          "id": "qwen3.6-27b",
          "name": "Qwen3.6 27B Thinking",
          "reasoning": true,
          "input": ["text"],
          "contextWindow": 262144,
          "maxTokens": 32768,
          "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
        },
        {
          "id": "qwen3.6-27b:nothink",
          "name": "Qwen3.6 27B No Thinking",
          "reasoning": false,
          "input": ["text"],
          "contextWindow": 262144,
          "maxTokens": 32768,
          "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
        },
        {
          "id": "qwen3.6-35b-a3b",
          "name": "Qwen3.6 35B A3B Thinking",
          "reasoning": true,
          "input": ["text"],
          "contextWindow": 262144,
          "maxTokens": 32768,
          "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
        },
        {
          "id": "qwen3.6-35b-a3b:nothink",
          "name": "Qwen3.6 35B A3B No Thinking",
          "reasoning": false,
          "input": ["text"],
          "contextWindow": 262144,
          "maxTokens": 32768,
          "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
        }
      ]
    }
  }
}

Notes:

apiKey is required by Pi's provider config, but llama-swap ignores it; any non-empty value works.
supportsDeveloperRole: false makes Pi send the main instruction as a system message, which is safer for llama.cpp OpenAI compatibility.
supportsReasoningEffort: false prevents Pi from sending OpenAI-specific reasoning_effort parameters.
The :nothink model IDs use the enable_thinking: false aliases configured in config.yaml and avoid a model reload.
After editing models.json, open /model in Pi; the file is reloaded when the model picker opens.

Example requests

List models:

curl http://127.0.0.1:8080/v1/models

Chat completion with thinking enabled:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.6-27b",
    "messages": [{"role": "user", "content": "Explain mmap simply."}],
    "stream": false
  }'

Chat completion with thinking disabled:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.6-27b:nothink",
    "messages": [{"role": "user", "content": "Explain mmap simply."}],
    "stream": false
  }'

Notes

Models are now fetched directly from Hugging Face via llama-server -hf on first use.
MTP uses slightly more RAM/VRAM than standard GGUFs; keep roughly ~1 GB extra headroom per loaded model.
Unsloth notes that thinking mode and non-thinking mode use different recommended sampling params; those presets are configured in config.yaml.
llama-swap is configured with --watch-config, so config changes are picked up automatically, but a restart is still the cleanest option after larger edits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-swap on macOS

Files

Installed binaries

Models exposed

Tested hardware

Start manually

Install autostart

Stop / uninstall autostart

Restart

Logs

Health check

Pi coding agent setup

Example requests

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
install-launchd.sh		install-launchd.sh
local.llama-swap.plist.template		local.llama-swap.plist.template
uninstall-launchd.sh		uninstall-launchd.sh

Folders and files

Latest commit

History

Repository files navigation

llama-swap on macOS

Files

Installed binaries

Models exposed

Tested hardware

Start manually

Install autostart

Stop / uninstall autostart

Restart

Logs

Health check

Pi coding agent setup

Example requests

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages