Native llama-swap setup for local Qwen3.6 MTP models using Homebrew and launchd.
config.yaml— llama-swap model configlocal.llama-swap.plist.template— launchd template with__PROJECT_DIR__placeholderinstall-launchd.sh— generates a machine-local plist in~/Library/LaunchAgents/and starts the serviceuninstall-launchd.sh— stops the service and removes the generated plist from~/Library/LaunchAgents/.gitignore— ignores local runtime files
Installed via Homebrew:
brew tap mostlygeek/llama-swap
brew install llama-swap llama.cppThe launchd installer auto-detects llama-swap from your PATH, so no machine-specific binary path needs to be committed.
Backed by Unsloth MTP GGUFs via llama-server -hf, so models are fetched automatically on first load.
qwen3.6-27bqwen3.6-27b:nothinkqwen3.6-35b-a3bqwen3.6-35b-a3b:nothink
Tested on an Apple Silicon Mac with 64 GB unified memory.
qwen3.6-27bworks well on this setupqwen3.6-35b-a3balso works, but with less memory headroom- MTP uses slightly more memory than standard GGUFs
- closing other heavy local AI apps is recommended
The :nothink variants use enable_thinking: false without forcing a model reload.
MTP best-practice defaults applied from the Unsloth guide:
--spec-type draft-mtp--spec-draft-n-max 2UD-Q4_K_XLquant via Hugging Face auto-download
Coding-optimized sampling defaults applied:
- thinking mode:
temperature 0.6,top_p 0.95,presence_penalty 0.0 - non-thinking mode:
temperature 1.0,top_p 0.95,presence_penalty 1.5
If you want to tune throughput further, Unsloth recommends testing --spec-draft-n-max values from 1 to 6, though 2 is their default recommendation and they do not recommend going above 2 in general.
cd /path/to/llamaswap
llama-swap --config ./config.yaml --listen 127.0.0.1:8080 --watch-configEndpoint:
http://127.0.0.1:8080
The repo contains no absolute paths. Instead, install-launchd.sh detects its own directory and renders local.llama-swap.plist.template into a machine-local plist under ~/Library/LaunchAgents/.
From the project directory:
cd /path/to/llamaswap
./install-launchd.shWhat it does:
- stops any existing
local.llama-swapuser agent - renders a local plist from
./local.llama-swap.plist.template - auto-detects the
llama-swapbinary from yourPATH - writes the result to
~/Library/LaunchAgents/local.llama-swap.plist - bootstraps the agent with
launchctl - starts it immediately
cd /path/to/llamaswap
./uninstall-launchd.shFast restart of the loaded service:
launchctl kickstart -k gui/$(id -u)/local.llama-swapClean reinstall after changing the plist:
cd /path/to/llamaswap
./install-launchd.shService logs (from the project directory):
tail -f ./llama-swap.out.log
tail -f ./llama-swap.err.logcurl http://127.0.0.1:8080/health
curl http://127.0.0.1:8080/v1/modelsPi can use llama-swap as an OpenAI-compatible local provider. Add the models to:
~/.pi/agent/models.json
Recommended configuration:
{
"providers": {
"llama-swap": {
"baseUrl": "http://127.0.0.1:8080/v1",
"api": "openai-completions",
"apiKey": "local",
"compat": {
"supportsDeveloperRole": false,
"supportsReasoningEffort": false,
"maxTokensField": "max_tokens",
"thinkingFormat": "qwen-chat-template"
},
"models": [
{
"id": "qwen3.6-27b",
"name": "Qwen3.6 27B Thinking",
"reasoning": true,
"input": ["text"],
"contextWindow": 262144,
"maxTokens": 32768,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
},
{
"id": "qwen3.6-27b:nothink",
"name": "Qwen3.6 27B No Thinking",
"reasoning": false,
"input": ["text"],
"contextWindow": 262144,
"maxTokens": 32768,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
},
{
"id": "qwen3.6-35b-a3b",
"name": "Qwen3.6 35B A3B Thinking",
"reasoning": true,
"input": ["text"],
"contextWindow": 262144,
"maxTokens": 32768,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
},
{
"id": "qwen3.6-35b-a3b:nothink",
"name": "Qwen3.6 35B A3B No Thinking",
"reasoning": false,
"input": ["text"],
"contextWindow": 262144,
"maxTokens": 32768,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
}
]
}
}
}Notes:
apiKeyis required by Pi's provider config, butllama-swapignores it; any non-empty value works.supportsDeveloperRole: falsemakes Pi send the main instruction as asystemmessage, which is safer forllama.cppOpenAI compatibility.supportsReasoningEffort: falseprevents Pi from sending OpenAI-specificreasoning_effortparameters.- The
:nothinkmodel IDs use theenable_thinking: falsealiases configured inconfig.yamland avoid a model reload. - After editing
models.json, open/modelin Pi; the file is reloaded when the model picker opens.
List models:
curl http://127.0.0.1:8080/v1/modelsChat completion with thinking enabled:
curl http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3.6-27b",
"messages": [{"role": "user", "content": "Explain mmap simply."}],
"stream": false
}'Chat completion with thinking disabled:
curl http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3.6-27b:nothink",
"messages": [{"role": "user", "content": "Explain mmap simply."}],
"stream": false
}'- Models are now fetched directly from Hugging Face via
llama-server -hfon first use. - MTP uses slightly more RAM/VRAM than standard GGUFs; keep roughly ~1 GB extra headroom per loaded model.
- Unsloth notes that thinking mode and non-thinking mode use different recommended sampling params; those presets are configured in
config.yaml. llama-swapis configured with--watch-config, so config changes are picked up automatically, but a restart is still the cleanest option after larger edits.