LLM inference, optimized for your Mac
Continuous batching and infinite SSD caching, managed directly from your menu bar.
Install · Quickstart · GitHub
Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar.
oMLX persists KV cache to SSD - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That's why I built it.
Download the .dmg from Releases, drag to Applications, done.
git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .Requires Python 3.10+ and Apple Silicon (M1/M2/M3/M4).
Launch oMLX from your Applications folder. The Welcome screen guides you through three steps - model directory, server start, and first model download. That's it.
omlx serve --model-dir ~/modelsThe server discovers models from subdirectories automatically. Any OpenAI-compatible client can connect to http://localhost:8000/v1. A built-in chat UI is also available at http://localhost:8000/admin/chat.
oMLX is built on top of vllm-mlx, extending it with paged SSD caching, multi-model serving, an admin dashboard, Claude Code optimization, and Anthropic API support. Currently supports text-based LLMs - VLM and OCR model support is planned for upcoming milestones.
macOS menubar app - Native menubar app to start, stop, and monitor the server without opening a terminal.
Admin dashboard - Web UI at /admin for model management, chat, real-time monitoring, and per-model settings.
Built-in model downloader - Search and download MLX models from HuggingFace directly in the admin dashboard. No CLI or git clone needed.
Claude Code optimization - Context scaling support for running smaller context models with Claude Code. Scales reported token counts so that auto-compact triggers at the right timing, and SSE keep-alive prevents read timeouts during long prefill.
Paged KV cache with SSD tiering - Block-based cache management inspired by vLLM, with prefix sharing and Copy-on-Write. When GPU memory fills up, blocks are offloaded to SSD. On the next request with a matching prefix, they're restored from disk instead of recomputed from scratch - even after a server restart.
Continuous batching - Handles concurrent requests through mlx-lm's BatchGenerator. Prefill and completion batch sizes are configurable.
Multi-model serving - Load LLMs, embedding models, and rerankers within the same server. Least-recently-used models are evicted automatically when memory runs low. Pin frequently used models to keep them loaded.
API compatibility - Drop-in replacement for OpenAI and Anthropic APIs.
| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
Chat completions (streaming) |
POST /v1/completions |
Text completions (streaming) |
POST /v1/messages |
Anthropic Messages API |
POST /v1/embeddings |
Text embeddings |
POST /v1/rerank |
Document reranking |
GET /v1/models |
List available models |
Tool calling & structured output - Supports all function calling formats available in mlx-lm, JSON schema validation, and MCP tool integration. Tool calling requires the model's chat template to support the tools parameter. The following model families are auto-detected via mlx-lm's built-in tool parsers:
| Model Family | Format |
|---|---|
| Llama, Qwen, DeepSeek, etc. | JSON <tool_call> |
| Qwen3 Coder | XML <function=...> |
| Gemma | <start_function_call> |
| GLM (4.7, 5) | <arg_key>/<arg_value> XML |
| MiniMax | Namespaced <minimax:tool_call> |
| Mistral | [TOOL_CALLS] |
| Kimi K2 | <|tool_calls_section_begin|> |
| Longcat | <longcat_tool_call> |
Models not listed above may still work if their chat template accepts tools and their output uses a recognized <tool_call> XML format. Streaming requests with tool calls buffer all content and emit results at completion.
Point --model-dir at a directory containing MLX-format model subdirectories:
~/models/
├── Step-3.5-Flash-8bit/
├── Qwen3-Coder-Next-8bit/
├── gpt-oss-120b-MXFP4-Q8/
└── bge-m3/
Models are auto-detected by type. You can also download models directly from the admin dashboard.
| Type | Models |
|---|---|
| LLM | Any model supported by mlx-lm |
| Embedding | BERT, BGE-M3, ModernBERT |
| Reranker | ModernBERT, XLM-RoBERTa |
# Memory limit for loaded models
omlx serve --model-dir ~/models --max-model-memory 32GB
# Enable SSD cache for KV blocks
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache
# Adjust batch sizes
omlx serve --model-dir ~/models --prefill-batch-size 8 --completion-batch-size 32
# With MCP tools
omlx serve --model-dir ~/models --mcp-config mcp.json
# API key authentication
omlx serve --model-dir ~/models --api-key your-secret-keyAll settings can also be configured from the web admin panel at /admin. Settings are persisted to ~/.omlx/settings.json, and CLI flags take precedence.
Architecture
FastAPI Server (OpenAI / Anthropic API)
│
├── EnginePool (multi-model, LRU eviction)
│ ├── BatchedEngine (LLMs, continuous batching)
│ ├── EmbeddingEngine
│ └── RerankerEngine
│
├── Scheduler (FCFS, configurable batch sizes)
│ └── mlx-lm BatchGenerator
│
└── Cache Stack
├── PagedCacheManager (GPU, block-based, CoW, prefix sharing)
└── PagedSSDCacheManager (SSD tier, safetensors format)
git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e ".[dev]"
pytest -m "not slow"Requires Python 3.11+ and venvstacks (pip install venvstacks).
cd packaging
# Full build (venvstacks + app bundle + DMG)
python build.py
# Skip venvstacks (code changes only)
python build.py --skip-venv
# DMG only
python build.py --dmg-onlySee packaging/README.md for details on the app bundle structure and layer configuration.
We welcome contributions! See Contributing Guide for details.
- Bug fixes and improvements
- Performance optimizations
- Documentation improvements
- MLX and mlx-lm by Apple
- vllm-mlx - oMLX originated as a fork of vllm-mlx v0.1.0, since re-architected with multi-model serving, paged SSD caching, an admin panel, and a standalone macOS menu bar app
- venvstacks - Portable Python environment layering for the macOS app bundle
- mlx-embeddings - Embedding model support for Apple Silicon