Skip to content
/ omlx Public

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

License

Notifications You must be signed in to change notification settings

jundot/omlx

Repository files navigation

oMLX

oMLX

LLM inference, optimized for your Mac
Continuous batching and infinite SSD caching, managed directly from your menu bar.

License Python 3.10+ Apple Silicon Buy Me a Coffee

Install · Quickstart · GitHub


oMLX Admin Dashboard

Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar.

oMLX persists KV cache to SSD - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That's why I built it.

Install

macOS App

Download the .dmg from Releases, drag to Applications, done.

From Source

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .

Requires Python 3.10+ and Apple Silicon (M1/M2/M3/M4).

Quickstart

macOS App

Launch oMLX from your Applications folder. The Welcome screen guides you through three steps - model directory, server start, and first model download. That's it.

oMLX Welcome Screen oMLX Welcome Screen

CLI

omlx serve --model-dir ~/models

The server discovers models from subdirectories automatically. Any OpenAI-compatible client can connect to http://localhost:8000/v1. A built-in chat UI is also available at http://localhost:8000/admin/chat.

Features

oMLX is built on top of vllm-mlx, extending it with paged SSD caching, multi-model serving, an admin dashboard, Claude Code optimization, and Anthropic API support. Currently supports text-based LLMs - VLM and OCR model support is planned for upcoming milestones.

macOS menubar app - Native menubar app to start, stop, and monitor the server without opening a terminal.

Admin dashboard - Web UI at /admin for model management, chat, real-time monitoring, and per-model settings.

Built-in model downloader - Search and download MLX models from HuggingFace directly in the admin dashboard. No CLI or git clone needed.

oMLX Model Downloader

Claude Code optimization - Context scaling support for running smaller context models with Claude Code. Scales reported token counts so that auto-compact triggers at the right timing, and SSE keep-alive prevents read timeouts during long prefill.

Paged KV cache with SSD tiering - Block-based cache management inspired by vLLM, with prefix sharing and Copy-on-Write. When GPU memory fills up, blocks are offloaded to SSD. On the next request with a matching prefix, they're restored from disk instead of recomputed from scratch - even after a server restart.

Continuous batching - Handles concurrent requests through mlx-lm's BatchGenerator. Prefill and completion batch sizes are configurable.

Multi-model serving - Load LLMs, embedding models, and rerankers within the same server. Least-recently-used models are evicted automatically when memory runs low. Pin frequently used models to keep them loaded.

API compatibility - Drop-in replacement for OpenAI and Anthropic APIs.

Endpoint Description
POST /v1/chat/completions Chat completions (streaming)
POST /v1/completions Text completions (streaming)
POST /v1/messages Anthropic Messages API
POST /v1/embeddings Text embeddings
POST /v1/rerank Document reranking
GET /v1/models List available models

Tool calling & structured output - Supports all function calling formats available in mlx-lm, JSON schema validation, and MCP tool integration. Tool calling requires the model's chat template to support the tools parameter. The following model families are auto-detected via mlx-lm's built-in tool parsers:

Model Family Format
Llama, Qwen, DeepSeek, etc. JSON <tool_call>
Qwen3 Coder XML <function=...>
Gemma <start_function_call>
GLM (4.7, 5) <arg_key>/<arg_value> XML
MiniMax Namespaced <minimax:tool_call>
Mistral [TOOL_CALLS]
Kimi K2 <|tool_calls_section_begin|>
Longcat <longcat_tool_call>

Models not listed above may still work if their chat template accepts tools and their output uses a recognized <tool_call> XML format. Streaming requests with tool calls buffer all content and emit results at completion.

Models

Point --model-dir at a directory containing MLX-format model subdirectories:

~/models/
├── Step-3.5-Flash-8bit/
├── Qwen3-Coder-Next-8bit/
├── gpt-oss-120b-MXFP4-Q8/
└── bge-m3/

Models are auto-detected by type. You can also download models directly from the admin dashboard.

Type Models
LLM Any model supported by mlx-lm
Embedding BERT, BGE-M3, ModernBERT
Reranker ModernBERT, XLM-RoBERTa

CLI Configuration

# Memory limit for loaded models
omlx serve --model-dir ~/models --max-model-memory 32GB

# Enable SSD cache for KV blocks
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

# Adjust batch sizes
omlx serve --model-dir ~/models --prefill-batch-size 8 --completion-batch-size 32

# With MCP tools
omlx serve --model-dir ~/models --mcp-config mcp.json

# API key authentication
omlx serve --model-dir ~/models --api-key your-secret-key

All settings can also be configured from the web admin panel at /admin. Settings are persisted to ~/.omlx/settings.json, and CLI flags take precedence.

Architecture
FastAPI Server (OpenAI / Anthropic API)
    │
    ├── EnginePool (multi-model, LRU eviction)
    │   ├── BatchedEngine (LLMs, continuous batching)
    │   ├── EmbeddingEngine
    │   └── RerankerEngine
    │
    ├── Scheduler (FCFS, configurable batch sizes)
    │   └── mlx-lm BatchGenerator
    │
    └── Cache Stack
        ├── PagedCacheManager (GPU, block-based, CoW, prefix sharing)
        └── PagedSSDCacheManager (SSD tier, safetensors format)

Development

CLI Server

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e ".[dev]"
pytest -m "not slow"

macOS App

Requires Python 3.11+ and venvstacks (pip install venvstacks).

cd packaging

# Full build (venvstacks + app bundle + DMG)
python build.py

# Skip venvstacks (code changes only)
python build.py --skip-venv

# DMG only
python build.py --dmg-only

See packaging/README.md for details on the app bundle structure and layer configuration.

Contributing

We welcome contributions! See Contributing Guide for details.

  • Bug fixes and improvements
  • Performance optimizations
  • Documentation improvements

License

Apache 2.0

Acknowledgments

  • MLX and mlx-lm by Apple
  • vllm-mlx - oMLX originated as a fork of vllm-mlx v0.1.0, since re-architected with multi-model serving, paged SSD caching, an admin panel, and a standalone macOS menu bar app
  • venvstacks - Portable Python environment layering for the macOS app bundle
  • mlx-embeddings - Embedding model support for Apple Silicon

About

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •