OpenAI-compatible API server for local LLM inference with MLX, llama.cpp, and CoreML.
A lightweight API server for running Apple MLX models, GGUF models, and a bunch of quality of life additions, using a modified OpenAI endpoint (can't really say it's fully compatible anymore given the additions), with on-the-fly model swapping and optional analytics.
- OpenAI-Compatible API: Works with existing OpenAI client libraries - Anthropic API compatible functionality coming soon
- Multi-Provider:
- MLX: Optimized for Apple Silicon (with additional Metal acceleration)
- llama.cpp: Cross-platform support for GGUF models (CUDA, Vulkan, CPU)
- Grateful for the continuous work put in by the core maintainers of llama.cpp, who've been pushing ahead since the first llama model.
- CoreML: Experimental Speech-to-Text on Apple Silicon (Neural Engine)
- Vision Models: Process images with vision-language models (VLM), with support to push down / resize from the client
- Performance Optimized:
- Metal acceleration on macOS
- Async processing and smart caching
- Fast multipart image endpoint
- Batch processing for 2-4x throughput
- Hot Swapping: Change models on the fly without restarting, specified in the request body
- Analytics: Optional tracking and performance metrics
- macOS: Both backends (MLX, llama.cpp)
- Linux: llama.cpp backend (CUDA, CPU)
- Windows: llama.cpp backend (CUDA, Vulkan, CPU)
macOS/Linux:
./setup.shWindows:
.\setup.ps1See Windows Installation Guide for detailed Windows setup.
git clone https://github.com/fblissjr/heylookitsanllm
cd heylookitsanllm
# Recommended: use uv sync for proper dependency resolution
uv sync # Base install
uv sync --extra mlx # macOS only
uv sync --extra llama-cpp # All platforms
uv sync --extra stt # macOS only (CoreML) - very experimental
uv sync --extra analytics # DuckDB analytics
uv sync --extra all # Install everything
# Alternative: pip-style install (doesn't use lockfile)
uv pip install -e .
uv pip install -e .[mlx,llama-cpp]macOS (Metal)
Included by default with mlx. For llama-cpp, run:
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 uv pip install --force-reinstall --no-cache-dir llama-cpp-pythonLinux/Windows (CUDA)
# Linux
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 uv pip install --force-reinstall --no-cache-dir llama-cpp-python
# Windows (PowerShell)
$env:CMAKE_ARGS = "-DGGML_CUDA=on"
$env:FORCE_CMAKE = "1"
python -m pip install --force-reinstall --no-cache-dir llama-cpp-python# Configure models first
cp models.toml.example models.toml
# Edit models.toml with your model paths
# Start server
heylookllm --log-level INFO
heylookllm --port 8080Run the server as a persistent service that survives SSH disconnects:
# Install service (localhost only by default)
heylookllm service install
# Install for LAN access (behind VPN)
heylookllm service install --host 0.0.0.0
# Manage service
heylookllm service status
heylookllm service start
heylookllm service stop
heylookllm service restart
heylookllm service uninstallSee Service and Security Guide for detailed setup and firewall configuration.
Scan directories for models and auto-generate configuration:
heylookllm import --folder ~/models --output models.toml
heylookllm import --hf-cache --profile fastInteractive docs available when server is running:
- Swagger UI: http://localhost:8080/docs
- ReDoc: http://localhost:8080/redoc
- OpenAPI Schema: http://localhost:8080/openapi.json
Core Endpoints (/v1)
GET /v1/models- List available modelsPOST /v1/chat/completions- Chat completion (text and vision)POST /v1/chat/completions/multipart- Fast raw image upload (57ms faster per image)POST /v1/batch/chat/completions- Batch processing (2-4x throughput)POST /v1/embeddings- Generate embeddingsPOST /v1/hidden_states- Extract hidden states from intermediate layers (MLX only)
Speech-to-Text (macOS only)
POST /v1/audio/transcriptions- Transcribe audioPOST /v1/audio/translations- Translate audio to EnglishGET /v1/stt/models- List STT models
Analytics and Admin
GET /v1/capabilities- Discover server capabilities and optimizationsGET /v1/performance- Real-time performance metricsGET /v1/data/summary- Analytics summary (requires analytics enabled)POST /v1/data/query- Query analytics dataPOST /v1/admin/restart- Restart serverPOST /v1/admin/reload- Reload model configuration
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
# Chat
response = client.chat.completions.create(
model="qwen2.5-coder-1.5b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")
# Embeddings
embedding = client.embeddings.create(
input="Your text here",
model="qwen2.5-coder-1.5b"
)
print(embedding.data[0].embedding[:5]) # First 5 dimensionsHidden States (MLX only)
Extract intermediate layer hidden states for use with diffusion models:
import requests
response = requests.post(
"http://localhost:8080/v1/hidden_states",
json={
"model": "Qwen/Qwen3-4B",
"input": "A photo of a cat",
"layer_index": -2, # Second-to-last layer
"encoding_format": "base64" # or "float" for JSON array
}
)
result = response.json()
print(f"Shape: {result['data'][0]['shape']}") # [seq_len, hidden_dim]Process multiple requests efficiently with 2-4x throughput improvement:
import requests
response = requests.post(
"http://localhost:8080/v1/batch/chat/completions",
json={
"requests": [
{
"model": "qwen-14b",
"messages": [{"role": "user", "content": "Prompt 1"}],
"max_tokens": 50
},
{
"model": "qwen-14b",
"messages": [{"role": "user", "content": "Prompt 2"}],
"max_tokens": 50
}
]
}
)Track performance metrics and request history.
- Setup:
python setup_analytics.py - Enable: Set
HEYLOOK_ANALYTICS_ENABLED=true - Analyze:
python analyze_logs.py
Model not loading
heylookllm --log-level DEBUGMIT License - see LICENSE file