Hey Look, It's an LLM

OpenAI-compatible API server for local LLM inference with MLX, llama.cpp, and CoreML.

A lightweight API server for running Apple MLX models, GGUF models, and a bunch of quality of life additions, using a modified OpenAI endpoint (can't really say it's fully compatible anymore given the additions), with on-the-fly model swapping and optional analytics.

Key Features

OpenAI-Compatible API: Works with existing OpenAI client libraries - Anthropic API compatible functionality coming soon
Multi-Provider:
- MLX: Optimized for Apple Silicon (with additional Metal acceleration)
  - Huge thanks to both the MLX team for mlx-lm and Blaizzy for mlx-vlm. Nothing here would work without them.
- llama.cpp: Cross-platform support for GGUF models (CUDA, Vulkan, CPU)
  - Grateful for the continuous work put in by the core maintainers of llama.cpp, who've been pushing ahead since the first llama model.
- CoreML: Experimental Speech-to-Text on Apple Silicon (Neural Engine)
Vision Models: Process images with vision-language models (VLM), with support to push down / resize from the client
Performance Optimized:
- Metal acceleration on macOS
- Async processing and smart caching
- Fast multipart image endpoint
- Batch processing for 2-4x throughput
Hot Swapping: Change models on the fly without restarting, specified in the request body
Analytics: Optional tracking and performance metrics

Platform Support

macOS: Both backends (MLX, llama.cpp)
Linux: llama.cpp backend (CUDA, CPU)
Windows: llama.cpp backend (CUDA, Vulkan, CPU)

Quick Start

Installation

macOS/Linux:

./setup.sh

Windows:

.\setup.ps1

See Windows Installation Guide for detailed Windows setup.

Manual Installation

git clone https://github.com/fblissjr/heylookitsanllm
cd heylookitsanllm

# Recommended: use uv sync for proper dependency resolution
uv sync                            # Base install
uv sync --extra mlx                # macOS only
uv sync --extra llama-cpp          # All platforms
uv sync --extra stt                # macOS only (CoreML) - very experimental
uv sync --extra analytics          # DuckDB analytics
uv sync --extra all                # Install everything

# Alternative: pip-style install (doesn't use lockfile)
uv pip install -e .
uv pip install -e .[mlx,llama-cpp]

GPU Acceleration

macOS (Metal) Included by default with mlx. For llama-cpp, run:

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 uv pip install --force-reinstall --no-cache-dir llama-cpp-python

Linux/Windows (CUDA)

# Linux
CMAKE_ARGS="-DGGML_CUDA=on" FORCE_CMAKE=1 uv pip install --force-reinstall --no-cache-dir llama-cpp-python

# Windows (PowerShell)
$env:CMAKE_ARGS = "-DGGML_CUDA=on"
$env:FORCE_CMAKE = "1"
python -m pip install --force-reinstall --no-cache-dir llama-cpp-python

Start Server

# Configure models first
cp models.toml.example models.toml
# Edit models.toml with your model paths

# Start server
heylookllm --log-level INFO
heylookllm --port 8080

Run as Background Service (macOS/Linux)

Run the server as a persistent service that survives SSH disconnects:

# Install service (localhost only by default)
heylookllm service install

# Install for LAN access (behind VPN)
heylookllm service install --host 0.0.0.0

# Manage service
heylookllm service status
heylookllm service start
heylookllm service stop
heylookllm service restart
heylookllm service uninstall

See Service and Security Guide for detailed setup and firewall configuration.

Automatic Import

Scan directories for models and auto-generate configuration:

heylookllm import --folder ~/models --output models.toml
heylookllm import --hf-cache --profile fast

API Documentation

Interactive docs available when server is running:

Swagger UI: http://localhost:8080/docs
ReDoc: http://localhost:8080/redoc
OpenAPI Schema: http://localhost:8080/openapi.json

Key Endpoints

Core Endpoints (/v1)

GET /v1/models - List available models
POST /v1/chat/completions - Chat completion (text and vision)
POST /v1/chat/completions/multipart - Fast raw image upload (57ms faster per image)
POST /v1/batch/chat/completions - Batch processing (2-4x throughput)
POST /v1/embeddings - Generate embeddings
POST /v1/hidden_states - Extract hidden states from intermediate layers (MLX only)

Speech-to-Text (macOS only)

POST /v1/audio/transcriptions - Transcribe audio
POST /v1/audio/translations - Translate audio to English
GET /v1/stt/models - List STT models

Analytics and Admin

GET /v1/capabilities - Discover server capabilities and optimizations
GET /v1/performance - Real-time performance metrics
GET /v1/data/summary - Analytics summary (requires analytics enabled)
POST /v1/data/query - Query analytics data
POST /v1/admin/restart - Restart server
POST /v1/admin/reload - Reload model configuration

Example Usage (Python)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

# Chat
response = client.chat.completions.create(
    model="qwen2.5-coder-1.5b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

# Embeddings
embedding = client.embeddings.create(
    input="Your text here",
    model="qwen2.5-coder-1.5b"
)
print(embedding.data[0].embedding[:5])  # First 5 dimensions

Hidden States (MLX only)

Extract intermediate layer hidden states for use with diffusion models:

import requests

response = requests.post(
    "http://localhost:8080/v1/hidden_states",
    json={
        "model": "Qwen/Qwen3-4B",
        "input": "A photo of a cat",
        "layer_index": -2,  # Second-to-last layer
        "encoding_format": "base64"  # or "float" for JSON array
    }
)
result = response.json()
print(f"Shape: {result['data'][0]['shape']}")  # [seq_len, hidden_dim]

Batch Processing

Process multiple requests efficiently with 2-4x throughput improvement:

import requests

response = requests.post(
    "http://localhost:8080/v1/batch/chat/completions",
    json={
        "requests": [
            {
                "model": "qwen-14b",
                "messages": [{"role": "user", "content": "Prompt 1"}],
                "max_tokens": 50
            },
            {
                "model": "qwen-14b",
                "messages": [{"role": "user", "content": "Prompt 2"}],
                "max_tokens": 50
            }
        ]
    }
)

Analytics (Optional)

Track performance metrics and request history.

Setup: python setup_analytics.py
Enable: Set HEYLOOK_ANALYTICS_ENABLED=true
Analyze: python analyze_logs.py

Troubleshooting

Model not loading

heylookllm --log-level DEBUG

License

MIT License - see LICENSE file

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
apps/HeylookAnalytics-Web		apps/HeylookAnalytics-Web
examples		examples
guides		guides
modelzoo		modelzoo
scripts		scripts
services		services
src/heylook_llm		src/heylook_llm
templates		templates
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
analytics_config.json		analytics_config.json
detect-gpu.ps1		detect-gpu.ps1
models.toml.example		models.toml.example
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.bat		setup.bat
setup.ps1		setup.ps1
setup.sh		setup.sh
setup_analytics.py		setup_analytics.py
update-heylook.ps1		update-heylook.ps1
update-heylook.sh		update-heylook.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hey Look, It's an LLM

Key Features

Platform Support

Quick Start

Installation

Manual Installation

GPU Acceleration

Start Server

Run as Background Service (macOS/Linux)

Automatic Import

API Documentation

Key Endpoints

Example Usage (Python)

Hidden States (MLX only)

Batch Processing

Analytics (Optional)

Troubleshooting

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

fblissjr/heylookitsanllm

Folders and files

Latest commit

History

Repository files navigation

Hey Look, It's an LLM

Key Features

Platform Support

Quick Start

Installation

Manual Installation

GPU Acceleration

Start Server

Run as Background Service (macOS/Linux)

Automatic Import

API Documentation

Key Endpoints

Example Usage (Python)

Hidden States (MLX only)

Batch Processing

Analytics (Optional)

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages