Run any LLM locally.
Use it from any language.
Deploy anywhere.
Mullama is a local LLM runtime written in Rust on top of llama.cpp.
Same CLI as Ollama. OpenAI- and Anthropic-compatible HTTP API. And —
unlike anything else in its class — native in-process bindings for
Rust, Python, Node.js, Go, PHP, and C/C++.
One dependency. No daemon required. No subprocess shims between your app and the model.
ollama run.Rust, Python, Node.js, Go, PHP, C/C++. Call the inference engine directly. No HTTP round-trips. No Python subprocess from your Node app. No FFI you wrote at 2am.
Same CLI (run, pull, serve, chat, list, ps, create, show, rm, cp). Same Modelfile format. Same model registry. Same port 11434. Your client code keeps working.
OpenAI-compatible at /v1. Anthropic-compatible at /v1/messages. LangChain, LlamaIndex, the openai SDK, and the anthropic SDK all just work — pointed at localhost.
Link the static C ABI library directly into your desktop app, your CLI tool, your Tauri / Electron / mobile build. No daemon, no orchestration, no network dependency.
CUDA, Metal, ROCm, Vulkan, OpenCL, SYCL, and RPC for distributed inference across machines. Apple Silicon uses Metal automatically. CPU works out of the box.
Vision via LLaVA, llava-phi3, Moondream. Real-time audio capture with voice activity detection and streaming responses. Embeddings via BGE and Nomic.
install
cargo add mullama Pick the channel that matches your stack. The shell one-liner pulls a static binary for Linux/macOS. The language package managers each ship the same native runtime, exposed through an idiomatic API for that language.
Windows users: iwr -useb https://mullama.cognisoc.com/install.ps1 | iex
GPU? Set the appropriate env var before building from source
(LLAMA_CUDA=1, LLAMA_METAL=1, LLAMA_HIPBLAS=1,
LLAMA_VULKAN=1, LLAMA_CLBLAST=1, LLAMA_SYCL=1,
LLAMA_RPC=1) or use ghcr.io/cognisoc/mullama:<ver>-cuda.
embed it
Same engine. Six idiomatic APIs. The model lives in your process — no daemon, no HTTP, no separate runtime to babysit.
use mullama::{Model, Context, ContextParams};
let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams::default())?;
let response = ctx.generate("Hello, AI!", 256)?;
println!("{response}"); models it runs
Any GGUF file from Hugging Face works. Pre-configured aliases for the families below — type the short name and Mullama resolves to the right repo and quantization. Anything else: hf:owner/repo or a path to a .gguf.
how mullama compares
| Mullama | Ollama | llama.cpp | llama-cpp-python | vLLM | |
|---|---|---|---|---|---|
| GGUF models | ✓ | ✓ | ✓ | ✓ | ✗ |
| Native language bindings | 6 | HTTP only | C/C++ | Python | Python |
| Embed in-process | ✓ | ✗ | ✓ | ✓ | ✗ |
| OpenAI-compatible API | ✓ | ✓ | partial | ✗ | ✓ |
| Anthropic-compatible API | ✓ | ✗ | ✗ | ✗ | ✗ |
| Ollama Modelfile compatible | ✓ | ✓ | ✗ | ✗ | ✗ |
| Built-in Web UI / TUI | ✓ | ✗ | ✗ | ✗ | ✗ |
| GPU backends | 7 | 4 | 7 | 4 | CUDA |
| Multimodal (vision + audio) | ✓ | partial | ✓ | partial | ✗ |
what you can build
- › Chatbots & assistants — streaming, multi-turn, custom system prompts.
- › RAG pipelines — built-in embeddings (BGE, Nomic), ColBERT-style late interaction, grammar-constrained JSON output.
- › Voice assistants — real-time audio capture, VAD, speech-to-text, streaming LLM responses.
- › Production API servers — OpenAI-compatible SSE, per-model resource limits, LRU eviction, Prometheus metrics.
- › Edge deployments — embed a quantized model into your app. No network. No daemon.
- › Batch processing — parallel inference across documents with work-stealing scheduling (
parallelfeature). - › Local AI in Electron / Tauri / mobile — link the static C ABI library directly into your desktop app.
as a server
# start an OpenAI/Anthropic-compatible API on :11434
mullama serve --model llama3.2:1b
# point any client at it
export OPENAI_BASE_URL=http://localhost:11434/v1
export ANTHROPIC_BASE_URL=http://localhost:11434
# LangChain, LlamaIndex, openai-py, anthropic-py — all unchanged
The serve mode is wire-compatible with both OpenAI and Anthropic SDKs.
Existing client code that talks to localhost:11434 (the Ollama
port) keeps working without modification.
one runtime. any language. anywhere.
MIT licensed. Apache-2.0-friendly. Use it in commercial products without restriction.