mullama
$ status: process running on :11434

Run any LLM locally.
Use it from any language.
Deploy anywhere.

Mullama is a local LLM runtime written in Rust on top of llama.cpp. Same CLI as Ollama. OpenAI- and Anthropic-compatible HTTP API. And — unlike anything else in its class — native in-process bindings for Rust, Python, Node.js, Go, PHP, and C/C++.

One dependency. No daemon required. No subprocess shims between your app and the model.

~/dev — mullama run llama3.2:1b
$ mullama run llama3.2:1b "What is the capital of France?"
pulling manifest... 100%
resolving llama3.2:1b -> hf:bartowski/Llama-3.2-1B-Instruct-GGUF
backend: Metal · ctx: 4096 · model loaded in 412ms
The capital of France is Paris.
[57 tok · 38ms first-token · 142 tok/s]
$ _
Daemon auto-started in the background. Same UX as ollama run.
/six-native-bindings
six native bindings

Rust, Python, Node.js, Go, PHP, C/C++. Call the inference engine directly. No HTTP round-trips. No Python subprocess from your Node app. No FFI you wrote at 2am.

/drop-in-for-ollama
drop-in for ollama

Same CLI (run, pull, serve, chat, list, ps, create, show, rm, cp). Same Modelfile format. Same model registry. Same port 11434. Your client code keeps working.

/two-http-APIs
two http APIs

OpenAI-compatible at /v1. Anthropic-compatible at /v1/messages. LangChain, LlamaIndex, the openai SDK, and the anthropic SDK all just work — pointed at localhost.

/embed-in-process
embed in-process

Link the static C ABI library directly into your desktop app, your CLI tool, your Tauri / Electron / mobile build. No daemon, no orchestration, no network dependency.

/7-GPU-backends
7 GPU backends

CUDA, Metal, ROCm, Vulkan, OpenCL, SYCL, and RPC for distributed inference across machines. Apple Silicon uses Metal automatically. CPU works out of the box.

/multimodal-+-audio
multimodal + audio

Vision via LLaVA, llava-phi3, Moondream. Real-time audio capture with voice activity detection and streaming responses. Embeddings via BGE and Nomic.

// step 1

install

cargo add mullama

Pick the channel that matches your stack. The shell one-liner pulls a static binary for Linux/macOS. The language package managers each ship the same native runtime, exposed through an idiomatic API for that language.

Windows users: iwr -useb https://mullama.cognisoc.com/install.ps1 | iex

GPU? Set the appropriate env var before building from source (LLAMA_CUDA=1, LLAMA_METAL=1, LLAMA_HIPBLAS=1, LLAMA_VULKAN=1, LLAMA_CLBLAST=1, LLAMA_SYCL=1, LLAMA_RPC=1) or use ghcr.io/cognisoc/mullama:<ver>-cuda.

// step 2

embed it

Same engine. Six idiomatic APIs. The model lives in your process — no daemon, no HTTP, no separate runtime to babysit.

use mullama::{Model, Context, ContextParams};

let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams::default())?;
let response = ctx.generate("Hello, AI!", 256)?;
println!("{response}");
// catalog

models it runs

Any GGUF file from Hugging Face works. Pre-configured aliases for the families below — type the short name and Mullama resolves to the right repo and quantization. Anything else: hf:owner/repo or a path to a .gguf.

Llama
llama3.2:1b · :3b · llama3.1:8b · :70b · llama3:8b
Qwen 2.5
qwen2.5:0.5b … :72b · qwen2.5-coder:7b · :14b · :32b
DeepSeek R1
deepseek-r1:1.5b · :7b · :14b · :32b · deepseek-coder:7b · :33b
Mistral
mistral:7b · mixtral:8x7b · codestral:22b
Phi
phi3:mini · phi3:medium · phi3.5:mini
Gemma
gemma2:2b · :9b · :27b
Vision
llava:7b · :13b · llava-phi3 · moondream:2b
Embeddings
nomic-embed · bge:small · bge:large
Code
starcoder2:3b · :7b · :15b
// diff

how mullama compares

Mullama Ollama llama.cpp llama-cpp-python vLLM
GGUF models
Native language bindings 6HTTP onlyC/C++PythonPython
Embed in-process
OpenAI-compatible API partial
Anthropic-compatible API
Ollama Modelfile compatible
Built-in Web UI / TUI
GPU backends 7474CUDA
Multimodal (vision + audio) partialpartial
Detailed: vs Ollama · vs llama.cpp
// patterns

what you can build

  • Chatbots & assistants — streaming, multi-turn, custom system prompts.
  • RAG pipelines — built-in embeddings (BGE, Nomic), ColBERT-style late interaction, grammar-constrained JSON output.
  • Voice assistants — real-time audio capture, VAD, speech-to-text, streaming LLM responses.
  • Production API servers — OpenAI-compatible SSE, per-model resource limits, LRU eviction, Prometheus metrics.
  • Edge deployments — embed a quantized model into your app. No network. No daemon.
  • Batch processing — parallel inference across documents with work-stealing scheduling (parallel feature).
  • Local AI in Electron / Tauri / mobile — link the static C ABI library directly into your desktop app.
// serve

as a server

# start an OpenAI/Anthropic-compatible API on :11434
mullama serve --model llama3.2:1b

# point any client at it
export OPENAI_BASE_URL=http://localhost:11434/v1
export ANTHROPIC_BASE_URL=http://localhost:11434

# LangChain, LlamaIndex, openai-py, anthropic-py — all unchanged

The serve mode is wire-compatible with both OpenAI and Anthropic SDKs. Existing client code that talks to localhost:11434 (the Ollama port) keeps working without modification.

one runtime. any language. anywhere.

MIT licensed. Apache-2.0-friendly. Use it in commercial products without restriction.