$ status: process running on :11434

Run any LLM locally.
Use it from any language.
Deploy anywhere.

Mullama is a local LLM runtime written in Rust on top of llama.cpp. Same CLI as Ollama. OpenAI- and Anthropic-compatible HTTP API. And — unlike anything else in its class — native in-process bindings for Rust, Python, Node.js, Go, PHP, and C/C++.

One dependency. No daemon required. No subprocess shims between your app and the model.

read the docs → github notes from the runtime

~/dev — mullama run llama3.2:1b

$ mullama run llama3.2:1b "What is the capital of France?"

pulling manifest... 100%

resolving llama3.2:1b -> hf:bartowski/Llama-3.2-1B-Instruct-GGUF

backend: Metal · ctx: 4096 · model loaded in 412ms

The capital of France is Paris.

[57 tok · 38ms first-token · 142 tok/s]

$ _

Daemon auto-started in the background. Same UX as ollama run.

/six-native-bindings

six native bindings

Rust, Python, Node.js, Go, PHP, C/C++. Call the inference engine directly. No HTTP round-trips. No Python subprocess from your Node app. No FFI you wrote at 2am.

/drop-in-for-ollama

drop-in for ollama

Same CLI (run, pull, serve, chat, list, ps, create, show, rm, cp). Same Modelfile format. Same model registry. Same port 11434. Your client code keeps working.

/two-http-APIs

two http APIs

OpenAI-compatible at /v1. Anthropic-compatible at /v1/messages. LangChain, LlamaIndex, the openai SDK, and the anthropic SDK all just work — pointed at localhost.

/embed-in-process

embed in-process

Link the static C ABI library directly into your desktop app, your CLI tool, your Tauri / Electron / mobile build. No daemon, no orchestration, no network dependency.

/7-GPU-backends

7 GPU backends

CUDA, Metal, ROCm, Vulkan, OpenCL, SYCL, and RPC for distributed inference across machines. Apple Silicon uses Metal automatically. CPU works out of the box.

/multimodal-+-audio

multimodal + audio

Vision via LLaVA, llava-phi3, Moondream. Real-time audio capture with voice activity detection and streaming responses. Embeddings via BGE and Nomic.

// step 1

install

cargo add mullama

pip install mullama

npm install mullama

go get github.com/cognisoc/mullama

composer require mullama/mullama

docker pull ghcr.io/cognisoc/mullama

curl -fsSL https://mullama.cognisoc.com/install.sh | sh

Pick the channel that matches your stack. The shell one-liner pulls a static binary for Linux/macOS. The language package managers each ship the same native runtime, exposed through an idiomatic API for that language.

Windows users: iwr -useb https://mullama.cognisoc.com/install.ps1 | iex

GPU? Set the appropriate env var before building from source (LLAMA_CUDA=1, LLAMA_METAL=1, LLAMA_HIPBLAS=1, LLAMA_VULKAN=1, LLAMA_CLBLAST=1, LLAMA_SYCL=1, LLAMA_RPC=1) or use ghcr.io/cognisoc/mullama:<ver>-cuda.

// step 2

embed it

Same engine. Six idiomatic APIs. The model lives in your process — no daemon, no HTTP, no separate runtime to babysit.

use mullama::{Model, Context, ContextParams};

let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams::default())?;
let response = ctx.generate("Hello, AI!", 256)?;
println!("{response}");

from mullama import Model, Context

model = Model.load("llama3.2-1b.gguf", n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)
print(ctx.generate("Hello, AI!"))

const { Model, Context } = require("mullama");

const model = Model.load("llama3.2-1b.gguf", { nGpuLayers: 32 });
const ctx = new Context(model, { nCtx: 4096 });
console.log(ctx.generate("Hello, AI!"));

import "github.com/cognisoc/mullama"

model, _ := mullama.LoadModel("llama3.2-1b.gguf", &mullama.ModelParams{NGpuLayers: 32})
ctx, _   := mullama.NewContext(model, &mullama.ContextParams{NCtx: 4096})
response, _ := ctx.Generate("Hello, AI!", 256, nil)

use Mullama\Model;
use Mullama\Context;

$model = Model::load("llama3.2-1b.gguf", ["nGpuLayers" => 32]);
$ctx   = new Context($model, ["nCtx" => 4096]);
echo $ctx->generate("Hello, AI!");

#include "mullama.h"

mullama_model* model = mullama_load_model("llama3.2-1b.gguf", NULL);
mullama_ctx*   ctx   = mullama_new_context(model, NULL);
char buf[2048];
mullama_generate(ctx, "Hello, AI!", buf, sizeof(buf));
puts(buf);

// catalog

models it runs

Any GGUF file from Hugging Face works. Pre-configured aliases for the families below — type the short name and Mullama resolves to the right repo and quantization. Anything else: hf:owner/repo or a path to a .gguf.

Llama

llama3.2:1b · :3b · llama3.1:8b · :70b · llama3:8b

Qwen 2.5

qwen2.5:0.5b … :72b · qwen2.5-coder:7b · :14b · :32b

DeepSeek R1

deepseek-r1:1.5b · :7b · :14b · :32b · deepseek-coder:7b · :33b

Mistral

mistral:7b · mixtral:8x7b · codestral:22b

Phi

phi3:mini · phi3:medium · phi3.5:mini

Gemma

gemma2:2b · :9b · :27b

Vision

llava:7b · :13b · llava-phi3 · moondream:2b

Embeddings

nomic-embed · bge:small · bge:large

Code

starcoder2:3b · :7b · :15b

// diff

how mullama compares

	Mullama	Ollama	llama.cpp	llama-cpp-python	vLLM
GGUF models	✓	✓	✓	✓	✗
Native language bindings	6	HTTP only	C/C++	Python	Python
Embed in-process	✓	✗	✓	✓	✗
OpenAI-compatible API	✓	✓	partial	✗	✓
Anthropic-compatible API	✓	✗	✗	✗	✗
Ollama Modelfile compatible	✓	✓	✗	✗	✗
Built-in Web UI / TUI	✓	✗	✗	✗	✗
GPU backends	7	4	7	4	CUDA
Multimodal (vision + audio)	✓	partial	✓	partial	✗

Detailed: vs Ollama · vs llama.cpp

// patterns

what you can build

› Chatbots & assistants — streaming, multi-turn, custom system prompts.
› RAG pipelines — built-in embeddings (BGE, Nomic), ColBERT-style late interaction, grammar-constrained JSON output.
› Voice assistants — real-time audio capture, VAD, speech-to-text, streaming LLM responses.
› Production API servers — OpenAI-compatible SSE, per-model resource limits, LRU eviction, Prometheus metrics.
› Edge deployments — embed a quantized model into your app. No network. No daemon.
› Batch processing — parallel inference across documents with work-stealing scheduling (parallel feature).
› Local AI in Electron / Tauri / mobile — link the static C ABI library directly into your desktop app.

// serve

as a server

# start an OpenAI/Anthropic-compatible API on :11434
mullama serve --model llama3.2:1b

# point any client at it
export OPENAI_BASE_URL=http://localhost:11434/v1
export ANTHROPIC_BASE_URL=http://localhost:11434

# LangChain, LlamaIndex, openai-py, anthropic-py — all unchanged

The serve mode is wire-compatible with both OpenAI and Anthropic SDKs. Existing client code that talks to localhost:11434 (the Ollama port) keeps working without modification.

one runtime. any language. anywhere.

MIT licensed. Apache-2.0-friendly. Use it in commercial products without restriction.

start with the docs read the source

Run any LLM locally. Use it from any language. Deploy anywhere.