Skip to content

cognisoc/mullama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mullama — local LLM inference in Rust, with native bindings for Python, Node.js, Go, PHP, and C

Crates.io PyPI npm CI Documentation License

What is Mullama?

Mullama is a local LLM runtime built on llama.cpp that runs GGUF models — Llama 3.2, Qwen 2.5, DeepSeek R1, Mistral, Phi 3, Gemma 2, LLaVA, and any other GGUF file from Hugging Face — directly inside your application. It exposes the same CLI as Ollama (mullama run, pull, serve, chat) plus native bindings for Rust, Python, Node.js, Go, PHP, and C/C++, so you can either run it as a server with OpenAI- and Anthropic-compatible HTTP APIs, or embed the inference engine in-process with zero HTTP overhead.

If you want to run a local LLM in your stack without an Ollama daemon, without spawning a Python subprocess from your Node app, and without writing your own FFI to llama.cpp — Mullama is the single dependency that covers all of it.

Install

# One-liner (Linux/macOS)
curl -fsSL https://mullama.cognisoc.com/install.sh | sh

# Windows (PowerShell)
iwr -useb https://mullama.cognisoc.com/install.ps1 | iex

# Or via your language's package manager
cargo add mullama                       # Rust
pip install mullama                     # Python
npm install mullama                     # Node.js / TypeScript
go get github.com/cognisoc/mullama      # Go
composer require mullama/mullama        # PHP
docker pull ghcr.io/cognisoc/mullama    # Docker

Quick start (CLI)

# Run a model — daemon auto-starts in the background
mullama run llama3.2:1b "What is the capital of France?"

# Interactive chat
mullama chat

# Start an OpenAI-compatible HTTP server on port 11434
mullama serve --model llama3.2:1b

Coming from Ollama? Your commands work unchanged — run, pull, serve, list, ps, create, show, rm, cp. Your Modelfiles work unchanged too.

Quick start (library)

Embed inference directly — no daemon, no HTTP, no separate process.

Rust:

use mullama::{Model, Context, ContextParams};

let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams::default())?;
let response = ctx.generate("Hello, AI!", 256)?;
println!("{response}");

Python:

from mullama import Model, Context
model = Model.load("llama3.2-1b.gguf", n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)
print(ctx.generate("Hello, AI!"))

Node.js:

const { Model, Context } = require("mullama");
const model = Model.load("llama3.2-1b.gguf", { nGpuLayers: 32 });
const ctx = new Context(model, { nCtx: 4096 });
console.log(ctx.generate("Hello, AI!"));

Go:

import "github.com/cognisoc/mullama"
model, _ := mullama.LoadModel("llama3.2-1b.gguf", &mullama.ModelParams{NGpuLayers: 32})
ctx, _ := mullama.NewContext(model, &mullama.ContextParams{NCtx: 4096})
response, _ := ctx.Generate("Hello, AI!", 256, nil)

PHP:

use Mullama\Model;
use Mullama\Context;
$model = Model::load("llama3.2-1b.gguf", ["nGpuLayers" => 32]);
$ctx = new Context($model, ["nCtx" => 4096]);
echo $ctx->generate("Hello, AI!");

See the binding-specific READMEs for full API docs: Python · Node.js · Go · PHP · C/C++ (FFI).

Supported models

Mullama runs any GGUF model from Hugging Face. The following are pre-configured aliases — type the short name and Mullama resolves it to the right HF repository and quantization:

  • Llamallama3.2:1b, llama3.2:3b, llama3.1:8b, llama3.1:70b, llama3:8b
  • Qwen 2.5qwen2.5:0.5b through qwen2.5:72b, plus qwen2.5-coder:7b / :14b / :32b
  • DeepSeek R1 (reasoning) — deepseek-r1:1.5b, :7b, :14b, :32b; plus deepseek-coder:7b, :33b
  • Mistralmistral:7b, mixtral:8x7b, codestral:22b
  • Phi (Microsoft) — phi3:mini, phi3:medium, phi3.5:mini
  • Gemma (Google) — gemma2:2b, :9b, :27b
  • Vision (multimodal)llava:7b, :13b, llava-phi3, moondream:2b
  • Embeddingsnomic-embed, bge:small, bge:large
  • Code completionstarcoder2:3b, :7b, :15b

Anything else in GGUF format works via hf:owner/repo (e.g. hf:bartowski/Llama-3.2-1B-Instruct-GGUF) or by giving a direct path to a .gguf file. The full registry lives in configs/models.toml.

Why Mullama?

Native bindings for 6 languages Rust, Python, Node.js, Go, PHP, C/C++ — call models directly, no HTTP round-trips, no Python subprocess from your Node app
Drop-in Ollama replacement Same CLI verbs, same Modelfile format, same model registry, same port (11434)
OpenAI + Anthropic API compatible Use your existing OpenAI SDK, LangChain, LlamaIndex, or Anthropic SDK against localhost:11434
Embed in any app Run inference in-process — no separate daemon, no orchestration
7 GPU backends CUDA, Metal (Apple Silicon), ROCm, Vulkan, OpenCL, SYCL (Intel Arc), RPC
Multimodal Text, image (LLaVA, Moondream), and real-time audio with voice activity detection
Built-in Web UI and TUI Chat interface, model management, and API playground

How Mullama compares

Mullama Ollama llama.cpp llama-cpp-python vLLM
GGUF models ✗ (uses safetensors)
Native language bindings 6 HTTP only C/C++ only Python only Python only
Embed in-process (no daemon)
OpenAI-compatible API partial
Anthropic-compatible API
Ollama Modelfile compatible
Built-in Web UI / TUI ✗ (third-party)
GPU backends 7 4 7 4 CUDA
Multimodal (vision + audio) partial partial

Full comparison · Migration guide from Ollama

What you can build

  • Chatbots and assistants — streaming responses, multi-turn context, custom system prompts.
  • RAG pipelines — built-in embeddings (BGE, Nomic), ColBERT-style late interaction, and grammar-constrained JSON output.
  • Voice assistants — real-time audio capture with voice activity detection, speech-to-text, streaming LLM responses.
  • Production API servers — OpenAI-compatible endpoints with SSE streaming, per-model resource limits, LRU model eviction, Prometheus metrics.
  • Edge deployments — embed a quantized model directly in your app; no network dependency, no daemon.
  • Batch processing — parallel inference across documents with work-stealing scheduling (parallel feature).
  • Local AI in Electron / Tauri / mobile — link the static C ABI library directly into your desktop app.

GPU acceleration

Set the environment variable for your hardware before building, or use the pre-built CUDA Docker image (ghcr.io/cognisoc/mullama:<version>-cuda):

Backend Env var Hardware
CUDA LLAMA_CUDA=1 NVIDIA GPUs
Metal LLAMA_METAL=1 Apple Silicon
ROCm LLAMA_HIPBLAS=1 AMD Radeon
Vulkan LLAMA_VULKAN=1 AMD / NVIDIA / Intel (cross-platform)
OpenCL LLAMA_CLBLAST=1 Intel / AMD legacy
SYCL LLAMA_SYCL=1 Intel Arc, Data Center GPUs
RPC LLAMA_RPC=1 Distributed inference across machines

FAQ

Is Mullama a drop-in replacement for Ollama?

Yes for the CLI, Modelfile format, model registry, and HTTP port (11434). The OpenAI-compatible API matches Ollama's. The Anthropic-compatible API is a Mullama addition that Ollama doesn't have. Existing Ollama Modelfiles work unchanged; existing client code pointed at localhost:11434 works unchanged.

How is Mullama different from llama.cpp?

Mullama uses llama.cpp as its inference engine — it's the same numerical kernel. What Mullama adds: a safe Rust API on top of the C++ library, six language bindings, an Ollama-compatible daemon with OpenAI/Anthropic HTTP APIs, model registry resolution, Modelfile parser, embedded Web UI, multimodal pipeline, streaming audio with VAD, and production features (per-model resource limits, LRU eviction, persistent stats via sled, Prometheus metrics).

Does Mullama require a GPU?

No. CPU inference works out of the box. GPU acceleration is optional and gated behind build-time environment variables — see the GPU acceleration table above. Apple Silicon Macs use Metal automatically.

Can I use the OpenAI Python SDK / LangChain / LlamaIndex with Mullama?

Yes. Point any OpenAI-compatible client at http://localhost:11434/v1. LangChain's ChatOpenAI, LlamaIndex's OpenAILike, and the official openai SDK all work without modification.

Does Mullama support function calling / tool use?

Yes for OpenAI-style tools in the chat completion API, and for grammar-constrained JSON output via the underlying llama.cpp grammar feature. See the tool calling guide.

What's the license?

MIT. Pure permissive — use it in commercial products without restriction.

Where do I report bugs or request features?

GitHub Issues for bugs and feature requests. Discussions for usage questions.

Documentation

Full documentation: docs.cognisoc.com/mullama.

Topics: installation, library usage per language, daemon configuration, API reference (OpenAI, Anthropic, Ollama-compatible), Modelfile format, GPU configuration, multimodal pipeline, streaming audio, embeddings, RAG patterns, deployment recipes (Docker, Kubernetes, systemd, launchd).

Contributing

git clone --recurse-submodules https://github.com/cognisoc/mullama.git
cd mullama
cargo test --all-features

See CONTRIBUTING.md for guidelines.

License

MIT — see LICENSE.

Releases

No releases published

Packages

 
 
 

Contributors