cyllama is a comprehensive no-dependencies Python library for AI inference built on the .cpp ecosystem:
- llama.cpp - LLM text generation
- whisper.cpp - Speech-to-text transcription
- stable-diffusion.cpp - Image and video generation
It combines the performance of compiled Cython wrappers with a simple, high-level Python API.
from cyllama import complete
# One line is all you need
response = complete(
"Explain quantum computing in simple terms",
model_path="models/llama.gguf",
temperature=0.7,
max_tokens=200
)
print(response)High-Level API - Get started in seconds:
from cyllama import complete, chat, LLM
# One-shot completion
response = complete("What is Python?", model_path="model.gguf")
# Multi-turn chat
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="model.gguf")
# Reusable LLM instance (faster for multiple prompts)
llm = LLM("model.gguf")
response1 = llm("Question 1")
response2 = llm("Question 2") # Model stays loaded!Streaming Support - Real-time token-by-token output:
for chunk in complete("Tell me a story", model_path="model.gguf", stream=True):
print(chunk, end="", flush=True)Batch Processing - Process multiple prompts 3-10x faster:
from cyllama import batch_generate
prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(prompts, model_path="model.gguf")Speculative Decoding - 2-3x speedup with draft models:
from cyllama.llama.llama_cpp import Speculative, SpeculativeParams
spec = Speculative(ctx_target, ctx_draft)
params = SpeculativeParams(n_draft=16, p_min=0.75)
draft_tokens = spec.gen_draft(params, prompt_tokens, last_token)Memory Optimization - Smart GPU layer allocation:
from cyllama import estimate_gpu_layers
estimate = estimate_gpu_layers(
model_path="model.gguf",
available_vram_mb=8000
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")N-gram Cache - 2-10x speedup for repetitive text:
from cyllama.llama.llama_cpp import NgramCache
cache = NgramCache()
cache.update(tokens, ngram_min=2, ngram_max=4)
draft = cache.draft(input_tokens, n_draft=16)OpenAI-Compatible API - Drop-in replacement:
from cyllama.integrations import OpenAIClient
client = OpenAIClient(model_path="model.gguf")
response = client.chat.completions.create(
messages=[{"role": "user", "content": "Hello!"}],
temperature=0.7
)
print(response.choices[0].message.content)LangChain Integration - Seamless ecosystem access:
from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain
llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")Cyllama includes a zero-dependency agent framework with three agent architectures:
ReActAgent - Reasoning + Acting agent with tool calling:
from cyllama import LLM
from cyllama.agents import ReActAgent, tool
from simpleeval import simple_eval
@tool
def calculate(expression: str) -> str:
"""Evaluate a math expression safely."""
return str(simple_eval(expression))
llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate])
result = agent.run("What is 25 * 4?")
print(result.answer)ConstrainedAgent - Grammar-enforced tool calling for 100% reliability:
from cyllama.agents import ConstrainedAgent
agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4") # Guaranteed valid tool callsContractAgent - Contract-based agent with C++26-inspired pre/post conditions:
from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy
@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
"""Divide a by x."""
return a / x
agent = ContractAgent(
llm=llm,
tools=[divide],
policy=ContractPolicy.ENFORCE,
task_precondition=lambda task: len(task) > 10,
answer_postcondition=lambda ans: len(ans) > 0,
)
result = agent.run("What is 100 divided by 4?")See Agents Overview for detailed agent documentation.
Whisper Transcription - Transcribe audio files with timestamps:
from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np
# Load model and audio
ctx = WhisperContext("models/ggml-base.en.bin")
samples = load_audio_as_16khz_float32("audio.wav") # Your audio loading function
# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)
# Get results
for i in range(ctx.full_n_segments()):
start = ctx.full_get_segment_t0(i) / 100.0
end = ctx.full_get_segment_t1(i) / 100.0
text = ctx.full_get_segment_text(i)
print(f"[{start:.2f}s - {end:.2f}s] {text}")See Whisper docs for full documentation.
Image Generation - Generate images from text using stable-diffusion.cpp:
from cyllama.sd import text_to_image
# Simple text-to-image
images = text_to_image(
model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
prompt="a photo of a cute cat",
width=512,
height=512,
sample_steps=4,
cfg_scale=1.0
)
images[0].save("output.png")Advanced Generation - Full control with SDContext:
from cyllama.sd import SDContext, SDContextParams, SampleMethod, Scheduler
params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4
ctx = SDContext(params)
images = ctx.generate(
prompt="a beautiful mountain landscape",
negative_prompt="blurry, ugly",
width=512,
height=512,
sample_method=SampleMethod.EULER,
scheduler=Scheduler.DISCRETE
)CLI Tool - Command-line interface:
# Text to image
python -m cyllama.sd txt2img \
--model models/sd_xl_turbo_1.0.q8_0.gguf \
--prompt "a beautiful sunset" \
--output sunset.png
# Image to image
python -m cyllama.sd img2img \
--model models/sd-v1-5.gguf \
--init-img input.png \
--prompt "oil painting style" \
--strength 0.7
# Show system info
python -m cyllama.sd infoSupports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See Stable Diffusion docs for full documentation.
Simple RAG - Query your documents with LLMs:
from cyllama.rag import RAG
# Create RAG instance with embedding and generation models
rag = RAG(
embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
generation_model="models/llama.gguf"
)
# Add documents
rag.add_texts([
"Python is a high-level programming language.",
"Machine learning is a subset of artificial intelligence.",
"Neural networks are inspired by biological neurons."
])
# Query
response = rag.query("What is Python?")
print(response.text)Load Documents - Support for multiple file formats:
from cyllama.rag import RAG, load_directory
rag = RAG(
embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
generation_model="models/llama.gguf"
)
# Load all documents from a directory
documents = load_directory("docs/", glob="**/*.md")
rag.add_documents(documents)
response = rag.query("How do I configure the system?")Hybrid Search - Combine vector and keyword search:
from cyllama.rag import RAG, HybridStore, Embedder
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")
store = HybridStore("knowledge.db", embedder)
store.add_texts(["Document content..."])
# Hybrid search with configurable weights
results = store.search("query", k=5, vector_weight=0.7, fts_weight=0.3)Agent Integration - Use RAG as an agent tool:
from cyllama import LLM
from cyllama.agents import ReActAgent
from cyllama.rag import RAG, create_rag_tool
rag = RAG(
embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
generation_model="models/llama.gguf"
)
rag.add_texts(["Your knowledge base..."])
# Create a tool from the RAG instance
search_tool = create_rag_tool(rag)
llm = LLM("models/llama.gguf")
agent = ReActAgent(llm=llm, tools=[search_tool])
result = agent.run("Find information about X in the knowledge base")Supports text chunking, multiple embedding pooling strategies, async operations, reranking, and SQLite-vector for persistent storage.
GGUF File Manipulation - Inspect and modify model files:
from cyllama.llama.llama_cpp import GGUFContext
ctx = GGUFContext.from_file("model.gguf")
metadata = ctx.get_all_metadata()
print(f"Model: {metadata['general.name']}")Structured Output - JSON schema to grammar conversion:
from cyllama.llama.llama_cpp import json_schema_to_grammar
schema = {"type": "object", "properties": {"name": {"type": "string"}}}
grammar = json_schema_to_grammar(schema)Huggingface Model Downloads:
from cyllama.llama.llama_cpp import download_model, list_cached_models, get_hf_file
# Download from HuggingFace (saves to ~/.cache/llama.cpp/)
download_model("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
# Or with explicit parameters
download_model(hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
# Download specific file to custom path
download_model(
hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF",
hf_file="Llama-3.2-1B-Instruct-Q8_0.gguf",
model_path="./models/my_model.gguf"
)
# Get file info without downloading
info = get_hf_file("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
print(info) # {'repo': '...', 'gguf_file': '...', 'mmproj_file': '...'}
# List cached models
models = list_cached_models()- Full llama.cpp API - Complete Cython wrapper with strong typing
- High-Level API - Simple, Pythonic interface (
LLM,complete,chat) - Streaming Support - Token-by-token generation with callbacks
- Batch Processing - Efficient parallel inference
- Multimodal - LLAVA and vision-language models
- Speculative Decoding - 2-3x inference speedup with draft models
- Full whisper.cpp API - Complete Cython wrapper
- High-Level API - Simple
transcribe()function - Multiple Formats - WAV, MP3, FLAC, and more
- Language Detection - Automatic or specified language
- Timestamps - Word and segment-level timing
- Full stable-diffusion.cpp API - Complete Cython wrapper
- Text-to-Image - SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2
- Image-to-Image - Transform existing images
- Inpainting - Mask-based editing
- ControlNet - Guided generation with edge/pose/depth
- Video Generation - Wan, CogVideoX models
- Upscaling - ESRGAN 4x upscaling
- GPU Acceleration - Metal, CUDA, Vulkan backends
- Memory Optimization - Smart GPU layer allocation
- Agent Framework - ReActAgent, ConstrainedAgent, ContractAgent
- Framework Integration - OpenAI API, LangChain, FastAPI
Performance: Compiled Cython wrappers with minimal overhead
- Strong type checking at compile time
- Zero-copy data passing where possible
- Efficient memory management
- Native integration with llama.cpp optimizations
Simplicity: From 50 lines to 1 line for basic generation
- Intuitive, Pythonic API design
- Automatic resource management
- Sensible defaults, full control when needed
Production-Ready: Battle-tested and comprehensive
- 863+ passing tests with extensive coverage
- Comprehensive documentation and examples
- Proper error handling and logging
- Framework integration for real applications
Up-to-Date: Tracks bleeding-edge llama.cpp (currently b7126)
- Regular updates with latest features
- All high-priority APIs wrapped
- Performance optimizations included
Current Version: 0.1.18 (December 2025) llama.cpp Version: b7126 Build System: scikit-build-core + CMake Test Coverage: 863+ tests passing Platform: macOS (tested), Linux (tested), Windows (tested)
- v0.1.18 (Dec 2025) - Remaining stable-diffusion.cpp wrapped
- v0.1.16 (Dec 2025) - Response class, Async API, Chat templates
- v0.1.12 (Nov 2025) - Initial wrapper of stable-diffusion.cpp
- v0.1.11 (Nov 2025) - ACP support, build improvements
- v0.1.10 (Nov 2025) - Agent Framework, bug fixes
- v0.1.9 (Nov 2025) - High-level APIs, integrations, batch processing, comprehensive documentation
- v0.1.8 (Nov 2025) - Speculative decoding API
- v0.1.7 (Nov 2025) - GGUF, JSON Schema, Downloads, N-gram Cache
- v0.1.6 (Nov 2025) - Multimodal test fixes
- v0.1.5 (Oct 2025) - Mongoose server, embedded server
- v0.1.4 (Oct 2025) - Memory estimation, performance optimizations
See CHANGELOG.md for complete release history.
To build cyllama:
-
A recent version of
python3(currently testing on python 3.13) -
Git clone the latest version of
cyllama:git clone https://github.com/shakfu/cyllama.git cd cyllama -
We use uv for package management:
If you don't have it see the link above to install it, otherwise:
uv sync
-
Type
makein the terminal.This will:
- Download and build
llama.cpp,whisper.cppandstable-diffusion.cpp - Install them into the
thirdpartyfolder - Build
cyllamausing scikit-build-core + CMake
- Download and build
# Full build (default)
make # Build dependencies + editable install
# Build wheel for distribution
make wheel # Creates wheel in dist/
# Backend-specific builds
make build-metal # macOS Metal (default on macOS)
make build-cuda # NVIDIA CUDA
make build-vulkan # Vulkan (cross-platform)
make build-cpu # CPU only
# Clean and rebuild
make clean # Remove build artifacts
make reset # Full reset including thirdparty
make remake # Clean rebuild with testsBy default, cyllama builds with Metal support on macOS and CPU-only on Linux. To enable other GPU backends (CUDA, Vulkan, etc.):
# CUDA (NVIDIA GPUs)
make build-cuda
# Vulkan (Cross-platform GPU)
make build-vulkan
# Multiple backends
export GGML_CUDA=1 GGML_VULKAN=1
make buildSee Build Backends for comprehensive backend build instructions.
The tests directory in this repo provides extensive examples of using cyllama.
However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good small model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. cyllama expects models to be stored in a models folder in the cloned cyllama directory. So to create the models directory if doesn't exist and download this model, you can just type:
make downloadThis basically just does:
cd cyllama
mkdir models && cd models
wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.ggufNow you can test it using llama-cli or llama-simple:
bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
-p "Is mathematics discovered or invented?"With 863+ passing tests, the library is ready for both quick prototyping and production use:
make test # Run full test suiteYou can also explore interactively:
python3 -i scripts/start.py
>>> from cyllama import complete
>>> response = complete("What is 2+2?", model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf")
>>> print(response)- User Guide - Comprehensive guide covering all features
- API Reference - Complete API documentation
- Cookbook - Practical recipes and patterns
- Changelog - Complete release history
- Examples - See
tests/examples/for working code samples
- Full llama.cpp API wrapper with Cython
- High-level API (
LLM,complete,chat) - Async API support (
AsyncLLM,complete_async,chat_async) - Response class with stats and serialization
- Built-in chat template system (llama.cpp templates)
- Batch processing utilities
- OpenAI-compatible API client
- LangChain integration
- Speculative decoding
- GGUF file manipulation
- JSON schema to grammar conversion
- Model download helper
- N-gram cache
- OpenAI-compatible servers (PythonServer, EmbeddedServer, LlamaServer)
- Whisper.cpp integration
- Multimodal support (LLAVA)
- Memory estimation utilities
- Agent Framework (ReActAgent, ConstrainedAgent, ContractAgent)
- Stable Diffusion (stable-diffusion.cpp) - image/video generation
- RAG utilities (text chunking, document processing)
- Response caching for identical prompts
- Web UI for testing
Contributions are welcome! Please see the User Guide for development guidelines.
This project wraps llama.cpp, whisper.cpp, and stable-diffusion.cpp which all follow the MIT licensing terms, as does cyllama.