3 unstable releases
| 0.2.0-rc.2 | Feb 10, 2026 |
|---|---|
| 0.1.0 | Dec 3, 2025 |
#594 in Machine learning
5.5MB
118K
SLoC
Infernum
"From the depths, intelligence rises"
Local LLM inference CLI for running large language models on your machine.
Quick Start
# Install
cargo install infernum
# Set your model
infernum config set-model TinyLlama/TinyLlama-1.1B-Chat-v1.0
# Start chatting
infernum chat
Features
- Local Inference: Run LLMs entirely on your machine - no API keys, no cloud
- Standard API: Industry-standard
/v1/*server atlocalhost:8080 - Interactive Chat: Full-featured CLI chat with history and session management
- Multi-Backend: CPU, CUDA (NVIDIA), and Metal (Apple Silicon) support
- Smart Caching: Models download once via HuggingFace Hub
- Streaming: Real-time token-by-token output
Commands
# Chat interface
infernum chat [--model MODEL] [--system PROMPT]
# Start API server
infernum server [--port PORT]
# Download a model
infernum pull meta-llama/Llama-3.2-3B-Instruct
# List available models
infernum list
# System diagnostics
infernum doctor
# Configuration
infernum config set-model MODEL
infernum config get-model
GPU Support
# Install with CUDA support
cargo install infernum --features cuda
# Install with Metal support (macOS)
cargo install infernum --features metal
Architecture
Infernum is built from specialized components:
- abaddon: High-performance inference engine
- malphas: Model orchestration and lifecycle management
- stolas: Knowledge retrieval and RAG capabilities
- beleth: Agent framework for autonomous AI
- asmodeus: Model fine-tuning and adaptation
- dantalion: Observability and telemetry
- infernum-server: HTTP API server
Documentation
Full documentation available at infernum.daemoniorum.com
License
Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Dependencies
~88–125MB
~2M SLoC