#model-inference #ocr #qwen #asr #flash-attention #multimodal #glm #chat #2-5 #ffmpeg

bin+lib aha

aha model inference library, now supports Qwen(2.5VL/3/3VL/3.5/ASR/3Embedding/3Reranker), MiniCPM4, VoxCPM/1.5, DeepSeek-OCR/2, Hunyuan-OCR, PaddleOCR-VL/1.5, RMBG2.0, GLM(ASR-Nano-2512/OCR), Fun-ASR-Nano-2512, LFM(2/2.5/2VL/2.5VL)

16 releases

Uses new Rust 2024

0.2.5 Apr 6, 2026
0.2.4 Mar 24, 2026
0.2.0 Feb 14, 2026
0.1.8 Jan 17, 2026
0.0.0 Sep 21, 2025

#315 in Artificial intelligence

Apache-2.0

5.5MB
32K SLoC

aha logo

GitHub Stars GitHub Issues GitHub License

简体中文 | English

aha

Lightweight AI Inference Engine — All-in-one Solution for Text, Vision, Speech, and OCR

aha is a high-performance, cross-platform AI inference engine built with Rust and the Candle framework. It brings state-of-the-art AI models to your local machine—no API keys, no cloud dependencies, just pure, fast AI running directly on your hardware.

Supported Models

Category Models
Text Qwen3, MiniCPM4, LFM2, LFM2.5
Vision Qwen2.5-VL, Qwen3-VL, Qwen3.5,
LFM2.5-VL, LFM2-VL
OCR DeepSeek-OCR, DeepSeek-OCR-2 , PaddleOCR-VL
PaddleOCR-VL1.5, Hunyuan-OCR, GLM-OCR
ASR GLM-ASR-Nano, Fun-ASR-Nano, Qwen3-ASR
TTS VoxCPM, VoxCPM1.5
Image RMBG-2.0 (background removal)
Embedding Qwen3-Embedding, all-MiniLM-L6-v2
Reranker Qwen3-Reranker

Why aha?

  • 🚀 High-Performance Inference - Powered by Candle framework for efficient tensor computation and model inference
  • 🔧 Unified Interface — One tool for text, vision, speech, and OCR
  • 📦 Local-First — All processing runs locally, no data leaves your machine
  • 🎯 Cross-Platform — Works on Linux, macOS, and Windows
  • ⚡ GPU Accelerated — Optional CUDA support for faster inference
  • 🛡️ Memory Safe — Built with Rust for reliability
  • 🧠 Attention Optimization - Optional Flash Attention support for optimized long sequence processing

Changelog

0.2.5 (2026-04-06)

  • add qwen3-embedding/qwen3-reranker/all-minilm-l6-v2

2026-04-03

  • CLI update: subcommand must be specified
  • ChatCompletionParameters add repeat_penalty and repeat_last_n
  • generate add penalty repeat

2026-04-02

  • refactor generate code
  • <think>...</think> The content of the thought chain is returned using the reasoning_content field.
  • chat response add time info

2026-04-01

  • refactor deepseek_ocr/fun_asr_nano generate code

2026-03-31

  • add server and cli mod
  • aha model name use modelscope id replace
  • update WhichModel
  • Usage add time info
  • dependencies delete aha_openai_dive,chrono

2026-03-30

  • add LFM2.5VL-1.6B
  • add LFM2VL-1.6B

v0.2.4 (2026-03-23)

  • add LFM2.5-1.2B-Instruct
  • add LFM2-1.2B

View full changelog

Quick Start

Installation

git clone https://github.com/jhqxxx/aha.git
cd aha
cargo build --release

Optional Features:

# CUDA (NVIDIA GPU acceleration)
cargo build --release --features cuda

# Metal (Apple GPU acceleration for macOS)
cargo build --release --features metal

# Flash Attention (faster inference)
cargo build --release --features cuda,flash-attn

# FFmpeg (multimedia processing)
cargo build --release --features ffmpeg

CLI Quick Reference


# List all supported models
aha list

# Download model only
aha download -m Qwen/Qwen3-ASR-0.6B

# Download model and start service
aha cli -m Qwen/Qwen3-ASR-0.6B

# Run inference directly (without starting service)
aha run -m Qwen/Qwen3-ASR-0.6B -i "audio.wav"

# Run local all-MiniLM-L6-v2 embedding (native safetensors)
aha run -m all-minilm-l6-v2 -i "Rust embedding test" --weight-path D:\model_download\all-MiniLM-L6-v2

# Run local all-MiniLM-L6-v2 embedding (GGUF)
aha run -m all-minilm-l6-v2 -i "Rust embedding test" --artifact-format gguf --gguf-path D:\model_download\All-MiniLM-L6-v2-Embedding-GGUF --tokenizer-dir D:\model_download\all-MiniLM-L6-v2

# Run local all-MiniLM-L6-v2 embedding (ONNX)
aha run -m all-minilm-l6-v2 -i "Rust embedding test" --artifact-format onnx --onnx-path D:\model_download\all-MiniLM-L6-v2\onnx --tokenizer-dir D:\model_download\all-MiniLM-L6-v2

# Run local GLM-OCR (GGUF)
aha run -m glm-ocr -i .\assets\img\ocr_test1.png --artifact-format gguf --gguf-path D:\model_download\GLM-OCR-GGUF

# Run local GLM-OCR (ONNX)
aha run -m glm-ocr -i .\assets\img\ocr_test1.png --artifact-format onnx --onnx-path D:\model_download\GLM-OCR-ONNX --tokenizer-dir D:\model_download\GLM-OCR-ONNX

# Start service only (model already downloaded)
aha serv -m Qwen/Qwen3-ASR-0.6B -p 10100

Chat

aha serv -m Qwen/Qwen3-0.6B -p 10100

Then use the unified (OpenAI-compatible) API:

curl http://localhost:10100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }
'

Documentation

Document Description
Getting Started First steps with aha
Installation Detailed installation guide
CLI Reference Command-line interface
API Documentation Library & REST API
Supported Models Available AI models
Concepts Architecture & design
Development Contributing guide
Changelog Version history

Development

Using aha as a Library

cargo add aha

# VoxCPM example
use aha::models::voxcpm::generate::VoxCPMGenerate;
use aha::utils::audio_utils::save_wav;
use anyhow::Result;

fn main() -> Result<()> {
    let model_path = "xxx/openbmb/VoxCPM-0.5B/";

    let mut voxcpm_generate = VoxCPMGenerate::init(model_path, None, None)?;

    let generate = voxcpm_generate.generate(
        "The sun is shining bright, flowers smile at me, birds say early early early".to_string(),
        None,
        None,
        2,
        100,
        10,
        2.0,
        false,
        6.0,
    )?;

    let _ = save_wav(&generate, "voxcpm.wav")?;
    Ok(())
}

Extending New Models

  • Create new model file in src/models/
  • Export in src/models/mod.rs
  • Add support for CLI model inference in src/exec/
  • Add tests and examples in tests/

Features

  • High-performance inference via Candle framework
  • Multi-modal model support (vision, language, speech)
  • Clean, easy-to-use API design
  • Minimal dependencies, compact binaries
  • Flash Attention support for long sequences
  • FFmpeg support for multimedia processing

License

Apache-2.0 — See LICENSE for details.

Acknowledgments

  • Candle - Excellent Rust ML framework
  • All model authors and contributors

Wechat

260405 expired

Built with ❤️ by the aha team

We're continuously expanding our model support. Contributions are welcome!

If this project helps you, please consider giving us a ⭐ Star!

Dependencies

~70–115MB
~2M SLoC