lele: Bare-Metal Rust AI Inference Engine

lele is a standalone, dependency-free inference engine for ONNX models, built from scratch in pure Rust.

It rejects the "general-purpose runtime" approach (wrapping C++ libs like ORT or using heavy Torch ports) in favor of AOT compilation — converting ONNX graphs into specialized Rust source code with hand-crafted, domain-specific kernels.

What Can It Do?

lele runs real-world models across speech, vision, and text-to-speech — all in pure Rust, on CPU, with zero runtime dependencies:

Domain	Model	Task
ASR	SenseVoice Small	Multilingual speech recognition (INT8 quantized)
VAD	Silero VAD	Voice activity detection (streaming)
TTS	Supertonic 2 / 3	Text-to-speech (5 / 31 languages, expression tags)
TTS	Hojo-TTS-Light	Bilingual ZH/EN voice-cloning TTS (0.08B Token-LM)
TTS	MOSS-TTS-Nano	End-to-end neural TTS (SDOT int8 GEMV)
Vision	Yolo26	Real-time object detection
Vision	Yolo26n-Seg	Instance segmentation
OCR	PP-OCRv6_tiny	End-to-end text recognition (DET + REC)

All of the above also run in the browser via WebAssembly.

Performance (2026-06-13)

Single-threaded comparison against ONNX Runtime (CPU) on macOS (Apple Silicon). ORT configured with intra_op_num_threads=1, inter_op_num_threads=1. Speech models use steady-state RTF (warmup + multi-run average).

Model	ORT	lele	Speedup
Silero VAD	RTF 0.002882	RTF 0.0022	1.31x
SenseVoice	RTF 0.0294	RTF 0.0256	1.15x
Supertonic 2	RTF 0.1667	RTF 0.0550	3.03x
Supertonic 3	—	RTF 0.1585	—
Yolo26	704.50 ms	534.97 ms	1.32x
Yolo26n-Seg	126.51 ms	64.82 ms	1.95x
Hojo-TTS-Light	RTF 8.75	RTF 6.69	1.31x
MOSS-TTS-Nano	RTF 0.376	RTF 0.293	1.28x
PP-OCRv6 DET	47.78 ms	46.83 ms	1.02x
PP-OCRv6 REC	2.94 ms/region	2.33 ms/region	1.26x

PP-OCRv6 uses NEON depthwise convolution (4-accumulator FMA, vld2q stride-2), GELU fusion, Conv+Add(bias) fusion, and NEON reduce_mean for SE blocks. Hojo-TTS-Light is a 3-stage pipeline (encoder → AR-LLM → decoder); lele uses a custom KV-cache for the AR loop. MOSS-TTS-Nano uses SDOT int8 GEMV (137 GFLOPS on Apple M2 Pro NEON).

Key Features

AOT Compilation Pipeline

lele is not an interpreter. It compiles ONNX graphs into Rust source code at build time:

ONNX → Rust codegen: Each operator becomes a direct function call with static buffer allocation. No graph traversal, no dispatch overhead at runtime.
Operator fusion: 12 built-in graph patterns are detected and fused into single optimized kernels:
- LayerNorm: 9-node subgraph (ReduceMean→Sub→Pow→ReduceMean→Add→Sqrt→Div→Mul→Add) → one layer_norm call
- GELU: 5-node subgraph (Div→Erf→Add→Mul→Mul) → one gelu call
- Conv + Add(bias): Conv→Add → single-pass fused bias convolution
- Conv + SiLU: Conv→Sigmoid→Mul → single-pass conv2d_silu
- Conv + ReLU, SiLU (single / double / triple), Linear (MatMul→Add), Embedding Concat, Quantized Linear (6-node DynamicQuantizeLinear chain)
Constant folding: Shape, Unsqueeze, Squeeze, Concat, Slice, Cast, and ConstantOfShape with all-constant inputs are evaluated at compile time and removed from the graph.
Buffer allocation: Liveness analysis assigns each tensor to a reusable workspace buffer with zero-copy aliasing for Reshape/Squeeze/Unsqueeze/Identity/Cast.
Weight deduplication: Identical initializers share a single offset in the binary blob.

Extensible Compiler API

The Compiler builder lets you customize code generation for any model:

use lele::compiler::Compiler;

let compiler = Compiler::new()
    .with_name("MyModel")           // Set generated struct name
    .with_default_optimizations()   // Load fusion patterns + helper methods
    .with_constant_folding(true)    // Enable compile-time folding
    .with_override("MyCustomOp", |node, inputs, outputs, buf, w, indent| {
        // Generate custom Rust code for any operator
        Ok(())
    })
    .with_pattern("MyFusion", my_matcher, my_generator);  // Custom fusion

Runtime / Compiler Split

The compiler feature is optional (default = ["compiler"]). For deployment, you can depend on lele with default-features = false and ship only the generated Rust code + weights binary — no protobuf parser, no ONNX dependency, no overhead.

SIMD-Optimized Kernels

AArch64 (NEON): Hand-written intrinsics for matmul, convolution, activations, normalization. Includes specialized depthwise convolution (4-accumulator FMA latency hiding, vld2q stride-2 decimation), per-channel broadcast add/mul for SE blocks, NEON reduce_mean, and SDOT/I8MM int8 GEMV for quantized inference.
x86_64 (AVX2/FMA): SIMD paths for GEMM, convolution, LSTM/GRU gates, normalization.
WASM (SIMD128): Tiled matmul micro-kernel with f32x4 4× blocking, SIMD activations.
Apple Accelerate: Optional cblas_sgemm FFI for AMX-accelerated GEMM.

Advanced Kernel Capabilities

INT8 quantization: DynamicQuantizeLinear, MatMulInteger, fused quantized linear with pre-packed weights
FP16 weights: f16 weight inference with dedicated GEMV kernel
Recurrent: LSTM and GRU with SIMD gate kernels
Autoregressive KV-cache: Custom cache for AR-LLM generation (used in Hojo-TTS)
STFT: Built-in Short-Time Fourier Transform for audio processing

Audio Feature Extraction

Built-in DSP frontend for speech models:

FFT: Radix-2 real FFT with precomputed twiddle factors
Mel spectrogram: HTK mel scale with sparse filterbank (only nonzero weights stored)
LFR: Low Frame Rate stacking (reduces frame rate by N×)
CMVN: Cepstral Mean-Variance Normalization
Integrated pipeline: SenseVoiceFrontend — scale → pre-emphasis → Hann window → FFT → mel → log → LFR

WebAssembly Support

lele compiles to WASM and runs ML inference directly in the browser with no server required.

The web demo includes ASR, TTS, object detection, and instance segmentation — all client-side.

Optimization	Impact
WASM SIMD128	Tiled matmul with `f32x4` (4x unroll)
Optimized Activations	SIMD paths for tanh/sigmoid/relu/silu
Release Settings	`opt-level=3`, `lto=true`, `codegen-units=1`, `panic="abort"`
Post-Processing	`wasm-opt -O3` for 5-15% size/speed gains

cd examples/web-demo
./build_wasm.sh
python3 -m http.server 8080 -d web
# Open http://localhost:8080

See examples/web-demo/README.md for details.

Supported ONNX Operators

60+ operators across all categories:

Math: Add, Sub, Mul, Div, Mod, Pow, Sqrt, Neg, Exp, Log, Sin, Cos, Erf, Softplus, Clip, Round, Floor, Ceil, Reciprocal, CumSum, Max, Range, Einsum (partial)
Neural Network: Conv, ConvTranspose, ConvInteger, Gemm, MatMul, MatMulInteger, LSTM, GRU, BatchNormalization, LayerNormalization, MaxPool, AveragePool, GlobalAveragePool, Resize
Activation: Relu, LeakyRelu, Sigmoid, Tanh, Softmax, PRelu, HardSigmoid, ArgMax
Tensor: Reshape, Transpose, Concat, Split, Slice, Gather, GatherElements, Pad, Expand, Tile, Where, TopK, Flatten, Squeeze, Unsqueeze
Reduction: ReduceSum, ReduceMean, ReduceMax, ReduceL2
Comparison: Equal, Less, Greater, LessOrEqual, GreaterOrEqual, And, Or, Xor, Not
Control Flow: If (then/else subgraph recursion)
Signal: STFT
Other: Shape, Size, Cast, ConstantOfShape, DynamicQuantizeLinear, Identity, Constant

Silu is available as a fused pattern (Sigmoid→Mul). GELU is fused via compiler pattern (Div→Erf→Add→Mul→Mul).

Getting Started

Prerequisites

Rust (latest stable)
cargo

Option 1: CLI Codegen

Convert an ONNX model into Rust source code:

cargo run --release --bin lele_gen -- <model.onnx> <output.rs>

This produces <output.rs> (the model code) and <output>_weights.bin (the weights).

Option 2: Build-Script Integration (`lele-build`)

For a seamless build workflow, use the lele-build crate in your build.rs. It supports model download from Hugging Face Hub, URL, or local path, with caching and automatic regeneration:

# model.toml
[model]
source = "hf-hub"
repo = "your-org/your-model"
files = [{ file = "model.onnx" }]

[codegen]
class_name = "MyModel"

// build.rs
use lele_build::{config::ModelConfig, *};

fn main() {
    let cfg = ModelConfig::load("model.toml").unwrap();
    if need_regenerate_with_model("MyModel", "src/gen", "model.toml", None) {
        // Download + generate
    }
}

Set HF_ENDPOINT to use a mirror. Set LELE_SKIP_MODEL_GEN=1 to skip generation. If download fails, a compiling stub is generated so cargo build always succeeds.

Running Examples

./run_sensevoice.sh          # SenseVoice ASR
./run_silero.sh              # Silero VAD
./run_supertonic.sh          # Supertonic 2 TTS (5 languages)
./run_supertonic3.sh         # Supertonic 3 TTS (31 languages)
./run_yolo26.sh              # Yolo26 object detection
./run_yolo26n_seg.sh         # Yolo26n-Seg instance segmentation
./run_hojo_tts.sh "text"     # Hojo-TTS-Light voice cloning
cd examples/moss-tts-nano && cargo run --release -- "text"  # MOSS-TTS-Nano
cd examples/ppocr && cargo run --release -- image.png      # PP-OCRv6 OCR

Supported Models

Model	Type	Details
SenseVoiceSmall	ASR	Multilingual speech recognition
Silero VAD	VAD	Streaming voice activity detection
Supertonic 2	TTS	5 languages
Supertonic 3	TTS	31 languages, expression tags (`<laugh>`, `<breath>`, `<sigh>`)
Hojo-TTS-Light	TTS	0.08B bilingual ZH/EN, voice cloning (encoder → AR-LLM → decoder)
MOSS-TTS-Nano	TTS	End-to-end neural TTS with int8 GEMV
Yolo26	Vision	Real-time object detection
Yolo26n-Seg	Vision	Instance segmentation
PP-OCRv6_tiny	OCR	End-to-end text recognition (DET + REC pipeline)

Architecture

ONNX Model
    │
    ▼
┌──────────────┐     ┌───────────────────────┐
│  lele-build  │────▶│      Compiler         │
│ (download,   │     │  ┌─────────────────┐  │
│  cache)      │     │  │ Constant Fold   │  │
└──────────────┘     │  │ Pattern Fusion  │  │
                     │  │ Buffer Alloc    │  │
                     │  │ Code Gen        │  │
                     │  └─────────────────┘  │
                     └──────────┬────────────┘
                                │
                    ┌───────────▼───────────┐
                    │  model.rs + .bin      │
                    │  (pure Rust, no deps) │
                    └───────────┬───────────┘
                                │
              ┌─────────────────┼─────────────────┐
              ▼                 ▼                  ▼
        Native (NEON/AVX)   WASM (SIMD128)   Embedded

Roadmap

Broader operator coverage — support as many ONNX operators as possible so any model "just works"
CPU-first performance — push SIMD (NEON/AVX/WASM) and multi-threading to be the fastest CPU inference engine
Lightweight & edge — stay zero-dependency, minimal binary size, ideal for embedded and resource-constrained devices
More models — Whisper, CosyVoice, and other popular edge-friendly models
Advanced techniques — FlashAttention, PagedAttention for long-sequence efficiency (still CPU)
Voice API server — RESTful ASR/TTS/Denoise endpoints for edge deployment

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
benches		benches
crates/lele-build		crates/lele-build
docs		docs
examples		examples
fixtures		fixtures
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
build.rs		build.rs
run_hojo_tts.sh		run_hojo_tts.sh
run_sensevoice.sh		run_sensevoice.sh
run_silero.sh		run_silero.sh
run_supertonic.sh		run_supertonic.sh
run_supertonic3.sh		run_supertonic3.sh
run_yolo26.sh		run_yolo26.sh
run_yolo26n_seg.sh		run_yolo26n_seg.sh
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lele: Bare-Metal Rust AI Inference Engine

What Can It Do?

Performance (2026-06-13)

Key Features

AOT Compilation Pipeline

Extensible Compiler API

Runtime / Compiler Split

SIMD-Optimized Kernels

Advanced Kernel Capabilities

Audio Feature Extraction

WebAssembly Support

Supported ONNX Operators

Getting Started

Prerequisites

Option 1: CLI Codegen

Option 2: Build-Script Integration (`lele-build`)

Running Examples

Supported Models

Architecture

Roadmap

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

lele: Bare-Metal Rust AI Inference Engine

What Can It Do?

Performance (2026-06-13)

Key Features

AOT Compilation Pipeline

Extensible Compiler API

Runtime / Compiler Split

SIMD-Optimized Kernels

Advanced Kernel Capabilities

Audio Feature Extraction

WebAssembly Support

Supported ONNX Operators

Getting Started

Prerequisites

Option 1: CLI Codegen

Option 2: Build-Script Integration (lele-build)

Running Examples

Supported Models

Architecture

Roadmap

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Option 2: Build-Script Integration (`lele-build`)