lele is a standalone, dependency-free inference engine for intelligence, built from scratch in pure Rust.
It rejects the "general-purpose runtime" approach (wrapping C++ libs like ORT or using heavy Torch ports) in favor of hand-crafted, domain-specific kernels.
lele is designed to run deep learning models (specifically speech-related ones like SenseVoice, Silero VAD, and TTS, even yolo26 ) with minimal overhead.
In-depth comparison between lele and ONNX Runtime (CPU) on macOS (Apple Silicon). All benchmarks run with single-thread affinity for fair comparison.
| Model | ORT RTF (CPU) | lele RTF | Speedup |
|---|---|---|---|
| Silero VAD | 0.0031 | 0.0016 | 1.93x |
| SenseVoice | 0.032 | 0.051 | 0.63x |
| Supertonic | 0.122 | 0.134 | 0.91x |
| Yolo26 | 759.19 | 1050.56ms | 0.72x |
Note: RTF (Real-Time Factor) is defined as (Inference Time / Audio Duration). Lower is better.
- Zero Runtime Dependencies: Generated models are pure Rust.
- AOT Compilation: Converts ONNX models to specialized Rust code for maximum performance.
- SIMD Optimized: Hand-written kernels using Apple Silicon (NEON) and x86_64 (AVX/SSE) intrinsics.
- Memory Efficient: Static buffer allocation and zero-copy weight loading.
- Speech Optimized: Built-in feature extraction for audio (FFT, Mel-spectrogram, LFR, CMVN).
- WebAssembly Ready: Full browser compatibility with WASM SIMD128 optimizations.
lele compiles to WebAssembly and runs ML inference directly in the browser with no server required.
| Optimization | Impact |
|---|---|
| WASM SIMD128 | Tiled matmul micro-kernel with f32x4_mul/f32x4_add (4x unroll) |
| Optimized Activations | SIMD paths for tanh/sigmoid/relu/silu using polynomial exp approximation |
| Vectorized Normalization | SIMD softmax and layer_norm with horizontal reduction |
| Release Settings | opt-level=3, lto=true, codegen-units=1, panic="abort" |
| Post-Processing | wasm-opt -O3 for additional 5-15% size/speed gains |
Binary Size Reduction: Dev builds (2.9M → 1.7M for SenseVoice, 42% smaller with optimizations)
Expected Runtime Speedup: 20-100x over unoptimized scalar WASM (10-50x from release mode + 2-4x from SIMD128)
cd examples/web-demo
./build_wasm.sh
python3 -m http.server 8080 -d web
# Open http://localhost:8080See examples/web-demo/README.md for details.
- SenseVoiceSmall: High-accuracy multi-lingual ASR.
- Silero VAD: Reliable Voice Activity Detection.
- Supertonic: Fast and high-quality Text-to-Speech.
- Yolo26: Real-time object detection.
- Rust (Latest stable)
cargo
To compile an ONNX model into Rust code:
cargo run --release --bin lele_gen -- <model.onnx> <output_path.rs># SenseVoice ASR
./run_sensevoice.sh
# Supertonic TTS
./run_supertonic.sh
# Silero VAD
./run_silero.sh
# Yolo26 Object Detection
./run_yolo26.sh- Performance optimizations (SIMD, multi-threading, etc.), better than ONNX Runtime.
- Support for more audio models (e.g., Whisper, CosyVoice, etc.)
- GPU acceleration backend (wgpu); Quantization (INT8/FP16)
- Advanced attention mechanisms (FlashAttention, PagedAttention)
- Voice API server (RESTful service), including ASR/TTS/Denoise endpoints.
MIT