3 releases (1 stable)
| new 1.0.0 | May 12, 2026 |
|---|---|
| 0.4.0 | May 8, 2026 |
| 0.3.4 | May 8, 2026 |
#57 in Machine learning
160KB
3K
SLoC
Kenosis is a Rust CLI toolkit for quantizing, validating, inspecting, and comparing ONNX models. Its flagship feature is static INT8 quantization with ReLU-aware QDQ placement that achieves full QLinearConv fusion on stock ONNX Runtime — no custom operators required.
Production Results
Kenosis quantizes the PP-YOLOE+ object detection models deployed in production edge AI pipelines, delivering production-validated performance gains:
| Model | Resolution | Cosine | Latency | Speedup | Size |
|---|---|---|---|---|---|
| PP-YOLOE+ Small | 320×320 | 0.998 | 23ms vs 44ms | 1.89× | 7.9 MB (3.9× smaller) |
| PP-YOLOE+ Small | 416×416 | 0.998 | 43ms vs 77ms | 1.80× | 7.9 MB (3.9× smaller) |
| PP-YOLOE+ Small | 640×640 | 0.999 | 111ms vs 187ms | 1.68× | 7.9 MB (3.8× smaller) |
Isolated Latency vs. Production Throughput
The latency figures above represent isolated, single-threaded compute reduction. However, in a real-world multi-camera edge deployment (where models are forced to share CPU cores and L3 cache), this compute reduction translates directly into density and throughput rather than absolute wall-clock latency:
- Cache Preservation: The 3.9× smaller memory footprint allows multiple INT8 models to fit entirely within the CPU's L3 cache, eliminating the memory-bus thrashing that crashes FP32 multi-cam pipelines.
- Thread Starvation Prevention: By executing in ~25ms of compute instead of ~44ms, the OS scheduler can juggle multiple video streams without starving the pipeline.
- The Result: In a 3-camera stress test on an 8-core edge CPU, FP32 pipelines experience severe thread starvation and memory bottlenecking, driving latency to ~43ms and dropping streams to a jittery 21 fps. Kenosis INT8 pipelines process efficiently enough to sustain 28–29 fps per camera, demonstrating that static quantization is the key to maximizing camera density on commodity hardware.
Classifier Benchmarks (Kenosis vs ORT Python Quantizer)
| Model | Cosine | Kenosis Latency | ORT Latency | Kenosis Advantage |
|---|---|---|---|---|
| SqueezeNet 1.1 | 0.999 | 2.82ms | 4.25ms | 51% faster |
| ResNet50 v2 | 0.999 | 38.0ms | 49.5ms | 24% faster |
Kenosis achieves 26/26 QLinearConv fusion on SqueezeNet (vs ORT's 26/26), plus 8/8 QLinearConcat and full pool fusion — with fewer residual DequantizeLinear nodes. The advantage comes from ReLU-aware QDQ placement that matches ORT's internal Conv+ReLU fusion patterns.
Key Features
| Feature | Kenosis | ORT Python |
|---|---|---|
| Static INT8 with ReLU-aware QDQ | ✅ | ❌ |
| Detection model mixed-precision | ✅ | ❌ |
| Non-vision tensor protection | ✅ | ❌ |
| Multi-input model calibration | ✅ | ❌ |
| Transformer & MatMul quantization | ✅ | ❌ |
| NLP synthetic calibration data | ✅ | ❌ |
| SNR-based sensitivity analysis | ✅ | ❌ |
| INT32 bias quantization w/ DQL | ✅ | ✅ |
| Per-channel weight quantization | ✅ | ✅ |
| Built-in validation + benchmarking | ✅ | ❌ |
| PaddlePaddle Constant extraction | ✅ | ❌ |
| Zero Python dependency | ✅ | ❌ |
| Cross-platform single binary | ✅ | ❌ |
Install
cargo install kenosis-cli
Or build from source:
git clone https://github.com/CoreEpoch/kenosis.git
cd kenosis
cargo build --release
Usage
Static INT8 Quantization (recommended)
The primary quantization mode. Produces QDQ-format models that run on stock ONNX Runtime with full INT8 acceleration.
# Standard vision model (SqueezeNet, ResNet, EfficientNet, etc.)
kenosis quantize model.onnx -o model_int8.onnx --static-int8
# Per-channel weights (better for models with high channel counts like ResNet)
kenosis quantize model.onnx -o model_int8.onnx --static-int8 --per-channel
# PaddlePaddle models (PP-YOLOE+, PP-LCNet, etc.)
kenosis quantize ppyoloe.onnx -o ppyoloe_int8.onnx --static-int8 --extract-constants
# Custom calibration sample count
kenosis quantize model.onnx -o model_int8.onnx --static-int8 --n-calib 40
# External calibration data (raw f32 binary files)
kenosis quantize model.onnx -o model_int8.onnx --static-int8 --calib-dir ./calib_data/
Validate Quantized Models
Compare a quantized model against its FP32 baseline — measures cosine similarity, Top-1 agreement, and latency side-by-side.
# Basic validation (50 samples, 200 timed runs)
kenosis validate model.onnx model_int8.onnx
# Custom sample counts
kenosis validate model.onnx model_int8.onnx -n 500 --timed 500
Output:
════════════════════════════════════════════════════════
📊 Kenosis Validation Report
════════════════════════════════════════════════════════
▸ Cosine similarity: 0.999128 (min 0.9986)
▸ Top-1 agreement: 83/100 (83%)
▸ Latency: 2.82ms vs 6.03ms (2.13× speedup)
▸ Size: 1.24 MB vs 4.73 MB (3.8× smaller)
▸ Verdict: EXCELLENT — production ready
════════════════════════════════════════════════════════
Inspect a Model
# Basic stats — ops, params, size, data types, largest tensors
kenosis inspect model.onnx
Utility Commands
# Cast to FP16/BF16
kenosis cast model.onnx -o model_fp16.onnx --precision fp16
# Compare two models
kenosis diff model.onnx model_int8.onnx
How Static INT8 Works
Kenosis's static INT8 pipeline applies seven coordinated optimizations:
-
Self-calibration — Automatically generates synthetic calibration inputs and runs them through the model via ONNX Runtime to collect per-tensor activation ranges. No external calibration data required. Multi-input models and NLP inputs (token IDs, attention masks) are handled automatically.
-
Weight quantization — INT8 symmetric per-tensor or per-channel. All scale computations in f64 to match ORT's internal precision.
-
INT32 bias quantization —
scale = activation_scale × weight_scale, zero_point = 0. Wrapped with DequantizeLinear for ORT kernel fusion. -
Zero-point nudged activation quantization — UINT8 asymmetric with post-hoc range adjustment ensuring
float 0.0maps exactly to the quantized zero. Prevents rounding asymmetry from compounding across layers. -
Activation-aware QDQ placement — ORT's Python quantizer places QDQ nodes on every Conv/MatMul output independently. Kenosis detects
Conv/MatMul → Activationpairs (ReLU, LeakyRelu, Clip, HardSwish, Sigmoid) at graph level and places QDQ after the activation instead. This gives ORT's runtime optimizer a cleaner pattern that fuses into a single INT8 kernel. Combined with second-pass wrapping of Add, Concat, MaxPool, and AveragePool, this maximizes QLinear fusions. -
Non-vision tensor protection — For multi-input models (detection, segmentation), tensors reachable from non-primary inputs (scale_factor, image_shape) are traced through the graph and excluded from quantization. This prevents metadata paths from being crushed by INT8 range limits.
-
Model output protection — Tensors that are direct model outputs are never QDQ-wrapped, preserving full FP32 precision in detection head scores and bounding box coordinates.
-
SNR Sensitivity Analysis — Computes Signal-to-Noise Ratio (SNR) for every layer's weight quantization. Automatically identifies mathematically fragile layers and protects them in FP32, recovering catastrophic Top-1 accuracy drops.
Detection Model Support
Kenosis handles the specific challenges of quantizing object detection models:
- Multi-input calibration — Auto-generates appropriate default values for secondary inputs (scale_factor → 1.0, shape tensors → 0.0)
- PaddlePaddle weight handling — Extracts inline Constant nodes, deduplicates shared weights (deepcopy tensors), and upgrades opset attributes (Squeeze, Unsqueeze, BatchNorm, Dropout)
- Mixed-precision detection head — Backbone and neck are fully INT8; detection head outputs and metadata paths stay FP32
- Scale factor preservation — The bounding box rescaling path remains live and dynamic, not frozen to calibration values
Architecture
kenosis/
├── crates/
│ └── kenosis-core/ # Library: quantization engine
│ └── src/
│ ├── model.rs # OnnxModel load/save/traversal + Constant extraction
│ ├── static_int8.rs # Static INT8 QDQ quantization pipeline
│ ├── inspect.rs # Stats and analysis
│ ├── cast.rs # FP16/BF16 casting
│ ├── diff.rs # Model comparison
│ ├── proto.rs # ONNX protobuf type definitions
│ └── error.rs # Error types
├── apps/
│ └── kenosis-cli/ # Binary: CLI interface
│ └── src/commands/
│ ├── quantize.rs # quantize command (static INT8)
│ ├── validate.rs # validate command (accuracy + latency)
│ ├── inspect.rs # inspect command
│ ├── cast.rs # cast command
│ └── diff.rs # diff command
License
Apache-2.0 — see LICENSE.
Built by Core Epoch.
Dependencies
~15–30MB
~323K SLoC