#onnx #quantization #deep-learning #cli

app kenosis-cli

CLI for ONNX model quantization, casting, inspection, and comparison — powered by kenosis-core

3 releases (1 stable)

new 1.0.0 May 12, 2026
0.4.0 May 8, 2026
0.3.4 May 8, 2026

#57 in Machine learning

Apache-2.0

160KB
3K SLoC

Kenosis

Production-grade ONNX model quantization. Zero Python. Single Native Binary.

CI License: Apache-2.0


Kenosis is a Rust CLI toolkit for quantizing, validating, inspecting, and comparing ONNX models. Its flagship feature is static INT8 quantization with ReLU-aware QDQ placement that achieves full QLinearConv fusion on stock ONNX Runtime — no custom operators required.

Production Results

Kenosis quantizes the PP-YOLOE+ object detection models deployed in production edge AI pipelines, delivering production-validated performance gains:

Model Resolution Cosine Latency Speedup Size
PP-YOLOE+ Small 320×320 0.998 23ms vs 44ms 1.89× 7.9 MB (3.9× smaller)
PP-YOLOE+ Small 416×416 0.998 43ms vs 77ms 1.80× 7.9 MB (3.9× smaller)
PP-YOLOE+ Small 640×640 0.999 111ms vs 187ms 1.68× 7.9 MB (3.8× smaller)

Isolated Latency vs. Production Throughput

The latency figures above represent isolated, single-threaded compute reduction. However, in a real-world multi-camera edge deployment (where models are forced to share CPU cores and L3 cache), this compute reduction translates directly into density and throughput rather than absolute wall-clock latency:

  • Cache Preservation: The 3.9× smaller memory footprint allows multiple INT8 models to fit entirely within the CPU's L3 cache, eliminating the memory-bus thrashing that crashes FP32 multi-cam pipelines.
  • Thread Starvation Prevention: By executing in ~25ms of compute instead of ~44ms, the OS scheduler can juggle multiple video streams without starving the pipeline.
  • The Result: In a 3-camera stress test on an 8-core edge CPU, FP32 pipelines experience severe thread starvation and memory bottlenecking, driving latency to ~43ms and dropping streams to a jittery 21 fps. Kenosis INT8 pipelines process efficiently enough to sustain 28–29 fps per camera, demonstrating that static quantization is the key to maximizing camera density on commodity hardware.

Classifier Benchmarks (Kenosis vs ORT Python Quantizer)

Model Cosine Kenosis Latency ORT Latency Kenosis Advantage
SqueezeNet 1.1 0.999 2.82ms 4.25ms 51% faster
ResNet50 v2 0.999 38.0ms 49.5ms 24% faster

Kenosis achieves 26/26 QLinearConv fusion on SqueezeNet (vs ORT's 26/26), plus 8/8 QLinearConcat and full pool fusion — with fewer residual DequantizeLinear nodes. The advantage comes from ReLU-aware QDQ placement that matches ORT's internal Conv+ReLU fusion patterns.

Key Features

Feature Kenosis ORT Python
Static INT8 with ReLU-aware QDQ
Detection model mixed-precision
Non-vision tensor protection
Multi-input model calibration
Transformer & MatMul quantization
NLP synthetic calibration data
SNR-based sensitivity analysis
INT32 bias quantization w/ DQL
Per-channel weight quantization
Built-in validation + benchmarking
PaddlePaddle Constant extraction
Zero Python dependency
Cross-platform single binary

Install

cargo install kenosis-cli

Or build from source:

git clone https://github.com/CoreEpoch/kenosis.git
cd kenosis
cargo build --release

Usage

The primary quantization mode. Produces QDQ-format models that run on stock ONNX Runtime with full INT8 acceleration.

# Standard vision model (SqueezeNet, ResNet, EfficientNet, etc.)
kenosis quantize model.onnx -o model_int8.onnx --static-int8

# Per-channel weights (better for models with high channel counts like ResNet)
kenosis quantize model.onnx -o model_int8.onnx --static-int8 --per-channel

# PaddlePaddle models (PP-YOLOE+, PP-LCNet, etc.)
kenosis quantize ppyoloe.onnx -o ppyoloe_int8.onnx --static-int8 --extract-constants

# Custom calibration sample count
kenosis quantize model.onnx -o model_int8.onnx --static-int8 --n-calib 40

# External calibration data (raw f32 binary files)
kenosis quantize model.onnx -o model_int8.onnx --static-int8 --calib-dir ./calib_data/

Validate Quantized Models

Compare a quantized model against its FP32 baseline — measures cosine similarity, Top-1 agreement, and latency side-by-side.

# Basic validation (50 samples, 200 timed runs)
kenosis validate model.onnx model_int8.onnx

# Custom sample counts
kenosis validate model.onnx model_int8.onnx -n 500 --timed 500

Output:

  ════════════════════════════════════════════════════════
  📊  Kenosis Validation Report
  ════════════════════════════════════════════════════════
  ▸ Cosine similarity:  0.999128 (min 0.9986)
  ▸ Top-1 agreement:    83/100 (83%)
  ▸ Latency:            2.82ms vs 6.03ms (2.13× speedup)
  ▸ Size:               1.24 MB vs 4.73 MB (3.8× smaller)
  ▸ Verdict:             EXCELLENT — production ready
  ════════════════════════════════════════════════════════

Inspect a Model

# Basic stats — ops, params, size, data types, largest tensors
kenosis inspect model.onnx

Utility Commands

# Cast to FP16/BF16
kenosis cast model.onnx -o model_fp16.onnx --precision fp16

# Compare two models
kenosis diff model.onnx model_int8.onnx

How Static INT8 Works

Kenosis's static INT8 pipeline applies seven coordinated optimizations:

  1. Self-calibration — Automatically generates synthetic calibration inputs and runs them through the model via ONNX Runtime to collect per-tensor activation ranges. No external calibration data required. Multi-input models and NLP inputs (token IDs, attention masks) are handled automatically.

  2. Weight quantization — INT8 symmetric per-tensor or per-channel. All scale computations in f64 to match ORT's internal precision.

  3. INT32 bias quantizationscale = activation_scale × weight_scale, zero_point = 0. Wrapped with DequantizeLinear for ORT kernel fusion.

  4. Zero-point nudged activation quantization — UINT8 asymmetric with post-hoc range adjustment ensuring float 0.0 maps exactly to the quantized zero. Prevents rounding asymmetry from compounding across layers.

  5. Activation-aware QDQ placement — ORT's Python quantizer places QDQ nodes on every Conv/MatMul output independently. Kenosis detects Conv/MatMul → Activation pairs (ReLU, LeakyRelu, Clip, HardSwish, Sigmoid) at graph level and places QDQ after the activation instead. This gives ORT's runtime optimizer a cleaner pattern that fuses into a single INT8 kernel. Combined with second-pass wrapping of Add, Concat, MaxPool, and AveragePool, this maximizes QLinear fusions.

  6. Non-vision tensor protection — For multi-input models (detection, segmentation), tensors reachable from non-primary inputs (scale_factor, image_shape) are traced through the graph and excluded from quantization. This prevents metadata paths from being crushed by INT8 range limits.

  7. Model output protection — Tensors that are direct model outputs are never QDQ-wrapped, preserving full FP32 precision in detection head scores and bounding box coordinates.

  8. SNR Sensitivity Analysis — Computes Signal-to-Noise Ratio (SNR) for every layer's weight quantization. Automatically identifies mathematically fragile layers and protects them in FP32, recovering catastrophic Top-1 accuracy drops.

Detection Model Support

Kenosis handles the specific challenges of quantizing object detection models:

  • Multi-input calibration — Auto-generates appropriate default values for secondary inputs (scale_factor → 1.0, shape tensors → 0.0)
  • PaddlePaddle weight handling — Extracts inline Constant nodes, deduplicates shared weights (deepcopy tensors), and upgrades opset attributes (Squeeze, Unsqueeze, BatchNorm, Dropout)
  • Mixed-precision detection head — Backbone and neck are fully INT8; detection head outputs and metadata paths stay FP32
  • Scale factor preservation — The bounding box rescaling path remains live and dynamic, not frozen to calibration values

Architecture

kenosis/
├── crates/
│   └── kenosis-core/           # Library: quantization engine
│       └── src/
│           ├── model.rs        # OnnxModel load/save/traversal + Constant extraction
│           ├── static_int8.rs  # Static INT8 QDQ quantization pipeline
│           ├── inspect.rs      # Stats and analysis
│           ├── cast.rs         # FP16/BF16 casting
│           ├── diff.rs         # Model comparison
│           ├── proto.rs        # ONNX protobuf type definitions
│           └── error.rs        # Error types
├── apps/
│   └── kenosis-cli/            # Binary: CLI interface
│       └── src/commands/
│           ├── quantize.rs     # quantize command (static INT8)
│           ├── validate.rs     # validate command (accuracy + latency)
│           ├── inspect.rs      # inspect command
│           ├── cast.rs         # cast command
│           └── diff.rs         # diff command

License

Apache-2.0 — see LICENSE.

Built by Core Epoch.

Dependencies

~15–30MB
~323K SLoC