Tags · stillwater-sc/kpu-sim

v0.8.0

v0.8.0: Native wheel infrastructure for PyPI

- scikit-build-core integration for multi-platform wheel builds
- cibuildwheel CI/CD for Linux, macOS, and Windows
- Fixed DFX parser library build (dfx_parser.cpp)
- Fixed MSVC symbol export for trace library
- Universal library v3.91 with /Zc:__cplusplus for MSVC C++20 support

Jan 26, 2026
f872e22
zip
tar.gz

v0.7.12

v0.7.12 - Python-C++ Integration Release

Features:
- XUE Observation Architecture with hierarchical event tracking
- All Python Tensor operations route through C++ BehavioralComputeFabric
- Complete quantization infrastructure (INT8, calibration, Q/DQ ops)
- Roofline analysis with DRAM traffic recording

Technical Changes:
- Native functions: native_matmul, native_add/sub/mul/div, native_relu/gelu/silu/sigmoid/tanh
- XUE event hierarchy: NAMED_OP, ALU_PRIMITIVE, MEMORY categories
- Fixed double-counting in hierarchical FLOPs recording

Note: Full C++ simulation requires source build with CMake.
PyPI package provides Python orchestration with NumPy fallback.
Native wheel infrastructure planned for v0.8.0.

Jan 24, 2026
a168590
zip
tar.gz
Notes

v0.7.11

v0.7.11: Calibration support for post-training quantization

Jan 22, 2026
14777f7
zip
tar.gz
Notes

v0.7.10

v0.7.10 - Q/DQ Operations

Explicit Quantize/Dequantize operations for QAT and ONNX-style graphs:

Q/DQ Pattern:
  input -> Q -> DQ -> op -> Q -> DQ -> output

New APIs:
- kpu.QDQParams - Parameter container
- kpu.Q() / kpu.DQ() - Core quantize/dequantize
- kpu.fake_quantize() - Simulate quantization error
- kpu.qdq_linear() / qdq_matmul() / qdq_conv2d()
- kpu.create_qdq_params() - Calibration utility
- kpu.quantize_error() - Error metrics (SNR, etc.)

Features:
- Per-tensor and per-channel quantization
- Symmetric and asymmetric modes
- Serialization support
- INT8 typical SNR: 40-46 dB

Use cases:
- Quantization-aware training (QAT)
- ONNX quantization representation
- Fine-grained quantization control
- Error analysis and debugging

Jan 22, 2026
af26d2b
zip
tar.gz

v0.7.9

v0.7.9 - Mixed Precision Support

Mixed precision inference with different dtypes for weights and activations:

Configurations:
| Name           | Weights | Acts  | Reduction |
|----------------|---------|-------|-----------|
| MIXED_INT8_FP16| INT8    | FP16  | 3.3x      |
| MIXED_INT8_BF16| INT8    | BF16  | 3.3x      |
| MIXED_INT4_FP16| INT4    | FP16  | 5.0x      |
| MIXED_FP8_FP16 | FP8     | FP16  | 3.3x      |
| MIXED_FP8_BF16 | FP8     | BF16  | 3.3x      |

New APIs:
- kpu.MixedPrecisionConfig - Custom configurations
- kpu.mixed_precision_linear()
- kpu.mixed_precision_matmul()
- kpu.mixed_precision_conv2d()
- kpu.calculate_mixed_precision_traffic()

Key insight: Keeping activations in FP16/BF16 while compressing
weights to INT8/INT4 provides ~1.5x better accuracy than pure INT8
with similar memory benefits.

Common use case: LLM inference with INT4 weights + BF16 activations.

Jan 22, 2026
79dad6c
zip
tar.gz

v0.7.8

v0.7.8 - FP4 Support (Packed Storage)

4-bit floating-point quantization for extreme model compression:

FP4 Formats:
| Format | Range      | Values |
|--------|------------|--------|
| E2M1   | [-6, 6]    | 15     |
| E1M2   | [-3.5, 3.5]| 15     |

E2M1 representable values:
  0, ±0.5, ±1.0, ±1.5, ±2.0, ±3.0, ±4.0, ±6.0

New APIs:
- kpu.fp4_quantize() / fp4_dequantize()
- kpu.pack_fp4() / unpack_fp4()
- kpu.fp4_matmul() / fp4_linear()
- kpu.fp4_range() / fp4_values() / fp4_info()
- kpu.FP4_E2M1 / FP4_E1M2 format constants

Memory: 8x bandwidth reduction vs FP32

FP4 completes the 4-bit quantization support alongside INT4.
Use cases: extreme edge deployment, model exploration.

Jan 22, 2026
3b7b34b
zip
tar.gz

v0.7.7

v0.7.7 - INT4 Support (Packed Storage)

4-bit integer quantization for aggressive model compression:

INT4 Format:
- Signed: [-8, 7] (16 discrete values)
- Unsigned: [0, 15] (16 discrete values)
- Storage: 2 values per byte (packed)

New APIs:
- kpu.pack_int4() / unpack_int4() - Packed storage
- kpu.quantize_int4() / dequantize_int4()
- kpu.compute_int4_scale_zero_point()
- kpu.int4_matmul() / int4_linear()
- kpu.int4_packed_size() / int4_memory_bytes()
- kpu.int4_info()

Memory Comparison (1024x1024 matmul):
| Type | Traffic | Reduction |
|------|---------|-----------|
| FP32 | 12.6 MB | 1x |
| FP16 | 6.3 MB  | 2x |
| INT8 | 3.1 MB  | 4x |
| INT4 | 1.6 MB  | 8x |

INT4 is used for extreme compression where bandwidth is critical
and accuracy loss is acceptable (e.g., edge deployment, LLM inference).

Jan 22, 2026
1ad23c8
zip
tar.gz

v0.7.3

v0.7.3 - FP8 Support (All Variants)

Complete 8-bit floating-point infrastructure supporting all common formats:

FP8 Formats:
| Format | Bits | Range    | Epsilon | Use Case |
|--------|------|----------|---------|----------|
| E2M5   | 2+5  | ±3.9     | 0.031   | Gradients |
| E3M4   | 3+4  | ±15.5    | 0.063   | General |
| E4M3   | 4+3  | ±240     | 0.125   | Weights (NVIDIA) |
| E5M2   | 5+2  | ±57344   | 0.250   | Activations |

New APIs:
- kpu.fp8_matmul() - FP8 matrix multiplication
- kpu.fp8_linear() - FP8 linear layer
- kpu.cast_to_fp8() / cast_from_fp8() - Type conversion
- kpu.fp8_range() / fp8_precision() / fp8_info()
- kpu.FP8_E4M3, FP8_E5M2, etc. - Format constants

Performance:
- 4x memory bandwidth reduction vs FP32
- E4M3 recommended for weights (NVIDIA H100 standard)
- E5M2 recommended for activations (wider range)

Uses ml_dtypes for native E4M3/E5M2 when installed,
otherwise emulates by quantizing to FP8 precision.

Jan 22, 2026
0b79c44
zip
tar.gz

v0.7.2

v0.7.2 - BF16 (BFloat16) Support

BFloat16 operations with ml_dtypes integration and fallback emulation:

New APIs:
- kpu.bf16_matmul() - BF16 matrix multiplication
- kpu.bf16_linear() - BF16 linear layer
- kpu.bf16_conv2d() - BF16 2D convolution
- kpu.cast_to_bf16() / cast_from_bf16() - Type conversion
- kpu.bf16_range() - Returns (±3.4e38)
- kpu.bf16_precision() - Returns epsilon (~0.0078)
- kpu.is_bfloat16_native() - Check ml_dtypes availability

BF16 vs FP16 Comparison:
| Property | FP16 | BF16 |
|----------|------|------|
| Bytes | 2 | 2 |
| Range | ±6.5e4 | ±3.4e38 |
| Precision | ~3 digits | ~2 digits |
| Rel. Error | ~0.15% | ~2% |

BF16 trades precision for range, making it ideal for deep learning
where gradient magnitudes vary widely.

Optional dependency: pip install stillwater-kpu[bfloat16]

Jan 22, 2026
48a00fa
zip
tar.gz

v0.7.1

v0.7.1 - FP16 Support

Native half-precision floating-point operations using NumPy float16:

New APIs:
- kpu.fp16_matmul() - FP16 matrix multiplication
- kpu.fp16_linear() - FP16 linear layer
- kpu.fp16_conv2d() - FP16 2D convolution
- kpu.cast_to_fp16() / cast_from_fp16() - Type conversion
- kpu.fp16_range() - Get representable range
- kpu.fp16_precision() - Get machine epsilon

Performance Characteristics:
- 2x memory bandwidth reduction vs FP32
- ~0.15-0.33% relative error vs FP32 baseline
- Range: ±65504
- Precision: ~3 decimal digits

Uses NumPy native float16 for realistic half-precision behavior
including reduced precision and potential overflow.

Jan 22, 2026
8c578bd
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.0

v0.7.12

v0.7.11

v0.7.10

v0.7.9

v0.7.8

v0.7.7

v0.7.3

v0.7.2

v0.7.1

Tags: stillwater-sc/kpu-sim