Skip to content

Tags: stillwater-sc/kpu-sim

Tags

v0.8.0

Toggle v0.8.0's commit message

Unverified

This tag is not signed, but one or more authors requires that any tag attributed to them is signed.
v0.8.0: Native wheel infrastructure for PyPI

- scikit-build-core integration for multi-platform wheel builds
- cibuildwheel CI/CD for Linux, macOS, and Windows
- Fixed DFX parser library build (dfx_parser.cpp)
- Fixed MSVC symbol export for trace library
- Universal library v3.91 with /Zc:__cplusplus for MSVC C++20 support

v0.7.12

Toggle v0.7.12's commit message

Unverified

This tag is not signed, but one or more authors requires that any tag attributed to them is signed.
v0.7.12 - Python-C++ Integration Release

Features:
- XUE Observation Architecture with hierarchical event tracking
- All Python Tensor operations route through C++ BehavioralComputeFabric
- Complete quantization infrastructure (INT8, calibration, Q/DQ ops)
- Roofline analysis with DRAM traffic recording

Technical Changes:
- Native functions: native_matmul, native_add/sub/mul/div, native_relu/gelu/silu/sigmoid/tanh
- XUE event hierarchy: NAMED_OP, ALU_PRIMITIVE, MEMORY categories
- Fixed double-counting in hierarchical FLOPs recording

Note: Full C++ simulation requires source build with CMake.
PyPI package provides Python orchestration with NumPy fallback.
Native wheel infrastructure planned for v0.8.0.

v0.7.11

Toggle v0.7.11's commit message

Unverified

This tag is not signed, but one or more authors requires that any tag attributed to them is signed.
v0.7.11: Calibration support for post-training quantization

v0.7.10

Toggle v0.7.10's commit message

Unverified

This tag is not signed, but one or more authors requires that any tag attributed to them is signed.
v0.7.10 - Q/DQ Operations

Explicit Quantize/Dequantize operations for QAT and ONNX-style graphs:

Q/DQ Pattern:
  input -> Q -> DQ -> op -> Q -> DQ -> output

New APIs:
- kpu.QDQParams - Parameter container
- kpu.Q() / kpu.DQ() - Core quantize/dequantize
- kpu.fake_quantize() - Simulate quantization error
- kpu.qdq_linear() / qdq_matmul() / qdq_conv2d()
- kpu.create_qdq_params() - Calibration utility
- kpu.quantize_error() - Error metrics (SNR, etc.)

Features:
- Per-tensor and per-channel quantization
- Symmetric and asymmetric modes
- Serialization support
- INT8 typical SNR: 40-46 dB

Use cases:
- Quantization-aware training (QAT)
- ONNX quantization representation
- Fine-grained quantization control
- Error analysis and debugging

v0.7.9

Toggle v0.7.9's commit message

Unverified

This tag is not signed, but one or more authors requires that any tag attributed to them is signed.
v0.7.9 - Mixed Precision Support

Mixed precision inference with different dtypes for weights and activations:

Configurations:
| Name           | Weights | Acts  | Reduction |
|----------------|---------|-------|-----------|
| MIXED_INT8_FP16| INT8    | FP16  | 3.3x      |
| MIXED_INT8_BF16| INT8    | BF16  | 3.3x      |
| MIXED_INT4_FP16| INT4    | FP16  | 5.0x      |
| MIXED_FP8_FP16 | FP8     | FP16  | 3.3x      |
| MIXED_FP8_BF16 | FP8     | BF16  | 3.3x      |

New APIs:
- kpu.MixedPrecisionConfig - Custom configurations
- kpu.mixed_precision_linear()
- kpu.mixed_precision_matmul()
- kpu.mixed_precision_conv2d()
- kpu.calculate_mixed_precision_traffic()

Key insight: Keeping activations in FP16/BF16 while compressing
weights to INT8/INT4 provides ~1.5x better accuracy than pure INT8
with similar memory benefits.

Common use case: LLM inference with INT4 weights + BF16 activations.

v0.7.8

Toggle v0.7.8's commit message

Unverified

This tag is not signed, but one or more authors requires that any tag attributed to them is signed.
v0.7.8 - FP4 Support (Packed Storage)

4-bit floating-point quantization for extreme model compression:

FP4 Formats:
| Format | Range      | Values |
|--------|------------|--------|
| E2M1   | [-6, 6]    | 15     |
| E1M2   | [-3.5, 3.5]| 15     |

E2M1 representable values:
  0, ±0.5, ±1.0, ±1.5, ±2.0, ±3.0, ±4.0, ±6.0

New APIs:
- kpu.fp4_quantize() / fp4_dequantize()
- kpu.pack_fp4() / unpack_fp4()
- kpu.fp4_matmul() / fp4_linear()
- kpu.fp4_range() / fp4_values() / fp4_info()
- kpu.FP4_E2M1 / FP4_E1M2 format constants

Memory: 8x bandwidth reduction vs FP32

FP4 completes the 4-bit quantization support alongside INT4.
Use cases: extreme edge deployment, model exploration.

v0.7.7

Toggle v0.7.7's commit message

Unverified

This tag is not signed, but one or more authors requires that any tag attributed to them is signed.
v0.7.7 - INT4 Support (Packed Storage)

4-bit integer quantization for aggressive model compression:

INT4 Format:
- Signed: [-8, 7] (16 discrete values)
- Unsigned: [0, 15] (16 discrete values)
- Storage: 2 values per byte (packed)

New APIs:
- kpu.pack_int4() / unpack_int4() - Packed storage
- kpu.quantize_int4() / dequantize_int4()
- kpu.compute_int4_scale_zero_point()
- kpu.int4_matmul() / int4_linear()
- kpu.int4_packed_size() / int4_memory_bytes()
- kpu.int4_info()

Memory Comparison (1024x1024 matmul):
| Type | Traffic | Reduction |
|------|---------|-----------|
| FP32 | 12.6 MB | 1x |
| FP16 | 6.3 MB  | 2x |
| INT8 | 3.1 MB  | 4x |
| INT4 | 1.6 MB  | 8x |

INT4 is used for extreme compression where bandwidth is critical
and accuracy loss is acceptable (e.g., edge deployment, LLM inference).

v0.7.3

Toggle v0.7.3's commit message

Unverified

This tag is not signed, but one or more authors requires that any tag attributed to them is signed.
v0.7.3 - FP8 Support (All Variants)

Complete 8-bit floating-point infrastructure supporting all common formats:

FP8 Formats:
| Format | Bits | Range    | Epsilon | Use Case |
|--------|------|----------|---------|----------|
| E2M5   | 2+5  | ±3.9     | 0.031   | Gradients |
| E3M4   | 3+4  | ±15.5    | 0.063   | General |
| E4M3   | 4+3  | ±240     | 0.125   | Weights (NVIDIA) |
| E5M2   | 5+2  | ±57344   | 0.250   | Activations |

New APIs:
- kpu.fp8_matmul() - FP8 matrix multiplication
- kpu.fp8_linear() - FP8 linear layer
- kpu.cast_to_fp8() / cast_from_fp8() - Type conversion
- kpu.fp8_range() / fp8_precision() / fp8_info()
- kpu.FP8_E4M3, FP8_E5M2, etc. - Format constants

Performance:
- 4x memory bandwidth reduction vs FP32
- E4M3 recommended for weights (NVIDIA H100 standard)
- E5M2 recommended for activations (wider range)

Uses ml_dtypes for native E4M3/E5M2 when installed,
otherwise emulates by quantizing to FP8 precision.

v0.7.2

Toggle v0.7.2's commit message

Unverified

This tag is not signed, but one or more authors requires that any tag attributed to them is signed.
v0.7.2 - BF16 (BFloat16) Support

BFloat16 operations with ml_dtypes integration and fallback emulation:

New APIs:
- kpu.bf16_matmul() - BF16 matrix multiplication
- kpu.bf16_linear() - BF16 linear layer
- kpu.bf16_conv2d() - BF16 2D convolution
- kpu.cast_to_bf16() / cast_from_bf16() - Type conversion
- kpu.bf16_range() - Returns (±3.4e38)
- kpu.bf16_precision() - Returns epsilon (~0.0078)
- kpu.is_bfloat16_native() - Check ml_dtypes availability

BF16 vs FP16 Comparison:
| Property | FP16 | BF16 |
|----------|------|------|
| Bytes | 2 | 2 |
| Range | ±6.5e4 | ±3.4e38 |
| Precision | ~3 digits | ~2 digits |
| Rel. Error | ~0.15% | ~2% |

BF16 trades precision for range, making it ideal for deep learning
where gradient magnitudes vary widely.

Optional dependency: pip install stillwater-kpu[bfloat16]

v0.7.1

Toggle v0.7.1's commit message

Unverified

This tag is not signed, but one or more authors requires that any tag attributed to them is signed.
v0.7.1 - FP16 Support

Native half-precision floating-point operations using NumPy float16:

New APIs:
- kpu.fp16_matmul() - FP16 matrix multiplication
- kpu.fp16_linear() - FP16 linear layer
- kpu.fp16_conv2d() - FP16 2D convolution
- kpu.cast_to_fp16() / cast_from_fp16() - Type conversion
- kpu.fp16_range() - Get representable range
- kpu.fp16_precision() - Get machine epsilon

Performance Characteristics:
- 2x memory bandwidth reduction vs FP32
- ~0.15-0.33% relative error vs FP32 baseline
- Range: ±65504
- Precision: ~3 decimal digits

Uses NumPy native float16 for realistic half-precision behavior
including reduced precision and potential overflow.