Tags: stillwater-sc/kpu-sim
Tags
v0.8.0: Native wheel infrastructure for PyPI - scikit-build-core integration for multi-platform wheel builds - cibuildwheel CI/CD for Linux, macOS, and Windows - Fixed DFX parser library build (dfx_parser.cpp) - Fixed MSVC symbol export for trace library - Universal library v3.91 with /Zc:__cplusplus for MSVC C++20 support
v0.7.12 - Python-C++ Integration Release Features: - XUE Observation Architecture with hierarchical event tracking - All Python Tensor operations route through C++ BehavioralComputeFabric - Complete quantization infrastructure (INT8, calibration, Q/DQ ops) - Roofline analysis with DRAM traffic recording Technical Changes: - Native functions: native_matmul, native_add/sub/mul/div, native_relu/gelu/silu/sigmoid/tanh - XUE event hierarchy: NAMED_OP, ALU_PRIMITIVE, MEMORY categories - Fixed double-counting in hierarchical FLOPs recording Note: Full C++ simulation requires source build with CMake. PyPI package provides Python orchestration with NumPy fallback. Native wheel infrastructure planned for v0.8.0.
v0.7.10 - Q/DQ Operations Explicit Quantize/Dequantize operations for QAT and ONNX-style graphs: Q/DQ Pattern: input -> Q -> DQ -> op -> Q -> DQ -> output New APIs: - kpu.QDQParams - Parameter container - kpu.Q() / kpu.DQ() - Core quantize/dequantize - kpu.fake_quantize() - Simulate quantization error - kpu.qdq_linear() / qdq_matmul() / qdq_conv2d() - kpu.create_qdq_params() - Calibration utility - kpu.quantize_error() - Error metrics (SNR, etc.) Features: - Per-tensor and per-channel quantization - Symmetric and asymmetric modes - Serialization support - INT8 typical SNR: 40-46 dB Use cases: - Quantization-aware training (QAT) - ONNX quantization representation - Fine-grained quantization control - Error analysis and debugging
v0.7.9 - Mixed Precision Support Mixed precision inference with different dtypes for weights and activations: Configurations: | Name | Weights | Acts | Reduction | |----------------|---------|-------|-----------| | MIXED_INT8_FP16| INT8 | FP16 | 3.3x | | MIXED_INT8_BF16| INT8 | BF16 | 3.3x | | MIXED_INT4_FP16| INT4 | FP16 | 5.0x | | MIXED_FP8_FP16 | FP8 | FP16 | 3.3x | | MIXED_FP8_BF16 | FP8 | BF16 | 3.3x | New APIs: - kpu.MixedPrecisionConfig - Custom configurations - kpu.mixed_precision_linear() - kpu.mixed_precision_matmul() - kpu.mixed_precision_conv2d() - kpu.calculate_mixed_precision_traffic() Key insight: Keeping activations in FP16/BF16 while compressing weights to INT8/INT4 provides ~1.5x better accuracy than pure INT8 with similar memory benefits. Common use case: LLM inference with INT4 weights + BF16 activations.
v0.7.8 - FP4 Support (Packed Storage) 4-bit floating-point quantization for extreme model compression: FP4 Formats: | Format | Range | Values | |--------|------------|--------| | E2M1 | [-6, 6] | 15 | | E1M2 | [-3.5, 3.5]| 15 | E2M1 representable values: 0, ±0.5, ±1.0, ±1.5, ±2.0, ±3.0, ±4.0, ±6.0 New APIs: - kpu.fp4_quantize() / fp4_dequantize() - kpu.pack_fp4() / unpack_fp4() - kpu.fp4_matmul() / fp4_linear() - kpu.fp4_range() / fp4_values() / fp4_info() - kpu.FP4_E2M1 / FP4_E1M2 format constants Memory: 8x bandwidth reduction vs FP32 FP4 completes the 4-bit quantization support alongside INT4. Use cases: extreme edge deployment, model exploration.
v0.7.7 - INT4 Support (Packed Storage) 4-bit integer quantization for aggressive model compression: INT4 Format: - Signed: [-8, 7] (16 discrete values) - Unsigned: [0, 15] (16 discrete values) - Storage: 2 values per byte (packed) New APIs: - kpu.pack_int4() / unpack_int4() - Packed storage - kpu.quantize_int4() / dequantize_int4() - kpu.compute_int4_scale_zero_point() - kpu.int4_matmul() / int4_linear() - kpu.int4_packed_size() / int4_memory_bytes() - kpu.int4_info() Memory Comparison (1024x1024 matmul): | Type | Traffic | Reduction | |------|---------|-----------| | FP32 | 12.6 MB | 1x | | FP16 | 6.3 MB | 2x | | INT8 | 3.1 MB | 4x | | INT4 | 1.6 MB | 8x | INT4 is used for extreme compression where bandwidth is critical and accuracy loss is acceptable (e.g., edge deployment, LLM inference).
v0.7.3 - FP8 Support (All Variants) Complete 8-bit floating-point infrastructure supporting all common formats: FP8 Formats: | Format | Bits | Range | Epsilon | Use Case | |--------|------|----------|---------|----------| | E2M5 | 2+5 | ±3.9 | 0.031 | Gradients | | E3M4 | 3+4 | ±15.5 | 0.063 | General | | E4M3 | 4+3 | ±240 | 0.125 | Weights (NVIDIA) | | E5M2 | 5+2 | ±57344 | 0.250 | Activations | New APIs: - kpu.fp8_matmul() - FP8 matrix multiplication - kpu.fp8_linear() - FP8 linear layer - kpu.cast_to_fp8() / cast_from_fp8() - Type conversion - kpu.fp8_range() / fp8_precision() / fp8_info() - kpu.FP8_E4M3, FP8_E5M2, etc. - Format constants Performance: - 4x memory bandwidth reduction vs FP32 - E4M3 recommended for weights (NVIDIA H100 standard) - E5M2 recommended for activations (wider range) Uses ml_dtypes for native E4M3/E5M2 when installed, otherwise emulates by quantizing to FP8 precision.
v0.7.2 - BF16 (BFloat16) Support BFloat16 operations with ml_dtypes integration and fallback emulation: New APIs: - kpu.bf16_matmul() - BF16 matrix multiplication - kpu.bf16_linear() - BF16 linear layer - kpu.bf16_conv2d() - BF16 2D convolution - kpu.cast_to_bf16() / cast_from_bf16() - Type conversion - kpu.bf16_range() - Returns (±3.4e38) - kpu.bf16_precision() - Returns epsilon (~0.0078) - kpu.is_bfloat16_native() - Check ml_dtypes availability BF16 vs FP16 Comparison: | Property | FP16 | BF16 | |----------|------|------| | Bytes | 2 | 2 | | Range | ±6.5e4 | ±3.4e38 | | Precision | ~3 digits | ~2 digits | | Rel. Error | ~0.15% | ~2% | BF16 trades precision for range, making it ideal for deep learning where gradient magnitudes vary widely. Optional dependency: pip install stillwater-kpu[bfloat16]
v0.7.1 - FP16 Support Native half-precision floating-point operations using NumPy float16: New APIs: - kpu.fp16_matmul() - FP16 matrix multiplication - kpu.fp16_linear() - FP16 linear layer - kpu.fp16_conv2d() - FP16 2D convolution - kpu.cast_to_fp16() / cast_from_fp16() - Type conversion - kpu.fp16_range() - Get representable range - kpu.fp16_precision() - Get machine epsilon Performance Characteristics: - 2x memory bandwidth reduction vs FP32 - ~0.15-0.33% relative error vs FP32 baseline - Range: ±65504 - Precision: ~3 decimal digits Uses NumPy native float16 for realistic half-precision behavior including reduced precision and potential overflow.
PreviousNext