1 unstable release
| 0.1.0 | Jan 27, 2026 |
|---|
#606 in Compression
Used in 2 crates
305KB
6.5K
SLoC
haagenti-cuda
CUDA GPU acceleration for Haagenti tensor compression, enabling high-performance DCT/IDCT operations and HCT decompression directly on the GPU.
Features
- GPU DCT/IDCT: CUDA-accelerated Discrete Cosine Transform for spectral compression
- HCT Decompression: Reconstruct tensors from HCT V3 compressed format
- Memory Pipeline: Zero-copy streaming with pinned memory and CUDA streams
- Kernel Caching: NVRTC compilation caching for faster startup
Architecture
Traditional Pipeline:
Disk → CPU RAM → Decompress (CPU) → GPU Transfer → Inference
[5s] [2GB] [500ms] [200ms] [ready]
GPU Decompression Pipeline:
Disk → Pinned Memory → GPU Transfer → Decompress (GPU) → Inference
[3s] [staged] [150ms] [50ms] [ready]
Usage
GPU DCT/IDCT
use haagenti_cuda::dct_gpu::GpuDctContext;
// Create GPU context
let mut ctx = GpuDctContext::new(0)?;
// Forward DCT (compress)
let dct_coeffs = ctx.dct_2d(&input_data, width, height)?;
// Inverse DCT (decompress)
let reconstructed = ctx.idct_2d(&dct_coeffs, width, height)?;
HCT Decompression
use haagenti_cuda::decompress::{GpuDecompressor, decompress_cpu};
// GPU decompression
let mut decompressor = GpuDecompressor::new(0)?;
let tensor = decompressor.decompress(&compressed_data, &[576, 576])?;
// CPU fallback (no GPU required)
let tensor = decompress_cpu(&compressed_data, &[576, 576])?;
Batch Decompression with Statistics
use haagenti_cuda::decompress::GpuDecompressor;
let mut decompressor = GpuDecompressor::new(0)?;
// Batch decompress multiple tensors with performance stats
let tensors: Vec<(&[u8], &[usize])> = vec![
(&compressed1, &[576, 576]),
(&compressed2, &[1024, 512]),
];
let (results, stats) = decompressor.decompress_batch_pipelined(&tensors)?;
println!("{}", stats.summary());
// "2 tensors, 1.2 MB -> 4.5 MB (3.8x), 12.3ms (365.8 MB/s)"
Direct GPU Memory
use haagenti_cuda::decompress::GpuDecompressor;
let mut decompressor = GpuDecompressor::new(0)?;
// Decompress directly to GPU memory (no host copy)
let gpu_tensor = decompressor.decompress_to_gpu(&compressed, &shape)?;
// Use directly in inference without data transfer
Performance
| Operation | Size | Throughput |
|---|---|---|
| GPU DCT 2D | 576x576 | 400-2100 MB/s |
| GPU IDCT 2D | 576x576 | 400-2100 MB/s |
| CPU DCT fallback | 576x576 | 50-100 MB/s |
Kernel Architecture
DCT kernels automatically select optimal implementation:
- Shared Memory Kernel: For tensors ≤ 32 in any dimension (fast)
- Direct Kernel: For medium tensors (no shared memory limit)
- FFT-based Kernel: For large tensors > 4096 (O(n log n) via cuFFT)
Both use NVRTC runtime compilation with caching for fast startup.
FFT-based DCT for Large Tensors
For tensors with dimensions > 4096, enable the cufft feature for O(n log n) DCT:
use haagenti_cuda::dct_gpu::GpuDctContext;
let mut ctx = GpuDctContext::new(0)?;
// Automatic selection: uses FFT for dimensions > 4096
let coeffs = ctx.dct_2d(&large_data, 8192, 8192)?;
// Manual control
ctx.set_fft_threshold(2048); // Use FFT for dimensions > 2048
// Force direct method (bypass FFT)
let direct_coeffs = ctx.dct_2d_direct(&data, width, height)?;
| Tensor Size | Direct DCT | FFT DCT | Speedup |
|---|---|---|---|
| 1024x1024 | 2.1ms | 0.8ms | 2.6x |
| 4096x4096 | 134ms | 3.2ms | 42x |
| 8192x8192 | 536ms | 6.8ms | 79x |
Features Flags
default: LZ4, Zstd, CPU fallbackcuda: Enable GPU tests (requires NVIDIA GPU)cufft: FFT-based DCT via cuFFT (requires libcufft)cpu-fallback: CPU DCT using haagenti-coreasync: Tokio async supportpython: PyO3 bindings
Conformance Testing
GPU implementations must pass the HCT conformance test suite to be considered compliant with the HCT specification.
Running Conformance Tests
# Standard run (requires NVIDIA GPU)
cargo run --release --example conformance_test -p haagenti-cuda
# With cuFFT support for large tensors
cargo run --release --example conformance_test -p haagenti-cuda --features cufft
# WSL2 with GPU passthrough
LD_LIBRARY_PATH=/usr/lib/wsl/lib cargo run --release --example conformance_test -p haagenti-cuda
Expected Output
╔═══════════════════════════════════════════════════════════════╗
║ HCT GPU CONFORMANCE TEST SUITE ║
║ Reference: HCT-SPECIFICATION-DRAFT.md Section 7 ║
╚═══════════════════════════════════════════════════════════════╝
Running 6 conformance tests...
───────────────────────────────────────────────────────────────────────────
Test Vector Shape GPU Cosine Expected Status
───────────────────────────────────────────────────────────────────────────
sequential_4x4 4x4 1.000000 1.000000 PASS
identity_4x4 4x4 1.000000 1.000000 PASS
gaussian_8x8 8x8 0.999998 0.999998 PASS
low_rank_8x8 8x8 0.934903 0.934903 PASS
zeros_4x4 4x4 1.000000 1.000000 PASS
constant_4x4 4x4 1.000000 1.000000 PASS
───────────────────────────────────────────────────────────────────────────
Result: ALL CONFORMANCE TESTS PASSED
Test Vectors
The conformance tests use reference test vectors from the HCT specification:
| Vector | Shape | Retention | Purpose |
|---|---|---|---|
| sequential_4x4 | 4x4 | 50% | Basic DCT validation |
| identity_4x4 | 4x4 | 50% | Sparse input handling |
| gaussian_8x8 | 8x8 | 70% | Realistic weight distribution |
| low_rank_8x8 | 8x8 | 30% | Low-retention quality |
| zeros_4x4 | 4x4 | 50% | Edge case: all zeros |
| constant_4x4 | 4x4 | 25% | DC-only compression |
See docs/HCT-SPECIFICATION-DRAFT.md Section 7 for complete test vector definitions.
Exit Codes
0: All conformance tests passed1: One or more tests failed2: GPU initialization failed
Requirements
- CUDA 12.0+ with NVRTC
- NVIDIA GPU with compute capability 7.0+
- cudarc 0.12+
HCT V3 Format
The HCT (Holographic Compressed Tensor) format stores spectral coefficients:
Header:
2 bytes: num_fragments (u16 LE)
For each fragment:
2 bytes: index (u16 LE)
2 bytes: flags (u16 LE)
8 bytes: checksum (u64 LE)
4 bytes: data_len (u32 LE)
data_len bytes: fragment data
Fragment data (V3 with bitmap):
4 bytes: num_coefficients (u32 LE)
bitmap: (num_elements + 7) / 8 bytes
coefficients: num_coefficients * 2 bytes (f16 LE)
License
MIT
Dependencies
~19–34MB
~390K SLoC