You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Custom CUDA kernels for accelerating 1.58-bit ternary LLM inference with 2:4 structured sparsity on consumer Ampere GPUs. Exploits both ternary arithmetic (no multiplies) and hardware sparse tensor cores to maximize throughput on RTX 3060. Based on the Sparse-BitNet paper (Zhang et al., 2026).