Tags: SHI-Labs/NATTEN
Tags
0.21.6 Release (#333) Changelog: * Fixed syntax error in natten.profiler (only affects `python < 3.12`) * Fixed stride overflow issue (occurs in very large seqlen + heads) in Blackwell FMHA/FNA, Hopper FMHA/FNA, and TokPerm kernels. * Blackwell FMHA / FNA improvements: * Deterministic backward pass added to Blackwell FMHA kernel (ported from [FBGEMM](https://github.com/pytorch/FBGEMM)). * Arbitrary head dimensions up to 128 that are multiples of 8 for float16 and bfloat16, and multiples of 16 for fp8 are now supported. * Hopper FMHA improvements: * Causal mask support * Varlen support * Removed extra memory op on LSE in backward pass (both FMHA and FNA) * Improved deterministic backward pass: * Improved deterministic mode and torch compile support for cutlass-fmha/cutlass-fna backends: * `backward_kv_splits` and `backward_use_pt_reduction` are now independent of backward config, and backward config is strictly tile shapes, while these two knobs are verified entirely inside the registered torch op, which hides some of the complexity from torch.compile, therefore doesn't trigger recompilations in varlen. * `backward_use_pt_reduction` controls whether we use native pytorch ops for computing sum of dO * O (delta), instead of the cutlass kernel. The cutlass kernel was non-deterministic, but was slightly faster. However, the torch implementation itself can sometimes behave non-deterministically for certain use cases. We therefore replaced the cutlass kernel with the kernel from hopper and blackwell fmha/fna, which is deterministic. The deterministic behavior now is to use the cutlass kernel, and not the pytorch implementation. * All knobs affecting determinism are checked against PyTorch's deterministic mode. Just set pytorch to deterministic and NATTEN will respect that setting. * Attention Merge backward pass now allows arbitrary number of splits. * Improved testing utilities, logging, stability, and runtime. * Improved error reporting and handling in libnatten. * Added build flags for adding lineinfo and building with PTX (`NATTEN_BUILD_WITH_PTX`, `NATTEN_BUILD_WITH_LINEINFO`). * Added runtime environment variable `NATTEN_LOG_PIPE` which allows customizing where NATTEN logs are streamed. * `cmake` dependency removed. It is only a build dependency.
Release/v0.21.5 (#310) * Extended Attention (FMHA) functionality: * Causal mask, variable length: for now only supported in CUTLASS FMHA and Blackwell FMHA. * Torch.compile support added * All libnatten ops are now registered as torch ops, enabling full-graph compilation with NATTEN ops. * TokPerm kernels: Moved dilation to batch instead of heads, which finally unblocks GQA/MQA. * GQA/MQA support added for all FNA and FMHA operations. * CUTLASS FNA/FMHA and Hopper FNA/FMHA don't support it in the kernels natively, therefore it's implemented with graph transforms for now. * Dedicated Token Permute kernels * Token Permute/Unpermute and padding operations are now implemented as their own kernels, and can be used instead of the PyTorch implementation. * More accurate `merge_attentions` backward pass * Limits number of outputs that can be merged to only 2 when `requires_grad=True`. * Misc bug fixes * Wheels for torch 2.10, python 3.14
Release v0.21.0 (#245) * Allow head dims 8 and 16 in Flex backend (without compilation only) * Minor bug fixes in Flex wrapper / backend. * Add support for SM120, Blackwell RTX (at this time only [CUTLASS FNA/FMHA](https://natten.org/backends/#cutlass-fna-fmha) and [Flex FNA/FMHA](https://natten.org/backends/#flex-fna-fmha) backends are supported.) * Minor bug fixes in Hopper/Blackwell FMHA. * Removed fused additional KV from Blackwell FNA forward: no improvement in perf, complicates dilation, and difficult to maintain. * **Add backward pass kernels to Hopper/Blackwell FMHA/FNA**: Both backends now support backprop, with the full set of features available in NATTEN. Expected performance gains compared to running CUTLASS (Ampere) FNA/FMHA is up to 10X (op-level). Hopper FMHA/FNA achieves more than 50% of FAv3 backward performance, while Blackwell FMHA/FNA can in some cases even outperform cuDNN backward.
Release/v0.20.0 (#226) This release includes the new kernels and features discussed in [Generalized Neighborhood Attention](https://arxiv.org/abs/2504.16922), and more. * New documentations and website: [natten.org](https://natten.org/) * New kernels: [Hopper FNA](https://natten.org/backends/#hopper-fna-fmha), [Blackwell FNA](https://natten.org/backends/#blackwell-fna-fmha), and [Flex FNA](https://natten.org/backends/#flex-fna-fmha) with multi-dimensional tiling. * Optional compilation of Flex Attention, and [guards](https://natten.org/context) against it due to instability. Read more [here](https://natten.org/context/#flex-attention-torchcompile). * Add support for [Strided Neighborhood Attention](https://arxiv.org/abs/2504.16922): * You can now implement Neighborhood Attention with a delay step in the sliding window. * This feature can implement many sparse attention patterns, such as [HaloNet](https://arxiv.org/abs/2103.12731), Blocked Attention, and combinations in between. * All fused neighborhood attention APIs in NATTEN (`na1d`, `na2d`, `na3d`), and torch modules (`NeighborhoodAttention{1,2,3}d` now allow a `stride` argument, which has identical profile to `kernel_size` and `dilation`: either an integer, or a tuple with the same profile as the token layout. * This is feature supported by all of our existing and new backends. * When `stride == kernel_size`, the operation will implement Blocked Attention (a.k.a Window Self Attention in [Swin Transformer](https://arxiv.org/abs/2103.14030)). * [Profiling toolkit](https://natten.org/profiler) now ships with NATTEN: `python -m natten.profiler`. * Removed Autotuner -- in favor of eventually replacing with profiling toolkit, [NATTEN Simulator](https://arxiv.org/abs/2504.16922), and: * Direct exposure of kernel configurations. * Interfaces for finding valid configurations for your use case * [Profiler dry runs](https://natten.org/profiler/#dry-run) can also help you navigate available backends, and their configurations that are suitable for your use case. * [Profiler optimize mode](https://natten.org/profiler/#optimize) can search through their configurations, and find the fastest one for your use case * Dropped unfused / CPU backends in libnatten. * `na{1,2,3}d_{qk,qkrpb,av}` APIs and respective backends have been dropped. It was difficult to continue maintaining them, as they mostly ran with very outdated and naive kernels, and the exceptions to that were not at all flexible with respect to user parameters. Moving forward, we will only provide [Fused Neighborhood Attention](https://arxiv.org/abs/2403.04690) kernels, but unfused kernels may be revisited depending on demand and use case. * CPU implementations were all unfused, and were very limited as well, and are likewise removed. * Our new [Flex FNA](https://natten.org/backends/#flex-fna-fmha) backend will serve as the default option for non-NVIDIA GPU users. * Dropped support for RPB. * Dropped support for experimental torch ops. * Massively improved error messages, type checking. * Considerable refactor of libnatten, and reduced binary size. * Unified interfaces for 1D/2D/3D forms, while still offering rank-specific interfaces. * `torch < 2.7` is no longer officially supported.
v0.17.5 release (#215) This commit: * Torch 2.6 support. * Dropped support for CTK < 12.0, and torch < 2.5 * Dropped deprecated ops (natten.functional.natten*d{qk,qkrpb,av}) Prior commits: * Added support for even-sized kernels! * NATTEN now allows any kernel size greater than 1 in fused ops. * Only available in Fused NA (both the CUTLASS 2.X kernels and Flex) for now. * NOTE: any even sized kernel 2r will force each token to attend to r tokens on the left, itself, and r - 1 tokens on the right (in non-corner cases). * Added Flex Attention as a backend. * Now you can use Flex Attention instead of FNA through NATTEN directly. * Just import use_flex_attention() from natten, call it, and enjoy potentially significant speedups on newer architectures. * With support for additional KV tokens. * NOTE: we've been observing some instabilities with Flex Attention when using torch 2.6. We'll try to raise the issue with the PyTorch team, but please proceed with caution. * Better precision on fused ops with additional KV. --------- Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
PreviousNext