Tags · SHI-Labs/NATTEN

v0.21.6

0.21.6 Release (#333)

Changelog:

* Fixed syntax error in natten.profiler (only affects `python < 3.12`)
* Fixed stride overflow issue (occurs in very large seqlen + heads) in
Blackwell FMHA/FNA, Hopper FMHA/FNA, and TokPerm kernels.
* Blackwell FMHA / FNA improvements:
* Deterministic backward pass added to Blackwell FMHA kernel (ported
from [FBGEMM](https://github.com/pytorch/FBGEMM)).
* Arbitrary head dimensions up to 128 that are multiples of 8 for
float16 and bfloat16, and multiples of 16 for fp8 are now supported.
* Hopper FMHA improvements:
    * Causal mask support
    * Varlen support
* Removed extra memory op on LSE in backward pass (both FMHA and FNA)
* Improved deterministic backward pass:
* Improved deterministic mode and torch compile support for
cutlass-fmha/cutlass-fna backends:
* `backward_kv_splits` and `backward_use_pt_reduction` are now
independent of backward config, and
backward config is strictly tile shapes, while these two knobs are
verified entirely inside
the registered torch op, which hides some of the complexity from
torch.compile, therefore
          doesn't trigger recompilations in varlen.
* `backward_use_pt_reduction` controls whether we use native pytorch ops
for computing sum of dO *
O (delta), instead of the cutlass kernel. The cutlass kernel was
non-deterministic, but was
slightly faster. However, the torch implementation itself can sometimes
behave
non-deterministically for certain use cases. We therefore replaced the
cutlass kernel with the
kernel from hopper and blackwell fmha/fna, which is deterministic.
The deterministic behavior now is to use the cutlass kernel, and not the
pytorch
          implementation.
* All knobs affecting determinism are checked against PyTorch's
deterministic mode. Just set
          pytorch to deterministic and NATTEN will respect that setting.
* Attention Merge backward pass now allows arbitrary number of splits.
* Improved testing utilities, logging, stability, and runtime.
* Improved error reporting and handling in libnatten.
* Added build flags for adding lineinfo and building with PTX
(`NATTEN_BUILD_WITH_PTX`,
    `NATTEN_BUILD_WITH_LINEINFO`).
* Added runtime environment variable `NATTEN_LOG_PIPE` which allows
customizing where NATTEN logs
    are streamed.
* `cmake` dependency removed. It is only a build dependency.

Apr 14, 2026
4472b95
zip
tar.gz
Notes
Downloads

v0.21.5

Release/v0.21.5 (#310)

* Extended Attention (FMHA) functionality:
* Causal mask, variable length: for now only supported in CUTLASS FMHA
and Blackwell FMHA.
* Torch.compile support added
* All libnatten ops are now registered as torch ops, enabling full-graph
compilation with NATTEN
        ops.
* TokPerm kernels: Moved dilation to batch instead of heads, which
finally unblocks GQA/MQA.
* GQA/MQA support added for all FNA and FMHA operations.
* CUTLASS FNA/FMHA and Hopper FNA/FMHA don't support it in the kernels
natively, therefore it's
      implemented with graph transforms for now.
* Dedicated Token Permute kernels
* Token Permute/Unpermute and padding operations are now implemented as
their own kernels, and can
      be used instead of the PyTorch implementation.
* More accurate `merge_attentions` backward pass
* Limits number of outputs that can be merged to only 2 when
`requires_grad=True`.
* Misc bug fixes
* Wheels for torch 2.10, python 3.14

Feb 8, 2026
f99e014
zip
tar.gz
Notes
Downloads

v0.21.1

Release 0.21.1 (#276)

Oct 26, 2025
6ef26ef
zip
tar.gz
Notes
Downloads

v0.21.0

Release v0.21.0 (#245)

* Allow head dims 8 and 16 in Flex backend (without compilation only)
* Minor bug fixes in Flex wrapper / backend.
* Add support for SM120, Blackwell RTX (at this time
only [CUTLASS FNA/FMHA](https://natten.org/backends/#cutlass-fna-fmha)
and
[Flex FNA/FMHA](https://natten.org/backends/#flex-fna-fmha) backends are
supported.)
* Minor bug fixes in Hopper/Blackwell FMHA.
* Removed fused additional KV from Blackwell FNA forward: no improvement
in perf, complicates
    dilation, and difficult to maintain.
* **Add backward pass kernels to Hopper/Blackwell FMHA/FNA**: Both
backends now support backprop,
with the full set of features available in NATTEN. Expected performance
gains compared to
running CUTLASS (Ampere) FNA/FMHA is up to 10X (op-level). Hopper
FMHA/FNA achieves more than
50% of FAv3 backward performance, while Blackwell FMHA/FNA can in some
cases even outperform
    cuDNN backward.

Jul 14, 2025
506816a
zip
tar.gz
Notes
Downloads

v0.20.1

0.20.1 release (#235)

Jun 14, 2025
49af5fa
zip
tar.gz
Notes
Downloads

v0.20.0

Release/v0.20.0 (#226)

This release includes the new kernels and features discussed in
[Generalized Neighborhood Attention](https://arxiv.org/abs/2504.16922),
and more.

* New documentations and website: [natten.org](https://natten.org/)
* New kernels: [Hopper
FNA](https://natten.org/backends/#hopper-fna-fmha),
[Blackwell FNA](https://natten.org/backends/#blackwell-fna-fmha), and
[Flex FNA](https://natten.org/backends/#flex-fna-fmha) with
multi-dimensional tiling.
* Optional compilation of Flex Attention, and
[guards](https://natten.org/context) against it
        due to instability. Read more
        [here](https://natten.org/context/#flex-attention-torchcompile).
* Add support for [Strided Neighborhood
Attention](https://arxiv.org/abs/2504.16922):
* You can now implement Neighborhood Attention with a delay step in the
sliding window.
  * This feature can implement many sparse attention patterns, such as
[HaloNet](https://arxiv.org/abs/2103.12731), Blocked Attention, and
combinations in
        between.
* All fused neighborhood attention APIs in NATTEN (`na1d`, `na2d`,
`na3d`), and torch modules
(`NeighborhoodAttention{1,2,3}d` now allow a `stride` argument, which
has identical profile
to `kernel_size` and `dilation`: either an integer, or a tuple with the
same profile as the
        token layout.
  * This is feature supported by all of our existing and new backends.
* When `stride == kernel_size`, the operation will implement Blocked
Attention (a.k.a Window Self
Attention in [Swin Transformer](https://arxiv.org/abs/2103.14030)).
* [Profiling toolkit](https://natten.org/profiler) now ships with
NATTEN:
        `python -m natten.profiler`.
* Removed Autotuner -- in favor of eventually replacing with profiling
toolkit,
        [NATTEN Simulator](https://arxiv.org/abs/2504.16922), and:
    * Direct exposure of kernel configurations.
    * Interfaces for finding valid configurations for your use case
* [Profiler dry runs](https://natten.org/profiler/#dry-run) can also
help you navigate
available backends, and their configurations that are suitable for your
use case.
* [Profiler optimize mode](https://natten.org/profiler/#optimize) can
search through their
        configurations, and find the fastest one for your use case
* Dropped unfused / CPU backends in libnatten.
* `na{1,2,3}d_{qk,qkrpb,av}` APIs and respective backends have been
dropped. It was difficult
to continue maintaining them, as they mostly ran with very outdated and
naive kernels, and
the exceptions to that were not at all flexible with respect to user
parameters. Moving
        forward, we will only provide
[Fused Neighborhood Attention](https://arxiv.org/abs/2403.04690)
kernels, but unfused
        kernels may be revisited depending on demand and use case.
* CPU implementations were all unfused, and were very limited as well,
and are likewise removed.
* Our new [Flex FNA](https://natten.org/backends/#flex-fna-fmha) backend
will serve as the
        default option for non-NVIDIA GPU users.
* Dropped support for RPB.
* Dropped support for experimental torch ops.
* Massively improved error messages, type checking.
* Considerable refactor of libnatten, and reduced binary size.
* Unified interfaces for 1D/2D/3D forms, while still offering
rank-specific interfaces.
* `torch < 2.7` is no longer officially supported.

Jun 7, 2025
fcf36f9
zip
tar.gz
Notes
Downloads

v0.17.5

v0.17.5 release (#215)

This commit:

* Torch 2.6 support.
* Dropped support for CTK < 12.0, and torch < 2.5
* Dropped deprecated ops (natten.functional.natten*d{qk,qkrpb,av})

Prior commits:

* Added support for even-sized kernels!
  * NATTEN now allows any kernel size greater than 1 in fused ops.
* Only available in Fused NA (both the CUTLASS 2.X kernels and Flex) for
now.
* NOTE: any even sized kernel 2r will force each token to attend to r
tokens on the left,
    itself, and r - 1 tokens on the right (in non-corner cases).
* Added Flex Attention as a backend.
* Now you can use Flex Attention instead of FNA through NATTEN directly.
* Just import use_flex_attention() from natten, call it, and enjoy
potentially significant
     speedups on newer architectures.
    * With support for additional KV tokens.
* NOTE: we've been observing some instabilities with Flex Attention when
using torch 2.6. We'll
try to raise the issue with the PyTorch team, but please proceed with
caution.
* Better precision on fused ops with additional KV.

---------

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>

Mar 17, 2025
4831f05
zip
tar.gz
Notes
Downloads

v0.17.4

Release 0.17.4 (#196)

Refer to the changelog for details.

Jan 29, 2025
c198e15
zip
tar.gz
Notes
Downloads

v0.17.3

0.17.3 Release (#178)

Replaces 0.17.2, because 0.17.2 is broken on torch < 2.4. (We really
need automated testing.)

Fixes #177.

Nov 1, 2024
4ff5306
zip
tar.gz
Notes
Downloads

v0.17.1

Fix python 3.8 and 3.9 releases (#129)

Fixes #128.

May 19, 2024
15db931
zip
tar.gz
Notes
Downloads

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.21.6

v0.21.5

v0.21.1

v0.21.0

v0.20.1

v0.20.0

v0.17.5

v0.17.4

v0.17.3

v0.17.1

Tags: SHI-Labs/NATTEN