AITER v0.1.15.post1

Hotfix release for downstream vLLM partner unblock. Adds 4 cherry-picked fixes on top of v0.1.15 (plus 5 prerequisite commits to make them apply cleanly).

Inherited from v0.1.15 (full list in v0.1.15 release notes) — including PR #3304 mla fp8 qh32 seqlen=1 persistent kernel for gfx950 (required for DSv3.2 with --kv-cache-dtype fp8_e4m3). The cherry-pick lives at commit 6415d586 on the release/v0.1.15 branch (added during v0.1.15-rc0); the commit title matches PR #3304 verbatim.

Cherry-picks (4 fixes Kenny Roche requested for vLLM unblock)

PR	Title	Fixes
#3540	Rebuild 32x384 kernel from new sources	MiniMax M2.5 OOB access in fmoe (#3471)
#3428	add MX_FP4_A8 tuned configs and dispatch for moe_gemm_a8w4	gpt-oss W4A8 regressions + crashes
#3492	Enabled stride-aware KV-cache block dim for non-contiguous layouts (fused_qk_norm_rope_cache_pts_quant_shuffle)	Qwen fusion non-contiguous KV-cache
#3546	[Triton][Gluon] fused_qk_rope_cat_and_cache_mla new grid layout	MLA perf + vLLM unit-test fix

Prerequisite commits (required for the above to cherry-pick clean): #3372 (LDS-aware num_stages), #3159 (hip kl refactor), #3358 (partial rope), #2888 (FP4 GFX12 support), GFX12 import fix.

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)

Model	v0.1.15.post1	v0.1.15	Threshold	Result
MiniMax-M2.5 (TP=2, fp8 KV)	0.9378	0.9340	0.92	PASS ↑
DeepSeek-R1-0528 (TP=8, fp8 KV)	0.9515	0.9431	0.94	PASS ↑
Qwen3-235B-A22B-FP8 (TP=8, fp8 KV)	0.8756	0.8795	0.87	PASS (within noise band)
GLM-5-FP8 (TP=8, fp8 KV)	0.9454	0.9431	0.93	PASS ↑
Kimi-K2.5-MXFP4 (TP=4, fp8 KV)	0.9363	0.9340	0.92	PASS ↑

5/5 PASS. 4/5 improved or equal to v0.1.15 baseline; Qwen3 single-question noise.

Related vLLM PRs (downstream)

For full gpt-oss + Qwen path you still need the vLLM-side companion PRs:

vLLM #44893 — Pass GateMode.INTERLEAVE for MXFP4 W4A16 fused MoE (Rohan138)
vLLM #44804 — Hybrid CDNA4 swizzle gate for A8W4 MoE (xiaohuguo2023)
vLLM intermediate_pad TP-aware fix (Rohan138)

Wheel Matrix

ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. torch 2.10 (rocm7.0/7.1) / torch 2.11 (rocm7.2). Fat binary gfx942 + gfx950.

Install

pip install \
  --extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ \
  https://github.com/ROCm/aiter/releases/download/v0.1.15.post1/<wheel-filename>

Partner deps same as v0.1.15: flydsl==0.1.9.dev599, triton>=3.6.0.

Known Issues

pip 26.0.1 wheel filename "wrong number of parts": rename wheel to drop .manylinux.2.28 infix before install. Will fix in v0.1.16.

Feedback

Bug reports: https://github.com/ROCm/aiter/issues — tag v0.1.15.post1
Direct: peng.sun@amd.com

AITER v0.1.14.post1

Patch release on top of v0.1.14 for vLLM downstream bump. Adds one cherry-pick: PR #3304 (mla fp8 qh32 seqlen=1 persistent kernel support on gfx950 — DSv3.2 FP8 MLA decode path).

Plus two CI infra commits to enable manylinux_2_28 wheel rebuild on current builders.

What's in it (delta vs v0.1.14)

0f3c58e6e  ci: pull latest install_triton.sh + aiter-release.yaml from main
76d80cd3f  mla: add fp8 qh32 seqlen=1 persistent kernel support on gfx950 (#3304)
[v0.1.14 baseline at bd0534e96]

Validation

DeepSeek-R1-0528 (TP=8, kv_cache_dtype=fp8) on MI355X (gfx950): GSM8K 3-shot flexible-extract = 0.9439 (threshold 0.94, PASS).
All 6 wheels installable + import aiter validated on rocm/atom torch 2.10 ABI container.
vLLM downstream: ABI compatible with current rocm/vllm-dev:nightly torch 2.10 path (verified by build matrix torch_pin=2.10 for rocm7.0/7.1, torch_pin=2.11 for rocm7.2).

Wheel Matrix

6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).

ROCm	Python	torch ABI	Size
7.0	3.10	2.10	~470 MB
7.0	3.12	2.10	~470 MB
7.1	3.10	2.10	~460 MB
7.1	3.12	2.10	~460 MB
7.2	3.10	2.11	~455 MB
7.2	3.12	2.11	~455 MB

Install

pip install https://github.com/ROCm/aiter/releases/download/v0.1.14.post1/<wheel-filename>

Known Issues

pip 26.0.1+ wheel filename parser

pip 26.0.1 (and possibly other 26.x versions) rejects this wheel with Invalid wheel filename (wrong number of parts): 'post1'. The combination of .post1 in the public version and .manylinux.2.28 in the local version segment confuses pip's PEP 491 filename parser.

Workaround: download the wheel, rename to strip the +rocm7.X.manylinux.2.28 local segment, then install:

wget https://github.com/ROCm/aiter/releases/download/v0.1.14.post1/amd_aiter-0.1.14.post1+rocm7.1.manylinux.2.28-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
mv amd_aiter-0.1.14.post1+rocm7.1.manylinux.2.28-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl \
   amd_aiter-0.1.14.post1-cp312-cp312-manylinux_2_28_x86_64.whl
pip install ./amd_aiter-0.1.14.post1-cp312-cp312-manylinux_2_28_x86_64.whl

Pip-installable in older pip (<= 25.x) directly from the URL.

gpt-oss accuracy

Per Doug Lehr's note (ROCM-25517), gpt-oss accuracy on v0.1.14 was below standard. This is not fixed in v0.1.14.post1 — only #3304 was cherry-picked per Richard Li's request to unblock the vLLM bump. gpt-oss fix is targeted for v0.1.15.

#3001 not included

Per Richard's request "#3001 if not already on the 0.1.14 line": #3001 was not on v0.1.14 line. We evaluated cherry-picking it and found it depends on a 7-PR chain including a 1210-line tuner refactor (#3220). Bringing the full chain risks the release window and changes far more than the post1 patch surface should. #3001 will land in v0.1.15. Skipped from post1.

Acknowledgments

Richard Li (vLLM team) — surfaced the DSv3.2 FP8 MLA gfx950 blocker and the minimum cherry-pick set
Alexios Lyrakis — author of the #3304 mla kernel

AITER v0.1.15

Bi-weekly release, paired with ATOM v0.1.4 (first AITER+ATOM paired release in the bi-weekly cadence pilot — see cadence proposal).

Same commit as v0.1.15-rc0 (8ddfc7510) — zero delta after 6-day RC soak with no partner issues filed. Release branch: release/v0.1.15.

Highlights since v0.1.14 (80 commits on main + 1 cherry-pick)

DSv4-Pro / V4-Flash kernels — fused compress attention (#3357), sparse prefill OPUS (#3225), fp8_mqa_logits re-add, indexer_qk_rope_quant_and_cache non-contiguous k support (#3301), DSv4 padding fix (#3184), DSv4 bf16 + fp8 a8w8 blockscale tunes (#3284 #3339 #3394)
MoE — fused dynamic MXFP8 quant + moe_sort HIP path (#3312), drop a_scale_one for fp8 stage1 + remove fp8 fuse_quant bypass (#3367), optimised prefill mxfp8 quant moe sort (#3398), LDS-aware num_stages selection for gfx950 (#3372), GLUON a8w4 optimisations (#3317)
FlyDSL — pin bumped to 0.1.9.dev599, fused qk_norm_rope_quant for DSv4-Pro decode (#3320), fused_compress_attn for V4-Pro/V4-Flash (#3357), dynamic layout fix (#3373)
Triton — tl.dot(..., acc=...) accumulator form (#3231), split-k common reduce (#3230), MoE gfx1250 optimisations (#3293), MoE routing support for expert_map (#3348), splitk deadlock fix (#3288)
mla — fp8 qh32 seqlen=1 persistent kernel for gfx950 (#3304, cherry-picked)
mhc_post / mhc_pre — fused rmsnorm (#3396), split-k acc_sq mask fix (#3278)
OPUS — bf16 gemm support (#2945), pa_sparse_prefill_opus (#3225), mono version m align assert fix (#3382), unroll loop + scale mfma update (#3329), synchronous fallback _async_load (#3336), CDNA-only v_pk_mul_f32 ASM guards (#3322 #3356)
gfx1200/1201 RDNA4 — FP8 dtype map (#3332), Gluon MoE optimisations (#3317)
DP-attention — CUDAGraph capture compatibility fix (#3375)
CK — submodule pin fix after CK re-sync with rocm-libraries (#3387)
CI/build infra — install_triton.sh pipefail bug fix, workflow installs flydsl from AMD mirror at build time

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)

Model	Score	Threshold	Result
DeepSeek-R1-0528 (TP=8, fp8 KV)	0.9431	0.94	PASS
MiniMax-M2.5 (TP=2, fp8 KV)	0.9340	0.92	PASS
Qwen3-235B-A22B-FP8 (TP=8, fp8 KV)	0.8795	0.87	PASS
GLM-5-FP8 (TP=8, fp8 KV)	0.9431	0.93	PASS
Kimi-K2.5-MXFP4 (TP=4, fp8 KV)	0.9340	0.93	PASS

5/5 PASS. Qwen3-235B-A22B passes cleanly for the first time on this base.

Wheel Matrix (6 wheels)

ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI (glibc 2.28+). Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).

ROCm	Python	torch ABI	Size
7.0	3.10	2.10	466 MB
7.0	3.12	2.10	467 MB
7.1	3.10	2.10	459 MB
7.1	3.12	2.10	459 MB
7.2	3.10	2.11	452 MB
7.2	3.12	2.11	453 MB

Install

pip install \
  --extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ \
  https://github.com/ROCm/aiter/releases/download/v0.1.15/<wheel-filename>

The --extra-index-url is required — see "Partner dependencies" below.

Partner dependencies (READ BEFORE INSTALLING)

1. `flydsl==0.1.9.dev599` (REQUIRED)

setup.py calls start_aot() which imports aiter.aot.flydsl.gemm at build time. Runtime aiter import also requires this exact version. Available only from the AMD nightlies mirror:

https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/

Always pass --extra-index-url <above> to pip. Without it, ROCm/HIP JIT silently disables.

2. `triton>=3.6.0` (REQUIRED)

aiter/__init__.py enforces triton>=3.6.0 for the new Gluon kernels. Use the paired ATOM v0.1.4 container which ships triton 3.6.0, or before installing the aiter wheel:

pip install --force-reinstall triton==3.6.0

Paired container

rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4 ships this AITER wheel pre-installed with matching triton + flydsl. Recommended pin point for partners.

Known Issues

rocm7.2 wheel built against torch 2.11 ABI. For deployments still on torch 2.10 ATOM containers, install the rocm7.1 wheel which uses torch 2.10 ABI (validated PASS on all 5 models).
pip 26.0.1 "wrong number of parts in filename": workaround — download the wheel and rename to drop the .manylinux.2.28 infix from the version segment before pip install. Tracked for v0.1.16.

Feedback

Bug reports: https://github.com/ROCm/aiter/issues — tag v0.1.15
Direct: peng.sun@amd.com

AITER v0.1.15-rc0

First release candidate for v0.1.15. Cut from main@e3940660b ("Add add mhc_pre fused rmsnorm (#3396)", 2026-05-28), with one cherry-pick (#3304 mla fp8 qh32 seqlen=1 persistent kernel for gfx950) and two CI fixes on top.

Release branch: release/v0.1.15 at 8ddfc7510.

Highlights since v0.1.14 (80 commits on main + 1 cherry-pick)

DSv4-Pro / V4-Flash kernels — fused compress attention (#3357), sparse prefill OPUS (#3225), fp8_mqa_logits re-add, indexer_qk_rope_quant_and_cache non-contiguous k support (#3301), DSv4 padding fix (#3184), DSv4 bf16 + fp8 a8w8 blockscale tunes (#3284 #3339 #3394)
MoE — fused dynamic MXFP8 quant + moe_sort HIP path (#3312), drop a_scale_one for fp8 stage1 + remove fp8 fuse_quant bypass (#3367), optimised prefill mxfp8 quant moe sort (#3398), LDS-aware num_stages selection for gfx950 (#3372), GLUON a8w4 optimisations (#3317)
FlyDSL — pin bumped to 0.1.9.dev599, fused qk_norm_rope_quant for DSv4-Pro decode (#3320), fused_compress_attn for V4-Pro/V4-Flash (#3357), dynamic layout fix (#3373)
Triton — tl.dot(..., acc=...) accumulator form (#3231), split-k common reduce (#3230), MoE gfx1250 optimisations (#3293), MoE routing support for expert_map (#3348), splitk deadlock fix (#3288)
mla — fp8 qh32 seqlen=1 persistent kernel for gfx950 (#3304, cherry-picked)
mhc_post / mhc_pre — fused rmsnorm (#3396), split-k acc_sq mask fix (#3278)
OPUS — bf16 gemm support (#2945), pa_sparse_prefill_opus (#3225), mono version m align assert fix (#3382), unroll loop + scale mfma update (#3329), synchronous fallback _async_load (#3336), CDNA-only v_pk_mul_f32 ASM guards (#3322 #3356)
gfx1200/1201 RDNA4 — FP8 dtype map (#3332), Gluon MoE optimisations (#3317)
DP-attention — CUDAGraph capture compatibility fix (#3375)
CK — submodule pin fix after CK re-sync with rocm-libraries (#3387)
CI/build infra — install_triton.sh pipefail bug fix, workflow installs flydsl from AMD mirror at build time (this RC)

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15, validated on `e3940660b` base before cherry-pick)

Model	Score	Threshold	Result
DeepSeek-R1-0528 (TP=8, fp8 KV)	0.9431	0.94	PASS
MiniMax-M2.5 (TP=2, fp8 KV)	0.9340	0.92	PASS
Qwen3-235B-A22B-FP8 (TP=8, fp8 KV)	0.8795	0.87	PASS ✨
GLM-5-FP8 (TP=8, fp8 KV)	0.9431	0.93	PASS
Kimi-K2.5-MXFP4 (TP=4, fp8 KV)	0.9340	0.93	PASS

5/5 PASS. Qwen3-235B-A22B passes cleanly for the first time on this base (was borderline 0.8696 / 0.8650 on v0.1.14 / Option A candidate). Likely due to #3398 prefill mxfp8 moe sort + #3396 mhc_pre fused rmsnorm.

Wheel Matrix (6 wheels)

ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI (glibc 2.28+). Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).

ROCm	Python	torch ABI	Size
7.0	3.10	2.10	466 MB
7.0	3.12	2.10	467 MB
7.1	3.10	2.10	459 MB
7.1	3.12	2.10	459 MB
7.2	3.10	2.11	452 MB
7.2	3.12	2.11	453 MB

Install

pip install   --extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/   https://github.com/ROCm/aiter/releases/download/v0.1.15-rc0/<wheel-filename>

The --extra-index-url is required at this RC — see "Partner dependencies" below.

Partner dependencies (READ BEFORE INSTALLING)

v0.1.15 introduces two hard runtime/build dependencies that are not on public PyPI:

1. `flydsl==0.1.9.dev599` (REQUIRED)

setup.py calls start_aot() which imports aiter.aot.flydsl.gemm at build time. Runtime aiter import also requires this exact version. Available only from the AMD nightlies mirror:

https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/

Always pass --extra-index-url <above> to pip when installing the aiter wheel. Without it, ROCm/HIP JIT silently disables (CK and HIP ops gone, only Triton ops remain).

We expect flydsl 0.1.9 final release wheels on public PyPI by v0.1.15 final.

2. `triton>=3.6.0` (REQUIRED)

aiter/__init__.py enforces triton>=3.6.0 for the new Gluon kernels (#2695, #3219). Current ATOM containers (rocm/atom:rocm7.2.2_*_pytorch_release_2.10.0_atom0.1.2.post) ship triton 3.5.1 and will hit:

RuntimeError: aiter gluon kernels require triton>=3.6.0, found 3.5.1

Partner action: before installing the aiter wheel:

pip install --force-reinstall triton==3.6.0

Or wait for new ATOM container builds that ship triton 3.6+ by default.

Known Issues

Kimi-K2.5-MXFP4 end-to-end requires ATOM with PR #670 (kwargs upgrade for aiter.fused_qk_rmsnorm). ATOM nightly tags from 2026-05-14 onward include this; older ATOM containers will hit AttributeError: 'float' object has no attribute 'size' at MLA path. Tracking: #3177
rocm7.2 wheel built against torch 2.11 ABI. For deployments still on torch 2.10 ATOM containers, install the rocm7.1 wheel which uses torch 2.10 ABI (validated PASS on all 5 models).

Build / release engineering notes

This RC fixes two CI bugs that blocked the first 6-wheel build attempt:

install_triton.sh silent-fail on non-Debian builders. The dpkg -l rocm-core | awk line failed under set -o pipefail when dpkg returned 1 (container has no rocm-core), and setup.py swallowed the _run_install_triton() exception in try/except, leaving the container with no triton installed after the uninstall step ran. Surfaced as AttributeError: module 'triton' has no attribute 'language'. Fix: || true after the awk pipe so the pipeline tolerates dpkg's non-zero exit.
Workflow drops flydsl from requirements.txt. That worked for v0.1.13 / v0.1.14 because their setup.py could build without flydsl (try/except wrapped start_aot). v0.1.15 unconditionally calls start_aot() during setup.py, so flydsl is now a hard build-time dep. Fix: install flydsl==0.1.9.dev599 from the AMD mirror in the build step.

Both fixes will be backported to main in a follow-up PR.

Feedback

Bug reports: https://github.com/ROCm/aiter/issues — tag v0.1.15-rc0
Direct: peng.sun@amd.com

Update 2026-06-05 — wheel ABI fix

All 6 wheels were rebuilt with explicit torch_pin:

ROCm 7.0 / 7.1: torch==2.10
ROCm 7.2: torch==2.11

The initial wheels in this release were built without torch_pin, so CI pulled torch 2.12+rocm7.X (latest on the public index). That created a c10::cuda::getCurrentCUDAStream ABI mismatch when partners loaded module_gemm_a8w8_blockscale_bpreshuffle.so inside ATOM containers shipping torch 2.10 (rocm/atom:rocm7.2.2_*_atom0.1.2.post). The rebuilt wheels match torch 2.10 ABI for rocm7.0/7.1 and torch 2.11 for rocm7.2.

DSR1 GSM8K re-validated on MI355X with the rocm7.1 cp312 wheel: 0.9477 (threshold 0.94, PASS).

Partners who downloaded wheels before 2026-06-05 21:00 UTC should re-download.

AITER v0.1.14

Production release of AITER v0.1.14. Cut from release/v0.1.14 at commit bd0534e96. 19 commits land in v0.1.14 vs v0.1.13.

Highlights

#3057 DSv4 fusions phase 1 — first batch of Triton/ATOM-side DSv4 fusions, the headline feature of v0.1.14.
#3163 minimax fused qknorm+allreduce — fused qknorm + allreduce kernel for MiniMax-M2.x, ~10-15% TPS improvement on prefill TP=2 / TP=4.
#3189 grid-strided loop, drop 80-token cap — follow-up to #3163 that removes the prior hard cap of 80 tokens per kernel launch.
Kimi-K2.5-MXFP4 unblocked end-to-end when paired with ATOM containing PR #670 (kwargs upgrade for aiter.fused_qk_rmsnorm).

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)

Model	Score	Threshold	Result
DeepSeek-R1-0528 (TP=8, fp8 KV)	0.9484	0.94	PASS
MiniMax-M2.5 (TP=2, fp8 KV)	0.9393	0.92	PASS
Qwen3-235B-A22B-FP8 (TP=8, fp8 KV)	0.8696	0.87	borderline (within GSM8K noise ±0.005)
GLM-5-FP8 (TP=8, fp8 KV)	0.9393	0.93	PASS
Kimi-K2.5-MXFP4 (TP=4, fp8 KV)	0.9348	0.93	PASS (requires ATOM with PR #670)

5/5 models pass. Qwen3 is technically below threshold by 0.0004 — within the standard deviation of the GSM8K 3-shot run and corresponds to a single-question swing.

Wheel Matrix

6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI (glibc 2.28+). Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).

Filename	Size	torch ABI
amd_aiter-0.1.14+rocm7.0...cp310	437 MB	2.10
amd_aiter-0.1.14+rocm7.0...cp312	438 MB	2.10
amd_aiter-0.1.14+rocm7.1...cp310	431 MB	2.10
amd_aiter-0.1.14+rocm7.1...cp312	431 MB	2.10
amd_aiter-0.1.14+rocm7.2...cp310	424 MB	2.11
amd_aiter-0.1.14+rocm7.2...cp312	425 MB	2.11

Install

pip install https://github.com/ROCm/aiter/releases/download/v0.1.14/<wheel-filename>

flydsl==0.1.7 is auto-resolved from PyPI as a runtime dep. Latest flydsl==0.1.8 also works (no API drift).

Known Issues

Kimi-K2.5-MXFP4 end-to-end requires ATOM with PR #670 (kwargs upgrade for aiter.fused_qk_rmsnorm). ATOM nightly tags from 2026-05-14 onward include this; older ATOM containers will hit AttributeError: 'float' object has no attribute 'size' at the MLA path. Tracking: #3177
rocm7.2 wheel was built against torch 2.11 ABI. For deployments still on torch 2.10 ATOM containers, install the rocm7.1 wheel which uses torch 2.10 ABI (validated PASS on all 5 models).

Cumulative Changes since v0.1.13

19 commits land in v0.1.14 vs v0.1.13. Grouped by area:

DSv4 / Triton-ATOM fusions

DSV4 fusions phase 1 (#3057)
Remove triton backend in dsv4_bf16_tuned_gemm.csv (#3171)

MoE

minimax ops: fused qknorm+allreduce kernel (#3163)
[custom_all_reduce] qknorm_allreduce_fusion_kernel_2stage: grid-strided loop, drop 80-token cap (#3189)
silu_and_mul_quant + Opt silu_and_mul (#3145)

FlyDSL

FlyDSL MXFP4 rounding alignment (#3153)
FlyDSL GDR decode kernel optimize (#3135)
FlyDSL xcd remap v2 (#3134)
FlyDSL per-kernel parallelism + AOT pool size (#3133)

Triton

mHC-post: post-stream + res-stream mixing optimization (#2920)
Triton blockscale num_stages pipelining (#3136)
Triton s_barrier sync waves (#3132)
feat(triton/rope): fused QKV split + QK RMSNorm + RoPE + paged KV (#2902)
Triton bench_gmm.py bug fix (#3154)

Bugfixes

fix gather mem violation (#3182)
[Bugfix][Triton] Honor transpose_bm in batched_gemm_a16wfp4_ fake tensor (#3166)

qk_rmsnorm_group_quant

refactor hip kl (-30% build time) (#3137)

CK_TILE

Use Unified Workspace for FMHA BWD (#2948)
Add nhead128,1 mask=1 + nhead128,4 fold to m16x4 (#3046)

Docs / Refactor

AITER May 2026 newsletter (#3170)
refactor + unify triton/bench_fav3_sage.py scripts (#2920)

AITER v0.1.13.post1

Patch release on top of v0.1.13 for SA InferenceX evaluation. Adds Kimi a16wi4 MoE support and a splitk dispatch fix. Built against the flydsl 0.1.4.post1.dev glibc-2.28 backport from the FlyDSL team.

What's in it (delta vs v0.1.13)

3a38da399  build(deps): pin flydsl>=0.1.4.post1.dev,<0.1.5
4365ee78c  fix splitk buffer dispatch (#3050)
9e831bad9  kimi a16wi4 moe support (#2863)

Wheels

ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X). Built against torch==2.10 (rocm 7.0/7.1) or torch==2.11 (rocm 7.2).

ROCm	Python	torch ABI	Size
7.0	3.10	2.10	~286 MB
7.0	3.12	2.10	~286 MB
7.1	3.10	2.10	~281 MB
7.1	3.12	2.10	~282 MB
7.2	3.10	2.11	~274 MB
7.2	3.12	2.11	~275 MB

Install

The wheel pins flydsl>=0.1.4.post1.dev,<0.1.5. Where to find a matching flydsl wheel depends on your environment glibc:

Ubuntu 22.04+ / glibc ≥ 2.35 (vLLM ROCm containers, RHEL 9, etc.) — pip resolves cleanly from PyPI without extra index:

pip install https://github.com/ROCm/aiter/releases/download/v0.1.13.post1/<wheel-filename>

glibc 2.28-2.34 (RHEL 8, CentOS 8, Ubuntu 20.04, manylinux_2_28 builders) — PyPI's 0.1.4.x wheels are manylinux_2_35 only. Use the AMD nightlies mirror as --extra-index-url:

pip install \
  --extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ \
  https://github.com/ROCm/aiter/releases/download/v0.1.13.post1/<wheel-filename>

The AMD mirror hosts flydsl-0.1.4.post1.dev20260515+fdd1c1e-cp{310,312}-manylinux_2_27_x86_64.whl (built on glibc 2.28 by the FlyDSL team to support older systems; tracking ROCm/FlyDSL#527).

Validation

DeepSeek-R1-0528 (TP=8, kv_cache_dtype=fp8): GSM8K 3-shot flexible-extract = 0.9507 (threshold 0.94, PASS) on mi355-gpu-9 (gfx950).
All 6 wheels installable + import aiter validated on vllm/vllm-openai-rocm:v0.19.1 (cp312) and AITER manylinux_2_28-builder (cp310/cp312).

Known Issues

Kimi-K2.5-MXFP4 end-to-end serving still blocked by an upstream FlyDSL JIT bug (name 'gy' is not defined / 'lds_out' is not defined in mixed_moe_gemm_2stage and flydsl_moe1_* kernels). Affects any AITER caller (ATOM/vLLM/SGLang) on the v0.1.13 line — the AITER PR #2958 API rename + ATOM PR #670 kwargs upgrade are not in this branch. For Kimi MXFP4 serving today, please use v0.1.14-rc0 (https://github.com/ROCm/aiter/releases/tag/v0.1.14-rc0) with ATOM containing PR #670.

Acknowledgments

Kiran Thumma + Felix Li (FlyDSL team) — for backporting glibc 2.28 support to the v0.1.4 line and publishing the 0.1.4.post1.dev wheel within ~24h of request.

Update 2026-05-28 — Verified by Amanzhol Salykov (FlyDSL team) with latest vLLM + Kimi-K2.5 int4 on MI355X; accuracy and perf both good. Published to GA.

⚠️ SUPERSEDED BY v0.1.14 — please use the final release. rc0 is left for historical reference only.

SUPERSEDED BY v0.1.14 — please use https://github.com/ROCm/aiter/releases/tag/v0.1.14

AITER v0.1.13

Production release of the v0.1.13 line. Same commit as v0.1.13-rc5 (cdcfa833b) after 5 RC iterations.

Highlights

DeepSeek R1 / GPT-OSS / Kimi / GLM-5 enablement maturing on MI300X / MI325X (gfx942) and MI350 / MI355X (gfx950)
New ASM fmoe kernels for gfx950 that bypass bf16→fp8 quantization, gated by AITER_XBFLOAT16=1 env var (default off, opt-in for safety) (#2262)
Substantial MLA improvements: MI350 MLA PS mode for new shapes (#2727, #2729, #2676), MoE PS mode for nhead=8/2 on MI308 (#2852), nhead=32 non-persistent decode crash fix on gfx950 (#2983)
FMHA / paged attention: runtime dispatch for >4 GB KV cache in batch prefill (#2893), top_k_per_row prefill fix for batched_token_num > 4096 (#2901), gfx942/gfx950 PA PS kernel update with stride_scale_page write (#2796)
RDNA4 expansion: FP8 support for gfx1200/gfx1201 (#2621), FlyDSL flash_attn_func backend for gfx1201 (R9600D) — first RDNA4-class attention backend in AITER (#2969 on main, included via baseline)
Triton kernel additions: Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441), Triton fallback for MI455 GPT-OSS / DSFP4 (#2657), GLM-5 70k+300 GEMM configs for gfx942 (#2743)
FlyDSL maturity: BF16 GEMM tuned configs added/retuned for 6 models (#2733), AOT defaults via AITER_CONFIGS (#2756), if/else compatibility across versions (#2740), updated FlyDSL version pin
Bulk silo merge — kernel fixes and tuned configs in preparation for the v0.1.13.post1 line (#3004, #3005, #3024)
Quality of life: pandas FutureWarning suppressed and pybind11 type hint mismatch fix (#2980), Linux import errors no longer swallowed (#3049), std::unordered_map replaced with SynchronizedCache for thread safety (#2221), ctypes C-ABI error bridging to prevent worker crashes during kernel build (#2498)

Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)

Model	Score	Threshold	Result
DeepSeek-R1-0528	0.9454	0.94	PASS
MiniMax-M2.5	0.9295	0.92	PASS
Qwen3-235B-FP8	0.8802	0.87	PASS

Wheel Matrix

6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2). Fat binary covers gfx942 (MI300/MI308/MI325X) + gfx950 (MI350/MI355X).

Cumulative Changes since v0.1.12.post2

149 commits land in v0.1.13 vs v0.1.12.post2. Full list available via git log v0.1.12.post2..v0.1.13. Highlights grouped by area:

MoE / FlyDSL kernels (44 commits)

ASM fmoe kernels for gfx950 with bf16→fp8 quantization bypass (#2262)
FlyDSL A8W4 MoE update (#2726)
GPT-OSS small-M MoE optimizations (#2775)
Kimi-K2.5 MoE tuned configs revert for batch sizes 32/64 (#2836) — Kimi int4 a16wi4 MoE (#2863) deferred to v0.1.13.post1
Triton Gluon-optimized MoE Int8 SmoothQuant for small K (#2441)
MoE tuner fixes (#2831, #2785, #2723)
fused_dynamic_mxfp4_quant_moe_sort_hip added (#2620, fix #2759)
CK_TILE bpreshuffle compile failure fix (#2811)
Bulk silo merge tuned configs and kernel fixes (#3004, #3005, #3024)
moe_routing_sigmoid_top1_fused tie-breaking fix (#2750)

MLA / Multi-head Latent Attention (9+ commits)

MI350 MLA PS mode support for new shapes (nhead 128,1 / 128,2 / 128,3 / 128,4 / 64,4 / 64,2 / 32,4) via mla_a16w16_qh32_qseqlen4_gqaratio32_ps.co (#2727)
gfx950 fp8 decode native qh32 qseqlen2 MLA PS kernel (#2676) and qh64 nhead=64 native kernel (#2636)
bf16 MLA decode kernel for gqa_ratio=64, qseqlen=1 (non-persistent) (#2729)
MLA PS mode nhead 8/2 on MI308 (#2852)
MLA Reduce and Metadata kernel rewritten with OPUS template (#2717)
gfx950 nhead=32 non-persistent decode crash fix (#2983)
OPUS lib improvements for MLA: mma step_k, dword copy via set_slice and inline asm for tr_load (#2652)

FMHA / Paged Attention

Runtime dispatch for >4 GB KV cache in batch prefill (#2893)
top_k_per_row prefill fix for batched_token_num > 4096 (#2901)
gfx942/gfx950 PA PS kernel update with stride_scale_page write in asm_pa (#2796)
fmha_fwd_v3 silence false warning when use_asm_v3 is disabled (#2744)
indexer_k_quant_and_cache preshuffled layout support (#2879)
car prefill kernel error fix for SGLang (#2745)

Triton path

Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441)
Triton MHA UT reduction (#2612)
Adapt model benchmarking scripts to new bench_mha.py CLI (#2673)
Triton fallback for MI455 GPT-OSS and DSFP4 (#2657)
GLM-5 70k+300 GEMM configs for gfx942 (#2743)
Triton MoE GEMM shared memory exhaustion fix by reducing stage count (#2723)
Drop GLM5 Triton tuned GEMM (#2803)

FlyDSL

BF16 GEMM configs added/retuned for 6 models (#2733)
AITER_CONFIGS for FlyDSL AOT defaults (#2756)
if const_expr introduction (#2776)
if/else compatibility across versions (#2740)
A8W4 MoE update (#2726)
bf16 GEMM implementation and tuned config update (#2634)
A8W8 FlyDSL tune fix (#2809)
Linear attention rebase for new FlyDSL version (#2746)

Architecture enablement

RDNA4 (gfx1200/gfx1201): FP8 support added (#2621)
MI355X (gfx950): continued maturation across MoE, MLA, FMHA paths
MI350 (gfx950): MLA PS mode coverage expanded
MI308 (gfx942): MLA PS mode nhead 8/2 (#2852), i8gemm tuning (#2590)
MI300X (gfx942): gemma rmsnorm quant fusion (#2853), gemm_a16w16 torch tune (#2860)

Quality and safety fixes

pandas FutureWarning suppression and pybind11 type hint mismatch (#2980)
Linux import errors no longer swallowed (#3049)
std::unordered_map → SynchronizedCache for thread safety in CK paths (#2221)
ctypes C-ABI error bridging to prevent worker crashes during kernel build (#2498)
fused_qk_norm_group_quant stride error check fix (#2637)
fused_dynamic_mxfp4_quant_moe_sort_hip EP fix (#2759)
fused_gemm_afp4wfp4_a16w16 LDS exhaustion fix under ASYNC_COPY (#2784)
opus.hpp build time optimization kernel template (single-header C++ template, up to 61x faster builds vs standard torch extension)

Release engineering

torch_pin + torch_index_url workflow inputs for release-build CI (#2875)
manylinux_2_28 wheel matrix standardized: ROCm 7.0/7.1/7.2 × Python...

AITER v0.1.13-rc5

Fifth release candidate for v0.1.13, focused on adding asm_fmoe kernels for gfx950 (no bf16->fp8 quantization required) while removing RC4's Kimi int4 MoE changes.

Changes vs RC4

Reverted:

kimi a16wi4 moe support (#2863) — defer to v0.1.14
fix splitk buffer dispatch (#3050) — only needed by #2863

Cherry-picked from main:

Introduce asm fmoe kernels that do not require bf16->fp8 quantization (#2262) — new gfx950-only kernels behind AITER_XBFLOAT16=1 env var (default off)
[Bugfix] Suppress pandas FutureWarning and fix pybind11 type hint mismatch (#2980)

Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)

Model	Score	Threshold	Result
DeepSeek-R1-0528	0.9454	0.94	PASS
MiniMax-M2.5	0.9295	0.92	PASS
Qwen3-235B-FP8	0.8802	0.87	PASS

Wheel Matrix

6 wheels for ROCm 7.0 / 7.1 / 7.2 x Python 3.10 / 3.12, manylinux_2_28 ABI. All built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2 - torch 2.10 was removed from PyTorch's rocm7.2 index).

Cherry-pick Audit Summary (PR #2262)

PR #2262 introduces a new code path that is off by default and gfx950-only:

Triple-gated: quant_type==per_1x128 + gfx950 + AITER_XBFLOAT16=1 env var
Public C++ API unchanged
New *_pybind.cu shims and pre-compiled .co HSA binaries for gfx950
Zero merge conflicts on release/v0.1.13
No follow-up correctness fixes on main

Existing MI300/MI308/MI450 deployments and unset-env MI355X deployments are unaffected.

AITER v0.1.13-rc2

Release candidate 2 for v0.1.13. Pre-release — please smoke-test downstream and report back before final tag.

Changes since rc1

5 cherry-picks onto release/v0.1.13:

Bug Fixes

#2983 — [MLA] Fix nhead=32 non-persistent decode crash on gfx950: Corrects the decode dispatch condition for MLA attention when nhead=32 (e.g., Kimi-K2.5). Without this fix, gfx950 takes the non-persistent path and crashes during decode.
#2879 — Support preshuffled layout in indexer_k_quant_and_cache: Adds preshuffled weight layout support to blockscale GEMM and KV cache indexer, fixing a blocker for DI/SA inference paths.

New Features

#3005 — [Silo] Bulk merge kernel fixes + features: Adds 5 new Triton kernels — causal_conv1d_update_single_token, fused_rearrange_sigmoid_gdr, fused_fp8_quant, pa_mqa_logits, and gated delta rule decode optimizations. Includes corresponding op tests.

Config & Tuning

#3004 — [Silo] Bulk merge tuned configs: Adds MI355X (gfx950) tuned configs for Kimi-K2, GLM-4.7, Qwen3-Next-80B across GEMM and FMoE kernels.
#3024 — [Silo] Add configs missing from bulk merge #3004: Adds 6375 MI355X GEMM tunings for DeepSeek-V3.2 + MiniMax-M2.5 FMoE tunings. Deduplicates cross-file shape collisions (best us per shape wins).

Files changed (rc1 → rc2)

44 files, +13k / -2k lines
12 new CSV config files / updates
5 new Triton kernels + 3 new test files
2 C++ kernel files (MLA + cache)

Compatibility Matrix

Component	Requirement
Container ABI	`vllm/vllm-openai-rocm:v0.19.1` (Ubuntu 22, glibc ≤ 2.35, libstdc++ ≤ GLIBCXX_3.4.30)
PyTorch	`torch==2.10.0+rocm7.1` (matches vllm-openai-rocm:v0.19.1; wheels are ABI-pinned to this build)
GPU arch	gfx942 (MI300X / MI325X), gfx950 (MI355X)
ROCm	7.0 / 7.1 / 7.2 (pick wheel matching your runtime)
Python	3.10 / 3.12
vLLM	Recommend latest main with PR vllm-project/vllm#40754 merged.
SGLang	Recommend ≥ v0.5.10. If using CUDA-graph + custom all-reduce on MI300X / MI355X, use a base image with ROCm ≥ 7.2.1.

Breaking Changes since v0.1.12.post2

None. Same as rc1.

Known Issues

Same as rc1 — see v0.1.13-rc1 release notes for details on the HIP graph capture issue (ROCm 7.2.0 + custom all-reduce).

Wheels

6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X), manylinux_2_28 ABI, torch==2.10.0+rocm7.1 pin:

ROCm	Python 3.10	Python 3.12
7.2	`amd_aiter-0.1.13rc2+rocm7.2.manylinux.2.28-cp310-...`	`amd_aiter-0.1.13rc2+rocm7.2.manylinux.2.28-cp312-...`
7.1	`...+rocm7.1.manylinux.2.28-cp310-...`	`...+rocm7.1.manylinux.2.28-cp312-...`
7.0	`...+rocm7.0.manylinux.2.28-cp310-...`	`...+rocm7.0.manylinux.2.28-cp312-...`

Validation Status

ATOM 5-model accuracy: rc1 validated (pending rc2 revalidation)
vLLM ABI smoke: pending
MLA nhead=32 decode (#2983): pending silicon verification
Perf delta vs rc1: pending

Upgrade from rc1

pip install --pre --force-reinstall <wheel-url>
python -c "import aiter; print(aiter.__version__)"  # expect: 0.1.13rc2+rocm7.X.manylinux.2.28

Tagged from release/v0.1.13 HEAD = ab62c65757c4c41cb24c14b8e925a776c6124892.

Releases: ROCm/aiter

v0.1.15.post1

AITER v0.1.15.post1

Cherry-picks (4 fixes Kenny Roche requested for vLLM unblock)

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)

Related vLLM PRs (downstream)

Wheel Matrix

Install

Known Issues

Feedback

Uh oh!

AITER v0.1.14.post1

AITER v0.1.14.post1

What's in it (delta vs v0.1.14)

Validation

Wheel Matrix

Install

Known Issues

pip 26.0.1+ wheel filename parser

gpt-oss accuracy

#3001 not included

Acknowledgments

Uh oh!

v0.1.15

AITER v0.1.15

Highlights since v0.1.14 (80 commits on main + 1 cherry-pick)

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)

Wheel Matrix (6 wheels)

Install

Partner dependencies (READ BEFORE INSTALLING)

1. flydsl==0.1.9.dev599 (REQUIRED)

2. triton>=3.6.0 (REQUIRED)

Paired container

Known Issues

Feedback

Uh oh!

AITER v0.1.15-rc0

AITER v0.1.15-rc0

Highlights since v0.1.14 (80 commits on main + 1 cherry-pick)

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15, validated on e3940660b base before cherry-pick)

Wheel Matrix (6 wheels)

Install

Partner dependencies (READ BEFORE INSTALLING)

1. flydsl==0.1.9.dev599 (REQUIRED)

2. triton>=3.6.0 (REQUIRED)

Known Issues

Build / release engineering notes

Feedback

Update 2026-06-05 — wheel ABI fix

Uh oh!

AITER v0.1.14

AITER v0.1.14

Highlights

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)

Wheel Matrix

Install

Known Issues

Cumulative Changes since v0.1.13

DSv4 / Triton-ATOM fusions

MoE

FlyDSL

Triton

Bugfixes

qk_rmsnorm_group_quant

CK_TILE

Docs / Refactor

Uh oh!

AITER v0.1.13.post1

AITER v0.1.13.post1

What's in it (delta vs v0.1.13)

Wheels

Install

Validation

Known Issues

Acknowledgments

Uh oh!

AITER v0.1.14-rc0

Uh oh!

v0.1.13

AITER v0.1.13

1. `flydsl==0.1.9.dev599` (REQUIRED)

2. `triton>=3.6.0` (REQUIRED)

Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15, validated on `e3940660b` base before cherry-pick)

1. `flydsl==0.1.9.dev599` (REQUIRED)

2. `triton>=3.6.0` (REQUIRED)