Releases: ROCm/aiter
v0.1.15.post1
AITER v0.1.15.post1
Hotfix release for downstream vLLM partner unblock. Adds 4 cherry-picked fixes on top of v0.1.15 (plus 5 prerequisite commits to make them apply cleanly).
Inherited from v0.1.15 (full list in v0.1.15 release notes) — including PR #3304
mla fp8 qh32 seqlen=1 persistent kernel for gfx950(required for DSv3.2 with--kv-cache-dtype fp8_e4m3). The cherry-pick lives at commit6415d586on the release/v0.1.15 branch (added during v0.1.15-rc0); the commit title matches PR #3304 verbatim.
Cherry-picks (4 fixes Kenny Roche requested for vLLM unblock)
| PR | Title | Fixes |
|---|---|---|
| #3540 | Rebuild 32x384 kernel from new sources | MiniMax M2.5 OOB access in fmoe (#3471) |
| #3428 | add MX_FP4_A8 tuned configs and dispatch for moe_gemm_a8w4 | gpt-oss W4A8 regressions + crashes |
| #3492 | Enabled stride-aware KV-cache block dim for non-contiguous layouts (fused_qk_norm_rope_cache_pts_quant_shuffle) | Qwen fusion non-contiguous KV-cache |
| #3546 | [Triton][Gluon] fused_qk_rope_cat_and_cache_mla new grid layout | MLA perf + vLLM unit-test fix |
Prerequisite commits (required for the above to cherry-pick clean): #3372 (LDS-aware num_stages), #3159 (hip kl refactor), #3358 (partial rope), #2888 (FP4 GFX12 support), GFX12 import fix.
Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)
| Model | v0.1.15.post1 | v0.1.15 | Threshold | Result |
|---|---|---|---|---|
| MiniMax-M2.5 (TP=2, fp8 KV) | 0.9378 | 0.9340 | 0.92 | PASS ↑ |
| DeepSeek-R1-0528 (TP=8, fp8 KV) | 0.9515 | 0.9431 | 0.94 | PASS ↑ |
| Qwen3-235B-A22B-FP8 (TP=8, fp8 KV) | 0.8756 | 0.8795 | 0.87 | PASS (within noise band) |
| GLM-5-FP8 (TP=8, fp8 KV) | 0.9454 | 0.9431 | 0.93 | PASS ↑ |
| Kimi-K2.5-MXFP4 (TP=4, fp8 KV) | 0.9363 | 0.9340 | 0.92 | PASS ↑ |
5/5 PASS. 4/5 improved or equal to v0.1.15 baseline; Qwen3 single-question noise.
Related vLLM PRs (downstream)
For full gpt-oss + Qwen path you still need the vLLM-side companion PRs:
- vLLM #44893 — Pass GateMode.INTERLEAVE for MXFP4 W4A16 fused MoE (Rohan138)
- vLLM #44804 — Hybrid CDNA4 swizzle gate for A8W4 MoE (xiaohuguo2023)
- vLLM intermediate_pad TP-aware fix (Rohan138)
Wheel Matrix
ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. torch 2.10 (rocm7.0/7.1) / torch 2.11 (rocm7.2). Fat binary gfx942 + gfx950.
Install
pip install \
--extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ \
https://github.com/ROCm/aiter/releases/download/v0.1.15.post1/<wheel-filename>Partner deps same as v0.1.15: flydsl==0.1.9.dev599, triton>=3.6.0.
Known Issues
- pip 26.0.1 wheel filename "wrong number of parts": rename wheel to drop
.manylinux.2.28infix before install. Will fix in v0.1.16.
Feedback
- Bug reports: https://github.com/ROCm/aiter/issues — tag
v0.1.15.post1 - Direct: peng.sun@amd.com
AITER v0.1.14.post1
AITER v0.1.14.post1
Patch release on top of v0.1.14 for vLLM downstream bump. Adds one cherry-pick: PR #3304 (mla fp8 qh32 seqlen=1 persistent kernel support on gfx950 — DSv3.2 FP8 MLA decode path).
Plus two CI infra commits to enable manylinux_2_28 wheel rebuild on current builders.
What's in it (delta vs v0.1.14)
0f3c58e6e ci: pull latest install_triton.sh + aiter-release.yaml from main
76d80cd3f mla: add fp8 qh32 seqlen=1 persistent kernel support on gfx950 (#3304)
[v0.1.14 baseline at bd0534e96]
Validation
- DeepSeek-R1-0528 (TP=8, kv_cache_dtype=fp8) on MI355X (gfx950): GSM8K 3-shot flexible-extract = 0.9439 (threshold 0.94, PASS).
- All 6 wheels installable +
import aitervalidated on rocm/atom torch 2.10 ABI container. - vLLM downstream: ABI compatible with current
rocm/vllm-dev:nightlytorch 2.10 path (verified by build matrix torch_pin=2.10 for rocm7.0/7.1, torch_pin=2.11 for rocm7.2).
Wheel Matrix
6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).
| ROCm | Python | torch ABI | Size |
|---|---|---|---|
| 7.0 | 3.10 | 2.10 | ~470 MB |
| 7.0 | 3.12 | 2.10 | ~470 MB |
| 7.1 | 3.10 | 2.10 | ~460 MB |
| 7.1 | 3.12 | 2.10 | ~460 MB |
| 7.2 | 3.10 | 2.11 | ~455 MB |
| 7.2 | 3.12 | 2.11 | ~455 MB |
Install
pip install https://github.com/ROCm/aiter/releases/download/v0.1.14.post1/<wheel-filename>Known Issues
pip 26.0.1+ wheel filename parser
pip 26.0.1 (and possibly other 26.x versions) rejects this wheel with Invalid wheel filename (wrong number of parts): 'post1'. The combination of .post1 in the public version and .manylinux.2.28 in the local version segment confuses pip's PEP 491 filename parser.
Workaround: download the wheel, rename to strip the +rocm7.X.manylinux.2.28 local segment, then install:
wget https://github.com/ROCm/aiter/releases/download/v0.1.14.post1/amd_aiter-0.1.14.post1+rocm7.1.manylinux.2.28-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
mv amd_aiter-0.1.14.post1+rocm7.1.manylinux.2.28-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl \
amd_aiter-0.1.14.post1-cp312-cp312-manylinux_2_28_x86_64.whl
pip install ./amd_aiter-0.1.14.post1-cp312-cp312-manylinux_2_28_x86_64.whlPip-installable in older pip (<= 25.x) directly from the URL.
gpt-oss accuracy
Per Doug Lehr's note (ROCM-25517), gpt-oss accuracy on v0.1.14 was below standard. This is not fixed in v0.1.14.post1 — only #3304 was cherry-picked per Richard Li's request to unblock the vLLM bump. gpt-oss fix is targeted for v0.1.15.
#3001 not included
Per Richard's request "#3001 if not already on the 0.1.14 line": #3001 was not on v0.1.14 line. We evaluated cherry-picking it and found it depends on a 7-PR chain including a 1210-line tuner refactor (#3220). Bringing the full chain risks the release window and changes far more than the post1 patch surface should. #3001 will land in v0.1.15. Skipped from post1.
Acknowledgments
- Richard Li (vLLM team) — surfaced the DSv3.2 FP8 MLA gfx950 blocker and the minimum cherry-pick set
- Alexios Lyrakis — author of the #3304 mla kernel
v0.1.15
AITER v0.1.15
Bi-weekly release, paired with ATOM v0.1.4 (first AITER+ATOM paired release in the bi-weekly cadence pilot — see cadence proposal).
Same commit as v0.1.15-rc0 (8ddfc7510) — zero delta after 6-day RC soak with no partner issues filed. Release branch: release/v0.1.15.
Highlights since v0.1.14 (80 commits on main + 1 cherry-pick)
- DSv4-Pro / V4-Flash kernels — fused compress attention (#3357), sparse prefill OPUS (#3225), fp8_mqa_logits re-add, indexer_qk_rope_quant_and_cache non-contiguous k support (#3301), DSv4 padding fix (#3184), DSv4 bf16 + fp8 a8w8 blockscale tunes (#3284 #3339 #3394)
- MoE — fused dynamic MXFP8 quant + moe_sort HIP path (#3312), drop a_scale_one for fp8 stage1 + remove fp8 fuse_quant bypass (#3367), optimised prefill mxfp8 quant moe sort (#3398), LDS-aware num_stages selection for gfx950 (#3372), GLUON a8w4 optimisations (#3317)
- FlyDSL — pin bumped to
0.1.9.dev599, fused qk_norm_rope_quant for DSv4-Pro decode (#3320), fused_compress_attn for V4-Pro/V4-Flash (#3357), dynamic layout fix (#3373) - Triton —
tl.dot(..., acc=...)accumulator form (#3231), split-k common reduce (#3230), MoE gfx1250 optimisations (#3293), MoE routing support for expert_map (#3348), splitk deadlock fix (#3288) - mla — fp8 qh32 seqlen=1 persistent kernel for gfx950 (#3304, cherry-picked)
- mhc_post / mhc_pre — fused rmsnorm (#3396), split-k acc_sq mask fix (#3278)
- OPUS — bf16 gemm support (#2945), pa_sparse_prefill_opus (#3225), mono version m align assert fix (#3382), unroll loop + scale mfma update (#3329), synchronous fallback _async_load (#3336), CDNA-only v_pk_mul_f32 ASM guards (#3322 #3356)
- gfx1200/1201 RDNA4 — FP8 dtype map (#3332), Gluon MoE optimisations (#3317)
- DP-attention — CUDAGraph capture compatibility fix (#3375)
- CK — submodule pin fix after CK re-sync with rocm-libraries (#3387)
- CI/build infra — install_triton.sh pipefail bug fix, workflow installs flydsl from AMD mirror at build time
Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)
| Model | Score | Threshold | Result |
|---|---|---|---|
| DeepSeek-R1-0528 (TP=8, fp8 KV) | 0.9431 | 0.94 | PASS |
| MiniMax-M2.5 (TP=2, fp8 KV) | 0.9340 | 0.92 | PASS |
| Qwen3-235B-A22B-FP8 (TP=8, fp8 KV) | 0.8795 | 0.87 | PASS |
| GLM-5-FP8 (TP=8, fp8 KV) | 0.9431 | 0.93 | PASS |
| Kimi-K2.5-MXFP4 (TP=4, fp8 KV) | 0.9340 | 0.93 | PASS |
5/5 PASS. Qwen3-235B-A22B passes cleanly for the first time on this base.
Wheel Matrix (6 wheels)
ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI (glibc 2.28+). Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).
| ROCm | Python | torch ABI | Size |
|---|---|---|---|
| 7.0 | 3.10 | 2.10 | 466 MB |
| 7.0 | 3.12 | 2.10 | 467 MB |
| 7.1 | 3.10 | 2.10 | 459 MB |
| 7.1 | 3.12 | 2.10 | 459 MB |
| 7.2 | 3.10 | 2.11 | 452 MB |
| 7.2 | 3.12 | 2.11 | 453 MB |
Install
pip install \
--extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ \
https://github.com/ROCm/aiter/releases/download/v0.1.15/<wheel-filename>The --extra-index-url is required — see "Partner dependencies" below.
Partner dependencies (READ BEFORE INSTALLING)
1. flydsl==0.1.9.dev599 (REQUIRED)
setup.py calls start_aot() which imports aiter.aot.flydsl.gemm at build time. Runtime aiter import also requires this exact version. Available only from the AMD nightlies mirror:
https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/
Always pass --extra-index-url <above> to pip. Without it, ROCm/HIP JIT silently disables.
2. triton>=3.6.0 (REQUIRED)
aiter/__init__.py enforces triton>=3.6.0 for the new Gluon kernels. Use the paired ATOM v0.1.4 container which ships triton 3.6.0, or before installing the aiter wheel:
pip install --force-reinstall triton==3.6.0Paired container
rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4 ships this AITER wheel pre-installed with matching triton + flydsl. Recommended pin point for partners.
Known Issues
- rocm7.2 wheel built against torch 2.11 ABI. For deployments still on torch 2.10 ATOM containers, install the rocm7.1 wheel which uses torch 2.10 ABI (validated PASS on all 5 models).
- pip 26.0.1 "wrong number of parts in filename": workaround — download the wheel and rename to drop the
.manylinux.2.28infix from the version segment beforepip install. Tracked for v0.1.16.
Feedback
- Bug reports: https://github.com/ROCm/aiter/issues — tag
v0.1.15 - Direct: peng.sun@amd.com
AITER v0.1.15-rc0
AITER v0.1.15-rc0
First release candidate for v0.1.15. Cut from main@e3940660b ("Add add mhc_pre fused rmsnorm (#3396)", 2026-05-28), with one cherry-pick (#3304 mla fp8 qh32 seqlen=1 persistent kernel for gfx950) and two CI fixes on top.
Release branch: release/v0.1.15 at 8ddfc7510.
Highlights since v0.1.14 (80 commits on main + 1 cherry-pick)
- DSv4-Pro / V4-Flash kernels — fused compress attention (#3357), sparse prefill OPUS (#3225), fp8_mqa_logits re-add, indexer_qk_rope_quant_and_cache non-contiguous k support (#3301), DSv4 padding fix (#3184), DSv4 bf16 + fp8 a8w8 blockscale tunes (#3284 #3339 #3394)
- MoE — fused dynamic MXFP8 quant + moe_sort HIP path (#3312), drop a_scale_one for fp8 stage1 + remove fp8 fuse_quant bypass (#3367), optimised prefill mxfp8 quant moe sort (#3398), LDS-aware num_stages selection for gfx950 (#3372), GLUON a8w4 optimisations (#3317)
- FlyDSL — pin bumped to
0.1.9.dev599, fused qk_norm_rope_quant for DSv4-Pro decode (#3320), fused_compress_attn for V4-Pro/V4-Flash (#3357), dynamic layout fix (#3373) - Triton —
tl.dot(..., acc=...)accumulator form (#3231), split-k common reduce (#3230), MoE gfx1250 optimisations (#3293), MoE routing support for expert_map (#3348), splitk deadlock fix (#3288) - mla — fp8 qh32 seqlen=1 persistent kernel for gfx950 (#3304, cherry-picked)
- mhc_post / mhc_pre — fused rmsnorm (#3396), split-k acc_sq mask fix (#3278)
- OPUS — bf16 gemm support (#2945), pa_sparse_prefill_opus (#3225), mono version m align assert fix (#3382), unroll loop + scale mfma update (#3329), synchronous fallback _async_load (#3336), CDNA-only v_pk_mul_f32 ASM guards (#3322 #3356)
- gfx1200/1201 RDNA4 — FP8 dtype map (#3332), Gluon MoE optimisations (#3317)
- DP-attention — CUDAGraph capture compatibility fix (#3375)
- CK — submodule pin fix after CK re-sync with rocm-libraries (#3387)
- CI/build infra — install_triton.sh pipefail bug fix, workflow installs flydsl from AMD mirror at build time (this RC)
Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15, validated on e3940660b base before cherry-pick)
| Model | Score | Threshold | Result |
|---|---|---|---|
| DeepSeek-R1-0528 (TP=8, fp8 KV) | 0.9431 | 0.94 | PASS |
| MiniMax-M2.5 (TP=2, fp8 KV) | 0.9340 | 0.92 | PASS |
| Qwen3-235B-A22B-FP8 (TP=8, fp8 KV) | 0.8795 | 0.87 | PASS ✨ |
| GLM-5-FP8 (TP=8, fp8 KV) | 0.9431 | 0.93 | PASS |
| Kimi-K2.5-MXFP4 (TP=4, fp8 KV) | 0.9340 | 0.93 | PASS |
5/5 PASS. Qwen3-235B-A22B passes cleanly for the first time on this base (was borderline 0.8696 / 0.8650 on v0.1.14 / Option A candidate). Likely due to #3398 prefill mxfp8 moe sort + #3396 mhc_pre fused rmsnorm.
Wheel Matrix (6 wheels)
ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI (glibc 2.28+). Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).
| ROCm | Python | torch ABI | Size |
|---|---|---|---|
| 7.0 | 3.10 | 2.10 | 466 MB |
| 7.0 | 3.12 | 2.10 | 467 MB |
| 7.1 | 3.10 | 2.10 | 459 MB |
| 7.1 | 3.12 | 2.10 | 459 MB |
| 7.2 | 3.10 | 2.11 | 452 MB |
| 7.2 | 3.12 | 2.11 | 453 MB |
Install
pip install --extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ https://github.com/ROCm/aiter/releases/download/v0.1.15-rc0/<wheel-filename>The --extra-index-url is required at this RC — see "Partner dependencies" below.
Partner dependencies (READ BEFORE INSTALLING)
v0.1.15 introduces two hard runtime/build dependencies that are not on public PyPI:
1. flydsl==0.1.9.dev599 (REQUIRED)
setup.py calls start_aot() which imports aiter.aot.flydsl.gemm at build time. Runtime aiter import also requires this exact version. Available only from the AMD nightlies mirror:
https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/
Always pass --extra-index-url <above> to pip when installing the aiter wheel. Without it, ROCm/HIP JIT silently disables (CK and HIP ops gone, only Triton ops remain).
We expect flydsl 0.1.9 final release wheels on public PyPI by v0.1.15 final.
2. triton>=3.6.0 (REQUIRED)
aiter/__init__.py enforces triton>=3.6.0 for the new Gluon kernels (#2695, #3219). Current ATOM containers (rocm/atom:rocm7.2.2_*_pytorch_release_2.10.0_atom0.1.2.post) ship triton 3.5.1 and will hit:
RuntimeError: aiter gluon kernels require triton>=3.6.0, found 3.5.1
Partner action: before installing the aiter wheel:
pip install --force-reinstall triton==3.6.0Or wait for new ATOM container builds that ship triton 3.6+ by default.
Known Issues
- Kimi-K2.5-MXFP4 end-to-end requires ATOM with PR #670 (kwargs upgrade for
aiter.fused_qk_rmsnorm). ATOM nightly tags from 2026-05-14 onward include this; older ATOM containers will hitAttributeError: 'float' object has no attribute 'size'at MLA path. Tracking: #3177 - rocm7.2 wheel built against torch 2.11 ABI. For deployments still on torch 2.10 ATOM containers, install the rocm7.1 wheel which uses torch 2.10 ABI (validated PASS on all 5 models).
Build / release engineering notes
This RC fixes two CI bugs that blocked the first 6-wheel build attempt:
install_triton.shsilent-fail on non-Debian builders. Thedpkg -l rocm-core | awkline failed underset -o pipefailwhendpkgreturned 1 (container has no rocm-core), andsetup.pyswallowed the_run_install_triton()exception intry/except, leaving the container with no triton installed after the uninstall step ran. Surfaced asAttributeError: module 'triton' has no attribute 'language'. Fix:|| trueafter the awk pipe so the pipeline tolerates dpkg's non-zero exit.- Workflow drops
flydslfromrequirements.txt. That worked for v0.1.13 / v0.1.14 because theirsetup.pycould build without flydsl (try/except wrappedstart_aot). v0.1.15 unconditionally callsstart_aot()duringsetup.py, so flydsl is now a hard build-time dep. Fix: installflydsl==0.1.9.dev599from the AMD mirror in the build step.
Both fixes will be backported to main in a follow-up PR.
Feedback
- Bug reports: https://github.com/ROCm/aiter/issues — tag
v0.1.15-rc0 - Direct: peng.sun@amd.com
Update 2026-06-05 — wheel ABI fix
All 6 wheels were rebuilt with explicit torch_pin:
- ROCm 7.0 / 7.1:
torch==2.10 - ROCm 7.2:
torch==2.11
The initial wheels in this release were built without torch_pin, so CI pulled torch 2.12+rocm7.X (latest on the public index). That created a c10::cuda::getCurrentCUDAStream ABI mismatch when partners loaded module_gemm_a8w8_blockscale_bpreshuffle.so inside ATOM containers shipping torch 2.10 (rocm/atom:rocm7.2.2_*_atom0.1.2.post). The rebuilt wheels match torch 2.10 ABI for rocm7.0/7.1 and torch 2.11 for rocm7.2.
DSR1 GSM8K re-validated on MI355X with the rocm7.1 cp312 wheel: 0.9477 (threshold 0.94, PASS).
Partners who downloaded wheels before 2026-06-05 21:00 UTC should re-download.
AITER v0.1.14
AITER v0.1.14
Production release of AITER v0.1.14. Cut from release/v0.1.14 at commit bd0534e96. 19 commits land in v0.1.14 vs v0.1.13.
Highlights
- #3057 DSv4 fusions phase 1 — first batch of Triton/ATOM-side DSv4 fusions, the headline feature of v0.1.14.
- #3163 minimax fused qknorm+allreduce — fused qknorm + allreduce kernel for MiniMax-M2.x, ~10-15% TPS improvement on prefill TP=2 / TP=4.
- #3189 grid-strided loop, drop 80-token cap — follow-up to #3163 that removes the prior hard cap of 80 tokens per kernel launch.
- Kimi-K2.5-MXFP4 unblocked end-to-end when paired with ATOM containing PR #670 (kwargs upgrade for
aiter.fused_qk_rmsnorm).
Validation (GSM8K 3-shot, flexible-extract, mi355-gpu-15)
| Model | Score | Threshold | Result |
|---|---|---|---|
| DeepSeek-R1-0528 (TP=8, fp8 KV) | 0.9484 | 0.94 | PASS |
| MiniMax-M2.5 (TP=2, fp8 KV) | 0.9393 | 0.92 | PASS |
| Qwen3-235B-A22B-FP8 (TP=8, fp8 KV) | 0.8696 | 0.87 | borderline (within GSM8K noise ±0.005) |
| GLM-5-FP8 (TP=8, fp8 KV) | 0.9393 | 0.93 | PASS |
| Kimi-K2.5-MXFP4 (TP=4, fp8 KV) | 0.9348 | 0.93 | PASS (requires ATOM with PR #670) |
5/5 models pass. Qwen3 is technically below threshold by 0.0004 — within the standard deviation of the GSM8K 3-shot run and corresponds to a single-question swing.
Wheel Matrix
6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI (glibc 2.28+). Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).
| Filename | Size | torch ABI |
|---|---|---|
| amd_aiter-0.1.14+rocm7.0...cp310 | 437 MB | 2.10 |
| amd_aiter-0.1.14+rocm7.0...cp312 | 438 MB | 2.10 |
| amd_aiter-0.1.14+rocm7.1...cp310 | 431 MB | 2.10 |
| amd_aiter-0.1.14+rocm7.1...cp312 | 431 MB | 2.10 |
| amd_aiter-0.1.14+rocm7.2...cp310 | 424 MB | 2.11 |
| amd_aiter-0.1.14+rocm7.2...cp312 | 425 MB | 2.11 |
Install
pip install https://github.com/ROCm/aiter/releases/download/v0.1.14/<wheel-filename>flydsl==0.1.7 is auto-resolved from PyPI as a runtime dep. Latest flydsl==0.1.8 also works (no API drift).
Known Issues
- Kimi-K2.5-MXFP4 end-to-end requires ATOM with PR #670 (kwargs upgrade for
aiter.fused_qk_rmsnorm). ATOM nightly tags from 2026-05-14 onward include this; older ATOM containers will hitAttributeError: 'float' object has no attribute 'size'at the MLA path. Tracking: #3177 - rocm7.2 wheel was built against torch 2.11 ABI. For deployments still on torch 2.10 ATOM containers, install the rocm7.1 wheel which uses torch 2.10 ABI (validated PASS on all 5 models).
Cumulative Changes since v0.1.13
19 commits land in v0.1.14 vs v0.1.13. Grouped by area:
DSv4 / Triton-ATOM fusions
MoE
- minimax ops: fused qknorm+allreduce kernel (#3163)
- [custom_all_reduce] qknorm_allreduce_fusion_kernel_2stage: grid-strided loop, drop 80-token cap (#3189)
- silu_and_mul_quant + Opt silu_and_mul (#3145)
FlyDSL
- FlyDSL MXFP4 rounding alignment (#3153)
- FlyDSL GDR decode kernel optimize (#3135)
- FlyDSL xcd remap v2 (#3134)
- FlyDSL per-kernel parallelism + AOT pool size (#3133)
Triton
- mHC-post: post-stream + res-stream mixing optimization (#2920)
- Triton blockscale num_stages pipelining (#3136)
- Triton s_barrier sync waves (#3132)
- feat(triton/rope): fused QKV split + QK RMSNorm + RoPE + paged KV (#2902)
- Triton bench_gmm.py bug fix (#3154)
Bugfixes
- fix gather mem violation (#3182)
- [Bugfix][Triton] Honor
transpose_bmin batched_gemm_a16wfp4_ fake tensor (#3166)
qk_rmsnorm_group_quant
- refactor hip kl (-30% build time) (#3137)
CK_TILE
Docs / Refactor
AITER v0.1.13.post1
AITER v0.1.13.post1
Patch release on top of v0.1.13 for SA InferenceX evaluation. Adds Kimi a16wi4 MoE support and a splitk dispatch fix. Built against the flydsl 0.1.4.post1.dev glibc-2.28 backport from the FlyDSL team.
What's in it (delta vs v0.1.13)
3a38da399 build(deps): pin flydsl>=0.1.4.post1.dev,<0.1.5
4365ee78c fix splitk buffer dispatch (#3050)
9e831bad9 kimi a16wi4 moe support (#2863)
Wheels
ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X). Built against torch==2.10 (rocm 7.0/7.1) or torch==2.11 (rocm 7.2).
| ROCm | Python | torch ABI | Size |
|---|---|---|---|
| 7.0 | 3.10 | 2.10 | ~286 MB |
| 7.0 | 3.12 | 2.10 | ~286 MB |
| 7.1 | 3.10 | 2.10 | ~281 MB |
| 7.1 | 3.12 | 2.10 | ~282 MB |
| 7.2 | 3.10 | 2.11 | ~274 MB |
| 7.2 | 3.12 | 2.11 | ~275 MB |
Install
The wheel pins flydsl>=0.1.4.post1.dev,<0.1.5. Where to find a matching flydsl wheel depends on your environment glibc:
Ubuntu 22.04+ / glibc ≥ 2.35 (vLLM ROCm containers, RHEL 9, etc.) — pip resolves cleanly from PyPI without extra index:
pip install https://github.com/ROCm/aiter/releases/download/v0.1.13.post1/<wheel-filename>glibc 2.28-2.34 (RHEL 8, CentOS 8, Ubuntu 20.04, manylinux_2_28 builders) — PyPI's 0.1.4.x wheels are manylinux_2_35 only. Use the AMD nightlies mirror as --extra-index-url:
pip install \
--extra-index-url https://rocm.frameworks-devreleases.amd.com/whl-staging/gfx942-gfx950/ \
https://github.com/ROCm/aiter/releases/download/v0.1.13.post1/<wheel-filename>The AMD mirror hosts flydsl-0.1.4.post1.dev20260515+fdd1c1e-cp{310,312}-manylinux_2_27_x86_64.whl (built on glibc 2.28 by the FlyDSL team to support older systems; tracking ROCm/FlyDSL#527).
Validation
- DeepSeek-R1-0528 (TP=8, kv_cache_dtype=fp8): GSM8K 3-shot flexible-extract = 0.9507 (threshold 0.94, PASS) on mi355-gpu-9 (gfx950).
- All 6 wheels installable +
import aitervalidated onvllm/vllm-openai-rocm:v0.19.1(cp312) and AITER manylinux_2_28-builder (cp310/cp312).
Known Issues
- Kimi-K2.5-MXFP4 end-to-end serving still blocked by an upstream FlyDSL JIT bug (
name 'gy' is not defined/'lds_out' is not definedinmixed_moe_gemm_2stageandflydsl_moe1_*kernels). Affects any AITER caller (ATOM/vLLM/SGLang) on the v0.1.13 line — the AITER PR #2958 API rename + ATOM PR #670 kwargs upgrade are not in this branch. For Kimi MXFP4 serving today, please use v0.1.14-rc0 (https://github.com/ROCm/aiter/releases/tag/v0.1.14-rc0) with ATOM containing PR #670.
Acknowledgments
- Kiran Thumma + Felix Li (FlyDSL team) — for backporting glibc 2.28 support to the v0.1.4 line and publishing the
0.1.4.post1.devwheel within ~24h of request.
Update 2026-05-28 — Verified by Amanzhol Salykov (FlyDSL team) with latest vLLM + Kimi-K2.5 int4 on MI355X; accuracy and perf both good. Published to GA.
AITER v0.1.14-rc0
⚠️ SUPERSEDED BY v0.1.14 — please use the final release. rc0 is left for historical reference only.
SUPERSEDED BY v0.1.14 — please use https://github.com/ROCm/aiter/releases/tag/v0.1.14
v0.1.13
AITER v0.1.13
Production release of the v0.1.13 line. Same commit as v0.1.13-rc5 (cdcfa833b) after 5 RC iterations.
Highlights
- DeepSeek R1 / GPT-OSS / Kimi / GLM-5 enablement maturing on MI300X / MI325X (gfx942) and MI350 / MI355X (gfx950)
- New ASM fmoe kernels for gfx950 that bypass bf16→fp8 quantization, gated by
AITER_XBFLOAT16=1env var (default off, opt-in for safety) (#2262) - Substantial MLA improvements: MI350 MLA PS mode for new shapes (#2727, #2729, #2676), MoE PS mode for nhead=8/2 on MI308 (#2852), nhead=32 non-persistent decode crash fix on gfx950 (#2983)
- FMHA / paged attention: runtime dispatch for >4 GB KV cache in batch prefill (#2893), top_k_per_row prefill fix for
batched_token_num > 4096(#2901), gfx942/gfx950 PA PS kernel update withstride_scale_pagewrite (#2796) - RDNA4 expansion: FP8 support for
gfx1200/gfx1201(#2621), FlyDSLflash_attn_funcbackend forgfx1201(R9600D) — first RDNA4-class attention backend in AITER (#2969 on main, included via baseline) - Triton kernel additions: Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441), Triton fallback for MI455 GPT-OSS / DSFP4 (#2657), GLM-5 70k+300 GEMM configs for gfx942 (#2743)
- FlyDSL maturity: BF16 GEMM tuned configs added/retuned for 6 models (#2733), AOT defaults via
AITER_CONFIGS(#2756), if/else compatibility across versions (#2740), updated FlyDSL version pin - Bulk silo merge — kernel fixes and tuned configs in preparation for the v0.1.13.post1 line (#3004, #3005, #3024)
- Quality of life: pandas FutureWarning suppressed and pybind11 type hint mismatch fix (#2980), Linux import errors no longer swallowed (#3049),
std::unordered_mapreplaced withSynchronizedCachefor thread safety (#2221), ctypes C-ABI error bridging to prevent worker crashes during kernel build (#2498)
Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)
| Model | Score | Threshold | Result |
|---|---|---|---|
| DeepSeek-R1-0528 | 0.9454 | 0.94 | PASS |
| MiniMax-M2.5 | 0.9295 | 0.92 | PASS |
| Qwen3-235B-FP8 | 0.8802 | 0.87 | PASS |
Wheel Matrix
6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2). Fat binary covers gfx942 (MI300/MI308/MI325X) + gfx950 (MI350/MI355X).
Cumulative Changes since v0.1.12.post2
149 commits land in v0.1.13 vs v0.1.12.post2. Full list available via git log v0.1.12.post2..v0.1.13. Highlights grouped by area:
MoE / FlyDSL kernels (44 commits)
- ASM fmoe kernels for gfx950 with bf16→fp8 quantization bypass (#2262)
- FlyDSL A8W4 MoE update (#2726)
- GPT-OSS small-M MoE optimizations (#2775)
- Kimi-K2.5 MoE tuned configs revert for batch sizes 32/64 (#2836) — Kimi int4 a16wi4 MoE (#2863) deferred to v0.1.13.post1
- Triton Gluon-optimized MoE Int8 SmoothQuant for small K (#2441)
- MoE tuner fixes (#2831, #2785, #2723)
- fused_dynamic_mxfp4_quant_moe_sort_hip added (#2620, fix #2759)
- CK_TILE bpreshuffle compile failure fix (#2811)
- Bulk silo merge tuned configs and kernel fixes (#3004, #3005, #3024)
- moe_routing_sigmoid_top1_fused tie-breaking fix (#2750)
MLA / Multi-head Latent Attention (9+ commits)
- MI350 MLA PS mode support for new shapes (nhead 128,1 / 128,2 / 128,3 / 128,4 / 64,4 / 64,2 / 32,4) via
mla_a16w16_qh32_qseqlen4_gqaratio32_ps.co(#2727) - gfx950 fp8 decode native qh32 qseqlen2 MLA PS kernel (#2676) and qh64 nhead=64 native kernel (#2636)
- bf16 MLA decode kernel for gqa_ratio=64, qseqlen=1 (non-persistent) (#2729)
- MLA PS mode nhead 8/2 on MI308 (#2852)
- MLA Reduce and Metadata kernel rewritten with OPUS template (#2717)
- gfx950 nhead=32 non-persistent decode crash fix (#2983)
- OPUS lib improvements for MLA: mma step_k, dword copy via
set_sliceand inline asm fortr_load(#2652)
FMHA / Paged Attention
- Runtime dispatch for >4 GB KV cache in batch prefill (#2893)
- top_k_per_row prefill fix for
batched_token_num > 4096(#2901) - gfx942/gfx950 PA PS kernel update with
stride_scale_pagewrite in asm_pa (#2796) - fmha_fwd_v3 silence false warning when
use_asm_v3is disabled (#2744) indexer_k_quant_and_cachepreshuffled layout support (#2879)- car prefill kernel error fix for SGLang (#2745)
Triton path
- Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441)
- Triton MHA UT reduction (#2612)
- Adapt model benchmarking scripts to new
bench_mha.pyCLI (#2673) - Triton fallback for MI455 GPT-OSS and DSFP4 (#2657)
- GLM-5 70k+300 GEMM configs for gfx942 (#2743)
- Triton MoE GEMM shared memory exhaustion fix by reducing stage count (#2723)
- Drop GLM5 Triton tuned GEMM (#2803)
FlyDSL
- BF16 GEMM configs added/retuned for 6 models (#2733)
- AITER_CONFIGS for FlyDSL AOT defaults (#2756)
if const_exprintroduction (#2776)- if/else compatibility across versions (#2740)
- A8W4 MoE update (#2726)
- bf16 GEMM implementation and tuned config update (#2634)
- A8W8 FlyDSL tune fix (#2809)
- Linear attention rebase for new FlyDSL version (#2746)
Architecture enablement
- RDNA4 (gfx1200/gfx1201): FP8 support added (#2621)
- MI355X (gfx950): continued maturation across MoE, MLA, FMHA paths
- MI350 (gfx950): MLA PS mode coverage expanded
- MI308 (gfx942): MLA PS mode nhead 8/2 (#2852), i8gemm tuning (#2590)
- MI300X (gfx942): gemma rmsnorm quant fusion (#2853),
gemm_a16w16torch tune (#2860)
Quality and safety fixes
- pandas FutureWarning suppression and pybind11 type hint mismatch (#2980)
- Linux import errors no longer swallowed (#3049)
std::unordered_map→SynchronizedCachefor thread safety in CK paths (#2221)- ctypes C-ABI error bridging to prevent worker crashes during kernel build (#2498)
fused_qk_norm_group_quantstride error check fix (#2637)fused_dynamic_mxfp4_quant_moe_sort_hipEP fix (#2759)fused_gemm_afp4wfp4_a16w16LDS exhaustion fix under ASYNC_COPY (#2784)- opus.hpp build time optimization kernel template (single-header C++ template, up to 61x faster builds vs standard torch extension)
Release engineering
torch_pin+torch_index_urlworkflow inputs for release-build CI (#2875)- manylinux_2_28 wheel matrix standardized: ROCm 7.0/7.1/7.2 × Python...
v0.1.13-rc5
AITER v0.1.13-rc5
Fifth release candidate for v0.1.13, focused on adding asm_fmoe kernels for gfx950 (no bf16->fp8 quantization required) while removing RC4's Kimi int4 MoE changes.
Changes vs RC4
Reverted:
kimi a16wi4 moe support (#2863)— defer to v0.1.14fix splitk buffer dispatch (#3050)— only needed by #2863
Cherry-picked from main:
Introduce asm fmoe kernels that do not require bf16->fp8 quantization (#2262)— new gfx950-only kernels behindAITER_XBFLOAT16=1env var (default off)[Bugfix] Suppress pandas FutureWarning and fix pybind11 type hint mismatch (#2980)
Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)
| Model | Score | Threshold | Result |
|---|---|---|---|
| DeepSeek-R1-0528 | 0.9454 | 0.94 | PASS |
| MiniMax-M2.5 | 0.9295 | 0.92 | PASS |
| Qwen3-235B-FP8 | 0.8802 | 0.87 | PASS |
Wheel Matrix
6 wheels for ROCm 7.0 / 7.1 / 7.2 x Python 3.10 / 3.12, manylinux_2_28 ABI. All built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2 - torch 2.10 was removed from PyTorch's rocm7.2 index).
Cherry-pick Audit Summary (PR #2262)
PR #2262 introduces a new code path that is off by default and gfx950-only:
- Triple-gated:
quant_type==per_1x128+gfx950+AITER_XBFLOAT16=1env var - Public C++ API unchanged
- New
*_pybind.cushims and pre-compiled.coHSA binaries for gfx950 - Zero merge conflicts on release/v0.1.13
- No follow-up correctness fixes on main
Existing MI300/MI308/MI450 deployments and unset-env MI355X deployments are unaffected.
v0.1.13-rc2
AITER v0.1.13-rc2
Release candidate 2 for v0.1.13. Pre-release — please smoke-test downstream and report back before final tag.
Changes since rc1
5 cherry-picks onto release/v0.1.13:
Bug Fixes
- #2983 —
[MLA] Fix nhead=32 non-persistent decode crash on gfx950: Corrects the decode dispatch condition for MLA attention whennhead=32(e.g., Kimi-K2.5). Without this fix, gfx950 takes the non-persistent path and crashes during decode. - #2879 —
Support preshuffled layout in indexer_k_quant_and_cache: Adds preshuffled weight layout support to blockscale GEMM and KV cache indexer, fixing a blocker for DI/SA inference paths.
New Features
- #3005 —
[Silo] Bulk merge kernel fixes + features: Adds 5 new Triton kernels —causal_conv1d_update_single_token,fused_rearrange_sigmoid_gdr,fused_fp8_quant,pa_mqa_logits, and gated delta rule decode optimizations. Includes corresponding op tests.
Config & Tuning
- #3004 —
[Silo] Bulk merge tuned configs: Adds MI355X (gfx950) tuned configs for Kimi-K2, GLM-4.7, Qwen3-Next-80B across GEMM and FMoE kernels. - #3024 —
[Silo] Add configs missing from bulk merge #3004: Adds 6375 MI355X GEMM tunings for DeepSeek-V3.2 + MiniMax-M2.5 FMoE tunings. Deduplicates cross-file shape collisions (bestusper shape wins).
Files changed (rc1 → rc2)
- 44 files, +13k / -2k lines
- 12 new CSV config files / updates
- 5 new Triton kernels + 3 new test files
- 2 C++ kernel files (MLA + cache)
Compatibility Matrix
| Component | Requirement |
|---|---|
| Container ABI | vllm/vllm-openai-rocm:v0.19.1 (Ubuntu 22, glibc ≤ 2.35, libstdc++ ≤ GLIBCXX_3.4.30) |
| PyTorch | torch==2.10.0+rocm7.1 (matches vllm-openai-rocm:v0.19.1; wheels are ABI-pinned to this build) |
| GPU arch | gfx942 (MI300X / MI325X), gfx950 (MI355X) |
| ROCm | 7.0 / 7.1 / 7.2 (pick wheel matching your runtime) |
| Python | 3.10 / 3.12 |
| vLLM | Recommend latest main with PR vllm-project/vllm#40754 merged. |
| SGLang | Recommend ≥ v0.5.10. If using CUDA-graph + custom all-reduce on MI300X / MI355X, use a base image with ROCm ≥ 7.2.1. |
Breaking Changes since v0.1.12.post2
None. Same as rc1.
Known Issues
Same as rc1 — see v0.1.13-rc1 release notes for details on the HIP graph capture issue (ROCm 7.2.0 + custom all-reduce).
Wheels
6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X), manylinux_2_28 ABI, torch==2.10.0+rocm7.1 pin:
| ROCm | Python 3.10 | Python 3.12 |
|---|---|---|
| 7.2 | amd_aiter-0.1.13rc2+rocm7.2.manylinux.2.28-cp310-... |
amd_aiter-0.1.13rc2+rocm7.2.manylinux.2.28-cp312-... |
| 7.1 | ...+rocm7.1.manylinux.2.28-cp310-... |
...+rocm7.1.manylinux.2.28-cp312-... |
| 7.0 | ...+rocm7.0.manylinux.2.28-cp310-... |
...+rocm7.0.manylinux.2.28-cp312-... |
Validation Status
- ATOM 5-model accuracy: rc1 validated (pending rc2 revalidation)
- vLLM ABI smoke: pending
- MLA nhead=32 decode (#2983): pending silicon verification
- Perf delta vs rc1: pending
Upgrade from rc1
pip install --pre --force-reinstall <wheel-url>
python -c "import aiter; print(aiter.__version__)" # expect: 0.1.13rc2+rocm7.X.manylinux.2.28Tagged from release/v0.1.13 HEAD = ab62c65757c4c41cb24c14b8e925a776c6124892.