AITER v0.1.14-rc0

First release candidate for v0.1.14, cut from main at commit 759589676 ("[Triton] [ATOM] DSV4 fusions phase 1 (#3057)"). Per-rc cherry-picks (#3163 minimax fused qknorm+allreduce, #3189 follow-up) deferred to rc1.

Highlights

DSv4 fusions phase 1 (#3057) — first batch of Triton/ATOM-side DSv4 fusions on top of v0.1.13.
Validated 5/5 production model accuracy at GSM8K 3-shot (flexible-extract).
Kimi-K2.5-MXFP4 unblocked when paired with ATOM containing PR #670 (kwargs upgrade for aiter.fused_qk_rmsnorm); see Known Issues for ABI pairing details.

Validation (GSM8K 3-shot, flexible-extract)

Model	Score	Threshold	Result	Notes
DeepSeek-R1-0528 (TP=8, fp8 KV)	PASS	0.94	PASS	mi355-gpu-15
MiniMax-M2.5 (TP=2, fp8 KV)	PASS	0.92	PASS	with `HSA_NO_SCRATCH_RECLAIM=1`
Qwen3-235B-A22B-FP8 (TP=8, fp8 KV)	PASS	0.87	PASS	with `ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1`
GLM-5-FP8 (TP=8, fp8 KV)	PASS	0.93	PASS	mi355-gpu-15
Kimi-K2.5-MXFP4 (TP=4, fp8 KV)	0.9303	0.93	PASS	requires ATOM with PR #670 (kwargs upgrade)

Wheel Matrix

6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2). Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).

Install

pip install https://github.com/ROCm/aiter/releases/download/v0.1.14-rc0/<wheel-filename>

flydsl==0.1.7 is auto-resolved as a runtime dep.

Known Issues

Kimi-K2.5-MXFP4 requires ATOM with PR #670 merged for the kwargs upgrade to aiter.fused_qk_rmsnorm. ATOM nightly tags from 2026-05-14 onward include the fix; older ATOM containers will hit AttributeError: 'float' object has no attribute 'size' at MLA path. Tracking: #3177
rocm7.2 wheel was built against torch 2.11 ABI but most current ATOM containers ship torch 2.10. For rocm7.2 deployments today, install the rocm7.1 wheel which uses torch 2.10 ABI (validated PASS on Kimi). rocm7.2-native wheel becomes useful once torch 2.11 ATOM images land.

What's Next (rc1)

Cherry-picks pending Markus must-list:

#3163 (4df783a) minimax fused qknorm+allreduce kernel
#3189 (open) grid-strided loop on top of #3163

Final v0.1.14 expected end of next week (~2026-05-22).

Cumulative Changes since v0.1.13

19 commits land in v0.1.14-rc0 vs v0.1.13. Highlights grouped by area:

DSv4 / Triton-ATOM fusions

DSV4 fusions phase 1 (#3057)
Remove triton backend in dsv4_bf16_tuned_gemm.csv (#3171)

MoE / FlyDSL

FlyDSL MXFP4 rounding alignment (#3153)
FlyDSL GDR decode kernel optimize (#3135)
FlyDSL xcd remap v2 (#3134)
FlyDSL per-kernel parallelism + AOT pool size (#3133)
silu_and_mul_quant + Opt silu_and_mul (#3145)

Triton

mHC-post: post-stream + res-stream mixing (#2920)
Triton blockscale num_stages pipelining (#3136)
Triton s_barrier sync waves (#3132)
feat(triton/rope): fused QKV split + QK RMSNorm + RoPE + paged KV (#2902)
Triton bench_gmm.py bug fix (#3154)

Bugfixes

fix gather mem violation (#3182)
[Bugfix][Triton] Honor transpose_bm in batched_gemm_a16wfp4_ fake tensor (#3166)

qk_rmsnorm_group_quant

refactor hip kl (-30% build time) (#3137)

CK_TILE

Use Unified Workspace for FMHA BWD (#2948)
Add nhead128,1 mask=1 + nhead128,4 fold to m16x4 (#3046)

Docs / Refactor

AITER May 2026 newsletter (#3170)
refactor + unify triton/bench_fav3_sage.py scripts (#2920)

AITER v0.1.13

Production release of the v0.1.13 line. Same commit as v0.1.13-rc5 (cdcfa833b) after 5 RC iterations.

Highlights

DeepSeek R1 / GPT-OSS / Kimi / GLM-5 enablement maturing on MI300X / MI325X (gfx942) and MI350 / MI355X (gfx950)
New ASM fmoe kernels for gfx950 that bypass bf16→fp8 quantization, gated by AITER_XBFLOAT16=1 env var (default off, opt-in for safety) (#2262)
Substantial MLA improvements: MI350 MLA PS mode for new shapes (#2727, #2729, #2676), MoE PS mode for nhead=8/2 on MI308 (#2852), nhead=32 non-persistent decode crash fix on gfx950 (#2983)
FMHA / paged attention: runtime dispatch for >4 GB KV cache in batch prefill (#2893), top_k_per_row prefill fix for batched_token_num > 4096 (#2901), gfx942/gfx950 PA PS kernel update with stride_scale_page write (#2796)
RDNA4 expansion: FP8 support for gfx1200/gfx1201 (#2621), FlyDSL flash_attn_func backend for gfx1201 (R9600D) — first RDNA4-class attention backend in AITER (#2969 on main, included via baseline)
Triton kernel additions: Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441), Triton fallback for MI455 GPT-OSS / DSFP4 (#2657), GLM-5 70k+300 GEMM configs for gfx942 (#2743)
FlyDSL maturity: BF16 GEMM tuned configs added/retuned for 6 models (#2733), AOT defaults via AITER_CONFIGS (#2756), if/else compatibility across versions (#2740), updated FlyDSL version pin
Bulk silo merge — kernel fixes and tuned configs in preparation for the v0.1.13.post1 line (#3004, #3005, #3024)
Quality of life: pandas FutureWarning suppressed and pybind11 type hint mismatch fix (#2980), Linux import errors no longer swallowed (#3049), std::unordered_map replaced with SynchronizedCache for thread safety (#2221), ctypes C-ABI error bridging to prevent worker crashes during kernel build (#2498)

Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)

Model	Score	Threshold	Result
DeepSeek-R1-0528	0.9454	0.94	PASS
MiniMax-M2.5	0.9295	0.92	PASS
Qwen3-235B-FP8	0.8802	0.87	PASS

Wheel Matrix

6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2). Fat binary covers gfx942 (MI300/MI308/MI325X) + gfx950 (MI350/MI355X).

Cumulative Changes since v0.1.12.post2

149 commits land in v0.1.13 vs v0.1.12.post2. Full list available via git log v0.1.12.post2..v0.1.13. Highlights grouped by area:

MoE / FlyDSL kernels (44 commits)

ASM fmoe kernels for gfx950 with bf16→fp8 quantization bypass (#2262)
FlyDSL A8W4 MoE update (#2726)
GPT-OSS small-M MoE optimizations (#2775)
Kimi-K2.5 MoE tuned configs revert for batch sizes 32/64 (#2836) — Kimi int4 a16wi4 MoE (#2863) deferred to v0.1.13.post1
Triton Gluon-optimized MoE Int8 SmoothQuant for small K (#2441)
MoE tuner fixes (#2831, #2785, #2723)
fused_dynamic_mxfp4_quant_moe_sort_hip added (#2620, fix #2759)
CK_TILE bpreshuffle compile failure fix (#2811)
Bulk silo merge tuned configs and kernel fixes (#3004, #3005, #3024)
moe_routing_sigmoid_top1_fused tie-breaking fix (#2750)

MLA / Multi-head Latent Attention (9+ commits)

MI350 MLA PS mode support for new shapes (nhead 128,1 / 128,2 / 128,3 / 128,4 / 64,4 / 64,2 / 32,4) via mla_a16w16_qh32_qseqlen4_gqaratio32_ps.co (#2727)
gfx950 fp8 decode native qh32 qseqlen2 MLA PS kernel (#2676) and qh64 nhead=64 native kernel (#2636)
bf16 MLA decode kernel for gqa_ratio=64, qseqlen=1 (non-persistent) (#2729)
MLA PS mode nhead 8/2 on MI308 (#2852)
MLA Reduce and Metadata kernel rewritten with OPUS template (#2717)
gfx950 nhead=32 non-persistent decode crash fix (#2983)
OPUS lib improvements for MLA: mma step_k, dword copy via set_slice and inline asm for tr_load (#2652)

FMHA / Paged Attention

Runtime dispatch for >4 GB KV cache in batch prefill (#2893)
top_k_per_row prefill fix for batched_token_num > 4096 (#2901)
gfx942/gfx950 PA PS kernel update with stride_scale_page write in asm_pa (#2796)
fmha_fwd_v3 silence false warning when use_asm_v3 is disabled (#2744)
indexer_k_quant_and_cache preshuffled layout support (#2879)
car prefill kernel error fix for SGLang (#2745)

Triton path

Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441)
Triton MHA UT reduction (#2612)
Adapt model benchmarking scripts to new bench_mha.py CLI (#2673)
Triton fallback for MI455 GPT-OSS and DSFP4 (#2657)
GLM-5 70k+300 GEMM configs for gfx942 (#2743)
Triton MoE GEMM shared memory exhaustion fix by reducing stage count (#2723)
Drop GLM5 Triton tuned GEMM (#2803)

FlyDSL

BF16 GEMM configs added/retuned for 6 models (#2733)
AITER_CONFIGS for FlyDSL AOT defaults (#2756)
if const_expr introduction (#2776)
if/else compatibility across versions (#2740)
A8W4 MoE update (#2726)
bf16 GEMM implementation and tuned config update (#2634)
A8W8 FlyDSL tune fix (#2809)
Linear attention rebase for new FlyDSL version (#2746)

Architecture enablement

RDNA4 (gfx1200/gfx1201): FP8 support added (#2621)
MI355X (gfx950): continued maturation across MoE, MLA, FMHA paths
MI350 (gfx950): MLA PS mode coverage expanded
MI308 (gfx942): MLA PS mode nhead 8/2 (#2852), i8gemm tuning (#2590)
MI300X (gfx942): gemma rmsnorm quant fusion (#2853), gemm_a16w16 torch tune (#2860)

Quality and safety fixes

pandas FutureWarning suppression and pybind11 type hint mismatch (#2980)
Linux import errors no longer swallowed (#3049)
std::unordered_map → SynchronizedCache for thread safety in CK paths (#2221)
ctypes C-ABI error bridging to prevent worker crashes during kernel build (#2498)
fused_qk_norm_group_quant stride error check fix (#2637)
fused_dynamic_mxfp4_quant_moe_sort_hip EP fix (#2759)
fused_gemm_afp4wfp4_a16w16 LDS exhaustion fix under ASYNC_COPY (#2784)
opus.hpp build time optimization kernel template (single-header C++ template, up to 61x faster builds vs standard torch extension)

Release engineering

torch_pin + torch_index_url workflow inputs for release-build CI (#2875)
manylinux_2_28 wheel matrix standardized: ROCm 7.0/7.1/7.2 × Python...

AITER v0.1.13-rc5

Fifth release candidate for v0.1.13, focused on adding asm_fmoe kernels for gfx950 (no bf16->fp8 quantization required) while removing RC4's Kimi int4 MoE changes.

Changes vs RC4

Reverted:

kimi a16wi4 moe support (#2863) — defer to v0.1.14
fix splitk buffer dispatch (#3050) — only needed by #2863

Cherry-picked from main:

Introduce asm fmoe kernels that do not require bf16->fp8 quantization (#2262) — new gfx950-only kernels behind AITER_XBFLOAT16=1 env var (default off)
[Bugfix] Suppress pandas FutureWarning and fix pybind11 type hint mismatch (#2980)

Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)

Model	Score	Threshold	Result
DeepSeek-R1-0528	0.9454	0.94	PASS
MiniMax-M2.5	0.9295	0.92	PASS
Qwen3-235B-FP8	0.8802	0.87	PASS

Wheel Matrix

6 wheels for ROCm 7.0 / 7.1 / 7.2 x Python 3.10 / 3.12, manylinux_2_28 ABI. All built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2 - torch 2.10 was removed from PyTorch's rocm7.2 index).

Cherry-pick Audit Summary (PR #2262)

PR #2262 introduces a new code path that is off by default and gfx950-only:

Triple-gated: quant_type==per_1x128 + gfx950 + AITER_XBFLOAT16=1 env var
Public C++ API unchanged
New *_pybind.cu shims and pre-compiled .co HSA binaries for gfx950
Zero merge conflicts on release/v0.1.13
No follow-up correctness fixes on main

Existing MI300/MI308/MI450 deployments and unset-env MI355X deployments are unaffected.

AITER v0.1.13-rc2

Release candidate 2 for v0.1.13. Pre-release — please smoke-test downstream and report back before final tag.

Changes since rc1

5 cherry-picks onto release/v0.1.13:

Bug Fixes

#2983 — [MLA] Fix nhead=32 non-persistent decode crash on gfx950: Corrects the decode dispatch condition for MLA attention when nhead=32 (e.g., Kimi-K2.5). Without this fix, gfx950 takes the non-persistent path and crashes during decode.
#2879 — Support preshuffled layout in indexer_k_quant_and_cache: Adds preshuffled weight layout support to blockscale GEMM and KV cache indexer, fixing a blocker for DI/SA inference paths.

New Features

#3005 — [Silo] Bulk merge kernel fixes + features: Adds 5 new Triton kernels — causal_conv1d_update_single_token, fused_rearrange_sigmoid_gdr, fused_fp8_quant, pa_mqa_logits, and gated delta rule decode optimizations. Includes corresponding op tests.

Config & Tuning

#3004 — [Silo] Bulk merge tuned configs: Adds MI355X (gfx950) tuned configs for Kimi-K2, GLM-4.7, Qwen3-Next-80B across GEMM and FMoE kernels.
#3024 — [Silo] Add configs missing from bulk merge #3004: Adds 6375 MI355X GEMM tunings for DeepSeek-V3.2 + MiniMax-M2.5 FMoE tunings. Deduplicates cross-file shape collisions (best us per shape wins).

Files changed (rc1 → rc2)

44 files, +13k / -2k lines
12 new CSV config files / updates
5 new Triton kernels + 3 new test files
2 C++ kernel files (MLA + cache)

Compatibility Matrix

Component	Requirement
Container ABI	`vllm/vllm-openai-rocm:v0.19.1` (Ubuntu 22, glibc ≤ 2.35, libstdc++ ≤ GLIBCXX_3.4.30)
PyTorch	`torch==2.10.0+rocm7.1` (matches vllm-openai-rocm:v0.19.1; wheels are ABI-pinned to this build)
GPU arch	gfx942 (MI300X / MI325X), gfx950 (MI355X)
ROCm	7.0 / 7.1 / 7.2 (pick wheel matching your runtime)
Python	3.10 / 3.12
vLLM	Recommend latest main with PR vllm-project/vllm#40754 merged.
SGLang	Recommend ≥ v0.5.10. If using CUDA-graph + custom all-reduce on MI300X / MI355X, use a base image with ROCm ≥ 7.2.1.

Breaking Changes since v0.1.12.post2

None. Same as rc1.

Known Issues

Same as rc1 — see v0.1.13-rc1 release notes for details on the HIP graph capture issue (ROCm 7.2.0 + custom all-reduce).

Wheels

6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X), manylinux_2_28 ABI, torch==2.10.0+rocm7.1 pin:

ROCm	Python 3.10	Python 3.12
7.2	`amd_aiter-0.1.13rc2+rocm7.2.manylinux.2.28-cp310-...`	`amd_aiter-0.1.13rc2+rocm7.2.manylinux.2.28-cp312-...`
7.1	`...+rocm7.1.manylinux.2.28-cp310-...`	`...+rocm7.1.manylinux.2.28-cp312-...`
7.0	`...+rocm7.0.manylinux.2.28-cp310-...`	`...+rocm7.0.manylinux.2.28-cp312-...`

Validation Status

ATOM 5-model accuracy: rc1 validated (pending rc2 revalidation)
vLLM ABI smoke: pending
MLA nhead=32 decode (#2983): pending silicon verification
Perf delta vs rc1: pending

Upgrade from rc1

pip install --pre --force-reinstall <wheel-url>
python -c "import aiter; print(aiter.__version__)"  # expect: 0.1.13rc2+rocm7.X.manylinux.2.28

Tagged from release/v0.1.13 HEAD = ab62c65757c4c41cb24c14b8e925a776c6124892.

AITER v0.1.13-rc1

Release candidate for v0.1.13. Pre-release — please smoke-test downstream and report back before final tag.

Compatibility Matrix

Component	Requirement
Container ABI	`vllm/vllm-openai-rocm:v0.19.1` (Ubuntu 22, glibc ≤ 2.35, libstdc++ ≤ GLIBCXX_3.4.30)
PyTorch	`torch==2.10.0+rocm7.1` (matches vllm-openai-rocm:v0.19.1; wheels are ABI-pinned to this build)
GPU arch	gfx942 (MI300X / MI325X), gfx950 (MI355X)
ROCm	7.0 / 7.1 / 7.2 (pick wheel matching your runtime)
Python	3.10 / 3.12
vLLM	Recommend latest main with PR vllm-project/vllm#40754 merged. Older vLLM that calls `gemm_a4w4(A, B, A_scale, B_scale, out, ...)` will silently pass `out` as bias on MXFP4 — see Breaking Changes.
SGLang	Recommend ≥ v0.5.10. If using CUDA-graph + custom all-reduce on MI300X / MI355X, use a base image with ROCm ≥ 7.2.1 (e.g. `rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1`) — see Known Issues.

Breaking Changes since v0.1.12.post2

None new in this release. For context, the gemm_a4w4 API changed in late 2025 (PR #1679, before v0.1.12) from

gemm_a4w4(A, B, A_scale, B_scale, out, bpreshuffle=True)

to

y = gemm_a4w4(A, B, A_scale, B_scale, dtype=out_dtype, bpreshuffle=True)

Both v0.1.12.post2 and v0.1.13-rc1 already use the new signature. vLLM PR vllm-project/vllm#40754 fixes vLLM's call site; install both AITER and a vLLM build that includes that PR for MXFP4 (Quark OCP MX) models.

Known Issues

HIP graph capture crash on ROCm 7.2.0 with custom all-reduce (sgl-project/sglang#23580 / #23581, #2857 / #2941).
- Real root cause (confirmed by ROCm team in #2857 thread): ROCm 7.2.0's hipEventQuery ignores THREAD_LOCAL capture mode, so the NCCL watchdog on another thread invalidates the in-flight HIP graph capture. Triggers hipErrorStreamCaptureInvalidated → next decode replay raises HSA_STATUS_ERROR_EXCEPTION 0x1016 → SIGABRT all TP scheduler subprocesses. AITER's IPCBufferPool change in v0.1.12.post1 only widened the race window, it did not introduce the bug.
- Fix: upgrade to ROCm ≥ 7.2.1 (the runtime fix is in 7.2.1 and later, no AITER or PyTorch rebuild needed). Recommended base: rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1.
- Alternative workaround (only if pinned to ROCm 7.2.0): set SGLANG_USE_AITER_AR=false (sgl-project/sglang#23581 made this the SGLang default). AITER attention / MoE / fused RMSNorm paths remain enabled.
- This is not an AITER-side bug, so v0.1.13 ships no code change for it. Tracking issue #2941 closed accordingly.

Upgrade Checklist (from v0.1.12.post2)

# 1. Install rc1 wheel (--pre needed because rc1 is PEP 440 pre-release)
pip install --pre --force-reinstall https://github.com/ROCm/aiter/releases/download/v0.1.13-rc1/amd_aiter-0.1.13rc1+rocm7.2.manylinux.2.28-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl

# 2. Verify
python -c "import aiter; print(aiter.__version__)"  # expect: 0.1.13rc1+rocm7.2.manylinux.2.28

# 3. Single-request smoke before opening to traffic
# (use your existing serving stack with ONE short request first, watch for HSA exceptions or shape errors)

Note: rc1 wheel filenames omit the .torch210 suffix that v0.1.12.post2 wheels carried. The torch ABI is unchanged — wheels are still built against torch==2.10.0+rocm7.1 (verified in build logs). Filename suffix will return in a future release.

What's new vs v0.1.12.post2

release/v0.1.13 is based on commit e039817a (selected from ATOM benchmark dashboard — ATOM commit 4f3c57970c showed 14/14 accuracy pass on this AITER head) plus 4 cherry-picked fixes:

#2893 — fix(fmha): support >4GB KV cache in batch prefill via runtime dispatch
#2901 — Fix top_k_per_row_prefill err when batched_token_numm > 4096
#2904 — revert gptoss tuned config (rolls back a perf-regressing tune)
#2875 — ci(release): torch_pin + torch_index_url workflow inputs (build infra)

Total delta from v0.1.12.post2: ~118 commits in release/v0.1.13 base + 3 surgical fixes on top.

ATOM accuracy validation (GSM8K 3-shot, MI355X 8-GPU)

All 5 tracked models pass thresholds on this rc:

Model	Score (flex-extract)	Threshold	Result
DeepSeek-R1-0528	0.9553	0.94	PASS
Kimi-K2.5-MXFP4	0.9340	0.93	PASS
MiniMax-M2.5	0.9325	0.92	PASS
GLM-5-FP8	0.9386	0.93	PASS
Qwen3-235B-A22B-FP8	0.8772	0.87	PASS

Test environment: rocm/atom-dev:latest container, AITER 0.1.13rc1+rocm7.2, lm_eval local-completions with tokenized_requests=False, num_concurrent=65.

CI validation (release/v0.1.13)

aiter-test (Standard Tests 10/10) — PASS
sglang_downstream — PASS
vLLM Benchmark — PASS (no regression vs v0.1.12.post2)
atom-test (DeepSeek-R1-0528 MI300X+MI355X, gpt-oss-120b MI355X) — PASS
Flash Attention Integration — PASS
OPUS Test — PASS
Black + Ruff — PASS

Wheels

6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X), manylinux_2_28 ABI, torch==2.10.0+rocm7.1 pin (matches vllm/vllm-openai-rocm:v0.19.1 PyTorch ABI):

ROCm	Python 3.10	Python 3.12
7.2	`amd_aiter-0.1.13rc1+rocm7.2.manylinux.2.28-cp310-...`	`amd_aiter-0.1.13rc1+rocm7.2.manylinux.2.28-cp312-...`
7.1	`...+rocm7.1.manylinux.2.28-cp310-...`	`...+rocm7.1.manylinux.2.28-cp312-...`
7.0	`...+rocm7.0.manylinux.2.28-cp310-...`	`...+rocm7.0.manylinux.2.28-cp312-...`

What we need from downstream

vLLM ml-ci-internal: smoke test DeepSeek-R1, gpt-oss-120b, Llama serving paths
SGLang downstream: integration test against rc1 wheel
ATOM standalone: confirm production benchmarks unchanged

If smoke is green, we'll cut v0.1.13 final from the same source commit. If issues surface, fix → rc2.

Tagged from release/v0.1.13 HEAD = 930c94120459bb352e1d7c68349b331b06397280.

@eppaneamd

AITER v0.1.12.post2

Post-release fix for v0.1.12 addressing two production blockers and a wheel ABI compatibility issue with vLLM containers.

Key Fixes

Fix c10_hip_check_implementation undefined symbol (#2843, #2846 by @lingpeng-jin)
Removed C10_HIP_KERNEL_LAUNCH_CHECK() from gated_rmsnorm_quant_kernels.cu — the macro signature drifted between PyTorch versions and broke loading in vllm/vllm-openai-rocm containers.
Fix DSR1 + MI300X "GEMM is not supported" crash (#2864, vllm#39485, #2645 by @eppaneamd )
Cherry-picked GEMM dispatch fix from main; surfaces the correct kernel for DeepSeek-R1 on gfx942 in vLLM serving.
manylinux2_28 + torch 2.10 ABI pin
All wheels now built on pytorch/manylinux2_28-builder (AlmaLinux 8 + devtoolset-11) with torch==2.10.0+rocm7.X pin. Resolves:
- GLIBCXX_3.4.32 symbol mismatch with vllm/vllm-openai-rocm:v0.19.1 (Ubuntu 22 base)
- c10::cuda::getCurrentCUDAStream(signed char) namespace masquerade drift between PyTorch 2.10 and 2.11

Validation

End-to-end DeepSeek-R1 serving in vllm/vllm-openai-rocm:v0.19.1:

MI300X (gfx942) — PASS
MI355X (gfx950) — PASS

Both architectures served from the same wheel.

Wheels

6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X):

ROCm	Python 3.10	Python 3.12
7.2	`amd_aiter-0.1.12.post2+rocm7.2.manylinux_2_28.torch210-cp310`	`amd_aiter-0.1.12.post2+rocm7.2.manylinux_2_28.torch210-cp312`
7.1	`amd_aiter-0.1.12.post2+rocm7.1.manylinux_2_28.torch210-cp310`	`amd_aiter-0.1.12.post2+rocm7.1.manylinux_2_28.torch210-cp312`
7.0	`amd_aiter-0.1.12.post2+rocm7.0.manylinux_2_28.torch210-cp310`	`amd_aiter-0.1.12.post2+rocm7.0.manylinux_2_28.torch210-cp312`

Install

pip install amd_aiter-0.1.12.post2+rocm7.2.manylinux_2_28.torch210-cp312-cp312-linux_x86_64.whl

Built from commit 28a7b6a5c85f4f34d266c8bbb9af6045b93f28fd (tag v0.1.12.post2).

AITER v0.1.12.post1

Post-release fix for v0.1.12 with 22 additional commits.

Key Fix

Fix fused_qk_rmsnorm_group_quant stride check — relaxes is_contiguous() to stride(1)==1 (last-dim contiguous), fixing crashes on DeepSeek MLA architecture models via ATOM (#2671)

Accuracy Validation (gsm8k 3-shot, ATOM upstream methodology)

All 5 models pass ATOM CI accuracy thresholds on MI355X (8x GPU):

Model	TP	flexible-extract	Threshold	Baseline	Result
DeepSeek-R1-0528	8	0.9568	0.94	0.9553	PASS
GLM-5-FP8	8	0.9439	0.93	0.9545	PASS
Kimi-K2.5-MXFP4	4	0.9378	0.93	0.9409	PASS
MiniMax-M2.5	2	0.9431	0.92	0.9401	PASS
Qwen3-235B-A22B-Instruct-2507-FP8	8	0.9052	0.87	0.909	PASS

Test environment: ATOM 0.1.3.dev52, AITER 0.1.12.post1+rocm7.2.1, lm-eval with local-completions, tokenized_requests=False, num_concurrent=65.

Wheels

6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X):

ROCm	Python 3.10	Python 3.12
7.2.1	`amd_aiter-0.1.12.post1+rocm7.2.1-cp310`	`amd_aiter-0.1.12.post1+rocm7.2.1-cp312`
7.1.1	`amd_aiter-0.1.12.post1+rocm7.1.1-cp310`	`amd_aiter-0.1.12.post1+rocm7.1.1-cp312`
7.0.2	`amd_aiter-0.1.12.post1+rocm7.0.2-cp310`	`amd_aiter-0.1.12.post1+rocm7.0.2-cp312`

Install

pip install amd_aiter-0.1.12.post1+rocm7.2.1-cp312-cp312-linux_x86_64.whl

Built from commit 7b57073 (tag v0.1.12.post1).

AITER v0.1.12 Release Notes

Release Date: 2026-04-10
Previous Release: v0.1.11.post1 (2026-03-05)
Commits: 334 (excluding release branch maintenance)
Supported GPU Architectures: gfx942 (MI300X/MI325X), gfx950 (MI355X)

Highlights

OPUS Migration -- Replacing CK Tile Primitives.
A major effort migrated internal kernel code from CK Tile APIs to the new OPUS (Operator Utility for Shader) abstraction layer. This includes replacing CK Tile in activation kernels (#2589), HIP kernels (#2533), allreduce (#2107), and type conversion primitives (#2331). OPUS also gained tiled scaled MFMA (#2384), finfo class (#2330), cast/numeric_limits enhancements (#2110), moe_sorting_opus (#2077), gfx950 smem transpose load (#2480), and comprehensive unit tests (#2017, #2040, #2127). This migration decouples AITER from CK internals and establishes OPUS as the portable device-code foundation.

FlyDSL Integration for MoE Kernels.
FlyDSL, AMD's high-performance domain-specific language, is now a first-class AITER dependency. Initial A4W4 MoE kernel support was imported (#2113) and enhanced (#2390), split-k GEMM was added (#2536), A4W4 MoE kernels were optimized (#2581), correctness and precision issues in split-k HGEMM were fixed (#2567), and FlyDSL was added to install requirements (#2430). The dependency was upgraded to v0.1.2 (#2635).

MLA (Multi-head Latent Attention) Enhancements.
MLA received extensive feature additions: HipKittens-based nhead=128 kernel (#2039), gfx950 A8W8 qh32 kernel (#1912), MLA persistent kernel LSE output (#2440), LSE-aware dispatch (#2378), FP8 return-LSE support (#2144), metadata split reference code (#2177), fast metadata update for decoding (#2215), and MI350-specific PS mode improvements including nhead=8/mtp=4 (#2461) and nhead64-to-nhead32 folding (#2570). Multiple NaN and accuracy bugs were also fixed (#2106, #2128, #2319).

ctypes Kernel Binding Refactoring.
Kernel dispatch was systematically migrated from pybind11 to ctypes, reducing build complexity and improving JIT build reliability. This includes the foundational ctypes binding refactor (#2255), paged attention ctypes migration (#2395), MoE ASM ctypes migration (#2341), int64 ctypes support (#2486), and a fix for ctypes JIT build issues with asm_topksoftmax (#2603).

CK Dependency Removal for FMHA.
Flash MHA forward (#2353) and backward v3 (#2250) kernels had their Composable Kernel dependencies removed, and a build-time ENABLE_CK option was added (#2074) enabling fully CK-free builds. The torch dependency was also removed from the MHA shared library build (#2501). These changes reduce build times and external dependency complexity.

Warp Size Generalization.
HIP kernels were updated to support variable warp sizes rather than hardcoding warp_size=64. A WARP_SIZE macro was added to the common header for both host and device use (#2525), and topksoftmax, grouptopk, cache, and sample kernels were updated (#2599). This is essential for cross-architecture portability between CDNA (warp_size=64) and future targets.

Allreduce Refactoring and Fusion.
The custom allreduce path was refactored to support prefill-phase collective operations (#2453), an allreduce+rmsnorm+quant fusion pass was added (#1990), GPT-OSS-120B hidden_size=2880 support was enabled in fused allreduce rmsnorm (#2329), numerical accuracy was improved (#2586), a double-buffer option was added for cross_device_reduce_1stage (#2064), and CUDA graph capture compatibility was fixed (#2075).

Sage Attention v2 and Flash Attention Improvements.
Triton-based Sage Attention v2 received multiple updates: MXFP4 Q*K support (#2066), optimizations (#2045), stride fixes (#2117), mask fix (#2158), and a consolidated patch (#2240). Flash Attention v3 gained hipgraph support for KV cache (#2096), configurable Triton configs via environment variable (#2000), Windows build support (#2433), and integration CI (#1974).

RDNA Architecture Support.
AITER expanded beyond data center GPUs with gfx1150/1151 RDNA arch registration (#2014), improved RDNA config selection for Flash Attention (#2397) and general kernels (#2402), and RDNA CI infrastructure (#2222).

New Features

Attention & MLA

Introduce HipKittens-based nhead=128 MLA kernel (#2039)
Add gfx950 MLA A8W8 qh32 kernel (#1912)
Add LSE output support for MLA decode qseqlen=1 persistent kernel (#2440)
Add LSE-aware kernel dispatch for MLA (#2378)
MLA PS mode FP8 with return LSE for nhead=128,4 (#2144)
MLA PS mode add metadata split reference code (#2177)
Add decode_update_mla_metadata_v1 for fast metadata update in decoding (#2215)
MI350 MLA PS mode support nhead=8, mtp=4 (#2461)
MI350 MLA PS mode fold nhead64,2 to nhead32,4 kernel (#2570)
Add head_num=40 for MLA FP8 reduce kernel for Qwen3.5 (#2481)
Upload mla_a8w8_qh64_qseqlen4_gqaratio16 config for MI300 (#2042)
Add FP8 hdim=256 tile for batch prefill kernel (#2549)
Support per_block for PA PS kernels (#2053)
Add sliding window support for Triton sink attention (#2505)
CK MHA backward: add sink attention score gradient support (#2321)
MHA forward v3 hdim128 support per-tensor FP8 for MI300/MI308 (#2105)
CK Tile FMHA backward use persistent kernels in deterministic mode (#2216)
Optimize flash attention forward (#2265)
Sage Attention v2: Q*K in MXFP4 (#2066)
Sage Attention v2 patch (#2240)
Hipgraph support for fav3 KV cache (#2096)
Add FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON env var support (#2000)
Flash Attention Triton Windows build support (#2433)

MoE Kernels

FlyDSL A4W4 MoE kernel import (#2113)
FlyDSL A4W4 MoE support and kernel update (#2390)
FlyDSL A4W4 MoE kernel optimization (#2581)
Add FlyDSL split-k GEMM (#2536)
Triton smoothquant int8 MoE kernel (#2049)
Introduce ASM 64x256 kernels for MI300 (#2404)
Introduce 64x256 fmoe kernels (#2279)
Support topk_softmax with shared expert scoring function (#2356)
Group_topk: moe_fused_gate support non-power-of-2 experts (192/96) (#2604)
Update topk.py to support non-power-of-2 experts (Kimi-K2) for long contexts (#2359)
Add moe_smooth_per_token_scaled_quant v1 and v2 (#2295)
Add ASM topsoftmax support 384x8 (#2130)
Support strided gating_score for topk_softmax (#2124)

GEMM

Add FP8 blockscale ASM kernel (#2142)
CK Tile A8W8 blockscale GEMM with preshuffleB support (#1954)
Add Triton A8W8 split-k support (#2180)
Add compiler configurations for bpreshuffle CK Tile modules (#2069)
Add 32x128 and 64x128 ASM kernels for Qwen3-next TP4 (#2285)
Add precision fix and gelu kernels for 64x256 (#2471)
MI325 support gfx942 i8gemm tilesize 112x256 (#2006)
Add igemm kernel for MI325 (#1968)
Enable hipblaslt FP8 tuning (#2212)
Add f32 MFMA support for 32x32x2f32 and 16x16x4f32 (#2070)
Add fast gelu activation (#2220)

Fused Kernels

Add fused_qk_norm_group_quant kernel (#2527)
Add fused_qknorm HIP kernel (#2442)
Fuse RMS + RoPE + block quantization kernel (#2027)
Optimize fused_qk_norm_rope_blkquant kernel (#2206)
Add allreduce+rmsnorm+quant fusion pass (#1990)
Support GPT-OSS-120B hidden_size=2880 in fused allreduce rmsnorm (#2329)
Add mhc_post HIP kernel (#2479)
Add mhc_pre HIP kernel (mhc_pre_gemm_sqrsum, mhc_pre_big_fuse) (#2136)
Add fused_qk_norm_rope_cache_quant rotary_dim parameter (#2199)
Top-K Top-P sampling kernel optimization (#2034)

RDNA Support

Adding gfx1150/51 to RDNA arch (#2014)
Improve RDNA config selection for Flash Attention (#2397)
Improve config selection for RDNA GPUs (#2402)

Other Features

HIP causal conv1d decode kernel (#2084)
PA decode gluon AOT C++ API (#2085)
Support naive mrope in get_rope (#2292)
Support FP8/MXFP4-quantized activation dtype (#2188)
Support value_cache 5D shuffle layout with GPT-OSS-120B precision tests (#2217)
Generate KV prefix preshuffle (#2288)
Support dim(-1) allgather (#2162)
Add ep, pp, dp group interface (#2137)
Respect AITER_LOG_LEVEL for C++ stdout prints (#2086)
Identify device name by chip ID (#2325)
Support comments in tuned config CSV files (#2422)
Defer expensive build operations to build_ext.run() (#1973)
Hipgraph support: correct arg_size type from int to size_t (#2163)
Add double-buffer option for cross_device_reduce_1stage (#2064)
Use unreg path for custom all-reduce during CUDA graph capture (#2075)

Performance

Tuned Configs

Add Kimi-K2.5 tuned configs for MI355X (#2619)
Add DSv3-MXFP4 tuned configs for MI355X (#2616)
Retune Kimi K2 MoE configs (#2625)
Replace CK MoE config in TP4 configs (#2626)
Add GLM-5 tuned configs (#2518)
Add Qwen3.5 FP8 and A8W8 blockscale GEMM tuned configs (#2324)
Tuned Qwen3.5 GEMM (#2485)
Add tuned CSV files for GEMM and MoE to accelerate Kimi-K2 (#2290)
Add MI355X (gfx950) tuned GEMM configs for FP4 and FP8 shapes (#2037)
Tune 493 new FP4 GEMM shapes for LLM inference (#2092)
Add new GEMM configuration files for various matrix sizes (#2024)
GEMM and MoE tuning for DeepSeek-R1 InferenceX FP4 (#2261)
Tune Triton GEMM kernel for MI355 DSV3 DP+EP configuration (#2016)
MI325 igemm ASM tuning (#2125)
Add blockPerCu support for CK Tile GEMMs and CK Tile MoE tuning (#2313)
Update dsv3 ptpc A8W8 GEMM config (#2253)
Add GEMM-A16W16-ATOMIC-N=256-K=6144 Triton GEMM tune config (#2213)
Update gfx950 PA PS kernels and wire stride_scale_blk in asm_pa (#2569)
Update gfx942 PA PS kernels and wire stride_scale_blk in asm_pa (#2522)
Add more MoE/GEMM configs (#2506)
Fix MoE stage2 tune config (#2438)
Fix MoE GEMM tuned config (#2463)
Remove duplicate tuned configs (#2219)
Add FlyDSL split-k GEMM with Kimi-2 BF16 tuned config (#2536)
Fix GEMM test failures and retune with latest Triton (#2434)

Kernel Optimizations

Optimize prefill A4W4 MoE (#2233)
Optimized fused split GDR decode (#2326)
Optimize _moe_mxfp4_sort_kernel to reduce Triton recompilation (#2414)
Triton fav3...

@LJ-underdog

What's Changed

Fix sink error for asm fmha by @LJ-underdog in #1652
add guard in case pynccl init failed by @valarLip in #1671
One shot pa by @fsx950223 in #1670
fix(pa_ps): fix pa_ps_asm .co for gfx950 by @dbyoung18 in #1669
modify test_bf16gemm_test by @amd-ruitang3 in #1678
[FIX/CI] Fix ruff CI check by @Boss2002n in #1675
fix mha bwd golden perf issue by @JaxChen29 in #1666
topk uplift v1 by @steamedMantou in #1662
fix missing return in mha_bwd by @yuguo68 in #1688
Remove the input parameter "out" in gemm_a4w4 by @junhaha666 in #1679
fwd v3 hd192 optimize inst alignment for causal mode by @shay-li77 in #1663
fix swa case mismatch by @JaxChen29 in #1694
fixing the fp4 gemm tune script Exception caused by csv title inconsistency with code by @hongxiayang in #1686
CI: Migrate Triton tests to aiter-1gpu-runner by @gyohuangxin in #1690
add ntile 128 for a8 blkQ moe 1 stage by @zufayu in #1695
Optimize RoPE in the cases that hdim is small. by @ruanjm in #1698
rm garbage from whl by @amd-ruitang3 in #1696
enhance prebuild logic by @zufayu in #1672
LLfp4 qr cap for atom by @amirumoAMD in #1673
[MLA] MLA conditions rewrite by @Zzz9990 in #1665
[MLA] fake non persistent fix dp causal by @Zzz9990 in #1677
add two fp4 tune shapes and tuned config by @hongxiayang in #1687
Dev/a8w4 and a8w8splitk by @yadaish in #1667
bf16_gemm_clean_in_kl by @amd-ruitang3 in #1700
fix tuner by @valarLip in #1701
add gen_fake for 4 gemm operators by @mqhc2020 in #1456
fix llvm issue by @valarLip in #1703
feat: Adaptive topk algorithm selection based on input characteristics by @ClementLinCF in #1578
fix mha bwd build error by @JaxChen29 in #1705
fix moe bug when pipever=v1 and nblk=64 by @lalala-sh in #1707
fix by @valarLip in #1710

New Contributors

@JaxChen29 made their first contribution in #1666
@amirumoAMD made their first contribution in #1673

Full Changelog: v0.1.8...v0.1.9

@yzhou103

What's Changed

add concat_and_cache_mla kernel by @yzhou103 in #1194
Add triton_metadata_redirect module with with_metadata_path decorator by @jwu10003 in #1172
[bug] fix qr when variable input by @lihaoyang-amd in #1191
fix test_concat_cache_mla by @yzhou103 in #1198
Update vllm_benchmark.yaml to use TW cluster to build vllm image by @gyohuangxin in #1200
add hipblaslt swizzle feature by @yixionghuo in #1169
mha fwd v3 gfx950 support dim_q=192 dim_v=128 by @minmengdie in #1188
Fix FA cpp api multi target build by @slippedJim in #1196
CI: parallel build Aiter whl packages for Python 3.10 and 3.12 by @gyohuangxin in #1204
CI: Move some tests back to TW cluster by @gyohuangxin in #1199
CI: Add timeout and retry when insatlling the python deps by @gyohuangxin in #1210
add aiter namespace to rope by @valarLip in #1202
update test_gemm_a16w16 by @amd-ruitang3 in #1205
[fea]: custom all gather by @TennyWang1223 in #1207
[TRITON] Add Positional Encoding (PE) support to Triton MHA kernels by @brunomazzottiamd in #1184
Tune gemm op bf16 by @yzhou103 in #1190
Add mha varlen fake for different from mha by @ZhangLirong-amd in #1214
CI: Use the pre-built sglang image instead of building the sglang image from source. by @gyohuangxin in #1224
[TRITON] Fix fp8 bmm op unit test bug on MI350 by @lucas-santos-amd in #1219
CI: Change the image when building the Aiter release python3.12 package by @gyohuangxin in #1225
CI: Add runner selection to workflow for flexible build host choice in Aiter release CI by @gyohuangxin in #1229
[CK_TILE] FMHA BWD Optimizations for D48 for GFX950 by @DDEle in #1180
Remove FA cpp api dependence of pandas by @slippedJim in #1230
CI: Add unique tag after the names of uploaded packages by @gyohuangxin in #1236
A8w8 asm codegen and tune by @yzhou103 in #1161
Opt concat cache mla by @yzhou103 in #1239
avoid copy ck by @valarLip in #1247
CI: Add GPU_ARCH options in Aiter release CI by @gyohuangxin in #1253
[MI35x] fix gfx950 build error by @yzhou103 in #1250
update bf16 gemm asm by @amd-ruitang3 in #1238
CI: Fix issues in Aiter release CI by @gyohuangxin in #1255
add ninja into requirements by @amd-ruitang3 in #1254
Wrapper gemm to fix get_config lru cache break by @ZhangLirong-amd in #1249
Fix build bdist wheel error by @yzhou103 in #1256
Added in GQA and 64-bit indexing by @kesavanramakrishnan in #1226
Refine ca by @valarLip in #1263
update mi308 fmoe fp16 asm,MI35x by @amd-ruitang3 in #1201
Introducing opus by @carlushuang in #1264
ca_refactor_fix by @valarLip in #1268
fix_fp4_quant_dtype by @lalala-sh in #1271
add sample use outer exponential by @junhaha666 in #1267
Fix rowwise a8w8 gemm in swizzled hipb_mm by @kliuae-amd in #1258
CI: Use TW cluster to run sglang tests by @gyohuangxin in #1273
Catchall PR for all 355_wip related changes by @azaidy in #1148
[MI35X] fix core check by @amd-ruitang3 in #1276
Refactor gemm bf16 tuner by @yzhou103 in #1275
CI: Operators tuning pipelines by @gyohuangxin in #1163
fix the problem that v3's performance is worse than ck's by @minmengdie in #1237
fix the python mha test run_perftest error by @minmengdie in #1277
Reuse custom decorator in core and torch guard by @ZhangLirong-amd in #1278
integrate deep gemm by @lalala-sh in #1265
add a tuned fp4 gemm ds config and insert entries in untuned config by @hongxiayang in #1243
Enable large batch size and optimization of non-Ragged batching by @valechen in #1269
added a few more fw ds f4 untuned and tuned shapes in order to use asm module or kernel by @hongxiayang in #1298
CI: Optimize autotuning pipeline and inital the docs by @gyohuangxin in #1286
topk per row kernel by @ukannika in #1262
fix aot by @fsx950223 in #1279
Fix ATOM fp8 model quant fail issue in torch compile by @ZhangLirong-amd in #1299
feat - pa_fwd support block map with stride in num_kv_heads_dim by @alibaba-miji in #1301
Fix how to update accumulator for dot_scaled by @zhanglx13 in #1297
CI: Optimize autotuning pipeline docs by @gyohuangxin in #1300
Fix the lint issue by @gyohuangxin in #1307
fix fwd perf calc error by @minmengdie in #1305
add the asm kernel performance of fwd and bwd by @minmengdie in #1270
Fused TopK and Sigmoid kernel by @samremes in #1251
Ar rms by @TennyWang1223 in #1290
Dsv32 cache by @junhaha666 in #1314
Fix displaying supported architectures by @HollowMan6 in #1316
using standalone pybind by @valarLip in #1317
Enable mha bwd hd192_hd128 by @slippedJim in #1308
CI: Add pre-check status check by @gyohuangxin in #1252
[CK_TILE] fmha: Add backward pass support for padded inputs by @Jeff-Huang in #1212
Mla splitkv enhance split alg inte by @valarLip in #1233
Fix gemm tuner error mi350 by @yzhou103 in #1313
CI: Skip triton setup in Aiter standard/multigpu tests and add retries when setting up triton by @gyohuangxin in #1325
Fix global variable torch_fp8 initialization caused issue by @huizhougit in #1322
[FEAT] [Triton] Add transpose scale to the triton fused_rms_fp8_group_quant by @tjtanaa in #1291
[Triton] 355 wip Llama FP4 triton fusion + TP8 triton decode shape tunning by @k50112113 in #1315
Kernel naming: add reusable constexpr repr helper and testing it with gemm_a16w16 by @Boss2002n in #1260
Merge tuned file by @yzhou103 in #1327
fix graph_breaks by return tensor for bool op by @ZhangLirong-amd in #1333
fix_bf16gemm_asm by @amd-ruitang3 in #1329
Improve Memory Usage in MLA by @ruanjm in #1338
fix tune error caused by merge tuned_file by @yzhou103 in #1342
rm rocblas op in aiter by @yzhou103 in #1337
[Triton] DS a16w8 GEMM and fused reduce_rms_fp8_group_quant by @k50112113 in #1328
Add block_m=16 for a8w8_ck_moe_blockscale by @huaiguxu in #1081
Add Fused RMSNorm + FP8 Per-tensor Static Quantization Triton Kernel by @farlukas in #1330
[TRITON] GEMM kernels nomenclature changes by @Boss2002n in #1283
Temporarily run aiter standard and multigpu tests on the TW cluster, will switch back once the mirror registry is ready. by @gyohuangxin in #1359
[Triton] Disable failing lean attention tests by @cagrikymk in #1357
[Fix] update ck to fix fp4 gemm issue by @gino-lu in #1361
add config by @valarLip in #1355
add how_v3_bf16_cvt control to the Python API by @minmengdie in #1351
[fix]: car 6 rank coredump by @TennyWang1223 in #1335
Wrapper_flash_attn_backward custom op to avoid functionalize fallback and fix guard logic by @ZhangLirong-amd in #1348
[TRITON] GEMM kernels nomenclature changes by @Boss2002n in https://github.com/ROCm/aiter/pull...

Releases: ROCm/aiter

AITER v0.1.14-rc0

AITER v0.1.14-rc0

Highlights

Validation (GSM8K 3-shot, flexible-extract)

Wheel Matrix

Install

Known Issues

What's Next (rc1)

Cumulative Changes since v0.1.13

DSv4 / Triton-ATOM fusions

MoE / FlyDSL

Triton

Bugfixes

qk_rmsnorm_group_quant

CK_TILE

Docs / Refactor

Uh oh!

v0.1.13

AITER v0.1.13

Highlights

Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)

Wheel Matrix

Cumulative Changes since v0.1.12.post2

MoE / FlyDSL kernels (44 commits)

MLA / Multi-head Latent Attention (9+ commits)

FMHA / Paged Attention

Triton path

FlyDSL

Architecture enablement

Quality and safety fixes

Release engineering

Uh oh!

v0.1.13-rc5

AITER v0.1.13-rc5

Changes vs RC4

Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)

Wheel Matrix

Cherry-pick Audit Summary (PR #2262)

Uh oh!

v0.1.13-rc2

AITER v0.1.13-rc2

Changes since rc1

Bug Fixes

New Features

Config & Tuning

Files changed (rc1 → rc2)

Compatibility Matrix

Breaking Changes since v0.1.12.post2

Known Issues

Wheels

Validation Status

Upgrade from rc1

Uh oh!

v0.1.13-rc1

AITER v0.1.13-rc1

Compatibility Matrix

Breaking Changes since v0.1.12.post2

Known Issues

Upgrade Checklist (from v0.1.12.post2)

What's new vs v0.1.12.post2

ATOM accuracy validation (GSM8K 3-shot, MI355X 8-GPU)

CI validation (release/v0.1.13)

Wheels

What we need from downstream

Uh oh!

v0.1.12.post2

AITER v0.1.12.post2

Key Fixes

Validation

Wheels

Install

Contributors

Uh oh!

v0.1.12.post1

AITER v0.1.12.post1

Key Fix

Accuracy Validation (gsm8k 3-shot, ATOM upstream methodology)

Wheels

Install