Skip to content

Releases: ROCm/aiter

AITER v0.1.14-rc0

14 May 23:11
7595896

Choose a tag to compare

AITER v0.1.14-rc0 Pre-release
Pre-release

AITER v0.1.14-rc0

First release candidate for v0.1.14, cut from main at commit 759589676 ("[Triton] [ATOM] DSV4 fusions phase 1 (#3057)"). Per-rc cherry-picks (#3163 minimax fused qknorm+allreduce, #3189 follow-up) deferred to rc1.

Highlights

  • DSv4 fusions phase 1 (#3057) — first batch of Triton/ATOM-side DSv4 fusions on top of v0.1.13.
  • Validated 5/5 production model accuracy at GSM8K 3-shot (flexible-extract).
  • Kimi-K2.5-MXFP4 unblocked when paired with ATOM containing PR #670 (kwargs upgrade for aiter.fused_qk_rmsnorm); see Known Issues for ABI pairing details.

Validation (GSM8K 3-shot, flexible-extract)

Model Score Threshold Result Notes
DeepSeek-R1-0528 (TP=8, fp8 KV) PASS 0.94 PASS mi355-gpu-15
MiniMax-M2.5 (TP=2, fp8 KV) PASS 0.92 PASS with HSA_NO_SCRATCH_RECLAIM=1
Qwen3-235B-A22B-FP8 (TP=8, fp8 KV) PASS 0.87 PASS with ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
GLM-5-FP8 (TP=8, fp8 KV) PASS 0.93 PASS mi355-gpu-15
Kimi-K2.5-MXFP4 (TP=4, fp8 KV) 0.9303 0.93 PASS requires ATOM with PR #670 (kwargs upgrade)

Wheel Matrix

6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2). Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).

Install

pip install https://github.com/ROCm/aiter/releases/download/v0.1.14-rc0/<wheel-filename>

flydsl==0.1.7 is auto-resolved as a runtime dep.

Known Issues

  • Kimi-K2.5-MXFP4 requires ATOM with PR #670 merged for the kwargs upgrade to aiter.fused_qk_rmsnorm. ATOM nightly tags from 2026-05-14 onward include the fix; older ATOM containers will hit AttributeError: 'float' object has no attribute 'size' at MLA path. Tracking: #3177
  • rocm7.2 wheel was built against torch 2.11 ABI but most current ATOM containers ship torch 2.10. For rocm7.2 deployments today, install the rocm7.1 wheel which uses torch 2.10 ABI (validated PASS on Kimi). rocm7.2-native wheel becomes useful once torch 2.11 ATOM images land.

What's Next (rc1)

Cherry-picks pending Markus must-list:

  • #3163 (4df783a) minimax fused qknorm+allreduce kernel
  • #3189 (open) grid-strided loop on top of #3163

Final v0.1.14 expected end of next week (~2026-05-22).

Cumulative Changes since v0.1.13

19 commits land in v0.1.14-rc0 vs v0.1.13. Highlights grouped by area:

DSv4 / Triton-ATOM fusions

  • DSV4 fusions phase 1 (#3057)
  • Remove triton backend in dsv4_bf16_tuned_gemm.csv (#3171)

MoE / FlyDSL

  • FlyDSL MXFP4 rounding alignment (#3153)
  • FlyDSL GDR decode kernel optimize (#3135)
  • FlyDSL xcd remap v2 (#3134)
  • FlyDSL per-kernel parallelism + AOT pool size (#3133)
  • silu_and_mul_quant + Opt silu_and_mul (#3145)

Triton

  • mHC-post: post-stream + res-stream mixing (#2920)
  • Triton blockscale num_stages pipelining (#3136)
  • Triton s_barrier sync waves (#3132)
  • feat(triton/rope): fused QKV split + QK RMSNorm + RoPE + paged KV (#2902)
  • Triton bench_gmm.py bug fix (#3154)

Bugfixes

  • fix gather mem violation (#3182)
  • [Bugfix][Triton] Honor transpose_bm in batched_gemm_a16wfp4_ fake tensor (#3166)

qk_rmsnorm_group_quant

  • refactor hip kl (-30% build time) (#3137)

CK_TILE

  • Use Unified Workspace for FMHA BWD (#2948)
  • Add nhead128,1 mask=1 + nhead128,4 fold to m16x4 (#3046)

Docs / Refactor

  • AITER May 2026 newsletter (#3170)
  • refactor + unify triton/bench_fav3_sage.py scripts (#2920)

v0.1.13

09 May 20:57

Choose a tag to compare

AITER v0.1.13

Production release of the v0.1.13 line. Same commit as v0.1.13-rc5 (cdcfa833b) after 5 RC iterations.

Highlights

  • DeepSeek R1 / GPT-OSS / Kimi / GLM-5 enablement maturing on MI300X / MI325X (gfx942) and MI350 / MI355X (gfx950)
  • New ASM fmoe kernels for gfx950 that bypass bf16→fp8 quantization, gated by AITER_XBFLOAT16=1 env var (default off, opt-in for safety) (#2262)
  • Substantial MLA improvements: MI350 MLA PS mode for new shapes (#2727, #2729, #2676), MoE PS mode for nhead=8/2 on MI308 (#2852), nhead=32 non-persistent decode crash fix on gfx950 (#2983)
  • FMHA / paged attention: runtime dispatch for >4 GB KV cache in batch prefill (#2893), top_k_per_row prefill fix for batched_token_num > 4096 (#2901), gfx942/gfx950 PA PS kernel update with stride_scale_page write (#2796)
  • RDNA4 expansion: FP8 support for gfx1200/gfx1201 (#2621), FlyDSL flash_attn_func backend for gfx1201 (R9600D) — first RDNA4-class attention backend in AITER (#2969 on main, included via baseline)
  • Triton kernel additions: Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441), Triton fallback for MI455 GPT-OSS / DSFP4 (#2657), GLM-5 70k+300 GEMM configs for gfx942 (#2743)
  • FlyDSL maturity: BF16 GEMM tuned configs added/retuned for 6 models (#2733), AOT defaults via AITER_CONFIGS (#2756), if/else compatibility across versions (#2740), updated FlyDSL version pin
  • Bulk silo merge — kernel fixes and tuned configs in preparation for the v0.1.13.post1 line (#3004, #3005, #3024)
  • Quality of life: pandas FutureWarning suppressed and pybind11 type hint mismatch fix (#2980), Linux import errors no longer swallowed (#3049), std::unordered_map replaced with SynchronizedCache for thread safety (#2221), ctypes C-ABI error bridging to prevent worker crashes during kernel build (#2498)

Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)

Model Score Threshold Result
DeepSeek-R1-0528 0.9454 0.94 PASS
MiniMax-M2.5 0.9295 0.92 PASS
Qwen3-235B-FP8 0.8802 0.87 PASS

Wheel Matrix

6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2). Fat binary covers gfx942 (MI300/MI308/MI325X) + gfx950 (MI350/MI355X).

Cumulative Changes since v0.1.12.post2

149 commits land in v0.1.13 vs v0.1.12.post2. Full list available via git log v0.1.12.post2..v0.1.13. Highlights grouped by area:

MoE / FlyDSL kernels (44 commits)

  • ASM fmoe kernels for gfx950 with bf16→fp8 quantization bypass (#2262)
  • FlyDSL A8W4 MoE update (#2726)
  • GPT-OSS small-M MoE optimizations (#2775)
  • Kimi-K2.5 MoE tuned configs revert for batch sizes 32/64 (#2836) — Kimi int4 a16wi4 MoE (#2863) deferred to v0.1.13.post1
  • Triton Gluon-optimized MoE Int8 SmoothQuant for small K (#2441)
  • MoE tuner fixes (#2831, #2785, #2723)
  • fused_dynamic_mxfp4_quant_moe_sort_hip added (#2620, fix #2759)
  • CK_TILE bpreshuffle compile failure fix (#2811)
  • Bulk silo merge tuned configs and kernel fixes (#3004, #3005, #3024)
  • moe_routing_sigmoid_top1_fused tie-breaking fix (#2750)

MLA / Multi-head Latent Attention (9+ commits)

  • MI350 MLA PS mode support for new shapes (nhead 128,1 / 128,2 / 128,3 / 128,4 / 64,4 / 64,2 / 32,4) via mla_a16w16_qh32_qseqlen4_gqaratio32_ps.co (#2727)
  • gfx950 fp8 decode native qh32 qseqlen2 MLA PS kernel (#2676) and qh64 nhead=64 native kernel (#2636)
  • bf16 MLA decode kernel for gqa_ratio=64, qseqlen=1 (non-persistent) (#2729)
  • MLA PS mode nhead 8/2 on MI308 (#2852)
  • MLA Reduce and Metadata kernel rewritten with OPUS template (#2717)
  • gfx950 nhead=32 non-persistent decode crash fix (#2983)
  • OPUS lib improvements for MLA: mma step_k, dword copy via set_slice and inline asm for tr_load (#2652)

FMHA / Paged Attention

  • Runtime dispatch for >4 GB KV cache in batch prefill (#2893)
  • top_k_per_row prefill fix for batched_token_num > 4096 (#2901)
  • gfx942/gfx950 PA PS kernel update with stride_scale_page write in asm_pa (#2796)
  • fmha_fwd_v3 silence false warning when use_asm_v3 is disabled (#2744)
  • indexer_k_quant_and_cache preshuffled layout support (#2879)
  • car prefill kernel error fix for SGLang (#2745)

Triton path

  • Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441)
  • Triton MHA UT reduction (#2612)
  • Adapt model benchmarking scripts to new bench_mha.py CLI (#2673)
  • Triton fallback for MI455 GPT-OSS and DSFP4 (#2657)
  • GLM-5 70k+300 GEMM configs for gfx942 (#2743)
  • Triton MoE GEMM shared memory exhaustion fix by reducing stage count (#2723)
  • Drop GLM5 Triton tuned GEMM (#2803)

FlyDSL

  • BF16 GEMM configs added/retuned for 6 models (#2733)
  • AITER_CONFIGS for FlyDSL AOT defaults (#2756)
  • if const_expr introduction (#2776)
  • if/else compatibility across versions (#2740)
  • A8W4 MoE update (#2726)
  • bf16 GEMM implementation and tuned config update (#2634)
  • A8W8 FlyDSL tune fix (#2809)
  • Linear attention rebase for new FlyDSL version (#2746)

Architecture enablement

  • RDNA4 (gfx1200/gfx1201): FP8 support added (#2621)
  • MI355X (gfx950): continued maturation across MoE, MLA, FMHA paths
  • MI350 (gfx950): MLA PS mode coverage expanded
  • MI308 (gfx942): MLA PS mode nhead 8/2 (#2852), i8gemm tuning (#2590)
  • MI300X (gfx942): gemma rmsnorm quant fusion (#2853), gemm_a16w16 torch tune (#2860)

Quality and safety fixes

  • pandas FutureWarning suppression and pybind11 type hint mismatch (#2980)
  • Linux import errors no longer swallowed (#3049)
  • std::unordered_mapSynchronizedCache for thread safety in CK paths (#2221)
  • ctypes C-ABI error bridging to prevent worker crashes during kernel build (#2498)
  • fused_qk_norm_group_quant stride error check fix (#2637)
  • fused_dynamic_mxfp4_quant_moe_sort_hip EP fix (#2759)
  • fused_gemm_afp4wfp4_a16w16 LDS exhaustion fix under ASYNC_COPY (#2784)
  • opus.hpp build time optimization kernel template (single-header C++ template, up to 61x faster builds vs standard torch extension)

Release engineering

  • torch_pin + torch_index_url workflow inputs for release-build CI (#2875)
  • manylinux_2_28 wheel matrix standardized: ROCm 7.0/7.1/7.2 × Python...
Read more

v0.1.13-rc5

08 May 02:30

Choose a tag to compare

v0.1.13-rc5 Pre-release
Pre-release

AITER v0.1.13-rc5

Fifth release candidate for v0.1.13, focused on adding asm_fmoe kernels for gfx950 (no bf16->fp8 quantization required) while removing RC4's Kimi int4 MoE changes.

Changes vs RC4

Reverted:

  • kimi a16wi4 moe support (#2863) — defer to v0.1.14
  • fix splitk buffer dispatch (#3050) — only needed by #2863

Cherry-picked from main:

  • Introduce asm fmoe kernels that do not require bf16->fp8 quantization (#2262) — new gfx950-only kernels behind AITER_XBFLOAT16=1 env var (default off)
  • [Bugfix] Suppress pandas FutureWarning and fix pybind11 type hint mismatch (#2980)

Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)

Model Score Threshold Result
DeepSeek-R1-0528 0.9454 0.94 PASS
MiniMax-M2.5 0.9295 0.92 PASS
Qwen3-235B-FP8 0.8802 0.87 PASS

Wheel Matrix

6 wheels for ROCm 7.0 / 7.1 / 7.2 x Python 3.10 / 3.12, manylinux_2_28 ABI. All built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2 - torch 2.10 was removed from PyTorch's rocm7.2 index).

Cherry-pick Audit Summary (PR #2262)

PR #2262 introduces a new code path that is off by default and gfx950-only:

  • Triple-gated: quant_type==per_1x128 + gfx950 + AITER_XBFLOAT16=1 env var
  • Public C++ API unchanged
  • New *_pybind.cu shims and pre-compiled .co HSA binaries for gfx950
  • Zero merge conflicts on release/v0.1.13
  • No follow-up correctness fixes on main

Existing MI300/MI308/MI450 deployments and unset-env MI355X deployments are unaffected.

v0.1.13-rc2

06 May 04:04

Choose a tag to compare

v0.1.13-rc2 Pre-release
Pre-release

AITER v0.1.13-rc2

Release candidate 2 for v0.1.13. Pre-release — please smoke-test downstream and report back before final tag.

Changes since rc1

5 cherry-picks onto release/v0.1.13:

Bug Fixes

  • #2983[MLA] Fix nhead=32 non-persistent decode crash on gfx950: Corrects the decode dispatch condition for MLA attention when nhead=32 (e.g., Kimi-K2.5). Without this fix, gfx950 takes the non-persistent path and crashes during decode.
  • #2879Support preshuffled layout in indexer_k_quant_and_cache: Adds preshuffled weight layout support to blockscale GEMM and KV cache indexer, fixing a blocker for DI/SA inference paths.

New Features

  • #3005[Silo] Bulk merge kernel fixes + features: Adds 5 new Triton kernels — causal_conv1d_update_single_token, fused_rearrange_sigmoid_gdr, fused_fp8_quant, pa_mqa_logits, and gated delta rule decode optimizations. Includes corresponding op tests.

Config & Tuning

  • #3004[Silo] Bulk merge tuned configs: Adds MI355X (gfx950) tuned configs for Kimi-K2, GLM-4.7, Qwen3-Next-80B across GEMM and FMoE kernels.
  • #3024[Silo] Add configs missing from bulk merge #3004: Adds 6375 MI355X GEMM tunings for DeepSeek-V3.2 + MiniMax-M2.5 FMoE tunings. Deduplicates cross-file shape collisions (best us per shape wins).

Files changed (rc1 → rc2)

  • 44 files, +13k / -2k lines
  • 12 new CSV config files / updates
  • 5 new Triton kernels + 3 new test files
  • 2 C++ kernel files (MLA + cache)

Compatibility Matrix

Component Requirement
Container ABI vllm/vllm-openai-rocm:v0.19.1 (Ubuntu 22, glibc ≤ 2.35, libstdc++ ≤ GLIBCXX_3.4.30)
PyTorch torch==2.10.0+rocm7.1 (matches vllm-openai-rocm:v0.19.1; wheels are ABI-pinned to this build)
GPU arch gfx942 (MI300X / MI325X), gfx950 (MI355X)
ROCm 7.0 / 7.1 / 7.2 (pick wheel matching your runtime)
Python 3.10 / 3.12
vLLM Recommend latest main with PR vllm-project/vllm#40754 merged.
SGLang Recommend ≥ v0.5.10. If using CUDA-graph + custom all-reduce on MI300X / MI355X, use a base image with ROCm ≥ 7.2.1.

Breaking Changes since v0.1.12.post2

None. Same as rc1.

Known Issues

Same as rc1 — see v0.1.13-rc1 release notes for details on the HIP graph capture issue (ROCm 7.2.0 + custom all-reduce).

Wheels

6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X), manylinux_2_28 ABI, torch==2.10.0+rocm7.1 pin:

ROCm Python 3.10 Python 3.12
7.2 amd_aiter-0.1.13rc2+rocm7.2.manylinux.2.28-cp310-... amd_aiter-0.1.13rc2+rocm7.2.manylinux.2.28-cp312-...
7.1 ...+rocm7.1.manylinux.2.28-cp310-... ...+rocm7.1.manylinux.2.28-cp312-...
7.0 ...+rocm7.0.manylinux.2.28-cp310-... ...+rocm7.0.manylinux.2.28-cp312-...

Validation Status

  • ATOM 5-model accuracy: rc1 validated (pending rc2 revalidation)
  • vLLM ABI smoke: pending
  • MLA nhead=32 decode (#2983): pending silicon verification
  • Perf delta vs rc1: pending

Upgrade from rc1

pip install --pre --force-reinstall <wheel-url>
python -c "import aiter; print(aiter.__version__)"  # expect: 0.1.13rc2+rocm7.X.manylinux.2.28

Tagged from release/v0.1.13 HEAD = ab62c65757c4c41cb24c14b8e925a776c6124892.

v0.1.13-rc1

25 Apr 06:11

Choose a tag to compare

v0.1.13-rc1 Pre-release
Pre-release

AITER v0.1.13-rc1

Release candidate for v0.1.13. Pre-release — please smoke-test downstream and report back before final tag.

Compatibility Matrix

Component Requirement
Container ABI vllm/vllm-openai-rocm:v0.19.1 (Ubuntu 22, glibc ≤ 2.35, libstdc++ ≤ GLIBCXX_3.4.30)
PyTorch torch==2.10.0+rocm7.1 (matches vllm-openai-rocm:v0.19.1; wheels are ABI-pinned to this build)
GPU arch gfx942 (MI300X / MI325X), gfx950 (MI355X)
ROCm 7.0 / 7.1 / 7.2 (pick wheel matching your runtime)
Python 3.10 / 3.12
vLLM Recommend latest main with PR vllm-project/vllm#40754 merged. Older vLLM that calls gemm_a4w4(A, B, A_scale, B_scale, out, ...) will silently pass out as bias on MXFP4 — see Breaking Changes.
SGLang Recommend ≥ v0.5.10. If using CUDA-graph + custom all-reduce on MI300X / MI355X, use a base image with ROCm ≥ 7.2.1 (e.g. rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1) — see Known Issues.

Breaking Changes since v0.1.12.post2

None new in this release. For context, the gemm_a4w4 API changed in late 2025 (PR #1679, before v0.1.12) from

gemm_a4w4(A, B, A_scale, B_scale, out, bpreshuffle=True)

to

y = gemm_a4w4(A, B, A_scale, B_scale, dtype=out_dtype, bpreshuffle=True)

Both v0.1.12.post2 and v0.1.13-rc1 already use the new signature. vLLM PR vllm-project/vllm#40754 fixes vLLM's call site; install both AITER and a vLLM build that includes that PR for MXFP4 (Quark OCP MX) models.

Known Issues

  • HIP graph capture crash on ROCm 7.2.0 with custom all-reduce (sgl-project/sglang#23580 / #23581, #2857 / #2941).
    • Real root cause (confirmed by ROCm team in #2857 thread): ROCm 7.2.0's hipEventQuery ignores THREAD_LOCAL capture mode, so the NCCL watchdog on another thread invalidates the in-flight HIP graph capture. Triggers hipErrorStreamCaptureInvalidated → next decode replay raises HSA_STATUS_ERROR_EXCEPTION 0x1016 → SIGABRT all TP scheduler subprocesses. AITER's IPCBufferPool change in v0.1.12.post1 only widened the race window, it did not introduce the bug.
    • Fix: upgrade to ROCm ≥ 7.2.1 (the runtime fix is in 7.2.1 and later, no AITER or PyTorch rebuild needed). Recommended base: rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1.
    • Alternative workaround (only if pinned to ROCm 7.2.0): set SGLANG_USE_AITER_AR=false (sgl-project/sglang#23581 made this the SGLang default). AITER attention / MoE / fused RMSNorm paths remain enabled.
    • This is not an AITER-side bug, so v0.1.13 ships no code change for it. Tracking issue #2941 closed accordingly.

Upgrade Checklist (from v0.1.12.post2)

# 1. Install rc1 wheel (--pre needed because rc1 is PEP 440 pre-release)
pip install --pre --force-reinstall https://github.com/ROCm/aiter/releases/download/v0.1.13-rc1/amd_aiter-0.1.13rc1+rocm7.2.manylinux.2.28-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl

# 2. Verify
python -c "import aiter; print(aiter.__version__)"  # expect: 0.1.13rc1+rocm7.2.manylinux.2.28

# 3. Single-request smoke before opening to traffic
# (use your existing serving stack with ONE short request first, watch for HSA exceptions or shape errors)

Note: rc1 wheel filenames omit the .torch210 suffix that v0.1.12.post2 wheels carried. The torch ABI is unchanged — wheels are still built against torch==2.10.0+rocm7.1 (verified in build logs). Filename suffix will return in a future release.


What's new vs v0.1.12.post2

release/v0.1.13 is based on commit e039817a (selected from ATOM benchmark dashboard — ATOM commit 4f3c57970c showed 14/14 accuracy pass on this AITER head) plus 4 cherry-picked fixes:

  • #2893fix(fmha): support >4GB KV cache in batch prefill via runtime dispatch
  • #2901Fix top_k_per_row_prefill err when batched_token_numm > 4096
  • #2904revert gptoss tuned config (rolls back a perf-regressing tune)
  • #2875ci(release): torch_pin + torch_index_url workflow inputs (build infra)

Total delta from v0.1.12.post2: ~118 commits in release/v0.1.13 base + 3 surgical fixes on top.

ATOM accuracy validation (GSM8K 3-shot, MI355X 8-GPU)

All 5 tracked models pass thresholds on this rc:

Model Score (flex-extract) Threshold Result
DeepSeek-R1-0528 0.9553 0.94 PASS
Kimi-K2.5-MXFP4 0.9340 0.93 PASS
MiniMax-M2.5 0.9325 0.92 PASS
GLM-5-FP8 0.9386 0.93 PASS
Qwen3-235B-A22B-FP8 0.8772 0.87 PASS

Test environment: rocm/atom-dev:latest container, AITER 0.1.13rc1+rocm7.2, lm_eval local-completions with tokenized_requests=False, num_concurrent=65.

CI validation (release/v0.1.13)

  • aiter-test (Standard Tests 10/10) — PASS
  • sglang_downstream — PASS
  • vLLM Benchmark — PASS (no regression vs v0.1.12.post2)
  • atom-test (DeepSeek-R1-0528 MI300X+MI355X, gpt-oss-120b MI355X) — PASS
  • Flash Attention Integration — PASS
  • OPUS Test — PASS
  • Black + Ruff — PASS

Wheels

6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X), manylinux_2_28 ABI, torch==2.10.0+rocm7.1 pin (matches vllm/vllm-openai-rocm:v0.19.1 PyTorch ABI):

ROCm Python 3.10 Python 3.12
7.2 amd_aiter-0.1.13rc1+rocm7.2.manylinux.2.28-cp310-... amd_aiter-0.1.13rc1+rocm7.2.manylinux.2.28-cp312-...
7.1 ...+rocm7.1.manylinux.2.28-cp310-... ...+rocm7.1.manylinux.2.28-cp312-...
7.0 ...+rocm7.0.manylinux.2.28-cp310-... ...+rocm7.0.manylinux.2.28-cp312-...

What we need from downstream

  • vLLM ml-ci-internal: smoke test DeepSeek-R1, gpt-oss-120b, Llama serving paths
  • SGLang downstream: integration test against rc1 wheel
  • ATOM standalone: confirm production benchmarks unchanged

If smoke is green, we'll cut v0.1.13 final from the same source commit. If issues surface, fix → rc2.

Tagged from release/v0.1.13 HEAD = 930c94120459bb352e1d7c68349b331b06397280.

v0.1.12.post2

23 Apr 21:34
28a7b6a

Choose a tag to compare

AITER v0.1.12.post2

Post-release fix for v0.1.12 addressing two production blockers and a wheel ABI compatibility issue with vLLM containers.

Key Fixes

  • Fix c10_hip_check_implementation undefined symbol (#2843, #2846 by @lingpeng-jin)
    Removed C10_HIP_KERNEL_LAUNCH_CHECK() from gated_rmsnorm_quant_kernels.cu — the macro signature drifted between PyTorch versions and broke loading in vllm/vllm-openai-rocm containers.

  • Fix DSR1 + MI300X "GEMM is not supported" crash (#2864, vllm#39485, #2645 by @eppaneamd )
    Cherry-picked GEMM dispatch fix from main; surfaces the correct kernel for DeepSeek-R1 on gfx942 in vLLM serving.

  • manylinux2_28 + torch 2.10 ABI pin
    All wheels now built on pytorch/manylinux2_28-builder (AlmaLinux 8 + devtoolset-11) with torch==2.10.0+rocm7.X pin. Resolves:

    • GLIBCXX_3.4.32 symbol mismatch with vllm/vllm-openai-rocm:v0.19.1 (Ubuntu 22 base)
    • c10::cuda::getCurrentCUDAStream(signed char) namespace masquerade drift between PyTorch 2.10 and 2.11

Validation

End-to-end DeepSeek-R1 serving in vllm/vllm-openai-rocm:v0.19.1:

  • MI300X (gfx942) — PASS
  • MI355X (gfx950) — PASS

Both architectures served from the same wheel.

Wheels

6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X):

ROCm Python 3.10 Python 3.12
7.2 amd_aiter-0.1.12.post2+rocm7.2.manylinux_2_28.torch210-cp310 amd_aiter-0.1.12.post2+rocm7.2.manylinux_2_28.torch210-cp312
7.1 amd_aiter-0.1.12.post2+rocm7.1.manylinux_2_28.torch210-cp310 amd_aiter-0.1.12.post2+rocm7.1.manylinux_2_28.torch210-cp312
7.0 amd_aiter-0.1.12.post2+rocm7.0.manylinux_2_28.torch210-cp310 amd_aiter-0.1.12.post2+rocm7.0.manylinux_2_28.torch210-cp312

Install

pip install amd_aiter-0.1.12.post2+rocm7.2.manylinux_2_28.torch210-cp312-cp312-linux_x86_64.whl

Built from commit 28a7b6a5c85f4f34d266c8bbb9af6045b93f28fd (tag v0.1.12.post2).

v0.1.12.post1

11 Apr 20:08
7b57073

Choose a tag to compare

AITER v0.1.12.post1

Post-release fix for v0.1.12 with 22 additional commits.

Key Fix

  • Fix fused_qk_rmsnorm_group_quant stride check — relaxes is_contiguous() to stride(1)==1 (last-dim contiguous), fixing crashes on DeepSeek MLA architecture models via ATOM (#2671)

Accuracy Validation (gsm8k 3-shot, ATOM upstream methodology)

All 5 models pass ATOM CI accuracy thresholds on MI355X (8x GPU):

Model TP flexible-extract Threshold Baseline Result
DeepSeek-R1-0528 8 0.9568 0.94 0.9553 PASS
GLM-5-FP8 8 0.9439 0.93 0.9545 PASS
Kimi-K2.5-MXFP4 4 0.9378 0.93 0.9409 PASS
MiniMax-M2.5 2 0.9431 0.92 0.9401 PASS
Qwen3-235B-A22B-Instruct-2507-FP8 8 0.9052 0.87 0.909 PASS

Test environment: ATOM 0.1.3.dev52, AITER 0.1.12.post1+rocm7.2.1, lm-eval with local-completions, tokenized_requests=False, num_concurrent=65.

Wheels

6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X):

ROCm Python 3.10 Python 3.12
7.2.1 amd_aiter-0.1.12.post1+rocm7.2.1-cp310 amd_aiter-0.1.12.post1+rocm7.2.1-cp312
7.1.1 amd_aiter-0.1.12.post1+rocm7.1.1-cp310 amd_aiter-0.1.12.post1+rocm7.1.1-cp312
7.0.2 amd_aiter-0.1.12.post1+rocm7.0.2-cp310 amd_aiter-0.1.12.post1+rocm7.0.2-cp312

Install

pip install amd_aiter-0.1.12.post1+rocm7.2.1-cp312-cp312-linux_x86_64.whl

Built from commit 7b57073 (tag v0.1.12.post1).

v0.1.12-rc1

10 Apr 20:25

Choose a tag to compare

v0.1.12-rc1 Pre-release
Pre-release

AITER v0.1.12 Release Notes

Release Date: 2026-04-10
Previous Release: v0.1.11.post1 (2026-03-05)
Commits: 334 (excluding release branch maintenance)
Supported GPU Architectures: gfx942 (MI300X/MI325X), gfx950 (MI355X)

Highlights

OPUS Migration -- Replacing CK Tile Primitives.
A major effort migrated internal kernel code from CK Tile APIs to the new OPUS (Operator Utility for Shader) abstraction layer. This includes replacing CK Tile in activation kernels (#2589), HIP kernels (#2533), allreduce (#2107), and type conversion primitives (#2331). OPUS also gained tiled scaled MFMA (#2384), finfo class (#2330), cast/numeric_limits enhancements (#2110), moe_sorting_opus (#2077), gfx950 smem transpose load (#2480), and comprehensive unit tests (#2017, #2040, #2127). This migration decouples AITER from CK internals and establishes OPUS as the portable device-code foundation.

FlyDSL Integration for MoE Kernels.
FlyDSL, AMD's high-performance domain-specific language, is now a first-class AITER dependency. Initial A4W4 MoE kernel support was imported (#2113) and enhanced (#2390), split-k GEMM was added (#2536), A4W4 MoE kernels were optimized (#2581), correctness and precision issues in split-k HGEMM were fixed (#2567), and FlyDSL was added to install requirements (#2430). The dependency was upgraded to v0.1.2 (#2635).

MLA (Multi-head Latent Attention) Enhancements.
MLA received extensive feature additions: HipKittens-based nhead=128 kernel (#2039), gfx950 A8W8 qh32 kernel (#1912), MLA persistent kernel LSE output (#2440), LSE-aware dispatch (#2378), FP8 return-LSE support (#2144), metadata split reference code (#2177), fast metadata update for decoding (#2215), and MI350-specific PS mode improvements including nhead=8/mtp=4 (#2461) and nhead64-to-nhead32 folding (#2570). Multiple NaN and accuracy bugs were also fixed (#2106, #2128, #2319).

ctypes Kernel Binding Refactoring.
Kernel dispatch was systematically migrated from pybind11 to ctypes, reducing build complexity and improving JIT build reliability. This includes the foundational ctypes binding refactor (#2255), paged attention ctypes migration (#2395), MoE ASM ctypes migration (#2341), int64 ctypes support (#2486), and a fix for ctypes JIT build issues with asm_topksoftmax (#2603).

CK Dependency Removal for FMHA.
Flash MHA forward (#2353) and backward v3 (#2250) kernels had their Composable Kernel dependencies removed, and a build-time ENABLE_CK option was added (#2074) enabling fully CK-free builds. The torch dependency was also removed from the MHA shared library build (#2501). These changes reduce build times and external dependency complexity.

Warp Size Generalization.
HIP kernels were updated to support variable warp sizes rather than hardcoding warp_size=64. A WARP_SIZE macro was added to the common header for both host and device use (#2525), and topksoftmax, grouptopk, cache, and sample kernels were updated (#2599). This is essential for cross-architecture portability between CDNA (warp_size=64) and future targets.

Allreduce Refactoring and Fusion.
The custom allreduce path was refactored to support prefill-phase collective operations (#2453), an allreduce+rmsnorm+quant fusion pass was added (#1990), GPT-OSS-120B hidden_size=2880 support was enabled in fused allreduce rmsnorm (#2329), numerical accuracy was improved (#2586), a double-buffer option was added for cross_device_reduce_1stage (#2064), and CUDA graph capture compatibility was fixed (#2075).

Sage Attention v2 and Flash Attention Improvements.
Triton-based Sage Attention v2 received multiple updates: MXFP4 Q*K support (#2066), optimizations (#2045), stride fixes (#2117), mask fix (#2158), and a consolidated patch (#2240). Flash Attention v3 gained hipgraph support for KV cache (#2096), configurable Triton configs via environment variable (#2000), Windows build support (#2433), and integration CI (#1974).

RDNA Architecture Support.
AITER expanded beyond data center GPUs with gfx1150/1151 RDNA arch registration (#2014), improved RDNA config selection for Flash Attention (#2397) and general kernels (#2402), and RDNA CI infrastructure (#2222).

New Features

Attention & MLA

  • Introduce HipKittens-based nhead=128 MLA kernel (#2039)
  • Add gfx950 MLA A8W8 qh32 kernel (#1912)
  • Add LSE output support for MLA decode qseqlen=1 persistent kernel (#2440)
  • Add LSE-aware kernel dispatch for MLA (#2378)
  • MLA PS mode FP8 with return LSE for nhead=128,4 (#2144)
  • MLA PS mode add metadata split reference code (#2177)
  • Add decode_update_mla_metadata_v1 for fast metadata update in decoding (#2215)
  • MI350 MLA PS mode support nhead=8, mtp=4 (#2461)
  • MI350 MLA PS mode fold nhead64,2 to nhead32,4 kernel (#2570)
  • Add head_num=40 for MLA FP8 reduce kernel for Qwen3.5 (#2481)
  • Upload mla_a8w8_qh64_qseqlen4_gqaratio16 config for MI300 (#2042)
  • Add FP8 hdim=256 tile for batch prefill kernel (#2549)
  • Support per_block for PA PS kernels (#2053)
  • Add sliding window support for Triton sink attention (#2505)
  • CK MHA backward: add sink attention score gradient support (#2321)
  • MHA forward v3 hdim128 support per-tensor FP8 for MI300/MI308 (#2105)
  • CK Tile FMHA backward use persistent kernels in deterministic mode (#2216)
  • Optimize flash attention forward (#2265)
  • Sage Attention v2: Q*K in MXFP4 (#2066)
  • Sage Attention v2 patch (#2240)
  • Hipgraph support for fav3 KV cache (#2096)
  • Add FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON env var support (#2000)
  • Flash Attention Triton Windows build support (#2433)

MoE Kernels

  • FlyDSL A4W4 MoE kernel import (#2113)
  • FlyDSL A4W4 MoE support and kernel update (#2390)
  • FlyDSL A4W4 MoE kernel optimization (#2581)
  • Add FlyDSL split-k GEMM (#2536)
  • Triton smoothquant int8 MoE kernel (#2049)
  • Introduce ASM 64x256 kernels for MI300 (#2404)
  • Introduce 64x256 fmoe kernels (#2279)
  • Support topk_softmax with shared expert scoring function (#2356)
  • Group_topk: moe_fused_gate support non-power-of-2 experts (192/96) (#2604)
  • Update topk.py to support non-power-of-2 experts (Kimi-K2) for long contexts (#2359)
  • Add moe_smooth_per_token_scaled_quant v1 and v2 (#2295)
  • Add ASM topsoftmax support 384x8 (#2130)
  • Support strided gating_score for topk_softmax (#2124)

GEMM

  • Add FP8 blockscale ASM kernel (#2142)
  • CK Tile A8W8 blockscale GEMM with preshuffleB support (#1954)
  • Add Triton A8W8 split-k support (#2180)
  • Add compiler configurations for bpreshuffle CK Tile modules (#2069)
  • Add 32x128 and 64x128 ASM kernels for Qwen3-next TP4 (#2285)
  • Add precision fix and gelu kernels for 64x256 (#2471)
  • MI325 support gfx942 i8gemm tilesize 112x256 (#2006)
  • Add igemm kernel for MI325 (#1968)
  • Enable hipblaslt FP8 tuning (#2212)
  • Add f32 MFMA support for 32x32x2f32 and 16x16x4f32 (#2070)
  • Add fast gelu activation (#2220)

Fused Kernels

  • Add fused_qk_norm_group_quant kernel (#2527)
  • Add fused_qknorm HIP kernel (#2442)
  • Fuse RMS + RoPE + block quantization kernel (#2027)
  • Optimize fused_qk_norm_rope_blkquant kernel (#2206)
  • Add allreduce+rmsnorm+quant fusion pass (#1990)
  • Support GPT-OSS-120B hidden_size=2880 in fused allreduce rmsnorm (#2329)
  • Add mhc_post HIP kernel (#2479)
  • Add mhc_pre HIP kernel (mhc_pre_gemm_sqrsum, mhc_pre_big_fuse) (#2136)
  • Add fused_qk_norm_rope_cache_quant rotary_dim parameter (#2199)
  • Top-K Top-P sampling kernel optimization (#2034)

RDNA Support

  • Adding gfx1150/51 to RDNA arch (#2014)
  • Improve RDNA config selection for Flash Attention (#2397)
  • Improve config selection for RDNA GPUs (#2402)

Other Features

  • HIP causal conv1d decode kernel (#2084)
  • PA decode gluon AOT C++ API (#2085)
  • Support naive mrope in get_rope (#2292)
  • Support FP8/MXFP4-quantized activation dtype (#2188)
  • Support value_cache 5D shuffle layout with GPT-OSS-120B precision tests (#2217)
  • Generate KV prefix preshuffle (#2288)
  • Support dim(-1) allgather (#2162)
  • Add ep, pp, dp group interface (#2137)
  • Respect AITER_LOG_LEVEL for C++ stdout prints (#2086)
  • Identify device name by chip ID (#2325)
  • Support comments in tuned config CSV files (#2422)
  • Defer expensive build operations to build_ext.run() (#1973)
  • Hipgraph support: correct arg_size type from int to size_t (#2163)
  • Add double-buffer option for cross_device_reduce_1stage (#2064)
  • Use unreg path for custom all-reduce during CUDA graph capture (#2075)

Performance

Tuned Configs

  • Add Kimi-K2.5 tuned configs for MI355X (#2619)
  • Add DSv3-MXFP4 tuned configs for MI355X (#2616)
  • Retune Kimi K2 MoE configs (#2625)
  • Replace CK MoE config in TP4 configs (#2626)
  • Add GLM-5 tuned configs (#2518)
  • Add Qwen3.5 FP8 and A8W8 blockscale GEMM tuned configs (#2324)
  • Tuned Qwen3.5 GEMM (#2485)
  • Add tuned CSV files for GEMM and MoE to accelerate Kimi-K2 (#2290)
  • Add MI355X (gfx950) tuned GEMM configs for FP4 and FP8 shapes (#2037)
  • Tune 493 new FP4 GEMM shapes for LLM inference (#2092)
  • Add new GEMM configuration files for various matrix sizes (#2024)
  • GEMM and MoE tuning for DeepSeek-R1 InferenceX FP4 (#2261)
  • Tune Triton GEMM kernel for MI355 DSV3 DP+EP configuration (#2016)
  • MI325 igemm ASM tuning (#2125)
  • Add blockPerCu support for CK Tile GEMMs and CK Tile MoE tuning (#2313)
  • Update dsv3 ptpc A8W8 GEMM config (#2253)
  • Add GEMM-A16W16-ATOMIC-N=256-K=6144 Triton GEMM tune config (#2213)
  • Update gfx950 PA PS kernels and wire stride_scale_blk in asm_pa (#2569)
  • Update gfx942 PA PS kernels and wire stride_scale_blk in asm_pa (#2522)
  • Add more MoE/GEMM configs (#2506)
  • Fix MoE stage2 tune config (#2438)
  • Fix MoE GEMM tuned config (#2463)
  • Remove duplicate tuned configs (#2219)
  • Add FlyDSL split-k GEMM with Kimi-2 BF16 tuned config (#2536)
  • Fix GEMM test failures and retune with latest Triton (#2434)

Kernel Optimizations

  • Optimize prefill A4W4 MoE (#2233)
  • Optimized fused split GDR decode (#2326)
  • Optimize _moe_mxfp4_sort_kernel to reduce Triton recompilation (#2414)
  • Triton fav3...
Read more

v0.1.9

23 Dec 03:12
2d71438

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.1.8...v0.1.9

v0.1.7

11 Nov 02:38
de14bec

Choose a tag to compare

What's Changed

Read more