Releases: ROCm/aiter
AITER v0.1.14-rc0
AITER v0.1.14-rc0
First release candidate for v0.1.14, cut from main at commit 759589676 ("[Triton] [ATOM] DSV4 fusions phase 1 (#3057)"). Per-rc cherry-picks (#3163 minimax fused qknorm+allreduce, #3189 follow-up) deferred to rc1.
Highlights
- DSv4 fusions phase 1 (#3057) — first batch of Triton/ATOM-side DSv4 fusions on top of v0.1.13.
- Validated 5/5 production model accuracy at GSM8K 3-shot (flexible-extract).
- Kimi-K2.5-MXFP4 unblocked when paired with ATOM containing PR #670 (kwargs upgrade for
aiter.fused_qk_rmsnorm); see Known Issues for ABI pairing details.
Validation (GSM8K 3-shot, flexible-extract)
| Model | Score | Threshold | Result | Notes |
|---|---|---|---|---|
| DeepSeek-R1-0528 (TP=8, fp8 KV) | PASS | 0.94 | PASS | mi355-gpu-15 |
| MiniMax-M2.5 (TP=2, fp8 KV) | PASS | 0.92 | PASS | with HSA_NO_SCRATCH_RECLAIM=1 |
| Qwen3-235B-A22B-FP8 (TP=8, fp8 KV) | PASS | 0.87 | PASS | with ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1 |
| GLM-5-FP8 (TP=8, fp8 KV) | PASS | 0.93 | PASS | mi355-gpu-15 |
| Kimi-K2.5-MXFP4 (TP=4, fp8 KV) | 0.9303 | 0.93 | PASS | requires ATOM with PR #670 (kwargs upgrade) |
Wheel Matrix
6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2). Fat binary covers gfx942 (MI300/MI325X) + gfx950 (MI350/MI355X).
Install
pip install https://github.com/ROCm/aiter/releases/download/v0.1.14-rc0/<wheel-filename>flydsl==0.1.7 is auto-resolved as a runtime dep.
Known Issues
- Kimi-K2.5-MXFP4 requires ATOM with PR #670 merged for the kwargs upgrade to
aiter.fused_qk_rmsnorm. ATOM nightly tags from 2026-05-14 onward include the fix; older ATOM containers will hitAttributeError: 'float' object has no attribute 'size'at MLA path. Tracking: #3177 - rocm7.2 wheel was built against torch 2.11 ABI but most current ATOM containers ship torch 2.10. For rocm7.2 deployments today, install the rocm7.1 wheel which uses torch 2.10 ABI (validated PASS on Kimi). rocm7.2-native wheel becomes useful once torch 2.11 ATOM images land.
What's Next (rc1)
Cherry-picks pending Markus must-list:
#3163(4df783a) minimax fused qknorm+allreduce kernel#3189(open) grid-strided loop on top of #3163
Final v0.1.14 expected end of next week (~2026-05-22).
Cumulative Changes since v0.1.13
19 commits land in v0.1.14-rc0 vs v0.1.13. Highlights grouped by area:
DSv4 / Triton-ATOM fusions
MoE / FlyDSL
- FlyDSL MXFP4 rounding alignment (#3153)
- FlyDSL GDR decode kernel optimize (#3135)
- FlyDSL xcd remap v2 (#3134)
- FlyDSL per-kernel parallelism + AOT pool size (#3133)
- silu_and_mul_quant + Opt silu_and_mul (#3145)
Triton
- mHC-post: post-stream + res-stream mixing (#2920)
- Triton blockscale num_stages pipelining (#3136)
- Triton s_barrier sync waves (#3132)
- feat(triton/rope): fused QKV split + QK RMSNorm + RoPE + paged KV (#2902)
- Triton bench_gmm.py bug fix (#3154)
Bugfixes
- fix gather mem violation (#3182)
- [Bugfix][Triton] Honor
transpose_bmin batched_gemm_a16wfp4_ fake tensor (#3166)
qk_rmsnorm_group_quant
- refactor hip kl (-30% build time) (#3137)
CK_TILE
Docs / Refactor
v0.1.13
AITER v0.1.13
Production release of the v0.1.13 line. Same commit as v0.1.13-rc5 (cdcfa833b) after 5 RC iterations.
Highlights
- DeepSeek R1 / GPT-OSS / Kimi / GLM-5 enablement maturing on MI300X / MI325X (gfx942) and MI350 / MI355X (gfx950)
- New ASM fmoe kernels for gfx950 that bypass bf16→fp8 quantization, gated by
AITER_XBFLOAT16=1env var (default off, opt-in for safety) (#2262) - Substantial MLA improvements: MI350 MLA PS mode for new shapes (#2727, #2729, #2676), MoE PS mode for nhead=8/2 on MI308 (#2852), nhead=32 non-persistent decode crash fix on gfx950 (#2983)
- FMHA / paged attention: runtime dispatch for >4 GB KV cache in batch prefill (#2893), top_k_per_row prefill fix for
batched_token_num > 4096(#2901), gfx942/gfx950 PA PS kernel update withstride_scale_pagewrite (#2796) - RDNA4 expansion: FP8 support for
gfx1200/gfx1201(#2621), FlyDSLflash_attn_funcbackend forgfx1201(R9600D) — first RDNA4-class attention backend in AITER (#2969 on main, included via baseline) - Triton kernel additions: Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441), Triton fallback for MI455 GPT-OSS / DSFP4 (#2657), GLM-5 70k+300 GEMM configs for gfx942 (#2743)
- FlyDSL maturity: BF16 GEMM tuned configs added/retuned for 6 models (#2733), AOT defaults via
AITER_CONFIGS(#2756), if/else compatibility across versions (#2740), updated FlyDSL version pin - Bulk silo merge — kernel fixes and tuned configs in preparation for the v0.1.13.post1 line (#3004, #3005, #3024)
- Quality of life: pandas FutureWarning suppressed and pybind11 type hint mismatch fix (#2980), Linux import errors no longer swallowed (#3049),
std::unordered_mapreplaced withSynchronizedCachefor thread safety (#2221), ctypes C-ABI error bridging to prevent worker crashes during kernel build (#2498)
Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)
| Model | Score | Threshold | Result |
|---|---|---|---|
| DeepSeek-R1-0528 | 0.9454 | 0.94 | PASS |
| MiniMax-M2.5 | 0.9295 | 0.92 | PASS |
| Qwen3-235B-FP8 | 0.8802 | 0.87 | PASS |
Wheel Matrix
6 wheels for ROCm 7.0 / 7.1 / 7.2 × Python 3.10 / 3.12, manylinux_2_28 ABI. Built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2). Fat binary covers gfx942 (MI300/MI308/MI325X) + gfx950 (MI350/MI355X).
Cumulative Changes since v0.1.12.post2
149 commits land in v0.1.13 vs v0.1.12.post2. Full list available via git log v0.1.12.post2..v0.1.13. Highlights grouped by area:
MoE / FlyDSL kernels (44 commits)
- ASM fmoe kernels for gfx950 with bf16→fp8 quantization bypass (#2262)
- FlyDSL A8W4 MoE update (#2726)
- GPT-OSS small-M MoE optimizations (#2775)
- Kimi-K2.5 MoE tuned configs revert for batch sizes 32/64 (#2836) — Kimi int4 a16wi4 MoE (#2863) deferred to v0.1.13.post1
- Triton Gluon-optimized MoE Int8 SmoothQuant for small K (#2441)
- MoE tuner fixes (#2831, #2785, #2723)
- fused_dynamic_mxfp4_quant_moe_sort_hip added (#2620, fix #2759)
- CK_TILE bpreshuffle compile failure fix (#2811)
- Bulk silo merge tuned configs and kernel fixes (#3004, #3005, #3024)
- moe_routing_sigmoid_top1_fused tie-breaking fix (#2750)
MLA / Multi-head Latent Attention (9+ commits)
- MI350 MLA PS mode support for new shapes (nhead 128,1 / 128,2 / 128,3 / 128,4 / 64,4 / 64,2 / 32,4) via
mla_a16w16_qh32_qseqlen4_gqaratio32_ps.co(#2727) - gfx950 fp8 decode native qh32 qseqlen2 MLA PS kernel (#2676) and qh64 nhead=64 native kernel (#2636)
- bf16 MLA decode kernel for gqa_ratio=64, qseqlen=1 (non-persistent) (#2729)
- MLA PS mode nhead 8/2 on MI308 (#2852)
- MLA Reduce and Metadata kernel rewritten with OPUS template (#2717)
- gfx950 nhead=32 non-persistent decode crash fix (#2983)
- OPUS lib improvements for MLA: mma step_k, dword copy via
set_sliceand inline asm fortr_load(#2652)
FMHA / Paged Attention
- Runtime dispatch for >4 GB KV cache in batch prefill (#2893)
- top_k_per_row prefill fix for
batched_token_num > 4096(#2901) - gfx942/gfx950 PA PS kernel update with
stride_scale_pagewrite in asm_pa (#2796) - fmha_fwd_v3 silence false warning when
use_asm_v3is disabled (#2744) indexer_k_quant_and_cachepreshuffled layout support (#2879)- car prefill kernel error fix for SGLang (#2745)
Triton path
- Gluon-optimized MoE Int8 SmoothQuant kernel for small K (#2441)
- Triton MHA UT reduction (#2612)
- Adapt model benchmarking scripts to new
bench_mha.pyCLI (#2673) - Triton fallback for MI455 GPT-OSS and DSFP4 (#2657)
- GLM-5 70k+300 GEMM configs for gfx942 (#2743)
- Triton MoE GEMM shared memory exhaustion fix by reducing stage count (#2723)
- Drop GLM5 Triton tuned GEMM (#2803)
FlyDSL
- BF16 GEMM configs added/retuned for 6 models (#2733)
- AITER_CONFIGS for FlyDSL AOT defaults (#2756)
if const_exprintroduction (#2776)- if/else compatibility across versions (#2740)
- A8W4 MoE update (#2726)
- bf16 GEMM implementation and tuned config update (#2634)
- A8W8 FlyDSL tune fix (#2809)
- Linear attention rebase for new FlyDSL version (#2746)
Architecture enablement
- RDNA4 (gfx1200/gfx1201): FP8 support added (#2621)
- MI355X (gfx950): continued maturation across MoE, MLA, FMHA paths
- MI350 (gfx950): MLA PS mode coverage expanded
- MI308 (gfx942): MLA PS mode nhead 8/2 (#2852), i8gemm tuning (#2590)
- MI300X (gfx942): gemma rmsnorm quant fusion (#2853),
gemm_a16w16torch tune (#2860)
Quality and safety fixes
- pandas FutureWarning suppression and pybind11 type hint mismatch (#2980)
- Linux import errors no longer swallowed (#3049)
std::unordered_map→SynchronizedCachefor thread safety in CK paths (#2221)- ctypes C-ABI error bridging to prevent worker crashes during kernel build (#2498)
fused_qk_norm_group_quantstride error check fix (#2637)fused_dynamic_mxfp4_quant_moe_sort_hipEP fix (#2759)fused_gemm_afp4wfp4_a16w16LDS exhaustion fix under ASYNC_COPY (#2784)- opus.hpp build time optimization kernel template (single-header C++ template, up to 61x faster builds vs standard torch extension)
Release engineering
torch_pin+torch_index_urlworkflow inputs for release-build CI (#2875)- manylinux_2_28 wheel matrix standardized: ROCm 7.0/7.1/7.2 × Python...
v0.1.13-rc5
AITER v0.1.13-rc5
Fifth release candidate for v0.1.13, focused on adding asm_fmoe kernels for gfx950 (no bf16->fp8 quantization required) while removing RC4's Kimi int4 MoE changes.
Changes vs RC4
Reverted:
kimi a16wi4 moe support (#2863)— defer to v0.1.14fix splitk buffer dispatch (#3050)— only needed by #2863
Cherry-picked from main:
Introduce asm fmoe kernels that do not require bf16->fp8 quantization (#2262)— new gfx950-only kernels behindAITER_XBFLOAT16=1env var (default off)[Bugfix] Suppress pandas FutureWarning and fix pybind11 type hint mismatch (#2980)
Validation (mi355-gpu-15, GSM8K 3-shot, flexible-extract)
| Model | Score | Threshold | Result |
|---|---|---|---|
| DeepSeek-R1-0528 | 0.9454 | 0.94 | PASS |
| MiniMax-M2.5 | 0.9295 | 0.92 | PASS |
| Qwen3-235B-FP8 | 0.8802 | 0.87 | PASS |
Wheel Matrix
6 wheels for ROCm 7.0 / 7.1 / 7.2 x Python 3.10 / 3.12, manylinux_2_28 ABI. All built against torch==2.10 (rocm7.0/7.1) or torch==2.11 (rocm7.2 - torch 2.10 was removed from PyTorch's rocm7.2 index).
Cherry-pick Audit Summary (PR #2262)
PR #2262 introduces a new code path that is off by default and gfx950-only:
- Triple-gated:
quant_type==per_1x128+gfx950+AITER_XBFLOAT16=1env var - Public C++ API unchanged
- New
*_pybind.cushims and pre-compiled.coHSA binaries for gfx950 - Zero merge conflicts on release/v0.1.13
- No follow-up correctness fixes on main
Existing MI300/MI308/MI450 deployments and unset-env MI355X deployments are unaffected.
v0.1.13-rc2
AITER v0.1.13-rc2
Release candidate 2 for v0.1.13. Pre-release — please smoke-test downstream and report back before final tag.
Changes since rc1
5 cherry-picks onto release/v0.1.13:
Bug Fixes
- #2983 —
[MLA] Fix nhead=32 non-persistent decode crash on gfx950: Corrects the decode dispatch condition for MLA attention whennhead=32(e.g., Kimi-K2.5). Without this fix, gfx950 takes the non-persistent path and crashes during decode. - #2879 —
Support preshuffled layout in indexer_k_quant_and_cache: Adds preshuffled weight layout support to blockscale GEMM and KV cache indexer, fixing a blocker for DI/SA inference paths.
New Features
- #3005 —
[Silo] Bulk merge kernel fixes + features: Adds 5 new Triton kernels —causal_conv1d_update_single_token,fused_rearrange_sigmoid_gdr,fused_fp8_quant,pa_mqa_logits, and gated delta rule decode optimizations. Includes corresponding op tests.
Config & Tuning
- #3004 —
[Silo] Bulk merge tuned configs: Adds MI355X (gfx950) tuned configs for Kimi-K2, GLM-4.7, Qwen3-Next-80B across GEMM and FMoE kernels. - #3024 —
[Silo] Add configs missing from bulk merge #3004: Adds 6375 MI355X GEMM tunings for DeepSeek-V3.2 + MiniMax-M2.5 FMoE tunings. Deduplicates cross-file shape collisions (bestusper shape wins).
Files changed (rc1 → rc2)
- 44 files, +13k / -2k lines
- 12 new CSV config files / updates
- 5 new Triton kernels + 3 new test files
- 2 C++ kernel files (MLA + cache)
Compatibility Matrix
| Component | Requirement |
|---|---|
| Container ABI | vllm/vllm-openai-rocm:v0.19.1 (Ubuntu 22, glibc ≤ 2.35, libstdc++ ≤ GLIBCXX_3.4.30) |
| PyTorch | torch==2.10.0+rocm7.1 (matches vllm-openai-rocm:v0.19.1; wheels are ABI-pinned to this build) |
| GPU arch | gfx942 (MI300X / MI325X), gfx950 (MI355X) |
| ROCm | 7.0 / 7.1 / 7.2 (pick wheel matching your runtime) |
| Python | 3.10 / 3.12 |
| vLLM | Recommend latest main with PR vllm-project/vllm#40754 merged. |
| SGLang | Recommend ≥ v0.5.10. If using CUDA-graph + custom all-reduce on MI300X / MI355X, use a base image with ROCm ≥ 7.2.1. |
Breaking Changes since v0.1.12.post2
None. Same as rc1.
Known Issues
Same as rc1 — see v0.1.13-rc1 release notes for details on the HIP graph capture issue (ROCm 7.2.0 + custom all-reduce).
Wheels
6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X), manylinux_2_28 ABI, torch==2.10.0+rocm7.1 pin:
| ROCm | Python 3.10 | Python 3.12 |
|---|---|---|
| 7.2 | amd_aiter-0.1.13rc2+rocm7.2.manylinux.2.28-cp310-... |
amd_aiter-0.1.13rc2+rocm7.2.manylinux.2.28-cp312-... |
| 7.1 | ...+rocm7.1.manylinux.2.28-cp310-... |
...+rocm7.1.manylinux.2.28-cp312-... |
| 7.0 | ...+rocm7.0.manylinux.2.28-cp310-... |
...+rocm7.0.manylinux.2.28-cp312-... |
Validation Status
- ATOM 5-model accuracy: rc1 validated (pending rc2 revalidation)
- vLLM ABI smoke: pending
- MLA nhead=32 decode (#2983): pending silicon verification
- Perf delta vs rc1: pending
Upgrade from rc1
pip install --pre --force-reinstall <wheel-url>
python -c "import aiter; print(aiter.__version__)" # expect: 0.1.13rc2+rocm7.X.manylinux.2.28Tagged from release/v0.1.13 HEAD = ab62c65757c4c41cb24c14b8e925a776c6124892.
v0.1.13-rc1
AITER v0.1.13-rc1
Release candidate for v0.1.13. Pre-release — please smoke-test downstream and report back before final tag.
Compatibility Matrix
| Component | Requirement |
|---|---|
| Container ABI | vllm/vllm-openai-rocm:v0.19.1 (Ubuntu 22, glibc ≤ 2.35, libstdc++ ≤ GLIBCXX_3.4.30) |
| PyTorch | torch==2.10.0+rocm7.1 (matches vllm-openai-rocm:v0.19.1; wheels are ABI-pinned to this build) |
| GPU arch | gfx942 (MI300X / MI325X), gfx950 (MI355X) |
| ROCm | 7.0 / 7.1 / 7.2 (pick wheel matching your runtime) |
| Python | 3.10 / 3.12 |
| vLLM | Recommend latest main with PR vllm-project/vllm#40754 merged. Older vLLM that calls gemm_a4w4(A, B, A_scale, B_scale, out, ...) will silently pass out as bias on MXFP4 — see Breaking Changes. |
| SGLang | Recommend ≥ v0.5.10. If using CUDA-graph + custom all-reduce on MI300X / MI355X, use a base image with ROCm ≥ 7.2.1 (e.g. rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1) — see Known Issues. |
Breaking Changes since v0.1.12.post2
None new in this release. For context, the gemm_a4w4 API changed in late 2025 (PR #1679, before v0.1.12) from
gemm_a4w4(A, B, A_scale, B_scale, out, bpreshuffle=True)to
y = gemm_a4w4(A, B, A_scale, B_scale, dtype=out_dtype, bpreshuffle=True)Both v0.1.12.post2 and v0.1.13-rc1 already use the new signature. vLLM PR vllm-project/vllm#40754 fixes vLLM's call site; install both AITER and a vLLM build that includes that PR for MXFP4 (Quark OCP MX) models.
Known Issues
- HIP graph capture crash on ROCm 7.2.0 with custom all-reduce (sgl-project/sglang#23580 / #23581, #2857 / #2941).
- Real root cause (confirmed by ROCm team in #2857 thread): ROCm 7.2.0's
hipEventQueryignoresTHREAD_LOCALcapture mode, so the NCCL watchdog on another thread invalidates the in-flight HIP graph capture. TriggershipErrorStreamCaptureInvalidated→ next decode replay raisesHSA_STATUS_ERROR_EXCEPTION 0x1016→ SIGABRT all TP scheduler subprocesses. AITER'sIPCBufferPoolchange in v0.1.12.post1 only widened the race window, it did not introduce the bug. - Fix: upgrade to ROCm ≥ 7.2.1 (the runtime fix is in 7.2.1 and later, no AITER or PyTorch rebuild needed). Recommended base:
rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1. - Alternative workaround (only if pinned to ROCm 7.2.0): set
SGLANG_USE_AITER_AR=false(sgl-project/sglang#23581 made this the SGLang default). AITER attention / MoE / fused RMSNorm paths remain enabled. - This is not an AITER-side bug, so v0.1.13 ships no code change for it. Tracking issue #2941 closed accordingly.
- Real root cause (confirmed by ROCm team in #2857 thread): ROCm 7.2.0's
Upgrade Checklist (from v0.1.12.post2)
# 1. Install rc1 wheel (--pre needed because rc1 is PEP 440 pre-release)
pip install --pre --force-reinstall https://github.com/ROCm/aiter/releases/download/v0.1.13-rc1/amd_aiter-0.1.13rc1+rocm7.2.manylinux.2.28-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
# 2. Verify
python -c "import aiter; print(aiter.__version__)" # expect: 0.1.13rc1+rocm7.2.manylinux.2.28
# 3. Single-request smoke before opening to traffic
# (use your existing serving stack with ONE short request first, watch for HSA exceptions or shape errors)Note: rc1 wheel filenames omit the .torch210 suffix that v0.1.12.post2 wheels carried. The torch ABI is unchanged — wheels are still built against torch==2.10.0+rocm7.1 (verified in build logs). Filename suffix will return in a future release.
What's new vs v0.1.12.post2
release/v0.1.13 is based on commit e039817a (selected from ATOM benchmark dashboard — ATOM commit 4f3c57970c showed 14/14 accuracy pass on this AITER head) plus 4 cherry-picked fixes:
- #2893 —
fix(fmha): support >4GB KV cache in batch prefill via runtime dispatch - #2901 —
Fix top_k_per_row_prefill err when batched_token_numm > 4096 - #2904 —
revert gptoss tuned config(rolls back a perf-regressing tune) - #2875 —
ci(release): torch_pin + torch_index_url workflow inputs(build infra)
Total delta from v0.1.12.post2: ~118 commits in release/v0.1.13 base + 3 surgical fixes on top.
ATOM accuracy validation (GSM8K 3-shot, MI355X 8-GPU)
All 5 tracked models pass thresholds on this rc:
| Model | Score (flex-extract) | Threshold | Result |
|---|---|---|---|
| DeepSeek-R1-0528 | 0.9553 | 0.94 | PASS |
| Kimi-K2.5-MXFP4 | 0.9340 | 0.93 | PASS |
| MiniMax-M2.5 | 0.9325 | 0.92 | PASS |
| GLM-5-FP8 | 0.9386 | 0.93 | PASS |
| Qwen3-235B-A22B-FP8 | 0.8772 | 0.87 | PASS |
Test environment: rocm/atom-dev:latest container, AITER 0.1.13rc1+rocm7.2, lm_eval local-completions with tokenized_requests=False, num_concurrent=65.
CI validation (release/v0.1.13)
- aiter-test (Standard Tests 10/10) — PASS
- sglang_downstream — PASS
- vLLM Benchmark — PASS (no regression vs v0.1.12.post2)
- atom-test (DeepSeek-R1-0528 MI300X+MI355X, gpt-oss-120b MI355X) — PASS
- Flash Attention Integration — PASS
- OPUS Test — PASS
- Black + Ruff — PASS
Wheels
6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X), manylinux_2_28 ABI, torch==2.10.0+rocm7.1 pin (matches vllm/vllm-openai-rocm:v0.19.1 PyTorch ABI):
| ROCm | Python 3.10 | Python 3.12 |
|---|---|---|
| 7.2 | amd_aiter-0.1.13rc1+rocm7.2.manylinux.2.28-cp310-... |
amd_aiter-0.1.13rc1+rocm7.2.manylinux.2.28-cp312-... |
| 7.1 | ...+rocm7.1.manylinux.2.28-cp310-... |
...+rocm7.1.manylinux.2.28-cp312-... |
| 7.0 | ...+rocm7.0.manylinux.2.28-cp310-... |
...+rocm7.0.manylinux.2.28-cp312-... |
What we need from downstream
- vLLM ml-ci-internal: smoke test DeepSeek-R1, gpt-oss-120b, Llama serving paths
- SGLang downstream: integration test against rc1 wheel
- ATOM standalone: confirm production benchmarks unchanged
If smoke is green, we'll cut v0.1.13 final from the same source commit. If issues surface, fix → rc2.
Tagged from release/v0.1.13 HEAD = 930c94120459bb352e1d7c68349b331b06397280.
v0.1.12.post2
AITER v0.1.12.post2
Post-release fix for v0.1.12 addressing two production blockers and a wheel ABI compatibility issue with vLLM containers.
Key Fixes
-
Fix
c10_hip_check_implementationundefined symbol (#2843, #2846 by @lingpeng-jin)
RemovedC10_HIP_KERNEL_LAUNCH_CHECK()fromgated_rmsnorm_quant_kernels.cu— the macro signature drifted between PyTorch versions and broke loading in vllm/vllm-openai-rocm containers. -
Fix DSR1 + MI300X "GEMM is not supported" crash (#2864, vllm#39485, #2645 by @eppaneamd )
Cherry-picked GEMM dispatch fix from main; surfaces the correct kernel for DeepSeek-R1 on gfx942 in vLLM serving. -
manylinux2_28 + torch 2.10 ABI pin
All wheels now built onpytorch/manylinux2_28-builder(AlmaLinux 8 + devtoolset-11) withtorch==2.10.0+rocm7.Xpin. Resolves:GLIBCXX_3.4.32symbol mismatch with vllm/vllm-openai-rocm:v0.19.1 (Ubuntu 22 base)c10::cuda::getCurrentCUDAStream(signed char)namespace masquerade drift between PyTorch 2.10 and 2.11
Validation
End-to-end DeepSeek-R1 serving in vllm/vllm-openai-rocm:v0.19.1:
- MI300X (gfx942) — PASS
- MI355X (gfx950) — PASS
Both architectures served from the same wheel.
Wheels
6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X):
| ROCm | Python 3.10 | Python 3.12 |
|---|---|---|
| 7.2 | amd_aiter-0.1.12.post2+rocm7.2.manylinux_2_28.torch210-cp310 |
amd_aiter-0.1.12.post2+rocm7.2.manylinux_2_28.torch210-cp312 |
| 7.1 | amd_aiter-0.1.12.post2+rocm7.1.manylinux_2_28.torch210-cp310 |
amd_aiter-0.1.12.post2+rocm7.1.manylinux_2_28.torch210-cp312 |
| 7.0 | amd_aiter-0.1.12.post2+rocm7.0.manylinux_2_28.torch210-cp310 |
amd_aiter-0.1.12.post2+rocm7.0.manylinux_2_28.torch210-cp312 |
Install
pip install amd_aiter-0.1.12.post2+rocm7.2.manylinux_2_28.torch210-cp312-cp312-linux_x86_64.whlBuilt from commit 28a7b6a5c85f4f34d266c8bbb9af6045b93f28fd (tag v0.1.12.post2).
v0.1.12.post1
AITER v0.1.12.post1
Post-release fix for v0.1.12 with 22 additional commits.
Key Fix
- Fix
fused_qk_rmsnorm_group_quantstride check — relaxesis_contiguous()tostride(1)==1(last-dim contiguous), fixing crashes on DeepSeek MLA architecture models via ATOM (#2671)
Accuracy Validation (gsm8k 3-shot, ATOM upstream methodology)
All 5 models pass ATOM CI accuracy thresholds on MI355X (8x GPU):
| Model | TP | flexible-extract | Threshold | Baseline | Result |
|---|---|---|---|---|---|
| DeepSeek-R1-0528 | 8 | 0.9568 | 0.94 | 0.9553 | PASS |
| GLM-5-FP8 | 8 | 0.9439 | 0.93 | 0.9545 | PASS |
| Kimi-K2.5-MXFP4 | 4 | 0.9378 | 0.93 | 0.9409 | PASS |
| MiniMax-M2.5 | 2 | 0.9431 | 0.92 | 0.9401 | PASS |
| Qwen3-235B-A22B-Instruct-2507-FP8 | 8 | 0.9052 | 0.87 | 0.909 | PASS |
Test environment: ATOM 0.1.3.dev52, AITER 0.1.12.post1+rocm7.2.1, lm-eval with local-completions, tokenized_requests=False, num_concurrent=65.
Wheels
6 prebuilt wheels with PREBUILD_KERNELS=1 for gfx942 (MI300X/MI325X) + gfx950 (MI355X):
| ROCm | Python 3.10 | Python 3.12 |
|---|---|---|
| 7.2.1 | amd_aiter-0.1.12.post1+rocm7.2.1-cp310 |
amd_aiter-0.1.12.post1+rocm7.2.1-cp312 |
| 7.1.1 | amd_aiter-0.1.12.post1+rocm7.1.1-cp310 |
amd_aiter-0.1.12.post1+rocm7.1.1-cp312 |
| 7.0.2 | amd_aiter-0.1.12.post1+rocm7.0.2-cp310 |
amd_aiter-0.1.12.post1+rocm7.0.2-cp312 |
Install
pip install amd_aiter-0.1.12.post1+rocm7.2.1-cp312-cp312-linux_x86_64.whlBuilt from commit 7b57073 (tag v0.1.12.post1).
v0.1.12-rc1
AITER v0.1.12 Release Notes
Release Date: 2026-04-10
Previous Release: v0.1.11.post1 (2026-03-05)
Commits: 334 (excluding release branch maintenance)
Supported GPU Architectures: gfx942 (MI300X/MI325X), gfx950 (MI355X)
Highlights
OPUS Migration -- Replacing CK Tile Primitives.
A major effort migrated internal kernel code from CK Tile APIs to the new OPUS (Operator Utility for Shader) abstraction layer. This includes replacing CK Tile in activation kernels (#2589), HIP kernels (#2533), allreduce (#2107), and type conversion primitives (#2331). OPUS also gained tiled scaled MFMA (#2384), finfo class (#2330), cast/numeric_limits enhancements (#2110), moe_sorting_opus (#2077), gfx950 smem transpose load (#2480), and comprehensive unit tests (#2017, #2040, #2127). This migration decouples AITER from CK internals and establishes OPUS as the portable device-code foundation.
FlyDSL Integration for MoE Kernels.
FlyDSL, AMD's high-performance domain-specific language, is now a first-class AITER dependency. Initial A4W4 MoE kernel support was imported (#2113) and enhanced (#2390), split-k GEMM was added (#2536), A4W4 MoE kernels were optimized (#2581), correctness and precision issues in split-k HGEMM were fixed (#2567), and FlyDSL was added to install requirements (#2430). The dependency was upgraded to v0.1.2 (#2635).
MLA (Multi-head Latent Attention) Enhancements.
MLA received extensive feature additions: HipKittens-based nhead=128 kernel (#2039), gfx950 A8W8 qh32 kernel (#1912), MLA persistent kernel LSE output (#2440), LSE-aware dispatch (#2378), FP8 return-LSE support (#2144), metadata split reference code (#2177), fast metadata update for decoding (#2215), and MI350-specific PS mode improvements including nhead=8/mtp=4 (#2461) and nhead64-to-nhead32 folding (#2570). Multiple NaN and accuracy bugs were also fixed (#2106, #2128, #2319).
ctypes Kernel Binding Refactoring.
Kernel dispatch was systematically migrated from pybind11 to ctypes, reducing build complexity and improving JIT build reliability. This includes the foundational ctypes binding refactor (#2255), paged attention ctypes migration (#2395), MoE ASM ctypes migration (#2341), int64 ctypes support (#2486), and a fix for ctypes JIT build issues with asm_topksoftmax (#2603).
CK Dependency Removal for FMHA.
Flash MHA forward (#2353) and backward v3 (#2250) kernels had their Composable Kernel dependencies removed, and a build-time ENABLE_CK option was added (#2074) enabling fully CK-free builds. The torch dependency was also removed from the MHA shared library build (#2501). These changes reduce build times and external dependency complexity.
Warp Size Generalization.
HIP kernels were updated to support variable warp sizes rather than hardcoding warp_size=64. A WARP_SIZE macro was added to the common header for both host and device use (#2525), and topksoftmax, grouptopk, cache, and sample kernels were updated (#2599). This is essential for cross-architecture portability between CDNA (warp_size=64) and future targets.
Allreduce Refactoring and Fusion.
The custom allreduce path was refactored to support prefill-phase collective operations (#2453), an allreduce+rmsnorm+quant fusion pass was added (#1990), GPT-OSS-120B hidden_size=2880 support was enabled in fused allreduce rmsnorm (#2329), numerical accuracy was improved (#2586), a double-buffer option was added for cross_device_reduce_1stage (#2064), and CUDA graph capture compatibility was fixed (#2075).
Sage Attention v2 and Flash Attention Improvements.
Triton-based Sage Attention v2 received multiple updates: MXFP4 Q*K support (#2066), optimizations (#2045), stride fixes (#2117), mask fix (#2158), and a consolidated patch (#2240). Flash Attention v3 gained hipgraph support for KV cache (#2096), configurable Triton configs via environment variable (#2000), Windows build support (#2433), and integration CI (#1974).
RDNA Architecture Support.
AITER expanded beyond data center GPUs with gfx1150/1151 RDNA arch registration (#2014), improved RDNA config selection for Flash Attention (#2397) and general kernels (#2402), and RDNA CI infrastructure (#2222).
New Features
Attention & MLA
- Introduce HipKittens-based nhead=128 MLA kernel (#2039)
- Add gfx950 MLA A8W8 qh32 kernel (#1912)
- Add LSE output support for MLA decode qseqlen=1 persistent kernel (#2440)
- Add LSE-aware kernel dispatch for MLA (#2378)
- MLA PS mode FP8 with return LSE for nhead=128,4 (#2144)
- MLA PS mode add metadata split reference code (#2177)
- Add decode_update_mla_metadata_v1 for fast metadata update in decoding (#2215)
- MI350 MLA PS mode support nhead=8, mtp=4 (#2461)
- MI350 MLA PS mode fold nhead64,2 to nhead32,4 kernel (#2570)
- Add head_num=40 for MLA FP8 reduce kernel for Qwen3.5 (#2481)
- Upload mla_a8w8_qh64_qseqlen4_gqaratio16 config for MI300 (#2042)
- Add FP8 hdim=256 tile for batch prefill kernel (#2549)
- Support per_block for PA PS kernels (#2053)
- Add sliding window support for Triton sink attention (#2505)
- CK MHA backward: add sink attention score gradient support (#2321)
- MHA forward v3 hdim128 support per-tensor FP8 for MI300/MI308 (#2105)
- CK Tile FMHA backward use persistent kernels in deterministic mode (#2216)
- Optimize flash attention forward (#2265)
- Sage Attention v2: Q*K in MXFP4 (#2066)
- Sage Attention v2 patch (#2240)
- Hipgraph support for fav3 KV cache (#2096)
- Add FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON env var support (#2000)
- Flash Attention Triton Windows build support (#2433)
MoE Kernels
- FlyDSL A4W4 MoE kernel import (#2113)
- FlyDSL A4W4 MoE support and kernel update (#2390)
- FlyDSL A4W4 MoE kernel optimization (#2581)
- Add FlyDSL split-k GEMM (#2536)
- Triton smoothquant int8 MoE kernel (#2049)
- Introduce ASM 64x256 kernels for MI300 (#2404)
- Introduce 64x256 fmoe kernels (#2279)
- Support topk_softmax with shared expert scoring function (#2356)
- Group_topk: moe_fused_gate support non-power-of-2 experts (192/96) (#2604)
- Update topk.py to support non-power-of-2 experts (Kimi-K2) for long contexts (#2359)
- Add moe_smooth_per_token_scaled_quant v1 and v2 (#2295)
- Add ASM topsoftmax support 384x8 (#2130)
- Support strided gating_score for topk_softmax (#2124)
GEMM
- Add FP8 blockscale ASM kernel (#2142)
- CK Tile A8W8 blockscale GEMM with preshuffleB support (#1954)
- Add Triton A8W8 split-k support (#2180)
- Add compiler configurations for bpreshuffle CK Tile modules (#2069)
- Add 32x128 and 64x128 ASM kernels for Qwen3-next TP4 (#2285)
- Add precision fix and gelu kernels for 64x256 (#2471)
- MI325 support gfx942 i8gemm tilesize 112x256 (#2006)
- Add igemm kernel for MI325 (#1968)
- Enable hipblaslt FP8 tuning (#2212)
- Add f32 MFMA support for 32x32x2f32 and 16x16x4f32 (#2070)
- Add fast gelu activation (#2220)
Fused Kernels
- Add fused_qk_norm_group_quant kernel (#2527)
- Add fused_qknorm HIP kernel (#2442)
- Fuse RMS + RoPE + block quantization kernel (#2027)
- Optimize fused_qk_norm_rope_blkquant kernel (#2206)
- Add
allreduce+rmsnorm+quantfusion pass (#1990) - Support GPT-OSS-120B hidden_size=2880 in fused allreduce rmsnorm (#2329)
- Add mhc_post HIP kernel (#2479)
- Add mhc_pre HIP kernel (mhc_pre_gemm_sqrsum, mhc_pre_big_fuse) (#2136)
- Add fused_qk_norm_rope_cache_quant rotary_dim parameter (#2199)
- Top-K Top-P sampling kernel optimization (#2034)
RDNA Support
- Adding gfx1150/51 to RDNA arch (#2014)
- Improve RDNA config selection for Flash Attention (#2397)
- Improve config selection for RDNA GPUs (#2402)
Other Features
- HIP causal conv1d decode kernel (#2084)
- PA decode gluon AOT C++ API (#2085)
- Support naive mrope in get_rope (#2292)
- Support FP8/MXFP4-quantized activation dtype (#2188)
- Support value_cache 5D shuffle layout with GPT-OSS-120B precision tests (#2217)
- Generate KV prefix preshuffle (#2288)
- Support dim(-1) allgather (#2162)
- Add ep, pp, dp group interface (#2137)
- Respect AITER_LOG_LEVEL for C++ stdout prints (#2086)
- Identify device name by chip ID (#2325)
- Support comments in tuned config CSV files (#2422)
- Defer expensive build operations to build_ext.run() (#1973)
- Hipgraph support: correct arg_size type from int to size_t (#2163)
- Add double-buffer option for cross_device_reduce_1stage (#2064)
- Use unreg path for custom all-reduce during CUDA graph capture (#2075)
Performance
Tuned Configs
- Add Kimi-K2.5 tuned configs for MI355X (#2619)
- Add DSv3-MXFP4 tuned configs for MI355X (#2616)
- Retune Kimi K2 MoE configs (#2625)
- Replace CK MoE config in TP4 configs (#2626)
- Add GLM-5 tuned configs (#2518)
- Add Qwen3.5 FP8 and A8W8 blockscale GEMM tuned configs (#2324)
- Tuned Qwen3.5 GEMM (#2485)
- Add tuned CSV files for GEMM and MoE to accelerate Kimi-K2 (#2290)
- Add MI355X (gfx950) tuned GEMM configs for FP4 and FP8 shapes (#2037)
- Tune 493 new FP4 GEMM shapes for LLM inference (#2092)
- Add new GEMM configuration files for various matrix sizes (#2024)
- GEMM and MoE tuning for DeepSeek-R1 InferenceX FP4 (#2261)
- Tune Triton GEMM kernel for MI355 DSV3 DP+EP configuration (#2016)
- MI325 igemm ASM tuning (#2125)
- Add blockPerCu support for CK Tile GEMMs and CK Tile MoE tuning (#2313)
- Update dsv3 ptpc A8W8 GEMM config (#2253)
- Add GEMM-A16W16-ATOMIC-N=256-K=6144 Triton GEMM tune config (#2213)
- Update gfx950 PA PS kernels and wire stride_scale_blk in asm_pa (#2569)
- Update gfx942 PA PS kernels and wire stride_scale_blk in asm_pa (#2522)
- Add more MoE/GEMM configs (#2506)
- Fix MoE stage2 tune config (#2438)
- Fix MoE GEMM tuned config (#2463)
- Remove duplicate tuned configs (#2219)
- Add FlyDSL split-k GEMM with Kimi-2 BF16 tuned config (#2536)
- Fix GEMM test failures and retune with latest Triton (#2434)
Kernel Optimizations
v0.1.9
What's Changed
- Fix sink error for asm fmha by @LJ-underdog in #1652
- add guard in case pynccl init failed by @valarLip in #1671
- One shot pa by @fsx950223 in #1670
- fix(pa_ps): fix pa_ps_asm .co for gfx950 by @dbyoung18 in #1669
- modify test_bf16gemm_test by @amd-ruitang3 in #1678
- [FIX/CI] Fix ruff CI check by @Boss2002n in #1675
- fix mha bwd golden perf issue by @JaxChen29 in #1666
- topk uplift v1 by @steamedMantou in #1662
- fix missing return in mha_bwd by @yuguo68 in #1688
- Remove the input parameter "out" in gemm_a4w4 by @junhaha666 in #1679
- fwd v3 hd192 optimize inst alignment for causal mode by @shay-li77 in #1663
- fix swa case mismatch by @JaxChen29 in #1694
- fixing the fp4 gemm tune script Exception caused by csv title inconsistency with code by @hongxiayang in #1686
- CI: Migrate Triton tests to aiter-1gpu-runner by @gyohuangxin in #1690
- add ntile 128 for a8 blkQ moe 1 stage by @zufayu in #1695
- Optimize RoPE in the cases that hdim is small. by @ruanjm in #1698
- rm garbage from whl by @amd-ruitang3 in #1696
- enhance prebuild logic by @zufayu in #1672
- LLfp4 qr cap for atom by @amirumoAMD in #1673
- [MLA] MLA conditions rewrite by @Zzz9990 in #1665
- [MLA] fake non persistent fix dp causal by @Zzz9990 in #1677
- add two fp4 tune shapes and tuned config by @hongxiayang in #1687
- Dev/a8w4 and a8w8splitk by @yadaish in #1667
- bf16_gemm_clean_in_kl by @amd-ruitang3 in #1700
- fix tuner by @valarLip in #1701
- add gen_fake for 4 gemm operators by @mqhc2020 in #1456
- fix llvm issue by @valarLip in #1703
- feat: Adaptive topk algorithm selection based on input characteristics by @ClementLinCF in #1578
- fix mha bwd build error by @JaxChen29 in #1705
- fix moe bug when pipever=v1 and nblk=64 by @lalala-sh in #1707
- fix by @valarLip in #1710
New Contributors
- @JaxChen29 made their first contribution in #1666
- @amirumoAMD made their first contribution in #1673
Full Changelog: v0.1.8...v0.1.9
v0.1.7
What's Changed
- add concat_and_cache_mla kernel by @yzhou103 in #1194
- Add triton_metadata_redirect module with with_metadata_path decorator by @jwu10003 in #1172
- [bug] fix qr when variable input by @lihaoyang-amd in #1191
- fix test_concat_cache_mla by @yzhou103 in #1198
- Update vllm_benchmark.yaml to use TW cluster to build vllm image by @gyohuangxin in #1200
- add hipblaslt swizzle feature by @yixionghuo in #1169
- mha fwd v3 gfx950 support dim_q=192 dim_v=128 by @minmengdie in #1188
- Fix FA cpp api multi target build by @slippedJim in #1196
- CI: parallel build Aiter whl packages for Python 3.10 and 3.12 by @gyohuangxin in #1204
- CI: Move some tests back to TW cluster by @gyohuangxin in #1199
- CI: Add timeout and retry when insatlling the python deps by @gyohuangxin in #1210
- add aiter namespace to rope by @valarLip in #1202
- update test_gemm_a16w16 by @amd-ruitang3 in #1205
- [fea]: custom all gather by @TennyWang1223 in #1207
- [TRITON] Add Positional Encoding (PE) support to Triton MHA kernels by @brunomazzottiamd in #1184
- Tune gemm op bf16 by @yzhou103 in #1190
- Add mha varlen fake for different from mha by @ZhangLirong-amd in #1214
- CI: Use the pre-built sglang image instead of building the sglang image from source. by @gyohuangxin in #1224
- [TRITON] Fix fp8 bmm op unit test bug on MI350 by @lucas-santos-amd in #1219
- CI: Change the image when building the Aiter release python3.12 package by @gyohuangxin in #1225
- CI: Add runner selection to workflow for flexible build host choice in Aiter release CI by @gyohuangxin in #1229
- [CK_TILE] FMHA BWD Optimizations for D48 for GFX950 by @DDEle in #1180
- Remove FA cpp api dependence of pandas by @slippedJim in #1230
- CI: Add unique tag after the names of uploaded packages by @gyohuangxin in #1236
- A8w8 asm codegen and tune by @yzhou103 in #1161
- Opt concat cache mla by @yzhou103 in #1239
- avoid copy ck by @valarLip in #1247
- CI: Add GPU_ARCH options in Aiter release CI by @gyohuangxin in #1253
- [MI35x] fix gfx950 build error by @yzhou103 in #1250
- update bf16 gemm asm by @amd-ruitang3 in #1238
- CI: Fix issues in Aiter release CI by @gyohuangxin in #1255
- add ninja into requirements by @amd-ruitang3 in #1254
- Wrapper gemm to fix get_config lru cache break by @ZhangLirong-amd in #1249
- Fix build bdist wheel error by @yzhou103 in #1256
- Added in GQA and 64-bit indexing by @kesavanramakrishnan in #1226
- Refine ca by @valarLip in #1263
- update mi308 fmoe fp16 asm,MI35x by @amd-ruitang3 in #1201
- Introducing opus by @carlushuang in #1264
- ca_refactor_fix by @valarLip in #1268
- fix_fp4_quant_dtype by @lalala-sh in #1271
- add sample use outer exponential by @junhaha666 in #1267
- Fix rowwise a8w8 gemm in swizzled hipb_mm by @kliuae-amd in #1258
- CI: Use TW cluster to run sglang tests by @gyohuangxin in #1273
- Catchall PR for all 355_wip related changes by @azaidy in #1148
- [MI35X] fix core check by @amd-ruitang3 in #1276
- Refactor gemm bf16 tuner by @yzhou103 in #1275
- CI: Operators tuning pipelines by @gyohuangxin in #1163
- fix the problem that v3's performance is worse than ck's by @minmengdie in #1237
- fix the python mha test run_perftest error by @minmengdie in #1277
- Reuse custom decorator in core and torch guard by @ZhangLirong-amd in #1278
- integrate deep gemm by @lalala-sh in #1265
- add a tuned fp4 gemm ds config and insert entries in untuned config by @hongxiayang in #1243
- Enable large batch size and optimization of non-Ragged batching by @valechen in #1269
- added a few more fw ds f4 untuned and tuned shapes in order to use asm module or kernel by @hongxiayang in #1298
- CI: Optimize autotuning pipeline and inital the docs by @gyohuangxin in #1286
- topk per row kernel by @ukannika in #1262
- fix aot by @fsx950223 in #1279
- Fix ATOM fp8 model quant fail issue in torch compile by @ZhangLirong-amd in #1299
- feat - pa_fwd support block map with stride in num_kv_heads_dim by @alibaba-miji in #1301
- Fix how to update accumulator for dot_scaled by @zhanglx13 in #1297
- CI: Optimize autotuning pipeline docs by @gyohuangxin in #1300
- Fix the lint issue by @gyohuangxin in #1307
- fix fwd perf calc error by @minmengdie in #1305
- add the asm kernel performance of fwd and bwd by @minmengdie in #1270
- Fused TopK and Sigmoid kernel by @samremes in #1251
- Ar rms by @TennyWang1223 in #1290
- Dsv32 cache by @junhaha666 in #1314
- Fix displaying supported architectures by @HollowMan6 in #1316
- using standalone pybind by @valarLip in #1317
- Enable mha bwd hd192_hd128 by @slippedJim in #1308
- CI: Add pre-check status check by @gyohuangxin in #1252
- [CK_TILE] fmha: Add backward pass support for padded inputs by @Jeff-Huang in #1212
- Mla splitkv enhance split alg inte by @valarLip in #1233
- Fix gemm tuner error mi350 by @yzhou103 in #1313
- CI: Skip triton setup in Aiter standard/multigpu tests and add retries when setting up triton by @gyohuangxin in #1325
- Fix global variable torch_fp8 initialization caused issue by @huizhougit in #1322
- [FEAT] [Triton] Add transpose scale to the triton fused_rms_fp8_group_quant by @tjtanaa in #1291
- [Triton] 355 wip Llama FP4 triton fusion + TP8 triton decode shape tunning by @k50112113 in #1315
- Kernel naming: add reusable constexpr repr helper and testing it with gemm_a16w16 by @Boss2002n in #1260
- Merge tuned file by @yzhou103 in #1327
- fix graph_breaks by return tensor for bool op by @ZhangLirong-amd in #1333
- fix_bf16gemm_asm by @amd-ruitang3 in #1329
- Improve Memory Usage in MLA by @ruanjm in #1338
- fix tune error caused by merge tuned_file by @yzhou103 in #1342
- rm rocblas op in aiter by @yzhou103 in #1337
- [Triton] DS a16w8 GEMM and fused reduce_rms_fp8_group_quant by @k50112113 in #1328
- Add block_m=16 for a8w8_ck_moe_blockscale by @huaiguxu in #1081
- Add Fused RMSNorm + FP8 Per-tensor Static Quantization Triton Kernel by @farlukas in #1330
- [TRITON] GEMM kernels nomenclature changes by @Boss2002n in #1283
- Temporarily run aiter standard and multigpu tests on the TW cluster, will switch back once the mirror registry is ready. by @gyohuangxin in #1359
- [Triton] Disable failing lean attention tests by @cagrikymk in #1357
- [Fix] update ck to fix fp4 gemm issue by @gino-lu in #1361
- add config by @valarLip in #1355
- add how_v3_bf16_cvt control to the Python API by @minmengdie in #1351
- [fix]: car 6 rank coredump by @TennyWang1223 in #1335
- Wrapper_flash_attn_backward custom op to avoid functionalize fallback and fix guard logic by @ZhangLirong-amd in #1348
- [TRITON] GEMM kernels nomenclature changes by @Boss2002n in https://github.com/ROCm/aiter/pull...