Tags · iree-org/iree

iree-3.12.0rc20260702

[DispatchCreation] Hoist scalar tensor.extract and tensor.extract_sli…

…ce (#24552)

This PR aims to :
- Enable hoisting tensor.extract ops that read from a scalar tensor
already outside the dispatch.
- Loosen the requirement to hoist tensor.extract_slice ops.

### Context
I encountered this error when working on a modified version of LFM2.5. I
attach a reproducer which captures the idea.

A causal mask path stayed fused inside a QK matmul dispatch, which
produced large vectors due to tile size propagation in an mmt4d ukernel.

```
mask_slice_qk_repro.mlir:45:18: error: One or more operations with large vector sizes (32768 bytes) were found:

	%scores_3d = torch.aten.bmm %q, %k : !torch.vtensor<[16,?,64],f32>, !torch.vtensor<[16,64,256],f32> -> !torch.vtensor<[16,?,256],f32>
                 ^
<unknown>:0: note:   %cst = arith.constant dense<0xFF800000> : vector<16x16x16x16xf32>

<unknown>:0: note:   %cst_0 = arith.constant dense<0.000000e+00> : vector<16x16x16x16xf32>

<unknown>:0: note:   %cst_1 = arith.constant dense<0> : vector<16x16x16x16xi8>
```

This originated due to the mask's slicing offset coming from a
tensor.extract and subsequently producing a scalar metadata chain that
remained inside the dispatch.

### Proposed fix
- `HoistUniformScalarComputePass` can accept more "candidate ops" than
just arith ops. I just included tensor.extract since its what I ran
into. `isUniformScalarForDispatch` is still in charge to verify that the
candidate op is hoistable, so i added the logic to check the
tensor.extract ops.
- `IREE::Flow::isOffsetSizeAndStrideMappableToFlow` got split into two:
`isOffsetSizeAndStrideStructurallyMappableToFlow` just checks if the
slice can be represented as one flat contiguous byte range, and
`isOffsetSizeAndStrideMappableToFlow` checks for that and the additional
tensor.extract provenance.
- `isHoistableOp` in HoistEncodingOps.cpp was rejecting extract slice
ops whose offset, size, and stride where produced by an extract op due
to calling `isOffsetSizeAndStrideMappableToFlow` on them. Now it calls
`isOffsetSizeAndStrideStructurallyMappableToFlow`.

### Additional Notes:
- I tried to not interfere with the codebase's original intentions.
- Since the extract and extract_slice make it out of the dispatch, the
large vectors never occur. I thought this was the right way to address
the root cause of the problem.
- Inspecting the mmt4d ukernel tile size propagation, it seems that the
problematic large vectors originated due to propagating a pack op tiling
config to the outer dims of an accumulator, which should not happen
afaiu. I could work on that separate issue if it is of interest.

<details>
<summary>mlir reproducer</summary>

```mlir
module @module {
  func.func @forward(
      %query: !torch.vtensor<[1,16,?,64],f32>,
      %key: !torch.vtensor<[1,16,64,256],f32>,
      %mask: !torch.vtensor<[256,256],ui8>,
      %positions: !torch.vtensor<[1],si64>)
      -> !torch.vtensor<[1,16,?,256],f32>
      attributes {torch.assume_strict_symbolic_shapes} {
    %s = torch.symbolic_int "s" {min_val = 1, max_val = 256} : !torch.int
    torch.bind_symbolic_shape %query, [%s], affine_map<()[s0] -> (1, 16, s0, 64)> : !torch.vtensor<[1,16,?,64],f32>

    %int0 = torch.constant.int 0
    %int1 = torch.constant.int 1
    %int2 = torch.constant.int 2
    %int16 = torch.constant.int 16
    %int64 = torch.constant.int 64
    %int256 = torch.constant.int 256
    %int-1 = torch.constant.int -1

    %seq_len = torch.aten.size.int %query, %int2 : !torch.vtensor<[1,16,?,64],f32>, !torch.int -> !torch.int
    %pos_tensor = torch.aten.select.int %positions, %int0, %int-1 : !torch.vtensor<[1],si64>, !torch.int, !torch.int -> !torch.vtensor<[],si64>
    %pos = torch.aten.item %pos_tensor : !torch.vtensor<[],si64> -> !torch.int

    %end = torch.aten.add.int %pos, %seq_len : !torch.int, !torch.int -> !torch.int
    %bool_dtype = torch.constant.int 11
    %mask_bool = torch.prims.convert_element_type %mask, %bool_dtype : !torch.vtensor<[256,256],ui8>, !torch.int -> !torch.vtensor<[256,256],i1>
    %mask_slice = torch.aten.slice.Tensor %mask_bool, %int0, %pos, %end, %int1 : !torch.vtensor<[256,256],i1>, !torch.int, !torch.int, !torch.int, !torch.int -> !torch.vtensor<[?,256],i1>
    torch.bind_symbolic_shape %mask_slice, [%s], affine_map<()[s0] -> (s0, 256)> : !torch.vtensor<[?,256],i1>

    %float-Inf = torch.constant.float 0xFFF0000000000000
    %float0 = torch.constant.float 0.000000e+00
    %f32_dtype = torch.constant.int 6
    %none = torch.constant.none
    %cpu = torch.constant.device "cpu"
    %neg_inf = torch.aten.scalar_tensor %float-Inf, %f32_dtype, %none, %cpu, %none : !torch.float, !torch.int, !torch.none, !torch.Device, !torch.none -> !torch.vtensor<[],f32>
    %mask_bias = torch.aten.where.ScalarSelf %mask_slice, %float0, %neg_inf : !torch.vtensor<[?,256],i1>, !torch.float, !torch.vtensor<[],f32> -> !torch.vtensor<[?,256],f32>
    torch.bind_symbolic_shape %mask_bias, [%s], affine_map<()[s0] -> (s0, 256)> : !torch.vtensor<[?,256],f32>

    %q_shape = torch.prim.ListConstruct %int16, %seq_len, %int64 : (!torch.int, !torch.int, !torch.int) -> !torch.list<int>
    %q = torch.aten.view %query, %q_shape : !torch.vtensor<[1,16,?,64],f32>, !torch.list<int> -> !torch.vtensor<[16,?,64],f32>
    torch.bind_symbolic_shape %q, [%s], affine_map<()[s0] -> (16, s0, 64)> : !torch.vtensor<[16,?,64],f32>

    %k_shape = torch.prim.ListConstruct %int16, %int64, %int256 : (!torch.int, !torch.int, !torch.int) -> !torch.list<int>
    %k = torch.aten.view %key, %k_shape : !torch.vtensor<[1,16,64,256],f32>, !torch.list<int> -> !torch.vtensor<[16,64,256],f32>
    %scores_3d = torch.aten.bmm %q, %k : !torch.vtensor<[16,?,64],f32>, !torch.vtensor<[16,64,256],f32> -> !torch.vtensor<[16,?,256],f32>
    torch.bind_symbolic_shape %scores_3d, [%s], affine_map<()[s0] -> (16, s0, 256)> : !torch.vtensor<[16,?,256],f32>

    %scores_shape = torch.prim.ListConstruct %int1, %int16, %seq_len, %int256 : (!torch.int, !torch.int, !torch.int, !torch.int) -> !torch.list<int>
    %scores = torch.aten.view %scores_3d, %scores_shape : !torch.vtensor<[16,?,256],f32>, !torch.list<int> -> !torch.vtensor<[1,16,?,256],f32>
    torch.bind_symbolic_shape %scores, [%s], affine_map<()[s0] -> (1, 16, s0, 256)> : !torch.vtensor<[1,16,?,256],f32>

    %result = torch.aten.add.Tensor %scores, %mask_bias, %int1 : !torch.vtensor<[1,16,?,256],f32>, !torch.vtensor<[?,256],f32>, !torch.int -> !torch.vtensor<[1,16,?,256],f32>
    torch.bind_symbolic_shape %result, [%s], affine_map<()[s0] -> (1, 16, s0, 256)> : !torch.vtensor<[1,16,?,256],f32>
    return %result : !torch.vtensor<[1,16,?,256],f32>
  }
}

```
</details>

Compile command:
```
iree-compile \
  mask_slice_qk_repro.mlir  \
  -o mask_slice_qk_repro.vmfb  \
  --iree-input-type=auto \
  --iree-hal-target-device=local \
  --iree-opt-data-tiling=true \
  --iree-llvmcpu-enable-ukernels=all \
  --iree-hal-local-target-device-backends=llvm-cpu \
  --iree-hal-local-host-device-backends=llvm-cpu \
  --iree-llvmcpu-target-cpu-features=host 
```

Assisted by Codex 5.5

---------

Signed-off-by: Juan Ignacio Pisula <pisula@roofline.ai>

Jul 1, 2026
af08a7c
zip
tar.gz
Notes
Downloads

iree-3.12.0rc20260701

[CUDA][LLVMGPU] Fix sm_120 WGP params and add BF16 mma.sync coverage (#…

…24648)

## Description

Fixes a correctness issue in `getSM120WgpDetails()` carried over from
the Ampere placeholder, consolidates BF16 test coverage into existing
architecture-scoped files, and extends sm_120 tests with a
mixed-precision case.

- **BF16 mma.sync fix**: `NV_MMA_SYNC_F32_16x8x16_BF16` was missing from
sm_120's `mmaOps` despite being supported from Ampere onwards. sm_121
already advertises this intrinsic. Without this fix, bf16 × bf16 → f32
matmuls silently fall back to SIMT tiling on sm_120.

- **Test consolidation**: Per the reviewer's suggestion, tests no longer
have their own per-architecture files. Instead they are folded into the
existing architecture-scoped test files.

## Testcases

1. config_tile_and_fuse_sm80.mlir / config_tile_and_fuse_sm120.mlir:
each gets a new BF16 section verifying that bf16 × bf16 → f32 selects
`NV_MMA_SYNC_F32_16x8x16_BF16` at the config stage.

1. config_tile_and_fuse_sm120.mlir: adds a mixed-precision f32 × bf16 →
f32 case to verify the SIMT fallback path when no MMA intrinsic matches
the operand types.

1. pipeline_tile_and_fuse_mma_sync.mlir: adds a shared BF16 pipeline
lowering section that covers both sm_80 and sm_120, verifying
`nvgpu.mma.sync` is generated with the correct shape.

1. target_device_features.mlir: expands the SM120 metadata check to a
full wgp attribute check (mma list, workgroup sizes, memory limits),
matching the granularity of the existing SM89 check.

1. bf16_mma_sm80.mlir: deleted — coverage moved into the files above.

---------

Signed-off-by: weimin023 <tnwilly@gmail.com>

Jun 30, 2026
a4d0287
zip
tar.gz
Notes
Downloads

iree-3.12.0rc20260630

[Metal] Fix indirect dispatch offset for sub-allocated parameter buff…

…ers (#24644)

iree_hal_metal_command_buffer_prepare_dispatch resolved the
indirect-dispatch workgroup-count buffer offset as just
config.workgroup_count_ref.offset, dropping
iree_hal_buffer_byte_offset(buffer) -- the base offset of the parameter
buffer within its backing allocation. This is inconsistent with the
sibling descriptor path in the same function (which adds byte_offset)
and with every other backend: Vulkan and amdgpu route all offsets
through a shared resolver that adds byte_offset, and the local HAL
resolves through iree_hal_buffer_map_range. When the indirect-parameter
buffer is a sub-allocation with a non-zero base offset, the dispatch
read the three workgroup-count uint32s from the wrong address and ran a
wrong grid.

The bug was latent because the Metal allocator returns standalone root
buffers (byte_offset==0) and the cross-backend CTS indirect-parameters
tests allocate the parameter buffer directly at offset 0.

Add byte_offset so the source offset matches the descriptor path and the
other backends:
workgroups_offset =
iree_hal_buffer_byte_offset(config.workgroup_count_ref.buffer) +
config.workgroup_count_ref.offset;

Adds CTS regression test
DispatchIndirectParametersTest.SubAllocatedParameterBuffer, which
references the workgroup counts through an iree_hal_buffer_subspan at a
non-zero base offset (placing deliberately-wrong counts at offset 0 so a
base-offset drop reads a deterministic grid instead of garbage). This is
the only test shape that catches the bug, and it exercises it on every
backend that supports sub-allocated buffers (Metal, Vulkan, amdgpu).

Signed-off-by: Alex Vasile
<48962821+Alex-Vasile@users.noreply.github.com>

Signed-off-by: Alex Vasile <48962821+Alex-Vasile@users.noreply.github.com>

Jun 29, 2026
540008c
zip
tar.gz
Notes
Downloads

iree-3.12.0rc20260629

Integrate LLVM to llvm/llvm-project@22da7f929139 (#24629)

Clean integrate, no carry-over revert.

Bump to
[llvm/llvm-project@22da7f929139](llvm/llvm-project@22da7f929139).

Signed-off-by: Ege Beysel <beyselege@gmail.com>

Jun 25, 2026
fca4d82
zip
tar.gz
Notes
Downloads

iree-3.12.0rc20260628

Integrate LLVM to llvm/llvm-project@22da7f929139 (#24629)

Clean integrate, no carry-over revert.

Bump to
[llvm/llvm-project@22da7f929139](llvm/llvm-project@22da7f929139).

Signed-off-by: Ege Beysel <beyselege@gmail.com>

Jun 25, 2026
fca4d82
zip
tar.gz
Notes
Downloads

iree-3.12.0rc20260627

Integrate LLVM to llvm/llvm-project@22da7f929139 (#24629)

Clean integrate, no carry-over revert.

Bump to
[llvm/llvm-project@22da7f929139](llvm/llvm-project@22da7f929139).

Signed-off-by: Ege Beysel <beyselege@gmail.com>

Jun 25, 2026
fca4d82
zip
tar.gz
Notes
Downloads

iree-3.12.0rc20260626

Integrate LLVM to llvm/llvm-project@22da7f929139 (#24629)

Clean integrate, no carry-over revert.

Bump to
[llvm/llvm-project@22da7f929139](llvm/llvm-project@22da7f929139).

Signed-off-by: Ege Beysel <beyselege@gmail.com>

Jun 25, 2026
fca4d82
zip
tar.gz
Notes
Downloads

iree-3.12.0rc20260625

[GlobalOpt] Fix crash in RaiseSpecialOps on buffer-semantics named ops (

#24626)

## Problem

Passing a `linalg.matmul` (or any named contraction/convolution op) with
**memref / buffer operands** to `iree-compile` crashes with a hard
assert instead of a clean diagnostic:

```
Assertion failed: (index < size() && "invalid index into type range"),
function operator[], file TypeRange.h, line 156.
...
NamedImplicitCastOpConversion<linalg::ContractionOpInterface>::matchAndRewrite(...)
```

`NamedImplicitCastOpConversion::matchAndRewrite` in
`GlobalOptimization/RaiseSpecialOps.cpp` reasons about the op's results
(`getResultTypes()[0]`) and rewrites its body region, both of which
assume tensor semantics. A buffer-semantics linalg op writes to an
output buffer and has **zero results**, so `getResultTypes()[0]` indexes
an empty `TypeRange` and aborts.

## Fix

Bail out via `hasPureTensorSemantics()` before touching results,
matching the guards already used elsewhere in this file (e.g. lines 48,
231). The op is left untouched and compilation proceeds normally.

## Verification (local, llvm-cpu)

A/B on identical current `main` sources, exact reproducer from the
issue:

| Build | `iree-compile ... memref matmul` |
|---|---|
| unfixed (`main`) | exit **134**, assert in
`NamedImplicitCastOpConversion` |
| fixed (this PR) | exit **0**, compiles cleanly to a valid `.vmfb` |

- Added lit test `@matmul_memref_no_crash` in `raise_special_ops.mlir`.
- Full `raise_special_ops.mlir` lit suite passes.

Fixes #24624

Signed-off-by: Alex-Wengg <hanweng9@gmail.com>

Jun 24, 2026
a337264
zip
tar.gz
Notes
Downloads

iree-3.12.0rc20260624

Integrate LLVM to llvm/llvm-project@9550cd76cade (#24623)

Clean integrate, no reverts carried over.

Signed-off-by: Ege Beysel <beyselege@gmail.com>

Jun 23, 2026
a45adea
zip
tar.gz
Notes
Downloads

iree-3.12.0rc20260623

[NFC] Remove the llvm-external-projects/iree-dialects directory (#24466…

…) (#24612)

## Overview

Removes the `llvm-external-projects/iree-dialects` directory now that
nothing in the compiler or runtime depends on it.

This is the final step of #24466. The production dependencies were
retired in #24543 and predecessors:
- the softmax matcher was reimplemented natively in
`GlobalOptimization/RaiseSpecialOps`,
- `ErrorCheckingTrackingListener` was moved into `Codegen/Common/`,
- `DropSchedulePass` was moved into `Codegen/Common/`,
- the callback-matcher transform ops
(`transform.iree.register_match_callbacks` / `match_callback` /
`take_first` / `emit_remark`) and their tests were dropped.

## Changes

- Delete `llvm-external-projects/iree-dialects/`.
- Drop the `iree_llvm_add_external_project(mlir-iree-dialects ...)` call
from the top-level `CMakeLists.txt`.
- Remove the dead iree-dialects target conversion from `bazel_to_cmake`.
- Drop the `IREEDialects*` Python sources + include path from
`compiler/bindings/python/CMakeLists.txt`, the `_ireeDialects`
`CMakeExtension` from `compiler/setup.py`, and the `iree_dialects.rst`
API doc.
- Remove the `iree-dialects-doc` build step from the website doc
generator.
- Remove the "Test iree-dialects" CI steps and the `check-iree-dialects`
targets from the ASan/UBSan scripts (and the now-unused
`test_iree_dialects.sh`).

## Testing

- **Local (Windows + MSVC, Ninja, Release):** CMake configure is clean
after the external-project removal only `stablehlo` remains as an
external LLVM project and a full-tree grep finds zero dangling
references. The full `iree-opt` target builds, links, and runs
(`iree-opt --version` → LLVM 23.0.0git).

---------

Signed-off-by: Alex-Wengg <hanweng9@gmail.com>

Jun 22, 2026
ac077d8
zip
tar.gz
Notes
Downloads

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

iree-3.12.0rc20260702

iree-3.12.0rc20260701

iree-3.12.0rc20260630

iree-3.12.0rc20260629

iree-3.12.0rc20260628

iree-3.12.0rc20260627

iree-3.12.0rc20260626

iree-3.12.0rc20260625

iree-3.12.0rc20260624

iree-3.12.0rc20260623

Uh oh!

Tags: iree-org/iree