Tags: iree-org/iree
Tags
[DispatchCreation] Hoist scalar tensor.extract and tensor.extract_sli… …ce (#24552) This PR aims to : - Enable hoisting tensor.extract ops that read from a scalar tensor already outside the dispatch. - Loosen the requirement to hoist tensor.extract_slice ops. ### Context I encountered this error when working on a modified version of LFM2.5. I attach a reproducer which captures the idea. A causal mask path stayed fused inside a QK matmul dispatch, which produced large vectors due to tile size propagation in an mmt4d ukernel. ``` mask_slice_qk_repro.mlir:45:18: error: One or more operations with large vector sizes (32768 bytes) were found: %scores_3d = torch.aten.bmm %q, %k : !torch.vtensor<[16,?,64],f32>, !torch.vtensor<[16,64,256],f32> -> !torch.vtensor<[16,?,256],f32> ^ <unknown>:0: note: %cst = arith.constant dense<0xFF800000> : vector<16x16x16x16xf32> <unknown>:0: note: %cst_0 = arith.constant dense<0.000000e+00> : vector<16x16x16x16xf32> <unknown>:0: note: %cst_1 = arith.constant dense<0> : vector<16x16x16x16xi8> ``` This originated due to the mask's slicing offset coming from a tensor.extract and subsequently producing a scalar metadata chain that remained inside the dispatch. ### Proposed fix - `HoistUniformScalarComputePass` can accept more "candidate ops" than just arith ops. I just included tensor.extract since its what I ran into. `isUniformScalarForDispatch` is still in charge to verify that the candidate op is hoistable, so i added the logic to check the tensor.extract ops. - `IREE::Flow::isOffsetSizeAndStrideMappableToFlow` got split into two: `isOffsetSizeAndStrideStructurallyMappableToFlow` just checks if the slice can be represented as one flat contiguous byte range, and `isOffsetSizeAndStrideMappableToFlow` checks for that and the additional tensor.extract provenance. - `isHoistableOp` in HoistEncodingOps.cpp was rejecting extract slice ops whose offset, size, and stride where produced by an extract op due to calling `isOffsetSizeAndStrideMappableToFlow` on them. Now it calls `isOffsetSizeAndStrideStructurallyMappableToFlow`. ### Additional Notes: - I tried to not interfere with the codebase's original intentions. - Since the extract and extract_slice make it out of the dispatch, the large vectors never occur. I thought this was the right way to address the root cause of the problem. - Inspecting the mmt4d ukernel tile size propagation, it seems that the problematic large vectors originated due to propagating a pack op tiling config to the outer dims of an accumulator, which should not happen afaiu. I could work on that separate issue if it is of interest. <details> <summary>mlir reproducer</summary> ```mlir module @module { func.func @forward( %query: !torch.vtensor<[1,16,?,64],f32>, %key: !torch.vtensor<[1,16,64,256],f32>, %mask: !torch.vtensor<[256,256],ui8>, %positions: !torch.vtensor<[1],si64>) -> !torch.vtensor<[1,16,?,256],f32> attributes {torch.assume_strict_symbolic_shapes} { %s = torch.symbolic_int "s" {min_val = 1, max_val = 256} : !torch.int torch.bind_symbolic_shape %query, [%s], affine_map<()[s0] -> (1, 16, s0, 64)> : !torch.vtensor<[1,16,?,64],f32> %int0 = torch.constant.int 0 %int1 = torch.constant.int 1 %int2 = torch.constant.int 2 %int16 = torch.constant.int 16 %int64 = torch.constant.int 64 %int256 = torch.constant.int 256 %int-1 = torch.constant.int -1 %seq_len = torch.aten.size.int %query, %int2 : !torch.vtensor<[1,16,?,64],f32>, !torch.int -> !torch.int %pos_tensor = torch.aten.select.int %positions, %int0, %int-1 : !torch.vtensor<[1],si64>, !torch.int, !torch.int -> !torch.vtensor<[],si64> %pos = torch.aten.item %pos_tensor : !torch.vtensor<[],si64> -> !torch.int %end = torch.aten.add.int %pos, %seq_len : !torch.int, !torch.int -> !torch.int %bool_dtype = torch.constant.int 11 %mask_bool = torch.prims.convert_element_type %mask, %bool_dtype : !torch.vtensor<[256,256],ui8>, !torch.int -> !torch.vtensor<[256,256],i1> %mask_slice = torch.aten.slice.Tensor %mask_bool, %int0, %pos, %end, %int1 : !torch.vtensor<[256,256],i1>, !torch.int, !torch.int, !torch.int, !torch.int -> !torch.vtensor<[?,256],i1> torch.bind_symbolic_shape %mask_slice, [%s], affine_map<()[s0] -> (s0, 256)> : !torch.vtensor<[?,256],i1> %float-Inf = torch.constant.float 0xFFF0000000000000 %float0 = torch.constant.float 0.000000e+00 %f32_dtype = torch.constant.int 6 %none = torch.constant.none %cpu = torch.constant.device "cpu" %neg_inf = torch.aten.scalar_tensor %float-Inf, %f32_dtype, %none, %cpu, %none : !torch.float, !torch.int, !torch.none, !torch.Device, !torch.none -> !torch.vtensor<[],f32> %mask_bias = torch.aten.where.ScalarSelf %mask_slice, %float0, %neg_inf : !torch.vtensor<[?,256],i1>, !torch.float, !torch.vtensor<[],f32> -> !torch.vtensor<[?,256],f32> torch.bind_symbolic_shape %mask_bias, [%s], affine_map<()[s0] -> (s0, 256)> : !torch.vtensor<[?,256],f32> %q_shape = torch.prim.ListConstruct %int16, %seq_len, %int64 : (!torch.int, !torch.int, !torch.int) -> !torch.list<int> %q = torch.aten.view %query, %q_shape : !torch.vtensor<[1,16,?,64],f32>, !torch.list<int> -> !torch.vtensor<[16,?,64],f32> torch.bind_symbolic_shape %q, [%s], affine_map<()[s0] -> (16, s0, 64)> : !torch.vtensor<[16,?,64],f32> %k_shape = torch.prim.ListConstruct %int16, %int64, %int256 : (!torch.int, !torch.int, !torch.int) -> !torch.list<int> %k = torch.aten.view %key, %k_shape : !torch.vtensor<[1,16,64,256],f32>, !torch.list<int> -> !torch.vtensor<[16,64,256],f32> %scores_3d = torch.aten.bmm %q, %k : !torch.vtensor<[16,?,64],f32>, !torch.vtensor<[16,64,256],f32> -> !torch.vtensor<[16,?,256],f32> torch.bind_symbolic_shape %scores_3d, [%s], affine_map<()[s0] -> (16, s0, 256)> : !torch.vtensor<[16,?,256],f32> %scores_shape = torch.prim.ListConstruct %int1, %int16, %seq_len, %int256 : (!torch.int, !torch.int, !torch.int, !torch.int) -> !torch.list<int> %scores = torch.aten.view %scores_3d, %scores_shape : !torch.vtensor<[16,?,256],f32>, !torch.list<int> -> !torch.vtensor<[1,16,?,256],f32> torch.bind_symbolic_shape %scores, [%s], affine_map<()[s0] -> (1, 16, s0, 256)> : !torch.vtensor<[1,16,?,256],f32> %result = torch.aten.add.Tensor %scores, %mask_bias, %int1 : !torch.vtensor<[1,16,?,256],f32>, !torch.vtensor<[?,256],f32>, !torch.int -> !torch.vtensor<[1,16,?,256],f32> torch.bind_symbolic_shape %result, [%s], affine_map<()[s0] -> (1, 16, s0, 256)> : !torch.vtensor<[1,16,?,256],f32> return %result : !torch.vtensor<[1,16,?,256],f32> } } ``` </details> Compile command: ``` iree-compile \ mask_slice_qk_repro.mlir \ -o mask_slice_qk_repro.vmfb \ --iree-input-type=auto \ --iree-hal-target-device=local \ --iree-opt-data-tiling=true \ --iree-llvmcpu-enable-ukernels=all \ --iree-hal-local-target-device-backends=llvm-cpu \ --iree-hal-local-host-device-backends=llvm-cpu \ --iree-llvmcpu-target-cpu-features=host ``` Assisted by Codex 5.5 --------- Signed-off-by: Juan Ignacio Pisula <pisula@roofline.ai>
[CUDA][LLVMGPU] Fix sm_120 WGP params and add BF16 mma.sync coverage (#… …24648) ## Description Fixes a correctness issue in `getSM120WgpDetails()` carried over from the Ampere placeholder, consolidates BF16 test coverage into existing architecture-scoped files, and extends sm_120 tests with a mixed-precision case. - **BF16 mma.sync fix**: `NV_MMA_SYNC_F32_16x8x16_BF16` was missing from sm_120's `mmaOps` despite being supported from Ampere onwards. sm_121 already advertises this intrinsic. Without this fix, bf16 × bf16 → f32 matmuls silently fall back to SIMT tiling on sm_120. - **Test consolidation**: Per the reviewer's suggestion, tests no longer have their own per-architecture files. Instead they are folded into the existing architecture-scoped test files. ## Testcases 1. config_tile_and_fuse_sm80.mlir / config_tile_and_fuse_sm120.mlir: each gets a new BF16 section verifying that bf16 × bf16 → f32 selects `NV_MMA_SYNC_F32_16x8x16_BF16` at the config stage. 1. config_tile_and_fuse_sm120.mlir: adds a mixed-precision f32 × bf16 → f32 case to verify the SIMT fallback path when no MMA intrinsic matches the operand types. 1. pipeline_tile_and_fuse_mma_sync.mlir: adds a shared BF16 pipeline lowering section that covers both sm_80 and sm_120, verifying `nvgpu.mma.sync` is generated with the correct shape. 1. target_device_features.mlir: expands the SM120 metadata check to a full wgp attribute check (mma list, workgroup sizes, memory limits), matching the granularity of the existing SM89 check. 1. bf16_mma_sm80.mlir: deleted — coverage moved into the files above. --------- Signed-off-by: weimin023 <tnwilly@gmail.com>
[Metal] Fix indirect dispatch offset for sub-allocated parameter buff… …ers (#24644) iree_hal_metal_command_buffer_prepare_dispatch resolved the indirect-dispatch workgroup-count buffer offset as just config.workgroup_count_ref.offset, dropping iree_hal_buffer_byte_offset(buffer) -- the base offset of the parameter buffer within its backing allocation. This is inconsistent with the sibling descriptor path in the same function (which adds byte_offset) and with every other backend: Vulkan and amdgpu route all offsets through a shared resolver that adds byte_offset, and the local HAL resolves through iree_hal_buffer_map_range. When the indirect-parameter buffer is a sub-allocation with a non-zero base offset, the dispatch read the three workgroup-count uint32s from the wrong address and ran a wrong grid. The bug was latent because the Metal allocator returns standalone root buffers (byte_offset==0) and the cross-backend CTS indirect-parameters tests allocate the parameter buffer directly at offset 0. Add byte_offset so the source offset matches the descriptor path and the other backends: workgroups_offset = iree_hal_buffer_byte_offset(config.workgroup_count_ref.buffer) + config.workgroup_count_ref.offset; Adds CTS regression test DispatchIndirectParametersTest.SubAllocatedParameterBuffer, which references the workgroup counts through an iree_hal_buffer_subspan at a non-zero base offset (placing deliberately-wrong counts at offset 0 so a base-offset drop reads a deterministic grid instead of garbage). This is the only test shape that catches the bug, and it exercises it on every backend that supports sub-allocated buffers (Metal, Vulkan, amdgpu). Signed-off-by: Alex Vasile <48962821+Alex-Vasile@users.noreply.github.com> Signed-off-by: Alex Vasile <48962821+Alex-Vasile@users.noreply.github.com>
Integrate LLVM to llvm/llvm-project@22da7f929139 (#24629) Clean integrate, no carry-over revert. Bump to [llvm/llvm-project@22da7f929139](llvm/llvm-project@22da7f929139). Signed-off-by: Ege Beysel <beyselege@gmail.com>
Integrate LLVM to llvm/llvm-project@22da7f929139 (#24629) Clean integrate, no carry-over revert. Bump to [llvm/llvm-project@22da7f929139](llvm/llvm-project@22da7f929139). Signed-off-by: Ege Beysel <beyselege@gmail.com>
Integrate LLVM to llvm/llvm-project@22da7f929139 (#24629) Clean integrate, no carry-over revert. Bump to [llvm/llvm-project@22da7f929139](llvm/llvm-project@22da7f929139). Signed-off-by: Ege Beysel <beyselege@gmail.com>
Integrate LLVM to llvm/llvm-project@22da7f929139 (#24629) Clean integrate, no carry-over revert. Bump to [llvm/llvm-project@22da7f929139](llvm/llvm-project@22da7f929139). Signed-off-by: Ege Beysel <beyselege@gmail.com>
[GlobalOpt] Fix crash in RaiseSpecialOps on buffer-semantics named ops ( #24626) ## Problem Passing a `linalg.matmul` (or any named contraction/convolution op) with **memref / buffer operands** to `iree-compile` crashes with a hard assert instead of a clean diagnostic: ``` Assertion failed: (index < size() && "invalid index into type range"), function operator[], file TypeRange.h, line 156. ... NamedImplicitCastOpConversion<linalg::ContractionOpInterface>::matchAndRewrite(...) ``` `NamedImplicitCastOpConversion::matchAndRewrite` in `GlobalOptimization/RaiseSpecialOps.cpp` reasons about the op's results (`getResultTypes()[0]`) and rewrites its body region, both of which assume tensor semantics. A buffer-semantics linalg op writes to an output buffer and has **zero results**, so `getResultTypes()[0]` indexes an empty `TypeRange` and aborts. ## Fix Bail out via `hasPureTensorSemantics()` before touching results, matching the guards already used elsewhere in this file (e.g. lines 48, 231). The op is left untouched and compilation proceeds normally. ## Verification (local, llvm-cpu) A/B on identical current `main` sources, exact reproducer from the issue: | Build | `iree-compile ... memref matmul` | |---|---| | unfixed (`main`) | exit **134**, assert in `NamedImplicitCastOpConversion` | | fixed (this PR) | exit **0**, compiles cleanly to a valid `.vmfb` | - Added lit test `@matmul_memref_no_crash` in `raise_special_ops.mlir`. - Full `raise_special_ops.mlir` lit suite passes. Fixes #24624 Signed-off-by: Alex-Wengg <hanweng9@gmail.com>
Integrate LLVM to llvm/llvm-project@9550cd76cade (#24623) Clean integrate, no reverts carried over. Signed-off-by: Ege Beysel <beyselege@gmail.com>
[NFC] Remove the llvm-external-projects/iree-dialects directory (#24466… …) (#24612) ## Overview Removes the `llvm-external-projects/iree-dialects` directory now that nothing in the compiler or runtime depends on it. This is the final step of #24466. The production dependencies were retired in #24543 and predecessors: - the softmax matcher was reimplemented natively in `GlobalOptimization/RaiseSpecialOps`, - `ErrorCheckingTrackingListener` was moved into `Codegen/Common/`, - `DropSchedulePass` was moved into `Codegen/Common/`, - the callback-matcher transform ops (`transform.iree.register_match_callbacks` / `match_callback` / `take_first` / `emit_remark`) and their tests were dropped. ## Changes - Delete `llvm-external-projects/iree-dialects/`. - Drop the `iree_llvm_add_external_project(mlir-iree-dialects ...)` call from the top-level `CMakeLists.txt`. - Remove the dead iree-dialects target conversion from `bazel_to_cmake`. - Drop the `IREEDialects*` Python sources + include path from `compiler/bindings/python/CMakeLists.txt`, the `_ireeDialects` `CMakeExtension` from `compiler/setup.py`, and the `iree_dialects.rst` API doc. - Remove the `iree-dialects-doc` build step from the website doc generator. - Remove the "Test iree-dialects" CI steps and the `check-iree-dialects` targets from the ASan/UBSan scripts (and the now-unused `test_iree_dialects.sh`). ## Testing - **Local (Windows + MSVC, Ninja, Release):** CMake configure is clean after the external-project removal only `stablehlo` remains as an external LLVM project and a full-tree grep finds zero dangling references. The full `iree-opt` target builds, links, and runs (`iree-opt --version` → LLVM 23.0.0git). --------- Signed-off-by: Alex-Wengg <hanweng9@gmail.com>
PreviousNext