loom-dataflow is a sub-module of the loom project. It provides an MLIR-backed compiler pipeline for exploring spatial hardware mappings and generating constraint models for dataflow accelerators. The pipeline lowers tensor-level kernels through a series of analysis and transformation passes, culminating in bufferized IR annotated with hardware mapping and memory allocation decisions.
lib/
analysis/ — Static memory analysis library (shared by passes and debug CLI)
dataflow-dialect/ — TableGen + C++ for the df MLIR dialect
loom-dialect/ — TableGen + C++ for the loom MLIR dialect
modules/ — Hardware topology compositions (2D mesh, torus, ring chains)
passes/
common/ — Shared analysis utilities (hardware discovery, affine utils, etc.)
loom-opt/ — Core transformation passes
lcs/ — Loom Compute Schedule: staged ETG builder and constraint expressions
tt-opt/ — Post-bufferization TT optimization passes
pipeline/ — High-level C++ API and Python bindings (pybind11)
resources/ — Primitive hardware resource models (SRAM banks, rings, chains)
tool/
loom-opt/single_stage/ — Single-stage CLI drivers for each pipeline pass
tt-opt/single_stage/ — CLI driver for tt-opt
dataflow-dialect/ — Dataflow dialect utilities
resource-system/ — Hardware resource demos
loom-lsp-server/ — LSP server for IDE support
test/
Passes/mm_2Dmesh/ — Primary regression test (matrix multiply on 2D mesh)
Passes/flashattn_2Dmesh/
Passes/mm_ibmring/
Dialect/ — Dialect syntax and semantics tests
- CMake ≥ 3.20, Ninja, a C++17 compiler, and
lld(or another linker if you overrideLLVM_USE_LINKER). - An installed LLVM/MLIR build that exports CMake packages. The scripts default to
MLIR_DIR=/opt/llvm-mlir/lib/cmake/mlir. litorllvm-litonPATHfor CTest-driven MLIR tests (pipx install litis the recommended route).
Quick install (Linux/Debian):
sudo apt install cmake build-essential ninja-build lldgit clone https://github.com/llvm/llvm-project.git $HOME/llvm-project
cd $HOME/llvm-project && mkdir build && cd build
cmake -G Ninja ../llvm \
-DLLVM_ENABLE_PROJECTS=mlir \
-DLLVM_BUILD_EXAMPLES=ON \
-DLLVM_TARGETS_TO_BUILD="Native" \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DLLVM_ENABLE_LLD=ON \
-DMLIR_INCLUDE_INTEGRATION_TESTS=ON \
-DCMAKE_INSTALL_PREFIX=/opt/llvm-mlir \
-DLLVM_BUILD_UTILS=ON -DLLVM_INSTALL_UTILS=ON
cmake --build . --target check-mlir
ninja install./build.shFlags such as --mlir-dir=/path/to/mlir and --llvm-lit=/path/to/lit override defaults. Run ./build.sh --help for the full list.
mkdir -p build && cd build
cmake -G Ninja .. \
-DCMAKE_BUILD_TYPE=Release \
-DMLIR_DIR=/opt/llvm-mlir/lib/cmake/mlir \
-DLLVM_EXTERNAL_LIT=$(command -v lit || command -v llvm-lit) \
-DLLVM_USE_LINKER=lld
cmake --build . --config Release./setup_ide.sh performs a clean Debug build and emits build/compile_commands.json for IntelliSense.
All binaries live under build/tool/ after a build. The full pipeline is exercised by run_pipeline.sh (the regression test).
| Step | Tool | Purpose |
|---|---|---|
| 1 | tensor_canonicalize |
Specialize linalg destination operands; fold redundant tensor.extract_slice |
| 2 | memory_binding |
Bind physical memory to tensor ops via loom.alloc annotations |
| 3 | enumerate_hw_mapping |
Enumerate spatial mappings from df.spatial_dim declarations to affine.parallel iterators |
| 4 | analyze_reuse |
Annotate each loom.subview with loom.reuse (spatial / temporal / sequential dims) |
| 5 | enumerate_copy_broadcast |
Enumerate per-copy broadcast choices; annotate memref.alloc with loom.alloc |
| 6 | staged_etg |
Build Staged Execution Task Graph → JSON constraint model |
| 7 | canonicalize |
Materialize symbolic block sizes into concrete constants; canonicalize IR |
| 8 | one_shot_bufferize |
One-shot bufferization: tensor ops → memref ops |
| 9 | tt-opt |
Convert zero-initialized linalg matmul ops to Loom ops and fold redundant zero fills |
Not in the default pipeline (under active development):
hoist_block_loading— hoist block loading operations from innermost loops to outer loop levels
cd third_party/loom-dataflow
./run_pipeline.sh# Step 1
build/tool/loom-opt/single_stage/tensor_canonicalize \
--input test/Passes/mqa_decode/IR/00_from_helion_frontend.mlir \
> test/Passes/mqa_decode/IR/01_tensor_canonicalized.mlir
# Step 2
build/tool/loom-opt/single_stage/memory_binding \
--input test/Passes/mqa_decode/IR/01_tensor_canonicalized.mlir \
> test/Passes/mqa_decode/IR/02_explicit_memory_access.mlir
# Step 3
build/tool/loom-opt/single_stage/enumerate_hw_mapping \
--input test/Passes/mqa_decode/IR/02_explicit_memory_access.mlir \
--hw_spec /root/loom/third_party/loom-mlar/tests/2d_mesh/2d_mesh_torus.mlir \
> test/Passes/mqa_decode/IR/03_after_hardware_mapping.mlir
# Step 4
build/tool/loom-opt/single_stage/analyze_reuse \
--input test/Passes/mqa_decode/IR/03_after_hardware_mapping.mlir \
> test/Passes/mqa_decode/IR/04_after_reuse_analyzation.mlir
# Step 5
build/tool/loom-opt/single_stage/enumerate_copy_broadcast \
--input test/Passes/mqa_decode/IR/04_after_reuse_analyzation.mlir \
> test/Passes/mqa_decode/IR/05_after_enumerate_broadcast.mlir
# Step 6 — emits JSON constraint model
build/tool/loom-opt/single_stage/staged_etg \
--input test/Passes/mqa_decode/IR/05_after_enumerate_broadcast.mlir \
--hw_spec /root/loom/third_party/loom-mlar/tests/2d_mesh/2d_mesh_torus.mlir \
--output test/Passes/mqa_decode/constraint_space/staged_etg_dump.json
# Step 7
build/tool/loom-opt/single_stage/canonicalize \
--input test/Passes/mqa_decode/IR/05_after_enumerate_broadcast.mlir \
> test/Passes/mqa_decode/IR/06_after_canonicalize.mlir
# Step 8
build/tool/loom-opt/single_stage/one_shot_bufferize \
--input test/Passes/mqa_decode/IR/06_after_canonicalize.mlir \
> test/Passes/mqa_decode/IR/07_after_osb.mlir
# Step 9
build/tool/tt-opt/single_stage/tt-opt \
--input test/Passes/mqa_decode/IR/07_after_osb.mlir \
> test/Passes/mqa_decode/IR/08_tt-opt.mlir- Purpose: Identify and fold redundant elementwise accumulation patterns into
linalgoutput operands (e.g.,matmul(A,B,fill(0)) + add(iter_args)→matmul(A,B,iter_args)). Then remove no-optensor.extract_sliceoperations. - Implementation:
lib/passes/loom-opt/src/linalg_destination_specialization_pass.cpp,fold_redundant_extract_slice_pass.cpp
- Purpose: Transform bufferization patterns to
loomdialect operations that bind physical memory allocations to tensor semantics for downstream dataflow analysis. - Implementation:
lib/passes/loom-opt/src/memory_binding_pass.cpp
- Purpose: Enumerate all valid assignments of
df.spatial_dimhardware dimensions to the outermostaffine.paralleliterators. Clones the function per mapping, annotates inner loops withloom.mapped_to, and inserts outeraffine.forwave loops when the mesh size does not cover the iteration space in one shot. - Implementation:
lib/passes/loom-opt/src/triton_shared_spatial_mapping_pass.cpp
- Purpose: Attach a
loom.reusedictionary to eachloom.subviewdescribing how its offset varies with surrounding iterators (spatial / temporal / sequential). Recordsreuse_type(no_reuse / total_reuse) and volume per dimension. - Implementation:
lib/passes/loom-opt/src/analyze_reuse.cpp
- Purpose: For each
loom.copy_to_tensor, enumerate whether the copy should use local memory or broadcast along dimensions with total spatial reuse. Annotatesmemref.allocwithloom.alloccarrying the candidate set. Supports--analysis-onlymode to annotate without cloning. - Implementation:
lib/passes/loom-opt/src/enumerate_copy_broadcast.cpp
- Purpose: Traverse the annotated IR and construct a Staged Execution Task Graph. Emits a JSON constraint model describing compute and communication schedules for use by the SMT solver in the broader loom pipeline.
- Implementation:
lib/passes/lcs/src/staged_etg_builder.cpp; CLI driver:tool/loom-opt/single_stage/staged_etg_main.cpp
- Purpose: Replace
loom.symsymbolic block-size variables with concretearith.constantvalues from the SMT solver result map. Variants for which no feasible solution exists are dropped with a diagnostic warning. - Implementation:
lib/passes/loom-opt/src/materialize.cpp
- Purpose: Run MLIR's one-shot bufferization to lower tensor-level IR to memref-based IR in a single pass, using the
loomdialect's custom bufferization interface. - Implementation:
lib/loom-dialect/Transforms/BufferizableOpInterfaceImpl.h; CLI driver:tool/loom-opt/single_stage/one_shot_bufferize_main.cpp
- Purpose: Convert same-block zero-initialized
linalg.matmulandlinalg.batch_matmulops toloom.matmulandloom.batch_matmul, then remove redundant zerolinalg.fillops feeding remaining destination-stylelinalgops when there is no intervening use. - Implementation:
lib/passes/tt-opt/src/convert_zero_fill_linalg_matmul_to_loom_pass.cpp,lib/passes/tt-opt/src/fold_zero_fill_linalg_pass.cpp
- Purpose: Hoist block loading operations from innermost
affine.forloops to outer loop levels to reduce redundant memory accesses. Identifiesloom.alloc + loom.copy_to_tensorloading block patterns and clones the function per loading block. - Status: Builds but is not part of the default pipeline. Updates pending.
- Implementation:
lib/passes/loom-opt/src/hoist_block_loading.cpp,lib/passes/common/src/block_loading_pattern.cpp
Standalone CLI for the memory analysis pass used internally by memory_binding. Parses an MLIR file and dumps the virtual buffer allocation plan (bucket grouping, coloring, liveness).
build/tool/loom-opt/single_stage/static_memory_analyser --input <file.mlir>benchmark.sh measures tool execution time with statistical analysis (mean, median, min, max, std dev across multiple runs).
./benchmark.sh --warmup=3 --runs=10 -- build/tool/loom-opt/single_stage/tensor_canonicalize \
--input test/Passes/mqa_decode/IR/00_from_helion_frontend.mlir# Run all CTest-driven MLIR tests
cd build && ctest --output-on-failure
# Run the full end-to-end regression pipeline
cd third_party/loom-dataflow && ./run_pipeline.shTest cases are under test/Passes/ (mm_2Dmesh, flashattn_2Dmesh, mm_ibmring) and test/Dialect/.
litnot found: install viapipx install lit(preferred) or provide--llvm-lit=/path/to/lit.MLIRConfig.cmakemissing: exportMLIR_DIRto point at your LLVM/MLIR installation.- IntelliSense gaps: rerun
./setup_ide.shso thatcompile_commands.jsonstays in sync with TableGen-generated headers.
See LICENSE.