Flatten pass removes block results required by `br_on_*` instructions #6924

tanishiking · 2024-09-10T11:44:33Z

WasmGC introduces new instructions: br_on_cast, br_on_cast_fail, br_on_non_null. These instructions branch to a specified label, passing the operand with a target type.

For example, in the following code, (br_on_non_null $labelNotNull (local.get $x)) branches to the (outside of) block $labelNotNull with the value (ref $obj) on the stack if $x is non-null.

When we apply the flatten pass to the code contains those instructions, the pass emits an invalid code:

(module
 (type $obj (struct))
 (func $f (export "f")
  (param $x (ref null $obj))
  (result (ref null $obj))
  (block $labelNotNull (result (ref null $obj))
   (br_on_non_null $labelNotNull
    (local.get $x)
   )
   (local.get $x)
  )
 )
)

$ wasm-opt -all --flatten test.wat -S -o -
[wasm-validator error in function f] break type must be a subtype of the target block type, on
(block $labelNotNull
 (local.set $1
  (local.get $0)
 )
 (br_on_non_null $labelNotNull
  (local.get $1)
 )
 (local.set $2
  (local.get $0)
 )
 (local.set $3
  (local.get $2)
 )
)
Fatal: error after opts

It is invalid because, while the break type of br_on_non_null is ref $obj, the block type is notype(?).

related: #6724

Should we lower br_on_non_null into ref.is_non_null and something, or can we somehow keep the block type through flatten pass?

The text was updated successfully, but these errors were encountered:

kripken · 2024-09-10T16:51:51Z

This is a current limitation of flatten, I believe. In general the idea of flatten was to remove block results in order to simplify the IR. But things like br_on_* fundamentally require block results. So we'd need to either change the goal of flatten, or lower br_on_* into branches and locals, using ref.is_non_null etc., as you said, though that would be less efficient code, unfortunately.

I'm curious what you are using flatten for? In Binaryen itself we've used it in wasm2js, but that isn't important for WasmGC code anyhow (since wasm2js doesn't support GC in general, yet, and no current plans to).

tanishiking · 2024-09-11T03:13:09Z

Thanks @kripken. I'm interested in using flatten for LoopInvariantCodeMotion. I've noticed that while LICM moves entire expressions, it doesn't move nested sub-expressions, and it seems that flattening helps LICM as commented:

binaryen/src/passes/LoopInvariantCodeMotion.cpp

Lines 21 to 22 in 432ed1c

    
           // Flattening is not necessary here, but may help (as separating 
        
           // out expressions may allow moving at least part of a larger whole).

binaryen/src/passes/LoopInvariantCodeMotion.cpp

Lines 212 to 213 in 432ed1c

    
           // too, as with more nesting moving code is harder - so something 
        
           // like -O --flatten --licm -O may be best).

I'm not yet sure how impactful LICM is on our artifact (generated from the Scala.js wasm-backend), I'm just exploring how LICM moves our code 😄

Also, I tried to flatten on compiler side (by inserting local.set and local.get for each instructions that I wanna make an expression), but couldn't make binaryen IR flattened, and not sure it's flattened structure survives through -O passes.

though that would be less efficient code, unfortunately.

Yeah, that's not great unless we can transform those instructions back to br_on_* later :(

kripken · 2024-09-11T16:19:01Z

I see, interesting.

About licm, it is not enabled by default because in practice it doesn't seem that helpful: VMs will do licm themselves, and it makes the code larger (it adds locals), so in general it increases size without speeding things up. Though, it might help baseline tiers and interpreters, perhaps. But I'd be interested to hear thoughts or data otherwise!

not sure it's flattened structure survives through -O passes.

Hmm, yeah, in general the passes will "un-flatten" as that is more efficient. You might experiment with running licm as the first pass, though. (But, again, I am not sure how useful licm is, as I mentioned above.)

Previously, we used mutable array-based interface tables (itables) without a specific reason. However, accessing mutable arrays can lead to missed optimization opportunities. For instance, loop-invariant code motion (LICM) won't be able to move certain parts of interface dispatching operations out of loops: When we invoke a method using interface table dispatching, it requires retrieving an itable from the class's itable array, which is considered non-pure. In fact, reading from an array is marked as unsafe to move in wasm-opt's LICM implementation: https://github.com/WebAssembly/binaryen/blob/34ad6a7598e662e9ff357987f2c81fde1e05c522/src/ir/effects.h#L207-L210 https://github.com/WebAssembly/binaryen/blob/34ad6a7598e662e9ff357987f2c81fde1e05c522/src/passes/LoopInvariantCodeMotion.cpp#L125-L128 (Note: We may not run wasm-opt's LICM in favor of the VM's LICM, as discussed in WebAssembly/binaryen#6924) This commit changes itables to be immutable structs, making itable retrieval a pure operation. Here's a comparison of benchmark results between commit 4c9494e and this change: | | array itables | struct itables | | ----------- | ------------- | -------------- | | sha512 | 12128.73213 | 12107.94142 | | queens | 3.632068223 | 3.655815179 | | list | 84.61223471 | 65.69730564 | | richards | 92.12833022 | 91.19028905 | | cd | 46267.10318 | 46574.81215 | | gcbench | 116540.568 | 112594.324 | | tracerFloat | 2569.675034 | 2551.227714 | | tracer | 1725.591249 | 1697.907466 | | sudoku | 14863.33909 | 12066.23882 | | nbody | 366388.3292 | 199625.971 | | permute | 1006.215249 | 983.9858283 | | deltaBlue | 2205.002528 | 2320.038418 | | kmeans | 1752457.829 | 1754425.973 | | brainfuck | 36983.97583 | 36093.7837 | | json | 1205.92764 | 1191.768986 | | bounce | 21.14001572 | 20.8851082 | While most benchmark results remain largely unchanged, some tests such as `list`, `sudoku`, and `nbody` show improved performance compared to the array-based itables. This improvement may be attributed to the VM's optimization capabilities. **immutable array?** In this change, we implement itables using immutable structs However, we can use immutable arrays as an alternative. We implemented and benchmarked itables using immutable arrays, and the results show nearly same performance as the immutable struct-based implementation. WebAssembly/gc#250 v8/v8@6e36e3e Let's proceed with the immutable struct-based implementation for now because - While V8 seems to optimize immutable array-related operations, other VMs may not offer similar optimizations (?) - Currently, wasm-opt doesn't take into account the mutability of arrays in its optimization process.

Previously, we used mutable array-based interface tables (itables) without a specific reason. However, accessing mutable arrays can lead to missed optimization opportunities. For instance, loop-invariant code motion (LICM) won't be able to move certain parts of interface dispatching operations out of loops: When we invoke a method using interface table dispatching, it requires retrieving an itable from the class's itable array, which is considered non-pure. In fact, reading from an array is marked as unsafe to move in wasm-opt's LICM implementation: https://github.com/WebAssembly/binaryen/blob/34ad6a7598e662e9ff357987f2c81fde1e05c522/src/ir/effects.h#L207-L210 https://github.com/WebAssembly/binaryen/blob/34ad6a7598e662e9ff357987f2c81fde1e05c522/src/passes/LoopInvariantCodeMotion.cpp#L125-L128 (Note: We may not run wasm-opt's LICM in favor of the VM's LICM, as discussed in WebAssembly/binaryen#6924) This commit changes itables to be immutable structs, making itable retrieval a pure operation. Here's a comparison of benchmark results between commit 4c9494e and this change: | | array itables | struct itables | | ----------- | ------------- | -------------- | | sha512 | 12128.73213 | 12107.94142 | | queens | 3.632068223 | 3.655815179 | | list | 84.61223471 | 65.69730564 | | richards | 92.12833022 | 91.19028905 | | cd | 46267.10318 | 46574.81215 | | gcbench | 116540.568 | 112594.324 | | tracerFloat | 2569.675034 | 2551.227714 | | tracer | 1725.591249 | 1697.907466 | | sudoku | 14863.33909 | 12066.23882 | | nbody | 366388.3292 | 199625.971 | | permute | 1006.215249 | 983.9858283 | | deltaBlue | 2205.002528 | 2320.038418 | | kmeans | 1752457.829 | 1754425.973 | | brainfuck | 36983.97583 | 36093.7837 | | json | 1205.92764 | 1191.768986 | | bounce | 21.14001572 | 20.8851082 | While most benchmark results remain same, some tests such as `list`, `sudoku`, and `nbody` show improved performance compared to the array-based itables (may be thanks to the VM's optimization capabilities). **immutable array?** In this change, we implement itables using immutable structs However, we can use immutable arrays as an alternative. We implemented and benchmarked itables using immutable arrays, and the results show nearly same performance as the immutable struct-based implementation. WebAssembly/gc#250 v8/v8@6e36e3e Let's proceed with the immutable struct-based implementation for now because - While V8 seems to optimize immutable array-related operations, other VMs may not offer similar optimizations (?) - Currently, wasm-opt doesn't take into account the mutability of arrays in its optimization process.

tanishiking · 2024-09-18T03:00:20Z

in practice it doesn't seem that helpful: VMs will do licm themselves, and it makes the code larger (it adds locals), so in general it increases size without speeding things up.

Thanks, that’s interesting. It seems like it's better to leave LICM optimizations to the VMs.

We recently moved the itables implementation from a mutable array to an immutable struct in this PR, and I've noticed some performance improvements in loop-heavy benchmarks (I'm not sure if these gains are directly due to LICM optimizations by the VMs, though).

I'll close the issue for now. Thank you very much, @kripken, for answering my questions!

tanishiking mentioned this issue Sep 17, 2024

Wasm: Make itables immutable struct scala-js/scala-js#5038

Merged

tanishiking closed this as completed Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatten pass removes block results required by `br_on_*` instructions #6924

Flatten pass removes block results required by `br_on_*` instructions #6924

tanishiking commented Sep 10, 2024

kripken commented Sep 10, 2024

tanishiking commented Sep 11, 2024 •

edited

Loading

kripken commented Sep 11, 2024

tanishiking commented Sep 18, 2024

Flatten pass removes block results required by br_on_* instructions #6924

Flatten pass removes block results required by br_on_* instructions #6924

Comments

tanishiking commented Sep 10, 2024

kripken commented Sep 10, 2024

tanishiking commented Sep 11, 2024 • edited Loading

kripken commented Sep 11, 2024

tanishiking commented Sep 18, 2024

Flatten pass removes block results required by `br_on_*` instructions #6924

Flatten pass removes block results required by `br_on_*` instructions #6924

tanishiking commented Sep 11, 2024 •

edited

Loading