Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten pass removes block results required by br_on_* instructions #6924

Closed
tanishiking opened this issue Sep 10, 2024 · 4 comments
Closed

Comments

@tanishiking
Copy link
Contributor

WasmGC introduces new instructions: br_on_cast, br_on_cast_fail, br_on_non_null. These instructions branch to a specified label, passing the operand with a target type.

For example, in the following code, (br_on_non_null $labelNotNull (local.get $x)) branches to the (outside of) block $labelNotNull with the value (ref $obj) on the stack if $x is non-null.

When we apply the flatten pass to the code contains those instructions, the pass emits an invalid code:

(module
 (type $obj (struct))
 (func $f (export "f")
  (param $x (ref null $obj))
  (result (ref null $obj))
  (block $labelNotNull (result (ref null $obj))
   (br_on_non_null $labelNotNull
    (local.get $x)
   )
   (local.get $x)
  )
 )
)
$ wasm-opt -all --flatten test.wat -S -o -
[wasm-validator error in function f] break type must be a subtype of the target block type, on
(block $labelNotNull
 (local.set $1
  (local.get $0)
 )
 (br_on_non_null $labelNotNull
  (local.get $1)
 )
 (local.set $2
  (local.get $0)
 )
 (local.set $3
  (local.get $2)
 )
)
Fatal: error after opts

It is invalid because, while the break type of br_on_non_null is ref $obj, the block type is notype(?).

related: #6724

Should we lower br_on_non_null into ref.is_non_null and something, or can we somehow keep the block type through flatten pass?

@kripken
Copy link
Member

kripken commented Sep 10, 2024

This is a current limitation of flatten, I believe. In general the idea of flatten was to remove block results in order to simplify the IR. But things like br_on_* fundamentally require block results. So we'd need to either change the goal of flatten, or lower br_on_* into branches and locals, using ref.is_non_null etc., as you said, though that would be less efficient code, unfortunately.

I'm curious what you are using flatten for? In Binaryen itself we've used it in wasm2js, but that isn't important for WasmGC code anyhow (since wasm2js doesn't support GC in general, yet, and no current plans to).

@tanishiking
Copy link
Contributor Author

tanishiking commented Sep 11, 2024

Thanks @kripken. I'm interested in using flatten for LoopInvariantCodeMotion. I've noticed that while LICM moves entire expressions, it doesn't move nested sub-expressions, and it seems that flattening helps LICM as commented:

// Flattening is not necessary here, but may help (as separating
// out expressions may allow moving at least part of a larger whole).

// too, as with more nesting moving code is harder - so something
// like -O --flatten --licm -O may be best).

I'm not yet sure how impactful LICM is on our artifact (generated from the Scala.js wasm-backend), I'm just exploring how LICM moves our code 😄

Also, I tried to flatten on compiler side (by inserting local.set and local.get for each instructions that I wanna make an expression), but couldn't make binaryen IR flattened, and not sure it's flattened structure survives through -O passes.


though that would be less efficient code, unfortunately.

Yeah, that's not great unless we can transform those instructions back to br_on_* later :(

@kripken
Copy link
Member

kripken commented Sep 11, 2024

I see, interesting.

About licm, it is not enabled by default because in practice it doesn't seem that helpful: VMs will do licm themselves, and it makes the code larger (it adds locals), so in general it increases size without speeding things up. Though, it might help baseline tiers and interpreters, perhaps. But I'd be interested to hear thoughts or data otherwise!

not sure it's flattened structure survives through -O passes.

Hmm, yeah, in general the passes will "un-flatten" as that is more efficient. You might experiment with running licm as the first pass, though. (But, again, I am not sure how useful licm is, as I mentioned above.)

tanishiking added a commit to tanishiking/scala-js that referenced this issue Sep 17, 2024
Previously, we used mutable array-based interface tables (itables) without a specific reason.
However, accessing mutable arrays can lead to missed optimization opportunities.
For instance, loop-invariant code motion (LICM) won't be able to move certain parts of interface dispatching operations out of loops:
When we invoke a method using interface table dispatching, it requires retrieving an itable from the class's itable array, which is considered non-pure.

In fact, reading from an array is marked as unsafe to move in wasm-opt's LICM implementation:

https://github.com/WebAssembly/binaryen/blob/34ad6a7598e662e9ff357987f2c81fde1e05c522/src/ir/effects.h#L207-L210
https://github.com/WebAssembly/binaryen/blob/34ad6a7598e662e9ff357987f2c81fde1e05c522/src/passes/LoopInvariantCodeMotion.cpp#L125-L128

(Note: We may not run wasm-opt's LICM in favor of the VM's LICM, as discussed in WebAssembly/binaryen#6924)

This commit changes itables to be immutable structs, making itable retrieval a pure operation.
Here's a comparison of benchmark results between commit 4c9494e and this change:

|             | array itables | struct itables |
| ----------- | ------------- | -------------- |
| sha512      | 12128.73213   | 12107.94142    |
| queens      | 3.632068223   | 3.655815179    |
| list        | 84.61223471   | 65.69730564    |
| richards    | 92.12833022   | 91.19028905    |
| cd          | 46267.10318   | 46574.81215    |
| gcbench     | 116540.568    | 112594.324     |
| tracerFloat | 2569.675034   | 2551.227714    |
| tracer      | 1725.591249   | 1697.907466    |
| sudoku      | 14863.33909   | 12066.23882    |
| nbody       | 366388.3292   | 199625.971     |
| permute     | 1006.215249   | 983.9858283    |
| deltaBlue   | 2205.002528   | 2320.038418    |
| kmeans      | 1752457.829   | 1754425.973    |
| brainfuck   | 36983.97583   | 36093.7837     |
| json        | 1205.92764    | 1191.768986    |
| bounce      | 21.14001572   | 20.8851082     |

While most benchmark results remain largely unchanged, some tests such as `list`, `sudoku`, and `nbody` show improved performance compared to the array-based itables. This improvement may be attributed to the VM's optimization capabilities.

**immutable array?**
In this change, we implement itables using immutable structs
However, we can use immutable arrays as an alternative.
We implemented and benchmarked itables using immutable arrays, and the results show nearly same performance as the immutable struct-based implementation.

WebAssembly/gc#250
v8/v8@6e36e3e

Let's proceed with the immutable struct-based implementation for now because

- While V8 seems to optimize immutable array-related operations, other VMs may not offer similar optimizations (?)
- Currently, wasm-opt doesn't take into account the mutability of arrays in its optimization process.
tanishiking added a commit to tanishiking/scala-js that referenced this issue Sep 17, 2024
Previously, we used mutable array-based interface tables (itables) without a specific reason.
However, accessing mutable arrays can lead to missed optimization opportunities.
For instance, loop-invariant code motion (LICM) won't be able to move certain parts of interface dispatching operations out of loops:
When we invoke a method using interface table dispatching, it requires retrieving an itable from the class's itable array, which is considered non-pure.

In fact, reading from an array is marked as unsafe to move in wasm-opt's LICM implementation:

https://github.com/WebAssembly/binaryen/blob/34ad6a7598e662e9ff357987f2c81fde1e05c522/src/ir/effects.h#L207-L210
https://github.com/WebAssembly/binaryen/blob/34ad6a7598e662e9ff357987f2c81fde1e05c522/src/passes/LoopInvariantCodeMotion.cpp#L125-L128

(Note: We may not run wasm-opt's LICM in favor of the VM's LICM, as discussed in WebAssembly/binaryen#6924)

This commit changes itables to be immutable structs, making itable retrieval a pure operation.
Here's a comparison of benchmark results between commit 4c9494e and this change:

|             | array itables | struct itables |
| ----------- | ------------- | -------------- |
| sha512      | 12128.73213   | 12107.94142    |
| queens      | 3.632068223   | 3.655815179    |
| list        | 84.61223471   | 65.69730564    |
| richards    | 92.12833022   | 91.19028905    |
| cd          | 46267.10318   | 46574.81215    |
| gcbench     | 116540.568    | 112594.324     |
| tracerFloat | 2569.675034   | 2551.227714    |
| tracer      | 1725.591249   | 1697.907466    |
| sudoku      | 14863.33909   | 12066.23882    |
| nbody       | 366388.3292   | 199625.971     |
| permute     | 1006.215249   | 983.9858283    |
| deltaBlue   | 2205.002528   | 2320.038418    |
| kmeans      | 1752457.829   | 1754425.973    |
| brainfuck   | 36983.97583   | 36093.7837     |
| json        | 1205.92764    | 1191.768986    |
| bounce      | 21.14001572   | 20.8851082     |

While most benchmark results remain same, some tests such as `list`, `sudoku`, and `nbody` show improved performance compared to the array-based itables (may be thanks to the VM's optimization capabilities).

**immutable array?**
In this change, we implement itables using immutable structs
However, we can use immutable arrays as an alternative.
We implemented and benchmarked itables using immutable arrays, and the results show nearly same performance as the immutable struct-based implementation.

WebAssembly/gc#250
v8/v8@6e36e3e

Let's proceed with the immutable struct-based implementation for now because

- While V8 seems to optimize immutable array-related operations, other VMs may not offer similar optimizations (?)
- Currently, wasm-opt doesn't take into account the mutability of arrays in its optimization process.
tanishiking added a commit to tanishiking/scala-js that referenced this issue Sep 17, 2024
Previously, we used mutable array-based interface tables (itables) without a specific reason.
However, accessing mutable arrays can lead to missed optimization opportunities.
For instance, loop-invariant code motion (LICM) won't be able to move certain parts of interface dispatching operations out of loops:
When we invoke a method using interface table dispatching, it requires retrieving an itable from the class's itable array, which is considered non-pure.

In fact, reading from an array is marked as unsafe to move in wasm-opt's LICM implementation:

https://github.com/WebAssembly/binaryen/blob/34ad6a7598e662e9ff357987f2c81fde1e05c522/src/ir/effects.h#L207-L210
https://github.com/WebAssembly/binaryen/blob/34ad6a7598e662e9ff357987f2c81fde1e05c522/src/passes/LoopInvariantCodeMotion.cpp#L125-L128

(Note: We may not run wasm-opt's LICM in favor of the VM's LICM, as discussed in WebAssembly/binaryen#6924)

This commit changes itables to be immutable structs, making itable retrieval a pure operation.
Here's a comparison of benchmark results between commit 4c9494e and this change:

|             | array itables | struct itables |
| ----------- | ------------- | -------------- |
| sha512      | 12128.73213   | 12107.94142    |
| queens      | 3.632068223   | 3.655815179    |
| list        | 84.61223471   | 65.69730564    |
| richards    | 92.12833022   | 91.19028905    |
| cd          | 46267.10318   | 46574.81215    |
| gcbench     | 116540.568    | 112594.324     |
| tracerFloat | 2569.675034   | 2551.227714    |
| tracer      | 1725.591249   | 1697.907466    |
| sudoku      | 14863.33909   | 12066.23882    |
| nbody       | 366388.3292   | 199625.971     |
| permute     | 1006.215249   | 983.9858283    |
| deltaBlue   | 2205.002528   | 2320.038418    |
| kmeans      | 1752457.829   | 1754425.973    |
| brainfuck   | 36983.97583   | 36093.7837     |
| json        | 1205.92764    | 1191.768986    |
| bounce      | 21.14001572   | 20.8851082     |

While most benchmark results remain same, some tests such as `list`, `sudoku`, and `nbody` show improved performance compared to the array-based itables (may be thanks to the VM's optimization capabilities).

**immutable array?**
In this change, we implement itables using immutable structs
However, we can use immutable arrays as an alternative.
We implemented and benchmarked itables using immutable arrays, and the results show nearly same performance as the immutable struct-based implementation.

WebAssembly/gc#250
v8/v8@6e36e3e

Let's proceed with the immutable struct-based implementation for now because

- While V8 seems to optimize immutable array-related operations, other VMs may not offer similar optimizations (?)
- Currently, wasm-opt doesn't take into account the mutability of arrays in its optimization process.
@tanishiking
Copy link
Contributor Author

in practice it doesn't seem that helpful: VMs will do licm themselves, and it makes the code larger (it adds locals), so in general it increases size without speeding things up.

Thanks, that’s interesting. It seems like it's better to leave LICM optimizations to the VMs.

We recently moved the itables implementation from a mutable array to an immutable struct in this PR, and I've noticed some performance improvements in loop-heavy benchmarks (I'm not sure if these gains are directly due to LICM optimizations by the VMs, though).

I'll close the issue for now. Thank you very much, @kripken, for answering my questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants