spec: Export indirection#2
Conversation
| a world of pain when they start mattering more. | ||
| --> | ||
|
|
||
| “Stack height” is the height of the implicit stack. Stack height is a sum of: |
There was a problem hiding this comment.
Is it on purpose that you're counting u32 and u64 the same, and that you're counting only one per activation frame while afaict it's equivalent to 2 u64? (return address and saved rbp, not spilled local registers as it's already counted just below)
There was a problem hiding this comment.
Yes, it is purposeful. Stack height here is a number of “entries” in WebAssembly specification parlance, and the actual machine quantities would be accounted for by setting up the machine stack in such a way that any combination of stacklimit entries would fit.
There was a problem hiding this comment.
The need to handle the locals here specially is somewhat unfortunate, but I don't really see any way out besides having a level indirection and allocating the locals on the heap for each call activation entry.
There was a problem hiding this comment.
Hmm so I don't think we actually need to count stack in number of wasm-spec-"entries", do we?
My thoughts are, sure it'll probably be hard to count in number of bytes (because currently we're making instrumentation before compiling, so we don't have type analysis on the wasm stack and don't know what is an u32 and what is an u64), but we could count in "u64 or less" sizes rather than the "u128 or less" that you suggest, thus gaining a 2x factor in number of actually-allowed memory for the same host-reserved memory usage (as function activation frames should be pretty rare).
Does that make sense?
There was a problem hiding this comment.
WASM simd spec adds values that are 128 bit long though, and we might add support for it soon.
There was a problem hiding this comment.
Ugh :/ and I guess these would be on the value stack with no specific eg. drop instruction just for them?
If we're thinking of adding support for that soon, I guess we can just count in terms of u128 and reconsider on the day we'll migrate this to be compiler-handled checks
There was a problem hiding this comment.
I guess the instrumentation could be parametrized by the size of certain entries, and we definitely know the layout of the stack at its maximum height too, as it is determined entirely by the instructions that are part of the block.
The problem with that approach, though is that it isn't clear how that would translate to e.g. interpreters that might store a pointer to heap allocated entry on the stack or something similar.
Either way, I suggest I slap a TODO on this for now and we come back to this once the spec shapes up more.
| 0. Assert: Stack height is 0, before any stack operations occur. | ||
| 9. Before function invocation: | ||
| * Charge gas for each local of `funcaddr` and execution of this `call` instruction; | ||
| * Require that the stack height required to execute `funcaddr` does not exceed `stacklimit`. |
There was a problem hiding this comment.
| * Require that the stack height required to execute `funcaddr` does not exceed `stacklimit`. | |
| * Require that the stack height required to execute `funcaddr` until the next `stack` call does not exceed `stacklimit`. |
There was a problem hiding this comment.
I avoided referencing stack and gas functions somewhat purposefully – so that the text here makes sense regardless of whatever specific instrumentation strategy is chosen. That might change in the future, but I'm thinking that a better approach would be to just adjust the execution semantics for the call instruction so that it checks the limit reserves the slots in the implicit stack.
Do you think that would make sense?
There was a problem hiding this comment.
You're right! I was originally thinking of functions that look like rep push / rep pop that'd require stack calls inside the function; but upon more thought this is actually not possible because wasm blocks need to leave the block at the same stack height as it was when they were entered, so there should never be any stack call in the middle of the function and we don't need this change.
| (func (export "main") | ||
| (call $stack (i32.const 6)) | ||
| (call $gas (i64.const 6)) | ||
| (call $main) |
There was a problem hiding this comment.
@matklad I think you were pretty concerned about potential runtime performance implications of doing something along these lines – would it help if this was a return_call (i.e. a tail-call), pretty much ending up as a very cheap jmp and no extra VM stack use once this gadget is done executing?
This would require us to implement the tail call extension, but it sounds like a good idea regardless, and sounds like it'd be pretty easy/high value extension to implement.
There was a problem hiding this comment.
My main concerns was not so much performance, as complexity -- adding trampolines feels like it'll require a bunch of fiddly code in the implementation.
There was a problem hiding this comment.
Got it. I don’t think the complexity is a concern. The export indirection can be an entirely separate transformation pass. Executed exactly once, before any other analysis or instrumentation, it sets up the module such that the regular analysis/instrumentation can entirely ignore any complications that arise due to indirect or external calls. Instrumenting these trampolines as-if they are regular WebAssembly code is sufficient to produce precise and accurate accounting.
That said, export indirection can pose a problem to how the specification is written.
On one hand export indirection is just a tool to make it possible to write the specification such that exports and table.elems aren’t an exceptional case, on another instrumentation does need to avoid instrumenting calls to the trampolines introduced by export indirection (these trampolines are intended to be transparent after all). So far this property comes out almost by definition. It isn’t possible to insert any instrumentation before a host-VM call, unless the embedder is modified to do something like this.
When functions are exported, there is no easy way to introduce code to account for these functions’ operations before those operations are executed. At least not without an intervention of the VM. Export indirection is probably the simplest mechanism that would allow implementing proper accounting purely within WASM.
89831c2 to
2c47767
Compare
|
Taking one step back: do we actually need to specify export indirection? I cannot think of a way export indirection can affect wasm-virtual-machine behavior. The only way I can think of it would be if we wanted to specify at exactly which amd64 instruction the execution should stop when hitting a limit, but I feel like we should avoid speccing that, just saying "if any limit is hit by the limitless execution trace, execution could stop at any point in the program but the output must be LimitExceeded with the amount of used gas specced" That said, export-indirection.mkd feels like nice implementation documentation and the tests you added as good implementation tests, but they don't feel like something we should spec out to me. |
|
Export indirection is an implementation choice, yes. I felt it is worthwhile to start working from here since it is very independent of everything else here, is easy to both describe and implement and is entirely optional. The alternatives are pretty well understood too, I feel, and it is pretty clear to me that indirection is the mechanism we want to continue using for the time being at least. I guess I could just move the document away from |
f397e5f to
1444ce2
Compare
|
I pushed up a commit that sketches out most of the implementation for the indirection transform. There are a couple of learnings here that I think are worth bringing up – first is that we’ll need to contribute to I can say for sure though that splitting out things into analyze and transform phases seems to work quite okay at least for this transformation. I do definitely see ourselves wanting a nicer framework for walking through module and modifying only the interesting parts of it. Otherwise we end up with a 700-LoC modules for each analysis/transform that mostly are just traversing the structure and converting from one representation to the other. |
c1056cc to
de5cc09
Compare
The API for this crate is not great for custom use-cases like this, but I guess if you close your eyes it is workable enough.
de5cc09 to
9498b34
Compare
|
I have re-evaluated use of crates such as I have also took opportunity to adjust a little bit the approach in our test suite. I went ahead and integrated |
| let mut insta = insta::Settings::new(); | ||
| insta.set_prepend_module_to_snapshot(false); | ||
| insta.set_snapshot_path( | ||
| self.test_path | ||
| .canonicalize() | ||
| .expect("unsable to canonicalize the test path") | ||
| .parent() | ||
| .expect("could not get the parent directory of the test path") | ||
| .join("snaps") | ||
| ); | ||
| insta.set_snapshot_suffix(directive.display()); | ||
| let (line, _) = directive.span().linecol_in(self.wast_data); | ||
| insta.bind(|| { | ||
| let _hook = std::panic::set_hook(Box::new(|_| {})); | ||
| let result = std::panic::catch_unwind(|| { | ||
| insta::_macro_support::assert_snapshot( | ||
| // Creates a ReferenceValue::Named variant | ||
| insta::_macro_support::ReferenceValue::Named(Some( | ||
| self.test_name.clone().into(), | ||
| )), | ||
| &output, | ||
| env!("CARGO_MANIFEST_DIR"), | ||
| "", | ||
| &self.test_name, | ||
| &self.test_path.display().to_string(), | ||
| line as u32 + 1, // lines are 0-indexed here | ||
| directive.display(), | ||
| ) | ||
| }); | ||
| let _ = std::panic::take_hook(); | ||
| match result { | ||
| Ok(Ok(_)) => self.pass(), | ||
| Ok(Err(error)) => return self.fail(&*error), | ||
| Err(_panic) => return self.fail(Error::Insta), | ||
| } | ||
| }); |
There was a problem hiding this comment.
expect-test might be a bit more friendlier to such custom processing than insta
https://docs.rs/expect-test/latest/expect_test/macro.expect_file.html
There was a problem hiding this comment.
expect-test has a lot of the same problems as insta – it prefers to panic, and it still outputs diffs unconditionally to stdout/err. I went ahead and wrote a custom thing instead, which is ~the same order-of-degree of code as the ready-made crates and does not require any hacks.
I am curious if there are dependencies between exports and call_indirect? As far as I understand it, both exports and indirect calls face the same fundamental problem: we can't really instrument the call-site. Could/should we use the same mechanism to handle both? |
Well, implementation in nearcore wise, my proposal that exports, the start function and table elements (i.e. That said, there is a problem that I still need to think about/figure out with regards to you see that each trampoline will contain |
|
Ah, so this indeed covers
🤔 this makes me like "worse is better" solution of forgoing indirects and just charging on function entry ("too late", in some sense) more :) |
These snapshot crates have a strong preference towards panicking and outputting their diffs to stdout/err unconditionally, while also not really providing sufficient flexibility to integrate them into our vector-as-output shceme. It seemed easy enough to implement our own, so here we are.
Hm, but then we must make sure the stack is over-provisioned sufficiently to fit all the locals a function would want (hard problem as seen with our secondary stack check), and also have the number of per-function-locals limited to low-enough-numbers that their initialization isn’t taking a huge amount of time before we can charge gas for that. If supporting multiple runtimes (without an ability to modify some of them) was not a concern, I would really just create a separate custom section in the wasm module and have our VM implementation charge the relevant fees during function’s prologue. Unfortunately I don’t see good alternative approaches at the moment. I will split out the supporting code out into a separate PR, so that this PR doesn’t block any parallel work. |
I'm not sure we actually have this concern? AFAICT we're only using wasmer2 in production, and other runtimes are only ever used in order to run over the archive, develop or run tests. These three other use cases don't actually need performance, so I think we could live with:
This actually matches our initial idea of "write a spec gas instrumentation then make it production-ready", though it also means that there may be more work on this front than anticipated. |
|
Even if we don't share the same implementation for gas/stack counting beetween the runtimes (thus having to ensure that the multiple implementations are all 100% spec compliant) the spec must be implementable either way. I believe the current spec allows to charge gas/account for stack using either approach, but it isn’t 100% clear to me that it will remain so indefinitely. If base webassembly spec needed to add some trapping runtime validation during execution of a function call, then there’d be a possibility of non-deterministic execution. |
|
Ah, so I was looking at a wrong place, I was expecting So even today we must specify whether these traps must occur before our charges or after. As spec is written right now, these trapping conditions must be evaluated before any stack operations or gas charges occur as part of the function call. This in practice prohibits use of these indirection trampolines for indirect calls, I… think? |
|
Hmmmm I think your link mostly means that we must specify whether we charge before or after runtime-typechecking the indirectly-called function? IMO if we say that the runtime-typeck must happen before the gas charges (IIUC, how the spec is currently written), then it means that both the options I'm suggesting can still be used correctly: one instrumenter that charges "too late" (after locals init) for development and testing, and one patched compiler that charges "just right" (after runtime-typeck that happens before the push rip/jmp as instrumentation would be in the prologue, and before locals init) Does that make sense? |
When functions are exported, the only way to account for these functions participating in host-to-VM calls is by introducing a trampoline indirection and placing some calls to instrumentation callbacks in said trampoline.