-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Implement lowered-then-lifted functions #4327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement lowered-then-lifted functions #4327
Conversation
I was originally going to include lifted-then-lowered functions in the same component here as well, but that has snowballed quite a lot so I opted to split this out and get this in first. |
Subscribe to Label Actioncc @peterhuene
This issue or pull request has been labeled: "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
// store: *mut dyn Store, | ||
// flags: [u8; component.num_runtime_component_instances], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// flags: [u8; component.num_runtime_component_instances], | |
// flags: [VMComponentFlags; component.num_runtime_component_instances], |
align(u32::from(ret.ptr.size())), | ||
size(store) = cmul(2, ret.ptr.size()), | ||
size(flags) = ret.num_runtime_component_instances, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better not to open code / bake in the assumption that the flags will always be one byte large, especially since we wrap them up in a VMComponentFlags
data abstraction elsewhere:
size(flags) = ret.num_runtime_component_instances, | |
size(flags) = cmul(size_of_vm_component_flags(), ret.num_runtime_component_instances), |
7b03ae3
to
292c82b
Compare
This commit is a few features bundled into one, culminating in the implementation of lowered-then-lifted functions for the component model. It's probably not going to be used all that often but this is possible within a valid component so Wasmtime needs to do something relatively reasonable. The main things implemented in this commit are: * Component instances are now assigned a `RuntimeComponentInstanceIndex` to differentiate each one. This will be used in the future to detect fusion (one instance lowering a function from another instance). For now it's used to allocate separate `VMComponentFlags` for each internal component instance. * The `CoreExport<FuncIndex>` of lowered functions was changed to a `CoreDef` since technically a lowered function can use another lowered function as the callee. This ended up being not too difficult to plumb through as everything else was already in place. * A need arose to compile host-to-wasm trampolines which weren't already present. Currently wasm in a component is always entered through a host-to-wasm trampoline but core wasm modules are the source of all the trampolines. In the case of a lowered-then-lifted function there may not actually be any core wasm modules, so component objects now contain necessary trampolines not otherwise provided by the core wasm objects. This feature required splitting a new function into the `Compiler` trait for creating a host-to-wasm trampoline. After doing this core wasm compilation was also updated to leverage this which further enabled compiling trampolines in parallel as opposed to the previous synchronous compilation.
292c82b
to
3687fb3
Compare
* Implement lowered-then-lifted functions This commit is a few features bundled into one, culminating in the implementation of lowered-then-lifted functions for the component model. It's probably not going to be used all that often but this is possible within a valid component so Wasmtime needs to do something relatively reasonable. The main things implemented in this commit are: * Component instances are now assigned a `RuntimeComponentInstanceIndex` to differentiate each one. This will be used in the future to detect fusion (one instance lowering a function from another instance). For now it's used to allocate separate `VMComponentFlags` for each internal component instance. * The `CoreExport<FuncIndex>` of lowered functions was changed to a `CoreDef` since technically a lowered function can use another lowered function as the callee. This ended up being not too difficult to plumb through as everything else was already in place. * A need arose to compile host-to-wasm trampolines which weren't already present. Currently wasm in a component is always entered through a host-to-wasm trampoline but core wasm modules are the source of all the trampolines. In the case of a lowered-then-lifted function there may not actually be any core wasm modules, so component objects now contain necessary trampolines not otherwise provided by the core wasm objects. This feature required splitting a new function into the `Compiler` trait for creating a host-to-wasm trampoline. After doing this core wasm compilation was also updated to leverage this which further enabled compiling trampolines in parallel as opposed to the previous synchronous compilation. * Review comments
@alexcrichton I'm doing some historical benchmarking to look at how various features have improved Wasmtime + Cranelift over time, and I found a regression in compile time that I've bisected to this PR. In particular, compiling
Is this expected (I see a note above about compiling more trampolines)? The regression persists to today, so I'm definitely interested in ways we could ameliorate this! |
Hm no this PR shouldn't have affected core wasm compilation, I will try to dig in further and see what this caused. |
Hm ok I am unable to reproduce this. I'm running:
where that The main change that this PR had on core wasm is that core wasm trampolines are now compiled in parallel whereas they were compiled serially previously. Could that parallelism be affecting your measurements somehow perhaps? It may be that the parallelism here isn't a great idea for the short-lived task of compiling small trampolines. |
Weird -- here's what I see:
Parameters: 12-core/24-thread system (Ryzen 3900X), frequency governor locked at 2.8GHz, Linux/x86-64, built with Rust 1.63.0, nothing else running on system (ssh into headless box). |
Notably the user CPU time (sum of all threads' time) is ~about the same or a little better in "after" (c1b) than "before" (df1), so parallelism is perhaps the culprit here. |
Doing some more experimentation on my system:
@alexcrichton , would you be OK with reverting the parallel compilation of trampolines that this PR introduced? Even if this regression appears only on some systems or in some parallelism scenarios, I think these numbers are too big to ignore and while from first principles the parallel-iter runtime should do the right thing, it seems to be falling flat here... |
Personally something is weird enough here that I don't think it warrants reverting this and sweeping it under the rug. I initially struggled to reproduce on an arm64 and x86_64 system so I don't think that this is a systemic issue. Testing just on the x86_64 machine I have I tried variations of |
The problem here seems to stem from |
I can confirm that this diff: diff --git a/crates/wasmtime/src/module.rs b/crates/wasmtime/src/module.rs
index 5524ebb73..1494bed18 100644
--- a/crates/wasmtime/src/module.rs
+++ b/crates/wasmtime/src/module.rs
@@ -372,8 +372,9 @@ impl Module {
let tunables = &engine.config().tunables;
let functions = mem::take(&mut translation.function_body_inputs);
let functions = functions.into_iter().collect::<Vec<_>>();
- let funcs = engine
- .run_maybe_parallel(functions, |(index, func)| {
+ let mut funcs = None;
+ rayon::scope(|_| {
+ funcs = Some(engine.run_maybe_parallel(functions, |(index, func)| {
let offset = func.body.range().start;
engine
.compiler()
@@ -389,9 +390,9 @@ impl Module {
"failed to compile wasm function {index}{name} at offset {offset:#x}"
)
})
- })?
- .into_iter()
- .collect();
+ }));
+ });
+ let funcs = funcs.unwrap()?.into_iter().collect();
// Collect all the function results into a final ELF object.
let mut obj = engine.compiler().object()?; on df15025 produces the slowdown I'm seeing. Unsure what's happening within rayon... |
I don't really know what's going on here. I dug into the actual compile time of |
Wow, this is wild: sure enough, when run with I agree this seems like a Rayon bug of some sort -- a dependence on there being a power-of-two number of worker threads (some sort of work-distribution imbalance maybe?). As a very short-term solution I will benchmark with this environment variable set; but I will file a bug upstream and see what folks think. |
This commit is a few features bundled into one, culminating in the
implementation of lowered-then-lifted functions for the component model.
It's probably not going to be used all that often but this is possible
within a valid component so Wasmtime needs to do something relatively
reasonable. The main things implemented in this commit are:
Component instances are now assigned a
RuntimeComponentInstanceIndex
to differentiate each one. This will be used in the future to detect
fusion (one instance lowering a function from another instance). For
now it's used to allocate separate
VMComponentFlags
for eachinternal component instance.
The
CoreExport<FuncIndex>
of lowered functions was changed to aCoreDef
since technically a lowered function can use another loweredfunction as the callee. This ended up being not too difficult to plumb
through as everything else was already in place.
A need arose to compile host-to-wasm trampolines which weren't already
present. Currently wasm in a component is always entered through a
host-to-wasm trampoline but core wasm modules are the source of all
the trampolines. In the case of a lowered-then-lifted function there
may not actually be any core wasm modules, so component objects now
contain necessary trampolines not otherwise provided by the core wasm
objects. This feature required splitting a new function into the
Compiler
trait for creating a host-to-wasm trampoline. After doingthis core wasm compilation was also updated to leverage this which
further enabled compiling trampolines in parallel as opposed to the
previous synchronous compilation.