Skip to content

Implement lowered-then-lifted functions #4327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

alexcrichton
Copy link
Member

This commit is a few features bundled into one, culminating in the
implementation of lowered-then-lifted functions for the component model.
It's probably not going to be used all that often but this is possible
within a valid component so Wasmtime needs to do something relatively
reasonable. The main things implemented in this commit are:

  • Component instances are now assigned a RuntimeComponentInstanceIndex
    to differentiate each one. This will be used in the future to detect
    fusion (one instance lowering a function from another instance). For
    now it's used to allocate separate VMComponentFlags for each
    internal component instance.

  • The CoreExport<FuncIndex> of lowered functions was changed to a
    CoreDef since technically a lowered function can use another lowered
    function as the callee. This ended up being not too difficult to plumb
    through as everything else was already in place.

  • A need arose to compile host-to-wasm trampolines which weren't already
    present. Currently wasm in a component is always entered through a
    host-to-wasm trampoline but core wasm modules are the source of all
    the trampolines. In the case of a lowered-then-lifted function there
    may not actually be any core wasm modules, so component objects now
    contain necessary trampolines not otherwise provided by the core wasm
    objects. This feature required splitting a new function into the
    Compiler trait for creating a host-to-wasm trampoline. After doing
    this core wasm compilation was also updated to leverage this which
    further enabled compiling trampolines in parallel as opposed to the
    previous synchronous compilation.

@alexcrichton
Copy link
Member Author

I was originally going to include lifted-then-lowered functions in the same component here as well, but that has snowballed quite a lot so I opted to split this out and get this in first.

@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Jun 27, 2022
@github-actions
Copy link

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

// store: *mut dyn Store,
// flags: [u8; component.num_runtime_component_instances],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// flags: [u8; component.num_runtime_component_instances],
// flags: [VMComponentFlags; component.num_runtime_component_instances],

align(u32::from(ret.ptr.size())),
size(store) = cmul(2, ret.ptr.size()),
size(flags) = ret.num_runtime_component_instances,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better not to open code / bake in the assumption that the flags will always be one byte large, especially since we wrap them up in a VMComponentFlags data abstraction elsewhere:

Suggested change
size(flags) = ret.num_runtime_component_instances,
size(flags) = cmul(size_of_vm_component_flags(), ret.num_runtime_component_instances),

@alexcrichton alexcrichton enabled auto-merge (squash) June 28, 2022 14:59
@alexcrichton alexcrichton force-pushed the component-lower-then-lift branch 2 times, most recently from 7b03ae3 to 292c82b Compare June 28, 2022 18:14
This commit is a few features bundled into one, culminating in the
implementation of lowered-then-lifted functions for the component model.
It's probably not going to be used all that often but this is possible
within a valid component so Wasmtime needs to do something relatively
reasonable. The main things implemented in this commit are:

* Component instances are now assigned a `RuntimeComponentInstanceIndex`
  to differentiate each one. This will be used in the future to detect
  fusion (one instance lowering a function from another instance). For
  now it's used to allocate separate `VMComponentFlags` for each
  internal component instance.

* The `CoreExport<FuncIndex>` of lowered functions was changed to a
  `CoreDef` since technically a lowered function can use another lowered
  function as the callee. This ended up being not too difficult to plumb
  through as everything else was already in place.

* A need arose to compile host-to-wasm trampolines which weren't already
  present. Currently wasm in a component is always entered through a
  host-to-wasm trampoline but core wasm modules are the source of all
  the trampolines. In the case of a lowered-then-lifted function there
  may not actually be any core wasm modules, so component objects now
  contain necessary trampolines not otherwise provided by the core wasm
  objects. This feature required splitting a new function into the
  `Compiler` trait for creating a host-to-wasm trampoline. After doing
  this core wasm compilation was also updated to leverage this which
  further enabled compiling trampolines in parallel as opposed to the
  previous synchronous compilation.
@alexcrichton alexcrichton force-pushed the component-lower-then-lift branch from 292c82b to 3687fb3 Compare June 28, 2022 18:16
@alexcrichton alexcrichton merged commit c1b3962 into bytecodealliance:main Jun 28, 2022
@alexcrichton alexcrichton deleted the component-lower-then-lift branch June 28, 2022 18:50
afonso360 pushed a commit to afonso360/wasmtime that referenced this pull request Jun 30, 2022
* Implement lowered-then-lifted functions

This commit is a few features bundled into one, culminating in the
implementation of lowered-then-lifted functions for the component model.
It's probably not going to be used all that often but this is possible
within a valid component so Wasmtime needs to do something relatively
reasonable. The main things implemented in this commit are:

* Component instances are now assigned a `RuntimeComponentInstanceIndex`
  to differentiate each one. This will be used in the future to detect
  fusion (one instance lowering a function from another instance). For
  now it's used to allocate separate `VMComponentFlags` for each
  internal component instance.

* The `CoreExport<FuncIndex>` of lowered functions was changed to a
  `CoreDef` since technically a lowered function can use another lowered
  function as the callee. This ended up being not too difficult to plumb
  through as everything else was already in place.

* A need arose to compile host-to-wasm trampolines which weren't already
  present. Currently wasm in a component is always entered through a
  host-to-wasm trampoline but core wasm modules are the source of all
  the trampolines. In the case of a lowered-then-lifted function there
  may not actually be any core wasm modules, so component objects now
  contain necessary trampolines not otherwise provided by the core wasm
  objects. This feature required splitting a new function into the
  `Compiler` trait for creating a host-to-wasm trampoline. After doing
  this core wasm compilation was also updated to leverage this which
  further enabled compiling trampolines in parallel as opposed to the
  previous synchronous compilation.

* Review comments
@cfallin
Copy link
Member

cfallin commented Aug 17, 2022

@alexcrichton I'm doing some historical benchmarking to look at how various features have improved Wasmtime + Cranelift over time, and I found a regression in compile time that I've bisected to this PR.

In particular, compiling spidermonkey.wasm (with wasmtime compile on the command line measured using hyperfine, on a machine with CPU frequency locked to 2.8GHz and turbo-boost disabled):

  • commit df1502531d437b2105effb990a7caa28796aa985, just prior to this PR's merge: 1.23s
  • commit c1b3962f7b9fc1c193e1b9709db64c455699a295, this PR's merge: 1.43s, or 16% slower

Is this expected (I see a note above about compiling more trampolines)?

The regression persists to today, so I'm definitely interested in ways we could ameliorate this!

@alexcrichton
Copy link
Member Author

Hm no this PR shouldn't have affected core wasm compilation, I will try to dig in further and see what this caused.

@alexcrichton
Copy link
Member Author

alexcrichton commented Aug 17, 2022

Hm ok I am unable to reproduce this. I'm running:

$ hyperfine --warmup 5 "./before compile ./benches/instantiation/spidermonkey.wasm" "./after compile ./benches/instantiation/spidermonkey.wasm"

where that spidermonkey.wasm just happens to be some old copy I have locally. On an arm64 machine this reports "after" being faster by a fairly variable amount if I run the hyperfine command multiple times. On an x86_64 machine it's waffling between either "after" or "before" being faster.

The main change that this PR had on core wasm is that core wasm trampolines are now compiled in parallel whereas they were compiled serially previously. Could that parallelism be affecting your measurements somehow perhaps? It may be that the parallelism here isn't a great idea for the short-lived task of compiling small trampolines.

@cfallin
Copy link
Member

cfallin commented Aug 17, 2022

Weird -- here's what I see:

[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 './df15025 compile ../wasm-tests/spidermonkey.wasm' './c1b3962 compile ../wasm-tests/spidermonkey.wasm'
Benchmark 1: ./df15025 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.413 s ±  0.050 s    [User: 20.426 s, System: 0.605 s]
  Range (min … max):    1.349 s …  1.492 s    10 runs

Benchmark 2: ./c1b3962 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.649 s ±  0.024 s    [User: 20.312 s, System: 0.579 s]
  Range (min … max):    1.636 s …  1.715 s    10 runs

Summary
  './df15025 compile ../wasm-tests/spidermonkey.wasm' ran
    1.17 ± 0.04 times faster than './c1b3962 compile ../wasm-tests/spidermonkey.wasm'

Parameters: 12-core/24-thread system (Ryzen 3900X), frequency governor locked at 2.8GHz, Linux/x86-64, built with Rust 1.63.0, nothing else running on system (ssh into headless box).

@cfallin
Copy link
Member

cfallin commented Aug 17, 2022

Notably the user CPU time (sum of all threads' time) is ~about the same or a little better in "after" (c1b) than "before" (df1), so parallelism is perhaps the culprit here.

@cfallin
Copy link
Member

cfallin commented Aug 17, 2022

Doing some more experimentation on my system:

  • with 12 cores (so, one SMT sibling on each core, removing the effects of SMT), a 26% regression:
[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 'taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm' 'taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'
Benchmark 1: taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.694 s ±  0.012 s    [User: 14.586 s, System: 0.422 s]
  Range (min … max):    1.679 s …  1.717 s    10 runs

Benchmark 2: taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      2.135 s ±  0.020 s    [User: 14.597 s, System: 0.429 s]
  Range (min … max):    2.109 s …  2.175 s    10 runs

Summary
  'taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm' ran
    1.26 ± 0.01 times faster than 'taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'
  • with the first 6 cores (one of two chiplets, eliminating any interconnect or NUMA-related issues), a 31% regression:
[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 'taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm' 'taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'
Benchmark 1: taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      2.862 s ±  0.018 s    [User: 14.378 s, System: 0.375 s]
  Range (min … max):    2.841 s …  2.907 s    10 runs

Benchmark 2: taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      3.748 s ±  0.018 s    [User: 14.368 s, System: 0.377 s]
  Range (min … max):    3.736 s …  3.795 s    10 runs

Summary
  'taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm' ran
    1.31 ± 0.01 times faster than 'taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'

@alexcrichton , would you be OK with reverting the parallel compilation of trampolines that this PR introduced? Even if this regression appears only on some systems or in some parallelism scenarios, I think these numbers are too big to ignore and while from first principles the parallel-iter runtime should do the right thing, it seems to be falling flat here...

@alexcrichton
Copy link
Member Author

alexcrichton commented Aug 17, 2022

Personally something is weird enough here that I don't think it warrants reverting this and sweeping it under the rug. I initially struggled to reproduce on an arm64 and x86_64 system so I don't think that this is a systemic issue. Testing just on the x86_64 machine I have I tried variations of RAYON_NUM_THREADS=N and 1,2,4,5,8,16 showed no perf difference in the before/after. Surprinsingly though 3,6,7 all showed 20%+ regressions. It seems like some CPU configurations are probably not great at the parallelism, but this seems like a deeper issue with our parallel strategy in general rather than specific to just the trampolines which happened to change here.

@alexcrichton
Copy link
Member Author

The problem here seems to stem from rayon::join. If I move out compilation of the trampolines entirely from the rayon::join and compile all trampolines serially I still observe the slowdown locally at 6 threads. This is where one closure to rayon::join does a parallel compilation of all of a module's functions, and the other closure does nothing. Seems like we may need to avoid rayon::join, but I'll try to dig a bit more.

@alexcrichton
Copy link
Member Author

I can confirm that this diff:

diff --git a/crates/wasmtime/src/module.rs b/crates/wasmtime/src/module.rs
index 5524ebb73..1494bed18 100644
--- a/crates/wasmtime/src/module.rs
+++ b/crates/wasmtime/src/module.rs
@@ -372,8 +372,9 @@ impl Module {
         let tunables = &engine.config().tunables;
         let functions = mem::take(&mut translation.function_body_inputs);
         let functions = functions.into_iter().collect::<Vec<_>>();
-        let funcs = engine
-            .run_maybe_parallel(functions, |(index, func)| {
+        let mut funcs = None;
+        rayon::scope(|_| {
+            funcs = Some(engine.run_maybe_parallel(functions, |(index, func)| {
                 let offset = func.body.range().start;
                 engine
                     .compiler()
@@ -389,9 +390,9 @@ impl Module {
                             "failed to compile wasm function {index}{name} at offset {offset:#x}"
                         )
                     })
-            })?
-            .into_iter()
-            .collect();
+            }));
+        });
+        let funcs = funcs.unwrap()?.into_iter().collect();

         // Collect all the function results into a final ELF object.
         let mut obj = engine.compiler().object()?;

on df15025 produces the slowdown I'm seeing. Unsure what's happening within rayon...

@alexcrichton
Copy link
Member Author

I don't really know what's going on here. I dug into the actual compile time of SLOW=1 vs not with the above patch. Summing the time spent compiling each individual function shows a 2.34% difference in time which doesn't account for the 20% wall time seen for the entire command. My best guess is that something is tickling rayon in just the wrong fashion here, although I have no idea what.

@cfallin
Copy link
Member

cfallin commented Aug 17, 2022

Wow, this is wild: sure enough, when run with RAYON_NUM_THREADS=8 on my system, the regression disappears (+/- 1% delta, within noise).

I agree this seems like a Rayon bug of some sort -- a dependence on there being a power-of-two number of worker threads (some sort of work-distribution imbalance maybe?).

As a very short-term solution I will benchmark with this environment variable set; but I will file a bug upstream and see what folks think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants