Implement lowered-then-lifted functions #4327

alexcrichton · 2022-06-27T15:39:15Z

This commit is a few features bundled into one, culminating in the
implementation of lowered-then-lifted functions for the component model.
It's probably not going to be used all that often but this is possible
within a valid component so Wasmtime needs to do something relatively
reasonable. The main things implemented in this commit are:

Component instances are now assigned a RuntimeComponentInstanceIndex
to differentiate each one. This will be used in the future to detect
fusion (one instance lowering a function from another instance). For
now it's used to allocate separate VMComponentFlags for each
internal component instance.
The CoreExport<FuncIndex> of lowered functions was changed to a
CoreDef since technically a lowered function can use another lowered
function as the callee. This ended up being not too difficult to plumb
through as everything else was already in place.
A need arose to compile host-to-wasm trampolines which weren't already
present. Currently wasm in a component is always entered through a
host-to-wasm trampoline but core wasm modules are the source of all
the trampolines. In the case of a lowered-then-lifted function there
may not actually be any core wasm modules, so component objects now
contain necessary trampolines not otherwise provided by the core wasm
objects. This feature required splitting a new function into the
Compiler trait for creating a host-to-wasm trampoline. After doing
this core wasm compilation was also updated to leverage this which
further enabled compiling trampolines in parallel as opposed to the
previous synchronous compilation.

alexcrichton · 2022-06-27T15:41:24Z

I was originally going to include lifted-then-lowered functions in the same component here as well, but that has snowballed quite a lot so I opted to split this out and get this in first.

github-actions · 2022-06-27T16:47:46Z

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

fitzgen

Nice!

fitzgen · 2022-06-27T18:22:08Z

crates/environ/src/component/vmcomponent_offsets.rs

 //      store: *mut dyn Store,
+//      flags: [u8; component.num_runtime_component_instances],


Suggested change

// flags: [u8; component.num_runtime_component_instances],

// flags: [VMComponentFlags; component.num_runtime_component_instances],

fitzgen · 2022-06-27T18:22:54Z

crates/environ/src/component/vmcomponent_offsets.rs

            align(u32::from(ret.ptr.size())),
            size(store) = cmul(2, ret.ptr.size()),
+            size(flags) = ret.num_runtime_component_instances,


better not to open code / bake in the assumption that the flags will always be one byte large, especially since we wrap them up in a VMComponentFlags data abstraction elsewhere:

Suggested change

size(flags) = ret.num_runtime_component_instances,

size(flags) = cmul(size_of_vm_component_flags(), ret.num_runtime_component_instances),

This commit is a few features bundled into one, culminating in the implementation of lowered-then-lifted functions for the component model. It's probably not going to be used all that often but this is possible within a valid component so Wasmtime needs to do something relatively reasonable. The main things implemented in this commit are: * Component instances are now assigned a `RuntimeComponentInstanceIndex` to differentiate each one. This will be used in the future to detect fusion (one instance lowering a function from another instance). For now it's used to allocate separate `VMComponentFlags` for each internal component instance. * The `CoreExport<FuncIndex>` of lowered functions was changed to a `CoreDef` since technically a lowered function can use another lowered function as the callee. This ended up being not too difficult to plumb through as everything else was already in place. * A need arose to compile host-to-wasm trampolines which weren't already present. Currently wasm in a component is always entered through a host-to-wasm trampoline but core wasm modules are the source of all the trampolines. In the case of a lowered-then-lifted function there may not actually be any core wasm modules, so component objects now contain necessary trampolines not otherwise provided by the core wasm objects. This feature required splitting a new function into the `Compiler` trait for creating a host-to-wasm trampoline. After doing this core wasm compilation was also updated to leverage this which further enabled compiling trampolines in parallel as opposed to the previous synchronous compilation.

* Implement lowered-then-lifted functions This commit is a few features bundled into one, culminating in the implementation of lowered-then-lifted functions for the component model. It's probably not going to be used all that often but this is possible within a valid component so Wasmtime needs to do something relatively reasonable. The main things implemented in this commit are: * Component instances are now assigned a `RuntimeComponentInstanceIndex` to differentiate each one. This will be used in the future to detect fusion (one instance lowering a function from another instance). For now it's used to allocate separate `VMComponentFlags` for each internal component instance. * The `CoreExport<FuncIndex>` of lowered functions was changed to a `CoreDef` since technically a lowered function can use another lowered function as the callee. This ended up being not too difficult to plumb through as everything else was already in place. * A need arose to compile host-to-wasm trampolines which weren't already present. Currently wasm in a component is always entered through a host-to-wasm trampoline but core wasm modules are the source of all the trampolines. In the case of a lowered-then-lifted function there may not actually be any core wasm modules, so component objects now contain necessary trampolines not otherwise provided by the core wasm objects. This feature required splitting a new function into the `Compiler` trait for creating a host-to-wasm trampoline. After doing this core wasm compilation was also updated to leverage this which further enabled compiling trampolines in parallel as opposed to the previous synchronous compilation. * Review comments

cfallin · 2022-08-17T18:28:42Z

@alexcrichton I'm doing some historical benchmarking to look at how various features have improved Wasmtime + Cranelift over time, and I found a regression in compile time that I've bisected to this PR.

In particular, compiling spidermonkey.wasm (with wasmtime compile on the command line measured using hyperfine, on a machine with CPU frequency locked to 2.8GHz and turbo-boost disabled):

commit df1502531d437b2105effb990a7caa28796aa985, just prior to this PR's merge: 1.23s
commit c1b3962f7b9fc1c193e1b9709db64c455699a295, this PR's merge: 1.43s, or 16% slower

Is this expected (I see a note above about compiling more trampolines)?

The regression persists to today, so I'm definitely interested in ways we could ameliorate this!

alexcrichton · 2022-08-17T18:34:16Z

Hm no this PR shouldn't have affected core wasm compilation, I will try to dig in further and see what this caused.

alexcrichton · 2022-08-17T18:50:33Z

Hm ok I am unable to reproduce this. I'm running:

$ hyperfine --warmup 5 "./before compile ./benches/instantiation/spidermonkey.wasm" "./after compile ./benches/instantiation/spidermonkey.wasm"

where that spidermonkey.wasm just happens to be some old copy I have locally. On an arm64 machine this reports "after" being faster by a fairly variable amount if I run the hyperfine command multiple times. On an x86_64 machine it's waffling between either "after" or "before" being faster.

The main change that this PR had on core wasm is that core wasm trampolines are now compiled in parallel whereas they were compiled serially previously. Could that parallelism be affecting your measurements somehow perhaps? It may be that the parallelism here isn't a great idea for the short-lived task of compiling small trampolines.

cfallin · 2022-08-17T18:59:57Z

Weird -- here's what I see:

[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 './df15025 compile ../wasm-tests/spidermonkey.wasm' './c1b3962 compile ../wasm-tests/spidermonkey.wasm'
Benchmark 1: ./df15025 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.413 s ±  0.050 s    [User: 20.426 s, System: 0.605 s]
  Range (min … max):    1.349 s …  1.492 s    10 runs

Benchmark 2: ./c1b3962 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.649 s ±  0.024 s    [User: 20.312 s, System: 0.579 s]
  Range (min … max):    1.636 s …  1.715 s    10 runs

Summary
  './df15025 compile ../wasm-tests/spidermonkey.wasm' ran
    1.17 ± 0.04 times faster than './c1b3962 compile ../wasm-tests/spidermonkey.wasm'

Parameters: 12-core/24-thread system (Ryzen 3900X), frequency governor locked at 2.8GHz, Linux/x86-64, built with Rust 1.63.0, nothing else running on system (ssh into headless box).

cfallin · 2022-08-17T19:02:25Z

Notably the user CPU time (sum of all threads' time) is ~about the same or a little better in "after" (c1b) than "before" (df1), so parallelism is perhaps the culprit here.

cfallin · 2022-08-17T20:25:39Z

Doing some more experimentation on my system:

with 12 cores (so, one SMT sibling on each core, removing the effects of SMT), a 26% regression:

[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 'taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm' 'taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'
Benchmark 1: taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      1.694 s ±  0.012 s    [User: 14.586 s, System: 0.422 s]
  Range (min … max):    1.679 s …  1.717 s    10 runs

Benchmark 2: taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      2.135 s ±  0.020 s    [User: 14.597 s, System: 0.429 s]
  Range (min … max):    2.109 s …  2.175 s    10 runs

Summary
  'taskset 0xfff ./df15025 compile ../wasm-tests/spidermonkey.wasm' ran
    1.26 ± 0.01 times faster than 'taskset 0xfff ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'

with the first 6 cores (one of two chiplets, eliminating any interconnect or NUMA-related issues), a 31% regression:

[cfallin@xap]~/work/wasmtime-clean% hyperfine --warmup 5 'taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm' 'taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'
Benchmark 1: taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      2.862 s ±  0.018 s    [User: 14.378 s, System: 0.375 s]
  Range (min … max):    2.841 s …  2.907 s    10 runs

Benchmark 2: taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm
  Time (mean ± σ):      3.748 s ±  0.018 s    [User: 14.368 s, System: 0.377 s]
  Range (min … max):    3.736 s …  3.795 s    10 runs

Summary
  'taskset 0x3f ./df15025 compile ../wasm-tests/spidermonkey.wasm' ran
    1.31 ± 0.01 times faster than 'taskset 0x3f ./c1b3962 compile ../wasm-tests/spidermonkey.wasm'

@alexcrichton , would you be OK with reverting the parallel compilation of trampolines that this PR introduced? Even if this regression appears only on some systems or in some parallelism scenarios, I think these numbers are too big to ignore and while from first principles the parallel-iter runtime should do the right thing, it seems to be falling flat here...

alexcrichton · 2022-08-17T21:33:53Z

Personally something is weird enough here that I don't think it warrants reverting this and sweeping it under the rug. I initially struggled to reproduce on an arm64 and x86_64 system so I don't think that this is a systemic issue. Testing just on the x86_64 machine I have I tried variations of RAYON_NUM_THREADS=N and 1,2,4,5,8,16 showed no perf difference in the before/after. Surprinsingly though 3,6,7 all showed 20%+ regressions. It seems like some CPU configurations are probably not great at the parallelism, but this seems like a deeper issue with our parallel strategy in general rather than specific to just the trampolines which happened to change here.

alexcrichton · 2022-08-17T21:53:50Z

The problem here seems to stem from rayon::join. If I move out compilation of the trampolines entirely from the rayon::join and compile all trampolines serially I still observe the slowdown locally at 6 threads. This is where one closure to rayon::join does a parallel compilation of all of a module's functions, and the other closure does nothing. Seems like we may need to avoid rayon::join, but I'll try to dig a bit more.

alexcrichton · 2022-08-17T22:01:56Z

I can confirm that this diff:

diff --git a/crates/wasmtime/src/module.rs b/crates/wasmtime/src/module.rs
index 5524ebb73..1494bed18 100644
--- a/crates/wasmtime/src/module.rs
+++ b/crates/wasmtime/src/module.rs
@@ -372,8 +372,9 @@ impl Module {
         let tunables = &engine.config().tunables;
         let functions = mem::take(&mut translation.function_body_inputs);
         let functions = functions.into_iter().collect::<Vec<_>>();
-        let funcs = engine
-            .run_maybe_parallel(functions, |(index, func)| {
+        let mut funcs = None;
+        rayon::scope(|_| {
+            funcs = Some(engine.run_maybe_parallel(functions, |(index, func)| {
                 let offset = func.body.range().start;
                 engine
                     .compiler()
@@ -389,9 +390,9 @@ impl Module {
                             "failed to compile wasm function {index}{name} at offset {offset:#x}"
                         )
                     })
-            })?
-            .into_iter()
-            .collect();
+            }));
+        });
+        let funcs = funcs.unwrap()?.into_iter().collect();

         // Collect all the function results into a final ELF object.
         let mut obj = engine.compiler().object()?;

on df15025 produces the slowdown I'm seeing. Unsure what's happening within rayon...

alexcrichton · 2022-08-17T22:51:05Z

I don't really know what's going on here. I dug into the actual compile time of SLOW=1 vs not with the above patch. Summing the time spent compiling each individual function shows a 2.34% difference in time which doesn't account for the 20% wall time seen for the entire command. My best guess is that something is tickling rayon in just the wrong fashion here, although I have no idea what.

cfallin · 2022-08-17T23:28:54Z

Wow, this is wild: sure enough, when run with RAYON_NUM_THREADS=8 on my system, the regression disappears (+/- 1% delta, within noise).

I agree this seems like a Rayon bug of some sort -- a dependence on there being a power-of-two number of worker threads (some sort of work-distribution imbalance maybe?).

As a very short-term solution I will benchmark with this environment variable set; but I will file a bug upstream and see what folks think.

alexcrichton requested a review from fitzgen June 27, 2022 15:39

alexcrichton mentioned this pull request Jun 27, 2022

Tracking issue for implementing the component model #4185

Closed

42 tasks

github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Jun 27, 2022

fitzgen approved these changes Jun 27, 2022

View reviewed changes

alexcrichton enabled auto-merge (squash) June 28, 2022 14:59

alexcrichton force-pushed the component-lower-then-lift branch 2 times, most recently from 7b03ae3 to 292c82b Compare June 28, 2022 18:14

alexcrichton force-pushed the component-lower-then-lift branch from 292c82b to 3687fb3 Compare June 28, 2022 18:16

alexcrichton merged commit c1b3962 into bytecodealliance:main Jun 28, 2022

alexcrichton deleted the component-lower-then-lift branch June 28, 2022 18:50

cfallin mentioned this pull request Aug 17, 2022

Performance regression with non-power-of-2 number of threads rayon-rs/rayon#968

Open

		// store: *mut dyn Store,
		// flags: [u8; component.num_runtime_component_instances],

	// flags: [u8; component.num_runtime_component_instances],
	// flags: [VMComponentFlags; component.num_runtime_component_instances],

	size(flags) = ret.num_runtime_component_instances,
	size(flags) = cmul(size_of_vm_component_flags(), ret.num_runtime_component_instances),

Implement lowered-then-lifted functions #4327

Implement lowered-then-lifted functions #4327

Uh oh!

Conversation

alexcrichton commented Jun 27, 2022

Uh oh!

alexcrichton commented Jun 27, 2022

Uh oh!

github-actions bot commented Jun 27, 2022

Subscribe to Label Action

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

fitzgen Jun 27, 2022

Choose a reason for hiding this comment

Uh oh!

fitzgen Jun 27, 2022

Choose a reason for hiding this comment

Uh oh!

cfallin commented Aug 17, 2022

Uh oh!

alexcrichton commented Aug 17, 2022

Uh oh!

alexcrichton commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cfallin commented Aug 17, 2022

Uh oh!

cfallin commented Aug 17, 2022

Uh oh!

cfallin commented Aug 17, 2022

Uh oh!

alexcrichton commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Aug 17, 2022

Uh oh!

alexcrichton commented Aug 17, 2022

Uh oh!

alexcrichton commented Aug 17, 2022

Uh oh!

cfallin commented Aug 17, 2022

Uh oh!

Uh oh!

alexcrichton commented Aug 17, 2022 •

edited

Loading

alexcrichton commented Aug 17, 2022 •

edited

Loading