Skip to content

Shouldn't we use TypedArray.set instead of a for loop for copying arraybuffers? #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
FallingSnow opened this issue Aug 8, 2023 · 21 comments · Fixed by #29
Closed

Comments

@FallingSnow
Copy link

Set is much faster, even on small arrays.

Code in question:

ringbuf.js/js/ringbuf.js

Lines 346 to 350 in 016d963

_copy(input, offset_input, output, offset_output, size) {
for (let i = 0; i < size; i++) {
output[offset_output + i] = input[offset_input + i];
}
}

Benchmark: input is 10,000 F32, and subInput is 100 F32.
Screenshot from 2023-08-08 15-08-53

@padenot
Copy link
Owner

padenot commented Aug 9, 2023

We can in some capacity. The problem is that we need to copy part of a TypedArray to part of a TypedArray, since it's a circular buffer.

set takes two parameters, the input array, and an offset to write the data at in the destination array. In particular, it doesn't take an input offset, or an input size, or that kind of thing.

The way to do this in JavaScript is to take a DataView on the TypedArray. Unfortunately, this isn't acceptable for this data structure, because it means that objects will be created, and this causes garbage collections. The objects will be extremely short-lived, and modern JS engines have generational GCs, but it's important to have exactly no object created.

This is not enough to express all what's needed for this ring buffer:

  • It's important to be able to copy from an offset (for the second copy, also when authors use the offset parameter)
  • It's important to be able to copy only certain element from the input, from the offset to a specified size

We can make it so set(...) is used when possible.

When running the test suite (which is admittedly a bit contrived, with workloads that aren't that realistic), this fast-path can be taken in about 28% of the _copy(...) calls.

We can also add two new methods that always use set, and create DataView, and that can be used on the side of the ring buffer that doesn't need to be real-time, this might be useful.

@padenot
Copy link
Owner

padenot commented Aug 9, 2023

If anybody can try this PR on a "real" workload and can report results, I'd be happy to have a look at the data.

@FallingSnow
Copy link
Author

FallingSnow commented Aug 9, 2023

Ah that makes sense. While .subarray() is copy free it is not GC free, and I forgot about the GC requirements of this library.

I believe one way around this is to not make wrapping pushes that would require a 2 separate pushes onto the ringbuffer. Instead an artificial end pointer would be set and we would write to the beginning of the ringbuffer (given there is enough room). Then the artificial end pointer could be cleared after the read pointer passes it. This would create some "wasted" space at the end of the ringbuffer but given the 4x - 15x copy time improvement it may be worth if for some fantasy work load I can't think of right now lol.

ringbuffer-with-artificial-end-pointer

However this is most likely out of the scope of this library and I'm not sure if the complexity is worth it, but is interesting to think about.

Thank you for your "fast path" additions.

@kyr0
Copy link
Contributor

kyr0 commented Jan 12, 2025

Let me add that with loop unrolling, the JIT has an easier time to identify SIMD vector intrinsics optimization and apply other optimizations too, leading to Ringbuf.js to be + ~20% faster with my JIT-optimized implementation (not changing anything algorithmically):

_copy(input, offset_input, output, offset_output, size) { 
  let i = 0;
  const unroll_factor = 4; 
  for (; i <= size - unroll_factor; i += unroll_factor) {
    output[offset_output + i] = input[offset_input + i];
    output[offset_output + i + 1] = input[offset_input + i + 1];
    output[offset_output + i + 2] = input[offset_input + i + 2];
    output[offset_output + i + 3] = input[offset_input + i + 3];
  }

  // handle any remaining elements
  for (; i < size; i++) {
    output[offset_output + i] = input[offset_input + i];
  }
}
Bildschirmfoto 2025-01-13 um 00 28 19

Benchmark reproduction link

@padenot
Copy link
Owner

padenot commented Jan 13, 2025

@kyr0, encouraging, do you have a patch on ringbuf.js lying around that you could use for a PR, or just a link to a patch?

Unrolling by e.g. 16 also yields faster results here.http://paul.cx/public/bench-unroll.html is a stand alone page that tries various things.

@kyr0
Copy link
Contributor

kyr0 commented Jan 13, 2025

@kyr0, encouraging, do you have a patch on ringbuf.js lying around that you could use for a PR, or just a link to a patch?

Unrolling by e.g. 16 also yields faster results here.http://paul.cx/public/bench-unroll.html is a stand alone page that tries various things.

@padenot As to my understanding, the optimizations the JIT can do depend on the specific underlaying hardware. If your CPU has SSE3 or SSE4 instruction support (Intel), AMD AVX or modern ARM NEON (Metal), then you might face better results unrolling by 8 or even 16. For WebAssembly for example, the working group defined the v128 (128 bit wide instructions) as the best common fit. This translates well to most bare metal CPUs and this is why I chose 4 over 16, because 4 time 32bit fits well in one 128 bit SIMD register. As we don't know the type of hardware most users have, and as all SIMD instruction sets > 128 bit also have support for 128bit, I'd suggest to stick with unrolling by 4, because this strategy has the highest probability to lead to performance gains for the majority of users.

And yes, I JIT-optimized all for loops in RingBuf.js meanwhile, and ported the whole project to TypeScript.

https://github.com/kyr0/microbe/tree/main/src/ringbuf

I put the Apache 2 license and credited your name, but this is not an official release, and I don't want to release it under my name, as it is your "baby" and your work. I'm not sure if you would like anything TypeScript as a PR. Please let me know. I can certainly also backport my impl. as a PR in pure JavaScript (I love both languages, started with JS in 1999 and TS three years after its first release :). In general, I tend to start projects with TypeScript these days, so I simply ported it all.

The project I need RingBuf for is part of a bigger spare-time project of mine, and an experiment for its feasibility in 2025 :) It worked out exceptionally well. I think you might find it interesting too... my goal was maximum performance for realtime audio DSP tasks; including enabling multithreading.

I used Rust to implement a very simple Audio Engine and compile it to WebAssembly. Now WebAssembly can support multithreading these days. If you want to have SIMD support, and Atomics support, you may enable that as well. All of this can be integrated and enabled with full Rust compiler optimizations support, if you are opting in for a custom/nightly Rust runtime with a pinned version:

https://github.com/kyr0/microbe/blob/main/.cargo/config.toml#L2

Rust has a very nice crate, Rayon; -- it allows to run typical operations on data structures transparently multithreaded and atomically synced. A little while back, wasm-bindgen-rayon was released (I think by Google). I used that to enable Rayon with WebAssembly in Rust.

Now I thought: "I want to build my own DAW for the web platform. But when I tried in 2015, 2019, 2021, I stepped into performance bottlenecks again and again."; So I had a "stupid" idea on Friday... what, if I implement my own Audio Engine "natively" in highly optimized WebAssembly using Rust and pass the audio buffers produced continuously to an AudioWorkletProcessorNode via a RingBuffer that uses Atomics and therefore is lock-free? Wouldn't that allow me to use all cores of a computer for DSP and guarantee jitter/underrun-free processing?

To get there, I had to re-implement parts of RingBuf.js in Rust (the parts that write). So I did: https://github.com/kyr0/microbe/blob/main/src/ringbuf.rs

This is a 1:1 clean room implementation; as you can see I also use Atomics here.
In the Audio Engine, I continuously produce an audio signal: https://github.com/kyr0/microbe/blob/main/src/engine.rs#L214 and to prevent the main loop from hanging (busy-wait), I do a little Promise callback ever when there's a good moment:
https://github.com/kyr0/microbe/blob/main/src/engine.rs#L273
(comments are a bit all over the place atm; it's a sandbox codebase...)

This is basically just a very basic subtractive synth atm. and not a DAW with a sequencer etc.; but I had to start somewhere..
The buffers produced are written to the RingBuffer via the SharedArrayBuffer:
https://github.com/kyr0/microbe/blob/main/src/engine.rs#L259

Vite is so kind to provide the necessary headers, if you ask it to do so: https://github.com/kyr0/microbe/blob/main/vite.config.ts#L19

...and it also bundles and loads both the resulting WebAssembly module and the AudioWorklet code without any hassle:
https://github.com/kyr0/microbe/blob/main/src/index.ts#L2

The only thing I do with the Web Audio API is receiving the audio signal via RingBuf.js and the SharedArrayBuffer in the AudioWorklet: https://github.com/kyr0/microbe/blob/main/src/index.ts#L82

I then forward the signal to AudioContext.destination:
https://github.com/kyr0/microbe/blob/main/src/signal-forwarder.ts#L30

And that's it... it currently computes 4096 oscillators in parallel on my machine, using all 8 cores efficiently, using high efficient assembly/vector intrinsics, deterministic timing, GC-free critical path, no heap allocations, no locks, and almost copy-free regarding the buffers. This is absolutely fantastic. I still can't believe it! I can finally implement my DAW from scratch! And it could be as performant as Reaper/Ableton/Logic Pro X/FL Studio etc. natively -- maybe even outperforming them, as any code written is automatically optimized for SIMD and multicore processing.

I'm absolutely stunned by this! I expected much more trouble and headache -- especially with the RingBuffer and different Web Audio implementations. But, in my limited tests, it works really, really well -- both in the most recent versions of Chrome, Firefox and Edge. So... thanks a lot for your outstanding work over all of the years (!!!).

The only artifacts audible are coming from my bad code in the Audio Engine ;)) But the number crunching/heavy lifting works really, really well. It also works fine for all typical buffer sizes and sampling rates - independent of the Web Audio buffer sizes.. :) (well... not really, but "virtually"). I have some stuff to work on still... like Firefox not liking my way of waking up the main loop/breaking busy-wait... but that should all be fixable.

Here is a live demo: https://stackblitz.com/~/github.com/kyr0/microbe?file=src/signal-forwarder.ts:L30

@padenot
Copy link
Owner

padenot commented Jan 14, 2025

@kyr0, encouraging, do you have a patch on ringbuf.js lying around that you could use for a PR, or just a link to a patch?
Unrolling by e.g. 16 also yields faster results here.http://paul.cx/public/bench-unroll.html is a stand alone page that tries various things.

@padenot As to my understanding, the optimizations the JIT can do depend on the specific underlaying hardware. If your CPU has SSE3 or SSE4 instruction support (Intel), AMD AVX or modern ARM NEON (Metal), then you might face better results unrolling by 8 or even 16. For WebAssembly for example, the working group defined the v128 (128 bit wide instructions) as the best common fit. This translates well to most bare metal CPUs and this is why I chose 4 over 16, because 4 time 32bit fits well in one 128 bit SIMD register. As we don't know the type of hardware most users have, and as all SIMD instruction sets > 128 bit also have support for 128bit, I'd suggest to stick with unrolling by 4, because this strategy has the highest probability to lead to performance gains for the majority of users.

I checked with SpiderMonkey engineers (I work at Mozilla). We do not unroll loops automatically. https://bugzilla.mozilla.org/show_bug.cgi?id=1039458 is an old patch that implemented it, but we haven't merged it. I don't think anybody auto-vectorize loops either, which is orthogonal to unrolling but somewhat related.

That said, it's extremely common to unroll SSE2 loops by 4 themselves, processing e.g. 16 f32 at once. What is important is to not spill to the stack, but that easy to verify by looking at the assembly. But we're in js here, so that's not something we can do.

And yes, I JIT-optimized all for loops in RingBuf.js meanwhile, and ported the whole project to TypeScript.

kyr0/microbe@main/src/ringbuf

I put the Apache 2 license and credited your name, but this is not an official release, and I don't want to release it under my name, as it is your "baby" and your work. I'm not sure if you would like anything TypeScript as a PR. Please let me know. I can certainly also backport my impl. as a PR in pure JavaScript (I love both languages, started with JS in 1999 and TS three years after its first release :). In general, I tend to start projects with TypeScript these days, so I simply ported it all.

I'd be happy to at least have types, or to port it to ts altogether, I also do both, depending, but prefer ts if I can.

The project I need RingBuf for is part of a bigger spare-time project of mine, and an experiment for its feasibility in 2025 :) It worked out exceptionally well. I think you might find it interesting too... my goal was maximum performance for realtime audio DSP tasks; including enabling multithreading.

I used Rust to implement a very simple Audio Engine and compile it to WebAssembly. Now WebAssembly can support multithreading these days. If you want to have SIMD support, and Atomics support, you may enable that as well. All of this can be integrated and enabled with full Rust compiler optimizations support, if you are opting in for a custom/nightly Rust runtime with a pinned version:

Yes, I've done plenty of this in 2018 when Mozilla pioneered SAB + WASM with rust, and we had AudioWorkletProcessor working, it works amazingly well, including SIMD, w/ rayon or other concurrency libs, etc.

[... snip the long part about the architecture... ]

Very cool. Keep in my that the Worker threads aren't high priority so you need to keep some buffering around, and can't reach the lowest latency possible.

If you restrict your processing to the AudioWorkletProcessor.process method, the code will run synchronously off the real-time audio callback of the system, and you can have more or less the same latencies as native DAWs, because we're using the same system-level things a DAW uses (whether it is thread priority class, regular thread priority management, etc. it depends on the OS and OS version).

@kyr0
Copy link
Contributor

kyr0 commented Jan 14, 2025

I checked with SpiderMonkey engineers (I work at Mozilla). We do not unroll loops automatically.

Interesting! TurboFan in v8, at-a-glance, seems to do so: https://github.com/v8/v8/blob/main/src/compiler/loop-unrolling.cc
They even have a unrolling_count_heuristic to determine the best unrolling factor. The loop is unrolled unrolling_count times, and they care for leftover iterations. NodeVector copies holds unrolled copies of the loop body.

I don't think anybody auto-vectorize loops either

Hmm, I re-checked: https://github.com/v8/v8/blob/main/src/compiler/revectorizer.cc
It seems like SLPTree builds a tree of operations for potential vectorization and nodes with SIMD 128 operations are combined into SIMD 256 operations. PackNode packs into larger SIMD units. VectorizeTree replaces 128-bit operations with their 256-bit counterparts. SIMPLE_SIMD_OP, SIMD_SHIFT_OP, etc. define SIMD operations for vectorization.
https://github.com/v8/v8/blob/main/src/compiler/revectorizer.cc#L403

Do you agree that my understanding of this is correct? I'd be truly happy to learn if I overlooked something. The deeper I understand stuff, the less issues I get with my user-land code :))

That said, it's extremely common to unroll SSE2 loops by 4 themselves, processing e.g. 16 f32 at once.

Reading the impl. linked, I come to the conclusion that factor 4 is safer for smaller or less computationally intensive loops. Factor 8 balances performance and code size for medium-sized / compute-intensive loops. If the loop dominates execution time, higher unrolling (8 or 16) might be a good idea. This could be tested with profiling to take a more informed decision. For SIMD optimizations the specific implementation seems to suggest to either go with a factor of 4 or 8 because of the codes focus on 128 and 256 bit wide instruction optimizations.

However.. I'm not a pro in all of this. I'm just reading the code and trying to jump to conclusions. And this is only v8.. I haven't checked the SpiderMonkey code nor JSC. I had to mess with v8 in the past because of an engine embedding project, so I'm a bit more familiar with their codebase.

I checked with SpiderMonkey engineers (I work at Mozilla). We do not unroll loops automatically.

Oh, cool! Thank you - that's really interesting! This means that I probably over-extended my understanding a bit too much by generalizing that this would be something, every optimizer implementation would do these days.

btw.: It has been my dream since 2003 to work for Mozilla someday. But I never really tried to apply.. xD

Yes, I've done plenty of this in 2018 when Mozilla pioneered SAB + WASM with rust, and we had AudioWorkletProcessor working, it works amazingly well, including SIMD, w/ rayon or other concurrency libs, etc.

That's super cool! Do you still have code published from these days? Would be a pleasure to read! I guess I can learn alot there.

Very cool. Keep in my that the Worker threads aren't high priority so you need to keep some buffering around, and can't reach the lowest latency possible.

Thank you! Yeah, I already realized this painfully yesterday. Sometimes, rather randomly, the buffer would underrun. So I checked if that was a priority issue and I read a bit on how the audio thread is usually implemented and found, for example, https://github.com/chromium/chromium/blob/main/chromecast/media/audio/audio_io_thread.cc#L22 rendering/GPU threads are abstracted as base::ThreadType::kDisplayCritical and workers can even be base::ThreadType::kBackground -- suggesting to be scheduled less important than even default, depending on the specific OS kernel scheduler chosen, I guess, but we can expect to lack behind when code is executed in Web Workers. The worst scenario here is, that we cannot know by how much our code is delayed. I ended up adding quite a bit of safety margin: bufferSize * channels /** channels, interleaved */ * 12 /* safety margin for scheduling delay */ -- this made sure that even when I ran on low battery, high CPU workload (96%+ load), and the high performance cores would drop in performance because of both heat and low battery, I would not experience stutters. (beware: "Works on my machine!" ;))

If you restrict your processing to the AudioWorkletProcessor.process method, the code will run synchronously off the real-time audio callback of the system, and you can have more or less the same latencies as native DAWs, because we're using the same system-level things a DAW uses (whether it is thread priority class, regular thread priority management, etc. it depends on the OS and OS version).

That's a good insight! I think my impl. should currently fulfill the criteria of not running any audio processing on the main thread. Reading from SAB only happens in the AudioWorklet: https://github.com/kyr0/microbe/blob/main/src/signal-forwarder.ts#L30

I'd be happy to at least have types, or to port it to ts altogether, I also do both, depending, but prefer ts if I can.

Oh, then we can just start with my codebase. I ported every LoC 1:1 already, except two deprecated methods I didn't cover (the snake-case named ones). As mentioned earlier, the loop unroll has been impl. in all loops.

https://github.com/kyr0/microbe/blob/main/src/ringbuf/audioqueue.ts#L27
https://github.com/kyr0/microbe/blob/main/src/ringbuf/audioqueue.ts#L58
https://github.com/kyr0/microbe/blob/main/src/ringbuf/ringbuf.ts#L354

Would you accept a "huge" PR? xD

@kyr0
Copy link
Contributor

kyr0 commented Jan 14, 2025

Thinking about it further and taking into consideration that in fact this task is a more heavy workload, I added comparison tests for unroll 8 and unroll 16 to the test:
https://jsbm.dev/tXzKxfJECopKg

Conclusions:

  • With this workload, the order of performance is: Safari (Fastest), Chrome, Firefox.
  • Smaller buffer sizes are overall faster of course.
  • Scaling is non-linear for all engines, bigger tasks scale better.
  • In general unroll factor 16 is the fastest across all ticket sizes and engines.
  • The performance difference between engines suggests that some do more optimization than others (obviously).
  • Therefore, because some engines entirely lack SIMD/vectorization optimizations, my initial assumption, that loop unrolling only works because of vectorization, is wrong. It probably works so well because of inlining and caching.
  • Once you introduce branching/control logic in your loops, inlining/caching isn't possible anymore, so all these nice gains are lost.
  • Some engines do loop unrolling, but interestingly, they seem not to be very good at it. Otherwise baseline, which doesn't have manual loop unrolling, would be as fast as a loop-unrolled variant. Manual loop unrolling remains to be a great idea, but it should be factor 16, not 4.
  • Based on these stats, the performance gaisn for factor 16 unroll, over baseline, are between 349% to 300% - so ~325% or 3.25x faster on average, compared to baseline.
  • Of course the original code should remain in a comment, so the code remains readable. An unroll by 16 is hard to understand at-a-glance, if you never encountered one.

==> Nice learning! I'll change the code to unroll by 16.

Stats:

let ticketSizeDivOpsUnroll16 = {  // smaller = faster
 safari1M :     1000000/3604,     // 277
 chrome1M :     1000000/1980,     // 505
 firefox1M :    1000000/976,      // 1024
   
 safari10k :    10000/418841,     // 00.2
 chrome10k :    10000/272068,     // 0.03
 firefox10k :   10000/168300,     // 0.05
   
 safari5k :     5000/1304507,     // 0.003
 chrome5k :     5000/547179,      // 0.009
 firefox5k :    5000/335512,      // 0.01
}


let ticketSizeDivOpsUnroll8 = {
 safari1M :     1000000/2267,     // 441
 chrome1M :     1000000/1178,     // 848
 firefox1M :    1000000/707,      // 1414
   
 safari10k :    10000/272068,     // 0.03
 chrome10k :    10000/120952,     // 0.08
 firefox10k :   10000/72013,      // 0.13
   
 safari5k :     5000/547179,      // 0.009
 chrome5k :     5000/244790,      // 0.02
 firefox5k :    5000/144840,      // 0.03
}

let ticketSizeDivOpsUnroll4 = {
 safari1M :     1000000/1457,     // 686
 chrome1M :     1000000/590,      // 1694
 firefox1M :    1000000/400,      // 2500
   
 safari10k :    10000/168300,     // 0.05
 chrome10k :    10000/64260,      // 0.15
 firefox10k :   10000/40097,      // 0.2
   
 safari5k :     5000/335512,      // 0.01
 chrome5k :     5000/126038,      // 0.03
 firefox5k :    5000/82113,       // 0.06
}

Ticket size: 1M

Safari:
Bildschirmfoto 2025-01-14 um 19 41 09

Chrome:
Bildschirmfoto 2025-01-14 um 19 43 10

Firefox:
Bildschirmfoto 2025-01-14 um 19 42 29

Ticket size: 10k

Safari:
Bildschirmfoto 2025-01-14 um 19 35 46

Chrome:
Bildschirmfoto 2025-01-14 um 19 35 03

Firefox:
Bildschirmfoto 2025-01-14 um 19 32 56

Ticket size: 5k

Safari:
Bildschirmfoto 2025-01-14 um 19 37 38

Chrome:
Bildschirmfoto 2025-01-14 um 19 38 25

Firefox:
Bildschirmfoto 2025-01-14 um 19 39 14

@kyr0
Copy link
Contributor

kyr0 commented Jan 14, 2025

When optimizing deinterleave in audioqueue.js, I was wondering if the performance gains were the same with a smaller ticket size of 128 and a slightly different workload.

https://jsbm.dev/yDm8bhZsKYrju

Turns out that:

  • The original workload (copy) scales well with ticket size 128. Unroll factor 16 is (and remains) to be the fastest across all engines.
  • With the new workload (deinterleave) / changed algo, we see a drastically different picture.
  • Unroll factor 4 is the fastest across all engines.
  • Safari is 2.3x faster than Firefox, which is 4x faster than Chrome in this case.
  • All are ~2x faster than baseline w/o unrolling.

Lessons learned: Always throughly profile every little workload change... we should stick with unroll factor 4 for

Safari:
Bildschirmfoto 2025-01-14 um 21 41 14

Firefox:
Bildschirmfoto 2025-01-14 um 21 41 59

Chrome:
Bildschirmfoto 2025-01-14 um 21 42 18

@kyr0
Copy link
Contributor

kyr0 commented Jan 14, 2025

Interleave does not profit much from unrolling and the results diverge per-engine.

  • Chrome is the fastest and benefits most from factor 4.
  • Safari is fastest with unroll factor 16.
  • Firefox suffers from unrolling here. It should use baseline.
  • Because none of them significantly benefit from loop unrolling, and engine detection would be necessary so that none of them suffers, I'd probably stick with baseline here.

https://jsbm.dev/0kwhJsBG4CLDy

Chrome:
Bildschirmfoto 2025-01-14 um 22 01 29

Safari:
Bildschirmfoto 2025-01-14 um 22 02 37

Firefox:
Bildschirmfoto 2025-01-14 um 22 02 09

@kyr0
Copy link
Contributor

kyr0 commented Jan 14, 2025

I updated and tested the TypeScript-port of RingBuf: https://github.com/kyr0/microbe/blob/main/src/ringbuf/ringbuf.ts#L354 with all loop unrolling changes and it seems to work flawlessly.

@padenot
Copy link
Owner

padenot commented Jan 15, 2025

I think we don't do enough iteration for the right JIT to kick in here, and using means isn't statistically sound, because we haven't looked at the data distribution.

https://paul.cx/public/bench-unroll.html runs a lot more workloads, and plots histograms, please feel free to adapt it. We see some slight bimodality in the distribution, showing that the mean isn't what we want here.

In my own testing, I see Firefox always faster than Chrome w/ unroll by 16, on both linux x86_64 and macOS aarch64, and Chrome always faster than Firefox without unroll. This is on a AMD Ryzen Threadripper PRO 7975WX. This is between Firefox Nightly and Chrome Dev.

I see Safari twice faster than Firefox on macOS aarch64, which is already 20% faster than Chrome. This is on a M1 Max, Firefox Nightly, Safari Tech Preview, Chrome Canary.

But all in all, unrolling by 16x is the sweet spot, it seems like.

@kyr0
Copy link
Contributor

kyr0 commented Jan 15, 2025

That's a good point. Of course, when the test runs for a prolonged amount of time I'd also expect the optimizer to pick better strategies. But the code of interleave() and deinterleave() is hard-coded to 128 samples only and does, as to my understanding, run intermitted in practice {user-land control logic, some invocations and context switching} ... deinterleave() ... {user-land control logic, some invocations and context switching} deinterleave(). Also, the algo changes slightly. There are multiplications and different assignments going on, which are not happening in copy.

Only when I put exactly that code to the test, the difference in the algos of copy, deinterleave and interleave showed and suggested different unroll factors to be more/less effective. For copy only, 16 was always the best. But even if I put, for example, 10000 as size, instead of 128 in https://jsbm.dev/0kwhJsBG4CLDy for deinterleave or interleave, not factor 16 is the sweet-spot, but 4.

I'm also fine with just assuming that unroll by 16 might be the best strategy and settle for it in my PR.

I think it would be interesting to know exactly whats the underlaying cause of this, so that we can build a better mental model about this. Next to the tests not being exactly statistically sound (for example, always when reloading https://paul.cx/public/bench-unroll.html I get completely different results -- I guess we need more iterations and find the mean between whole test runs; I did so in https://github.com/jsheaven/perf ), I currently suspect that even if the optimizer could find better strategies over many invocations, only so many optimizations are available for each very specific implementation. For some reason it seems the best optimizations are often found with unroll by 4 only for interleave and deinterleave, while in case of copy, the ops differ (and the ticket size is larger). Maybe this unlocks optimization potential because specific, more performant instructions can be used with the ops used in the loop body of copy, so unroll factor 16 seems to be the sweet-spot for this algo, while with the ops in the loop body of the other functions, this isn't the case.

Because in my tests, neither the amount of samples nor the amount of invocations made a difference for interleave and deinterleave- the results would always show that for these specific implementations, 16 doesn't work that well -- across all engines, except Firefox with deinterleave.

I guess one way to find out would be to copy my test code that tests the specific algos to your test code to verify. I'll probably be able to do that tonight. I like the way you use Plotly there.

btw: I'm also fine with just assuming that unroll by 16 might be the best strategy and settle for it in my PR. I'm just always very curious about getting the mental model exactly right in my head -- even if it means to do extensive research. I found that this could save me from future pain..

@padenot
Copy link
Owner

padenot commented Jan 15, 2025

For a good JIT tier to be reached, the code has to run a certain amount of time. The good code is kept even though something else executes in the meantime.

If you want to understand what is going on, just e.g. profile your chrome instance and load the result in perfetto. Zoom during the compilation stage and there will be markers saying what is happening. It looks like this:

image(1)

@kyr0
Copy link
Contributor

kyr0 commented Jan 16, 2025

Nice!! Thank you! That is insightful!

If I'm reading the graph correctly, deinterleave is optimized one time by TurboFan in a timeframe of 1 minute recording. I only selected the timeframe where the audio worker thread was actively playing back audio.

Bildschirmfoto 2025-01-16 um 02 54 46 Bildschirmfoto 2025-01-16 um 02 55 14

I also recorded for a shorter time (15 seconds) and it also only shows one optimization process; there seems to be no more re-optimizations down the road. But as I know the tools only for a few minutes, it could be that I misinterpreted the graph.

Also, I copied your code and locally checked with the deinterleave algo. Here for your verification:
https://github.com/kyr0/microbe/blob/main/src/ringbuf/deinterleave-bench.html

While implementing this, I found that:

  • performance.now() is too coarse for measuring such small amounts of time; obviously leading to 0 time consumed results; I added batch execution
  • pre-generating the random buffer data, each run gets its own random data but no random gen while benchmarking

Okay, here are the results (Apple Macbook Air M3, arm64):

Chrome:
Bildschirmfoto 2025-01-16 um 03 43 49

Safari:
Bildschirmfoto 2025-01-16 um 03 43 54

Firefox:
Bildschirmfoto 2025-01-16 um 03 44 16

Of course, with every reload the stats look a little different; but in 95%+ of my tests, unroll by 4 won in all engines for deinterleave.

For copy, unroll by 16 always wins in all engines.

@kyr0
Copy link
Contributor

kyr0 commented Jan 16, 2025

Double-checking on a Linux machine, Intel Core i5-13500, x86-64:

Chrome:
Bildschirmfoto 2025-01-16 um 03 53 49

Firefox:
Bildschirmfoto 2025-01-16 um 03 55 26

Just to clarify: This is only for deinterleave and interleave -- copy works best with unroll by 16.

How do we want to proceed? I could just PR my TypeScript code upgrading the impl. in this repo 1:1 to TypeScript, line by line, plus the optimizations (and reading the latest graphs, I'd suggest 16 for copy and 4 for the others, but as you prefer!), if that would seem "mergable" to you.

@padenot
Copy link
Owner

padenot commented Jan 16, 2025

Yeah just send the PR over. As long as the interface doesn't change, and downstream consumers aren't affected, and can get a .js file, we're good. Let's check that all tests and examples and docs are updated and working, and we're good.

Thanks for all the work, in any case.

@kyr0
Copy link
Contributor

kyr0 commented Jan 16, 2025

@padenot You're welcome! I'm happy that you've built this wonderful library which I can build my whole DAW on! Well, and it's also really cool that I could learn a few more things in our conversation here :)

@padenot
Copy link
Owner

padenot commented Apr 16, 2025

This is now live in https://www.npmjs.com/package/ringbuf.js/v/0.4.0!

@FallingSnow
Copy link
Author

Thanks kyr0 and padenot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants