-
Notifications
You must be signed in to change notification settings - Fork 19
Shouldn't we use TypedArray.set instead of a for loop for copying arraybuffers? #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We can in some capacity. The problem is that we need to copy part of a
The way to do this in JavaScript is to take a This is not enough to express all what's needed for this ring buffer:
We can make it so When running the test suite (which is admittedly a bit contrived, with workloads that aren't that realistic), this fast-path can be taken in about 28% of the We can also add two new methods that always use |
If anybody can try this PR on a "real" workload and can report results, I'd be happy to have a look at the data. |
Ah that makes sense. While I believe one way around this is to not make wrapping pushes that would require a 2 separate pushes onto the ringbuffer. Instead an artificial end pointer would be set and we would write to the beginning of the ringbuffer (given there is enough room). Then the artificial end pointer could be cleared after the read pointer passes it. This would create some "wasted" space at the end of the ringbuffer but given the 4x - 15x copy time improvement it may be worth if for some fantasy work load I can't think of right now lol. However this is most likely out of the scope of this library and I'm not sure if the complexity is worth it, but is interesting to think about. Thank you for your "fast path" additions. |
@kyr0, encouraging, do you have a patch on ringbuf.js lying around that you could use for a PR, or just a link to a patch? Unrolling by e.g. 16 also yields faster results here.http://paul.cx/public/bench-unroll.html is a stand alone page that tries various things. |
@padenot As to my understanding, the optimizations the JIT can do depend on the specific underlaying hardware. If your CPU has SSE3 or SSE4 instruction support (Intel), AMD AVX or modern ARM NEON (Metal), then you might face better results unrolling by 8 or even 16. For WebAssembly for example, the working group defined the v128 (128 bit wide instructions) as the best common fit. This translates well to most bare metal CPUs and this is why I chose 4 over 16, because 4 time 32bit fits well in one 128 bit SIMD register. As we don't know the type of hardware most users have, and as all SIMD instruction sets > 128 bit also have support for 128bit, I'd suggest to stick with unrolling by 4, because this strategy has the highest probability to lead to performance gains for the majority of users. And yes, I JIT-optimized all for loops in RingBuf.js meanwhile, and ported the whole project to TypeScript. https://github.com/kyr0/microbe/tree/main/src/ringbuf I put the Apache 2 license and credited your name, but this is not an official release, and I don't want to release it under my name, as it is your "baby" and your work. I'm not sure if you would like anything TypeScript as a PR. Please let me know. I can certainly also backport my impl. as a PR in pure JavaScript (I love both languages, started with JS in 1999 and TS three years after its first release :). In general, I tend to start projects with TypeScript these days, so I simply ported it all. The project I need RingBuf for is part of a bigger spare-time project of mine, and an experiment for its feasibility in 2025 :) It worked out exceptionally well. I think you might find it interesting too... my goal was maximum performance for realtime audio DSP tasks; including enabling multithreading. I used Rust to implement a very simple Audio Engine and compile it to WebAssembly. Now WebAssembly can support multithreading these days. If you want to have SIMD support, and Atomics support, you may enable that as well. All of this can be integrated and enabled with full Rust compiler optimizations support, if you are opting in for a custom/nightly Rust runtime with a pinned version: https://github.com/kyr0/microbe/blob/main/.cargo/config.toml#L2 Rust has a very nice crate, Rayon; -- it allows to run typical operations on data structures transparently multithreaded and atomically synced. A little while back, wasm-bindgen-rayon was released (I think by Google). I used that to enable Rayon with WebAssembly in Rust. Now I thought: "I want to build my own DAW for the web platform. But when I tried in 2015, 2019, 2021, I stepped into performance bottlenecks again and again."; So I had a "stupid" idea on Friday... what, if I implement my own Audio Engine "natively" in highly optimized WebAssembly using Rust and pass the audio buffers produced continuously to an AudioWorkletProcessorNode via a RingBuffer that uses Atomics and therefore is lock-free? Wouldn't that allow me to use all cores of a computer for DSP and guarantee jitter/underrun-free processing? To get there, I had to re-implement parts of RingBuf.js in Rust (the parts that write). So I did: https://github.com/kyr0/microbe/blob/main/src/ringbuf.rs This is a 1:1 clean room implementation; as you can see I also use Atomics here. This is basically just a very basic subtractive synth atm. and not a DAW with a sequencer etc.; but I had to start somewhere.. Vite is so kind to provide the necessary headers, if you ask it to do so: https://github.com/kyr0/microbe/blob/main/vite.config.ts#L19 ...and it also bundles and loads both the resulting WebAssembly module and the AudioWorklet code without any hassle: The only thing I do with the Web Audio API is receiving the audio signal via RingBuf.js and the SharedArrayBuffer in the AudioWorklet: https://github.com/kyr0/microbe/blob/main/src/index.ts#L82 I then forward the signal to AudioContext.destination: And that's it... it currently computes 4096 oscillators in parallel on my machine, using all 8 cores efficiently, using high efficient assembly/vector intrinsics, deterministic timing, GC-free critical path, no heap allocations, no locks, and almost copy-free regarding the buffers. This is absolutely fantastic. I still can't believe it! I can finally implement my DAW from scratch! And it could be as performant as Reaper/Ableton/Logic Pro X/FL Studio etc. natively -- maybe even outperforming them, as any code written is automatically optimized for SIMD and multicore processing. I'm absolutely stunned by this! I expected much more trouble and headache -- especially with the RingBuffer and different Web Audio implementations. But, in my limited tests, it works really, really well -- both in the most recent versions of Chrome, Firefox and Edge. So... thanks a lot for your outstanding work over all of the years (!!!). The only artifacts audible are coming from my bad code in the Audio Engine ;)) But the number crunching/heavy lifting works really, really well. It also works fine for all typical buffer sizes and sampling rates - independent of the Web Audio buffer sizes.. :) (well... not really, but "virtually"). I have some stuff to work on still... like Firefox not liking my way of waking up the main loop/breaking busy-wait... but that should all be fixable. Here is a live demo: https://stackblitz.com/~/github.com/kyr0/microbe?file=src/signal-forwarder.ts:L30 |
I checked with SpiderMonkey engineers (I work at Mozilla). We do not unroll loops automatically. https://bugzilla.mozilla.org/show_bug.cgi?id=1039458 is an old patch that implemented it, but we haven't merged it. I don't think anybody auto-vectorize loops either, which is orthogonal to unrolling but somewhat related. That said, it's extremely common to unroll SSE2 loops by 4 themselves, processing e.g. 16 f32 at once. What is important is to not spill to the stack, but that easy to verify by looking at the assembly. But we're in js here, so that's not something we can do.
I'd be happy to at least have types, or to port it to ts altogether, I also do both, depending, but prefer ts if I can.
Yes, I've done plenty of this in 2018 when Mozilla pioneered SAB + WASM with rust, and we had [... snip the long part about the architecture... ] Very cool. Keep in my that the Worker threads aren't high priority so you need to keep some buffering around, and can't reach the lowest latency possible. If you restrict your processing to the |
Interesting! TurboFan in v8, at-a-glance, seems to do so: https://github.com/v8/v8/blob/main/src/compiler/loop-unrolling.cc
Hmm, I re-checked: https://github.com/v8/v8/blob/main/src/compiler/revectorizer.cc Do you agree that my understanding of this is correct? I'd be truly happy to learn if I overlooked something. The deeper I understand stuff, the less issues I get with my user-land code :))
Reading the impl. linked, I come to the conclusion that factor 4 is safer for smaller or less computationally intensive loops. Factor 8 balances performance and code size for medium-sized / compute-intensive loops. If the loop dominates execution time, higher unrolling (8 or 16) might be a good idea. This could be tested with profiling to take a more informed decision. For SIMD optimizations the specific implementation seems to suggest to either go with a factor of 4 or 8 because of the codes focus on 128 and 256 bit wide instruction optimizations. However.. I'm not a pro in all of this. I'm just reading the code and trying to jump to conclusions. And this is only v8.. I haven't checked the SpiderMonkey code nor JSC. I had to mess with v8 in the past because of an engine embedding project, so I'm a bit more familiar with their codebase.
Oh, cool! Thank you - that's really interesting! This means that I probably over-extended my understanding a bit too much by generalizing that this would be something, every optimizer implementation would do these days. btw.: It has been my dream since 2003 to work for Mozilla someday. But I never really tried to apply.. xD
That's super cool! Do you still have code published from these days? Would be a pleasure to read! I guess I can learn alot there.
Thank you! Yeah, I already realized this painfully yesterday. Sometimes, rather randomly, the buffer would underrun. So I checked if that was a priority issue and I read a bit on how the audio thread is usually implemented and found, for example, https://github.com/chromium/chromium/blob/main/chromecast/media/audio/audio_io_thread.cc#L22 rendering/GPU threads are abstracted as base::ThreadType::kDisplayCritical and workers can even be base::ThreadType::kBackground -- suggesting to be scheduled less important than even default, depending on the specific OS kernel scheduler chosen, I guess, but we can expect to lack behind when code is executed in Web Workers. The worst scenario here is, that we cannot know by how much our code is delayed. I ended up adding quite a bit of safety margin:
That's a good insight! I think my impl. should currently fulfill the criteria of not running any audio processing on the main thread. Reading from SAB only happens in the AudioWorklet: https://github.com/kyr0/microbe/blob/main/src/signal-forwarder.ts#L30
Oh, then we can just start with my codebase. I ported every LoC 1:1 already, except two deprecated methods I didn't cover (the snake-case named ones). As mentioned earlier, the loop unroll has been impl. in all loops. https://github.com/kyr0/microbe/blob/main/src/ringbuf/audioqueue.ts#L27 Would you accept a "huge" PR? xD |
Thinking about it further and taking into consideration that in fact this task is a more heavy workload, I added comparison tests for unroll 8 and unroll 16 to the test: Conclusions:
==> Nice learning! I'll change the code to unroll by 16. Stats: let ticketSizeDivOpsUnroll16 = { // smaller = faster
safari1M : 1000000/3604, // 277
chrome1M : 1000000/1980, // 505
firefox1M : 1000000/976, // 1024
safari10k : 10000/418841, // 00.2
chrome10k : 10000/272068, // 0.03
firefox10k : 10000/168300, // 0.05
safari5k : 5000/1304507, // 0.003
chrome5k : 5000/547179, // 0.009
firefox5k : 5000/335512, // 0.01
}
let ticketSizeDivOpsUnroll8 = {
safari1M : 1000000/2267, // 441
chrome1M : 1000000/1178, // 848
firefox1M : 1000000/707, // 1414
safari10k : 10000/272068, // 0.03
chrome10k : 10000/120952, // 0.08
firefox10k : 10000/72013, // 0.13
safari5k : 5000/547179, // 0.009
chrome5k : 5000/244790, // 0.02
firefox5k : 5000/144840, // 0.03
}
let ticketSizeDivOpsUnroll4 = {
safari1M : 1000000/1457, // 686
chrome1M : 1000000/590, // 1694
firefox1M : 1000000/400, // 2500
safari10k : 10000/168300, // 0.05
chrome10k : 10000/64260, // 0.15
firefox10k : 10000/40097, // 0.2
safari5k : 5000/335512, // 0.01
chrome5k : 5000/126038, // 0.03
firefox5k : 5000/82113, // 0.06
} Ticket size: 1MTicket size: 10kTicket size: 5k |
When optimizing deinterleave in audioqueue.js, I was wondering if the performance gains were the same with a smaller ticket size of 128 and a slightly different workload. https://jsbm.dev/yDm8bhZsKYrju Turns out that:
Lessons learned: Always throughly profile every little workload change... we should stick with unroll factor 4 for |
Interleave does not profit much from unrolling and the results diverge per-engine.
|
I updated and tested the TypeScript-port of RingBuf: https://github.com/kyr0/microbe/blob/main/src/ringbuf/ringbuf.ts#L354 with all loop unrolling changes and it seems to work flawlessly. |
I think we don't do enough iteration for the right JIT to kick in here, and using means isn't statistically sound, because we haven't looked at the data distribution. https://paul.cx/public/bench-unroll.html runs a lot more workloads, and plots histograms, please feel free to adapt it. We see some slight bimodality in the distribution, showing that the mean isn't what we want here. In my own testing, I see Firefox always faster than Chrome w/ unroll by 16, on both linux x86_64 and macOS aarch64, and Chrome always faster than Firefox without unroll. This is on a AMD Ryzen Threadripper PRO 7975WX. This is between Firefox Nightly and Chrome Dev. I see Safari twice faster than Firefox on macOS aarch64, which is already 20% faster than Chrome. This is on a M1 Max, Firefox Nightly, Safari Tech Preview, Chrome Canary. But all in all, unrolling by 16x is the sweet spot, it seems like. |
That's a good point. Of course, when the test runs for a prolonged amount of time I'd also expect the optimizer to pick better strategies. But the code of Only when I put exactly that code to the test, the difference in the algos of I'm also fine with just assuming that unroll by 16 might be the best strategy and settle for it in my PR. I think it would be interesting to know exactly whats the underlaying cause of this, so that we can build a better mental model about this. Next to the tests not being exactly statistically sound (for example, always when reloading https://paul.cx/public/bench-unroll.html I get completely different results -- I guess we need more iterations and find the mean between whole test runs; I did so in https://github.com/jsheaven/perf ), I currently suspect that even if the optimizer could find better strategies over many invocations, only so many optimizations are available for each very specific implementation. For some reason it seems the best optimizations are often found with unroll by 4 only for Because in my tests, neither the amount of samples nor the amount of invocations made a difference for I guess one way to find out would be to copy my test code that tests the specific algos to your test code to verify. I'll probably be able to do that tonight. I like the way you use Plotly there. btw: I'm also fine with just assuming that unroll by 16 might be the best strategy and settle for it in my PR. I'm just always very curious about getting the mental model exactly right in my head -- even if it means to do extensive research. I found that this could save me from future pain.. |
For a good JIT tier to be reached, the code has to run a certain amount of time. The good code is kept even though something else executes in the meantime. If you want to understand what is going on, just e.g. profile your chrome instance and load the result in perfetto. Zoom during the compilation stage and there will be markers saying what is happening. It looks like this: |
Nice!! Thank you! That is insightful! If I'm reading the graph correctly, ![]() ![]() I also recorded for a shorter time (15 seconds) and it also only shows one optimization process; there seems to be no more re-optimizations down the road. But as I know the tools only for a few minutes, it could be that I misinterpreted the graph. Also, I copied your code and locally checked with the While implementing this, I found that:
Okay, here are the results (Apple Macbook Air M3, arm64): Of course, with every reload the stats look a little different; but in 95%+ of my tests, unroll by 4 won in all engines for For |
Double-checking on a Linux machine, Intel Core i5-13500, x86-64: Just to clarify: This is only for How do we want to proceed? I could just PR my TypeScript code upgrading the impl. in this repo 1:1 to TypeScript, line by line, plus the optimizations (and reading the latest graphs, I'd suggest 16 for |
Yeah just send the PR over. As long as the interface doesn't change, and downstream consumers aren't affected, and can get a Thanks for all the work, in any case. |
@padenot You're welcome! I'm happy that you've built this wonderful library which I can build my whole DAW on! Well, and it's also really cool that I could learn a few more things in our conversation here :) |
This is now live in https://www.npmjs.com/package/ringbuf.js/v/0.4.0! |
Thanks kyr0 and padenot! |
Set is much faster, even on small arrays.
Code in question:
ringbuf.js/js/ringbuf.js
Lines 346 to 350 in 016d963
Benchmark:

input
is 10,000 F32, andsubInput
is 100 F32.The text was updated successfully, but these errors were encountered: