Coroutines vs setjmp #5
-
I wanted to first thank you for pushing this amazing work online to use for free. I have a few questions regarding the performance of coroutines vs setjmp. Pablo Halpern mentioned in his talk that usually continuation stealing thread pools are using setjmp in their implementation. If I understand correctly, you opted for coroutines as they intrinsically do the same thing. But my question is about performance: Is the coroutine variant faster than setjmp? My gut feeling is that coroutines just have too big overhead atm and setjmp will be faster, but you probably tested it. EDIT: Looking into coroutines a bit more it makes sense to prefer them over setjmp as you elegantly avoid the cactus stack problem. Actually I think you're on to something. In fork-join parallelism the scope of the forked coroutine is limited to the scope of the calling coroutine so a sufficiently smart compiler could avoid dynamic allocations and allocate the coro activation frame on the stack. I think your solution is one of the few really reasonable use cases for coroutines. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi there! thanks for your thoughts, I've not actually explored setjmp, it looks like it could be faster but may cause some problems when working with objects with with non-trivial destructors. My benchmarking at the moment suggests coroutines currently have an overhead somewhere between 8-15 time greater than a regular function call and often prevent inlining. Currently I am avoiding the cactus stack with a lot of dynamic allocation, I am working to reduce the allocations as compilers are not that good at it at the moment. |
Beta Was this translation helpful? Give feedback.
-
Your info about setjmp is correct, it doesn't consider stackunwinding, you would need to do that manually. But it seems one of the benefits of fork-join parallelism is that coroutines should be able to be inlined at some point as the destruction happens within the same function scope as they are created (forked). But my tests show that at least gcc is not yet smart enough to notice that, even if I use RAII to destroy my coroutine object unconditionally in the destructor. If the compiler can guarantee that the coroutine will never outlive the enclosing scope it should be able to allocate the coroutine frame (not the activation frame) on the stack, thus making them faster by a huge margin, so that's just a matter of leaning back and waiting for compiler implementers to catch up. If my insight is correct you don't actually need to bother with a cactus stack as the compilers coroutine transformation will allocate the coroutine frame on the heap which is of fixed size while activation frame will be placed on whichever thread is running the continuation. This effectively solves the cactus stack problem in the nicest way possible. |
Beta Was this translation helpful? Give feedback.
Hi there! thanks for your thoughts, I've not actually explored setjmp, it looks like it could be faster but may cause some problems when working with objects with with non-trivial destructors. My benchmarking at the moment suggests coroutines currently have an overhead somewhere between 8-15 time greater than a regular function call and often prevent inlining. Currently I am avoiding the cactus stack with a lot of dynamic allocation, I am working to reduce the allocations as compilers are not that good at it at the moment.