Skip to content

Audit asyncio thread safety #128002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
9 tasks done
kumaraditya303 opened this issue Dec 16, 2024 · 12 comments
Open
9 tasks done

Audit asyncio thread safety #128002

kumaraditya303 opened this issue Dec 16, 2024 · 12 comments

Comments

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented Dec 16, 2024

The following functions needs to be made thread safe and atomic for free-threading for both pure python and C implementation:

  • asyncio._enter_task
  • asyncio._leave_task
  • asyncio._register_task
  • asyncio._unregister_task
  • asyncio._swap_current_task
  • asyncio.current_task
  • asyncio.all_tasks

Note that some of these were made thread safe in C impl #120974

The following classes needs to be thread safe for free-threading in C implementation:

  • asyncio.Task
  • asyncio.Future

Both of these classes are documented to be not thread safe but currently calling methods on these classes from multiple threads can crash the interpreter. The pure python implementation cannot crash the interpreter when called from multiple threads so changes are only needed for C implementation. Before making these thread safe in C I would gather some numbers for how much a difference the C implementation makes in free-threading and if it isn't much we can just disable the C extension for free-threading.

cc @colesbury

Linked PRs

@colesbury
Copy link
Contributor

You may be already planning on this, but I think it'd be helpful to have a summary of the current thread safety issues in these classes and functions.

@kumaraditya303
Copy link
Contributor Author

kumaraditya303 commented Dec 18, 2024

I have gathered data of performance of pure python implementation and C implementation.

Without free-threading

Benchmark C implementation Pure Python implementation
async_tree_cpu_io_mixed_tg 478 ms 746 ms: 1.56x slower
async_tree_io_tg 620 ms 1.00 sec: 1.61x slower
async_tree_cpu_io_mixed 493 ms 796 ms: 1.62x slower
async_tree_io 631 ms 1.03 sec: 1.63x slower
async_tree_memoization_tg 314 ms 580 ms: 1.85x slower
async_tree_memoization 338 ms 632 ms: 1.87x slower
async_tree_none 272 ms 522 ms: 1.92x slower
async_tree_none_tg 252 ms 489 ms: 1.94x slower
Geometric mean (ref) 1.50x slower

With free-threading

Benchmark C implementation Pure Python implementation
async_tree_cpu_io_mixed_tg 580 ms 951 ms: 1.64x slower
async_tree_cpu_io_mixed 627 ms 1.03 sec: 1.65x slower
async_tree_io_tg 765 ms 1.28 sec: 1.67x slower
async_tree_io 810 ms 1.37 sec: 1.69x slower
async_tree_memoization_tg 433 ms 805 ms: 1.86x slower
async_tree_memoization 472 ms 884 ms: 1.87x slower
async_tree_none 372 ms 733 ms: 1.97x slower
async_tree_none_tg 333 ms 679 ms: 2.04x slower
Geometric mean (ref) 1.53x slower

From this data, it looks like there's on average slowdown of 1.50x when disabling the C implementation as such it is too large of performance loss to consider dropping the C implementation even in case of free-threading.
I was hoping that the difference would be less so we could disable the C implementation for free-threading and reduce the changes but it isn't worth the perf loss.

Some thread safety Issues in C implementation

Here's an outline of some of the thread safety issues, I'll be adding more to this list.

  • Critical sections needs to be added methods of asyncio.Task and asyncio.Future otherwise they can crash the interpreter in C implementation. Even though tasks are documented to be not thread safe but still it shouldn't crash the interpreter. Example: tasks could be concurrently cancelled while it's stepping through the coroutine in another thread. The cancellation counter isn't atomic so it could lead to possible loss of cancellations requests.

    _asyncio_Task_uncancel_impl(TaskObj *self)
    /*[clinic end generated code: output=58184d236a817d3c input=68f81a4b90b46be2]*/
    /*[clinic end generated code]*/
    {
    if (self->task_num_cancels_requested > 0) {
    self->task_num_cancels_requested -= 1;
    if (self->task_num_cancels_requested == 0) {
    self->task_must_cancel = 0;
    }
    }
    return PyLong_FromLong(self->task_num_cancels_requested);

  • The eager_tasks set could be concurrently mutated while it's being iterated over in another thread in asyncio.all_tasks.

    // First add eager tasks to the set so that we don't miss
    // any tasks which graduates from eager to non-eager
    PyObject *eager_iter = PyObject_GetIter(state->eager_tasks);

  • asyncio.Handle needs to be thread safe because it can be called from any thread with loop.call_soon_threadsafe.

  • The internal asyncio linked list uses borrowed references to task as such can be concurrently deallocated while list is being read in another thread as deallocation doesn't hold the state lock.

Performance bottlenecks in free-threading

Consider an async server which uses multiple threads to run one independent loop per thread.

  • asyncio currently relies on globals dicts and sets like current_tasks, eager_tasks, non_asyncio_tasks etc which would perform poorly with free-threading. Moving these to per thread would help in performance scaling.
  • Use non python data structures, this was an overall performance win when I implemented double linked implementation because we can it avoids a lot of reference counting at expense of some locking which could possibly be further optimized to compare exchanges.
  • maybe defer reference counting on the loop objects which are the source of most reference counting operations.

@itamaro
Copy link
Contributor

itamaro commented Dec 18, 2024

  • asyncio currently relies on globals dicts and sets like current_tasks, eager_tasks, non_asyncio_tasks etc which would perform poorly with free-threading. Moving these to per thread would help in performance scaling.

I vaguely remember past discussions about moving the global state to live on the loop. I don't remember what was the conclusion (or if there was a conclusion), but if we do that, that would resolve most thread-safety issues, wouldn't it?

@kumaraditya303
Copy link
Contributor Author

kumaraditya303 commented Dec 18, 2024

I vaguely remember past discussions about moving the global state to live on the loop. I don't remember what was the conclusion (or if there was a conclusion), but if we do that, that would resolve most thread-safety issues, wouldn't it?

That has backwards compatibility issues with custom event loops, currently asyncio handles it for all loops. Moving things to event loop may improve performance for free-threading but will causes performance loss in normal builds and in cases where just a single loop runs because of extra attribute lookups and indirection and we cannot use non-python data structures like the new double linked-lists implementation.

kumaraditya303 added a commit that referenced this issue Jan 4, 2025
* Makes `_asyncio.Task` and `_asyncio.Future` thread-safe by adding critical sections
* Add assertions to check for thread safety checking locking of object by critical sections in internal functions
* Make `_asyncio.all_tasks` thread safe when eager tasks are used
* Add a thread safety test
WolframAlph pushed a commit to WolframAlph/cpython that referenced this issue Jan 4, 2025
)

* Makes `_asyncio.Task` and `_asyncio.Future` thread-safe by adding critical sections
* Add assertions to check for thread safety checking locking of object by critical sections in internal functions
* Make `_asyncio.all_tasks` thread safe when eager tasks are used
* Add a thread safety test
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Jan 8, 2025
)

* Makes `_asyncio.Task` and `_asyncio.Future` thread-safe by adding critical sections
* Add assertions to check for thread safety checking locking of object by critical sections in internal functions
* Make `_asyncio.all_tasks` thread safe when eager tasks are used
* Add a thread safety test
@colesbury
Copy link
Contributor

@kumaraditya303 - I think we're missing a _PyObject_SetMaybeWeakref on Task objects. Without it, the _Py_TryIncref may fail in non-owning threads even when the total reference count is not zero. We can either call it in register_task or when the task object is created.

@kumaraditya303
Copy link
Contributor Author

I think we're missing a _PyObject_SetMaybeWeakref on Task objects. Without it, the _Py_TryIncref may fail in non-owning threads even when the total reference count is not zero.

So I think that would be the case when asyncio.all_tasks is called for a loop running in a different thread right?

We can either call it in register_task or when the task object is created.

I have a side question: judging from the function name _PyObject_SetMaybeWeakref shouldn't it be called for all objects which are weakrefable automatically. Are there reasons why this isn't done?

@colesbury
Copy link
Contributor

So I think that would be the case when asyncio.all_tasks is called for a loop running in a different thread right?

Yes. In that case, I think asyncio.all_tasks may currently miss some tasks.

judging from the function name _PyObject_SetMaybeWeakref shouldn't it be called for all objects which are weakrefable automatically

It's called on objects when weakrefs are actually created. Most objects that are weakrefable do not actually have any weakrefs. _PyObject_SetMaybeWeakref forces a slightly slower deallocation path. The extra overhead is not much, but it would add up to a few percent if it were enabled on every object.

@kumaraditya303
Copy link
Contributor Author

It's called on objects when weakrefs are actually created. Most objects that are weakrefable do not actually have any weakrefs. _PyObject_SetMaybeWeakref forces a slightly slower deallocation path. The extra overhead is not much, but it would add up to a few percent if it were enabled on every object.

That makes sense, created #128885 to fix this.

ambv added a commit that referenced this issue Feb 6, 2025
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Feb 7, 2025
cmaloney pushed a commit to cmaloney/cpython that referenced this issue Feb 8, 2025
kumaraditya303 added a commit that referenced this issue Feb 10, 2025
@kumaraditya303
Copy link
Contributor Author

Currently, asyncio has two modes of task execution, the usual lazy (default) and eager execution.

lazy execution

With #128869, lazy tasks use per thread linked list of tasks so in the general case there should be no contention around registering and unregistering of tasks.

eager execution

Currently, asyncio stores all eager tasks in a global strong set of tasks 1 so when multiple threads are executing under eager execution, there would be contention around adding and removing tasks from this set.
Note that I had implemented linked list implementation only for lazy tasks in #107803.
There have been discussion of making eager execution the default in future https://discuss.python.org/t/make-asyncio-eager-task-factory-default/75164 so this would become more important.

There are a few ways to improve performance under free-threading when eager execution I can think of:

  • Use a per-thread strong set of tasks, this would eliminate contention of multiple threads, however in free-threading sets always acquire its lock even when it's not shared across threads so there might be still some perf left behind.
  • Implement the same linked list implementation for native tasks under eager execution, this would give best performance as we can eliminate using set altogether so there would be some memory savings too and for rare third party tasks which don't inherit from native, we will have fallback set as it is now.
    • Add asyncio_eager_tasks_head to the thread state.
    • When a task initially starts executing in eager mode, add it to asyncio_eager_tasks_head. 2
    • After stepping through the task, if it eagerly completes, remove it from asyncio_eager_tasks_head and return after some book-keeping. 3
    • If the task is not done, remove that from asyncio_eager_tasks_head and add that to asyncio_tasks_head and continue lazy execution. From here lazy execution takes over and that is already made performant.
    • In lazy execution there were issues if the task survives the lifetime of thread, we needed another asyncio_tasks_head on the interp state to add such tasks, however since eager tasks are executed synchronously there should not be any such issue as any lingering task which doesn't end up completing eagerly will be added to asyncio_tasks_head anyways.

I am leaning towards the second approach as it will be most performant and lock free in the general case but I would like to hear from @itamaro as I think Instagram currently uses eager execution by default so they would know more.

Also @colesbury for any alternatives for better performance in free-threading that I might have missed.

Footnotes

  1. https://github.com/python/cpython/blob/798f8d3ea9f54ee80a684bff74f67ee7d99e35c2/Modules/_asynciomodule.c#L2194

  2. https://github.com/python/cpython/blob/798f8d3ea9f54ee80a684bff74f67ee7d99e35c2/Modules/_asynciomodule.c#L3574

  3. https://github.com/python/cpython/blob/798f8d3ea9f54ee80a684bff74f67ee7d99e35c2/Modules/_asynciomodule.c#L3607

@kumaraditya303
Copy link
Contributor Author

The second approach would also make it much easier for external introspection to inspect eager tasks

cc @ambv @pablogsal

@itamaro
Copy link
Contributor

itamaro commented Feb 16, 2025

The second approach sounds reasonable to me!

In cinder 3.10, I think all native tasks (eager and lazy) are stored in one linked list.

@kumaraditya303
Copy link
Contributor Author

I created #130518 to store eager tasks in the same linked list as lazy tasks, it is much simpler and less changes are required.

seehwan pushed a commit to seehwan/cpython that referenced this issue Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

3 participants