Skip to content

Adding an internal C-API for dynamic arrays #125543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ZeroIntensity opened this issue Oct 15, 2024 · 16 comments
Closed

Adding an internal C-API for dynamic arrays #125543

ZeroIntensity opened this issue Oct 15, 2024 · 16 comments
Assignees
Labels
extension-modules C modules in the Modules dir topic-C-API type-feature A feature request or enhancement

Comments

@ZeroIntensity
Copy link
Member

ZeroIntensity commented Oct 15, 2024

Feature or enhancement

Proposal:

Per the leak introduced in #124776, it would be nice if we had an internal structure for dynamically allocated arrays, similar to how we have _Py_hashtable_t.

cc @Eclips4 for bikeshedding the name and API (_Py_dynarray_t right now)

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

Discussed on discord

Linked PRs

@ZeroIntensity ZeroIntensity added type-feature A feature request or enhancement extension-modules C modules in the Modules dir topic-C-API labels Oct 15, 2024
@ZeroIntensity ZeroIntensity self-assigned this Oct 15, 2024
@Eclips4 Eclips4 changed the title Adding an internal API for dynamic arrays Adding an internal C-API for dynamic arrays Oct 15, 2024
@Eclips4
Copy link
Member

Eclips4 commented Oct 15, 2024

Note for others: the general idea is to have a structure like a Python list that does not affect the reference counter state._Py_hashtable_t is an example of an analogue of Python dict that does not affect reference counter.

@colesbury
Copy link
Contributor

How would this API address the leak in #124776?

@ZeroIntensity
Copy link
Member Author

From my understanding, (part of) the leak in that PR was caused by incorrectly setting up a dynamic array--that could have been prevented with a proper internal API for it.

@vstinner
Copy link
Member

I'm not convinced that we need such API. Do you have more examples of code which would benefit of such API?

@ZeroIntensity
Copy link
Member Author

There are a number of array-like structures throughout the interpreter, namely on the interpreter and thread state off the top of my head. It would be nice if we didn't have to reinvent the wheel every time we wanted to do that :)

The idea here is to get a nice unified interface in place, and then migrate some (but not all!) of the internal dynamic arrays to _PyDynArray, which should make maintaining easier.

@vstinner
Copy link
Member

There are a number of array-like structures throughout the interpreter, namely on the interpreter and thread state off the top of my head.

Do you have concrete examples? Can you point me to the code?

@ZeroIntensity
Copy link
Member Author

Some that I found: _PyErr_StackItem, _PyStackChunk, PyInterpreterState.pythreads, PyInterpreterState.executor_list_head, PyInterpreterState.mem_free_queue.head, _PyRuntime.pyinterpreters. Granted, most of these are linked lists rather than C arrays, but I'm guessing that's just because they're much easier to implement.

My issue with this approach of creating a new linked list every time we need a new array internally is that a). linked lists have lower performance when used in this way (e.g. indexing is O(n) rather than O(1), and you generally have to make an allocation for every addition) and b). without a unified interface, we have to deal with fixing things on a case-by-case basis (for example, most of the structures that I linked had to each get their own mutex field for free threading--wouldn't it have been better to just do that once?)

@colesbury
Copy link
Contributor

This is still not the level of explanation and concrete motivating examples that I'd like to see before adding a new abstraction. It's not clear to me that any of these fields are better suited to a generic dynamic array API than their currently implementations. _PyStackChunk and PyInterpreterState.mem_free_queue.head in particular are linked lists of fixed-size chunks.

linked lists have lower performance when used in this way (e.g. indexing is O(n) rather than O(1), and you generally have to make an allocation for every addition)

Most of these examples do not require indexing arbitrary elements. They also do not require an allocation for every addition. The linked list pointers are embedded in the structures.

I don't think a generic _Py_dynarray_t is a good fit even for the fields where we may want to use an array in the future, like PyInterpreterState.pythreads and possibly _PyRuntime.pyinterpreters. A major motivation would be to avoid require locking while iterating over them, but that requires special handling of allocation and resizing, which isn't a good fit for a generic dynamic array API. Additionally, we probably don't want terminating a thread to require an O(N) _PyDynArray_Remove call.

@ZeroIntensity
Copy link
Member Author

Yeah, the ones that I found certainly shouldn't all be dynamic arrays--I was just making a point that we have a lot of array-like structures internally :)

Agreeing with most of what you said--I don't really want to target this toward the interpreter core (though, I'm sure there are cases somewhere). The main beneficiaries here would be stdlib extensions. For example, asyncio has a linked list for futures, which causes complicated changes like this to exist for improvements.

@kumaraditya303
Copy link
Contributor

For example, asyncio has a linked list for futures, which causes #121264 to exist for improvements.

How would a dynamic array help here? asyncio uses linked list and there are good reasons for it, it avoids allocations, has fast addition and deletions and doesn't need traverse the list while adding duplicate entries, just check the pointers.

@ZeroIntensity
Copy link
Member Author

it avoids allocations, has fast addition and deletions and doesn't need traverse the list while adding duplicate entries, just check the pointers.

So does a dynamic array :)

The main upside in regards to that PR is that the change would be much simpler, really.

@kumaraditya303
Copy link
Contributor

So does a dynamic array :)

How? doesn't it need to resize the array? also how can you check for duplicate entries without traversing the array?

@colesbury
Copy link
Contributor

Yeah, the ones that I found certainly shouldn't all be dynamic arrays--I was just making a point that we have a lot of array-like structures internally

I think Victor and I are still looking for concrete examples where using the proposed API would be an improvement. Victor's original question was "Do you have more examples of code which would benefit of such API?" (and I am also looking for an answer to that same question).

By concrete examples, I mean:

  • Specific fields and code that would definitely (in your opinion) benefit from this code change. Don't include a mix of examples that may or may not benefit.
  • For each of the examples, describe how we currently use those fields and why the the dynamic array API would be better.

At this point, if you have a prototype of the dynamic array API, you should actually prototype the code changes for one or two of the fields to show how the usage would change.

One or two good examples are much better than six vague examples.

@ZeroIntensity
Copy link
Member Author

How? doesn't it need to resize the array? also how can you check for duplicate entries without traversing the array?

It does, but not often. The more it needs to get resized, the less it will need to. I'm not exactly certain how asyncio is checking duplicate entires in the linked list without traversing, but I imagine if it's just checking prev and next, the same can be done by just indexing.

I think Victor and I are still looking for concrete examples where using the proposed API would be an improvement.

I don't know. My impression from when I originally proposed the idea was that people had their own places where it would benefit, but I don't know where those places are. There's enough contention here that I don't think the change is worth pursuing, so I'm going to close this.

@ZeroIntensity ZeroIntensity closed this as not planned Won't fix, can't repro, duplicate, stale Oct 20, 2024
@vstinner
Copy link
Member

I don't think that replacing a linked list with an array just to reuse code is a good rationale. I'm quite sure that if a linked list is used, there is a reason for that :-)

@ZeroIntensity
Copy link
Member Author

In some cases, yeah, but C arrays are hard--I'm almost certain that linked lists have been retrofitted in some places simply because they're easier. Let's keep this idea in the back of our minds in case more uses come up :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir topic-C-API type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants