Skip to content

Support for numpy.ndarray and pandas.Series with any python object as entry #4444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

philastrophist
Copy link

This change would add support for generating numpy.ndarray and pandas.Series with any python object as an element.
Effectively, hypothesis can now generate np.array([MyObject()], dtype=object).
The first use-case for this is with Pandas and Pandera where it is possible and sometimes required to have columns which themselves contain structured datatypes.
Pandera seems to be waiting for this change to support PythonDict, PythonTypedDict, PythonNamedTuple etc.

  • Accept dtype.kind = 'O' in from_dtype
  • Add the base case of any type
  • Use .iat instead of .iloc to set values in pandas strategies (this allows setting of dictionaries as elements etc)

@Zac-HD Zac-HD requested a review from tybug June 26, 2025 02:08
Shaun Read added 3 commits July 2, 2025 14:46
@philastrophist
Copy link
Author

Some form of timeout error in CI

@Zac-HD
Copy link
Member

Zac-HD commented Jul 3, 2025

@tybug FAILED hypothesis-python/tests/watchdog/test_database.py::test_database_listener_directory_move - Exception: timing out after waiting 1s for condition lambda: set(events) on Windows CI

(I've hit retry, should be OK soon 🤞)

Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your PR, Shaun!

This is looking good, and I'm excited to ship it soon! Small comments below about testing and code-comments; and I can always push something to the changelog when I work out what I wanted for that.

Comment on lines 217 to 227
raise InvalidArgument(f"No strategy inference for {dtype}")
raise InvalidArgument(f"No strategy inference for {dtype}") # pragma: no cover
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still hit this with void-dtype; can we add a covering test? (e.g. replacing the deleted test case?)

Comment on lines 641 to 643
data[c.name].iloc[i] = value
data[c.name].iat[i] = value # noqa: PD009
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense in the context of this PR, but it'd be great to write a one-or-two-sentence comment above the line explaining why we use .iat over .iloc here, for the benefit of future maintainers.

@@ -40,7 +40,6 @@ def e(a, **kwargs):
e(nps.array_shapes, min_dims=33),
e(nps.array_shapes, max_dims=33),
e(nps.arrays, dtype=float, shape=(0.5,)),
e(nps.arrays, dtype=object, shape=1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider a void-dtype test case here to maintain coverage?

Comment on lines 50 to 52
pdst.series(elements=st.just(anything), dtype=object).filter(
lambda x: not x.empty
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a filter, I think we could pass index=range_indexes(min_size=1), right?

Comment on lines 3 to 11
This version adds support for generating numpy.ndarray and pandas.Series with any python object as an element.
Effectively, hypothesis can now generate ``np.array([MyObject()], dtype=object)``.
The first use-case for this is with Pandas and Pandera where it is possible and sometimes required to have columns which themselves contain structured datatypes.
Pandera seems to be waiting for this change to support ``PythonDict, PythonTypedDict, PythonNamedTuple`` etc.

---

- Accept ``dtype.kind = 'O'`` in ``from_dtype``
- Use ``.iat`` instead of ``.iloc`` to set values in pandas strategies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(apologies for this comment, it's late at night & I don't really know what I want to do instead, but thought it better to send a review now than wait until later)

I'd like to rework this note, to focus more tightly on the specific changes - as prose, not dot-points - and then afterwards note why this is valuable, with pandera only mentioned as one possible case for structured data within a pandas series. I'd also include cross-references to each class you mention, and (optional but encouraged) a thank-you note to yourself at the end of the changelog ("Thanks to Shaun Read for identifying and fixing these issues!" or similar).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, ok I've reworded the release notes and implemented all the suggestions

@philastrophist
Copy link
Author

Some interesting error is occurring outside of the changes in this PR...

@philastrophist philastrophist requested a review from Zac-HD July 3, 2025 09:16
@tybug
Copy link
Member

tybug commented Jul 3, 2025

sorry for dropping the requested review here, I'd want to be confident I understand the pandas interactions first and I don't have that requisite knowledge at the moment 😅

That failure might be a real crosshair failure, but I'm not sure it's worth pursuing with such a non-reproducer.

@philastrophist
Copy link
Author

sorry for dropping the requested review here, I'd want to be confident I understand the pandas interactions first and I don't have that requisite knowledge at the moment 😅

As far as I understand at and iat are more basic indexers than loc and iloc in that they can only access a single entry rather than possibly an subset of entries.
But ignoring vector access here, loc will transform dicts into a series and then set them. There's an interesting note in their source here:

# TODO(EA): ExtensionBlock.setitem this causes issues with
# setting for extensionarrays that store dicts. Need to decide
# if it's worth supporting that.

Seems to be vaguely related.

But the important points are:

  1. loc does transformations to the given values stopping us from inserting dicts into series using iloc/loc. This may or may not be a bug. Either way, editing this logic within pandas is likely to be fraught and it's difficult to tell what other transforms might be applied.
  2. at is the intended way to set single values within a dataframe/series according to the docs. It's technically faster but more importantly it doesn't perform any checks or transformations on the value. The logic is a lot simpler. The reason ruff warns against it is that "iloc is more idiomatic and versatile". We know, that in our use-case, we will only ever be setting a series element by integer index, which is what iat is for.

From the docstrings:

DataFrame.iat : Access a single value for a row/column label pair by integer position(s).
DataFrame.iloc : Access a group of rows and columns by integer position(s).
Similar to ``iloc``, in that both provide integer-based lookups. Use
    ``iat`` if you only need to get or set a single value in a DataFrame
    or Series.

Demonstration:

import pandas as pd

s = pd.Series([1, 2, 3], dtype=object)  # object dtype so we dont get mismatch warnings

s.iloc[0] = {'a': 1}
print('series with iloc:\n', s)
print('entry type with iloc:', type(s.iloc[0]))

s.iat[0] = {'a': 1}
print('with iat:\n', s)
print('entry type with iat:', type(s.iat[0]))

prints out:

series with iloc:
 0    a    1
dtype: int64
1                      2
2                      3
dtype: object
entry type with iloc: <class 'pandas.core.series.Series'>
with iat:
 0    {'a': 1}
1           2
2           3
dtype: object
entry type with iat: <class 'dict'>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants