Support for numpy.ndarray and pandas.Series with any python object as entry #4444

philastrophist · 2025-06-20T12:57:02Z

This change would add support for generating numpy.ndarray and pandas.Series with any python object as an element.
Effectively, hypothesis can now generate np.array([MyObject()], dtype=object).
The first use-case for this is with Pandas and Pandera where it is possible and sometimes required to have columns which themselves contain structured datatypes.
Pandera seems to be waiting for this change to support PythonDict, PythonTypedDict, PythonNamedTuple etc.

Accept dtype.kind = 'O' in from_dtype
Add the base case of any type
Use .iat instead of .iloc to set values in pandas strategies (this allows setting of dictionaries as elements etc)

- Use `.iat` instead of `.iloc` to set values in pandas strategies

…rage since we now actually cover all types and this line is now not covered

philastrophist · 2025-07-02T15:39:16Z

Some form of timeout error in CI

Zac-HD · 2025-07-03T04:34:58Z

@tybug FAILED hypothesis-python/tests/watchdog/test_database.py::test_database_listener_directory_move - Exception: timing out after waiting 1s for condition lambda: set(events) on Windows CI

(I've hit retry, should be OK soon 🤞)

Zac-HD

Thanks so much for your PR, Shaun!

This is looking good, and I'm excited to ship it soon! Small comments below about testing and code-comments; and I can always push something to the changelog when I work out what I wanted for that.

Zac-HD · 2025-07-03T04:38:42Z

hypothesis-python/src/hypothesis/extra/numpy.py

-        raise InvalidArgument(f"No strategy inference for {dtype}")
+        raise InvalidArgument(f"No strategy inference for {dtype}")  # pragma: no cover


I think we can still hit this with void-dtype; can we add a covering test? (e.g. replacing the deleted test case?)

Zac-HD · 2025-07-03T04:42:44Z

hypothesis-python/src/hypothesis/extra/pandas/impl.py

-                            data[c.name].iloc[i] = value
+                            data[c.name].iat[i] = value  # noqa: PD009


This makes sense in the context of this PR, but it'd be great to write a one-or-two-sentence comment above the line explaining why we use .iat over .iloc here, for the benefit of future maintainers.

Zac-HD · 2025-07-03T04:43:50Z

hypothesis-python/tests/numpy/test_argument_validation.py

@@ -40,7 +40,6 @@ def e(a, **kwargs):
        e(nps.array_shapes, min_dims=33),
        e(nps.array_shapes, max_dims=33),
        e(nps.arrays, dtype=float, shape=(0.5,)),
-        e(nps.arrays, dtype=object, shape=1),


consider a void-dtype test case here to maintain coverage?

hypothesis-python/tests/numpy/test_argument_validation.py

Zac-HD · 2025-07-03T04:46:34Z

hypothesis-python/tests/pandas/test_series.py

+        pdst.series(elements=st.just(anything), dtype=object).filter(
+            lambda x: not x.empty
+        )


Rather than a filter, I think we could pass index=range_indexes(min_size=1), right?

Zac-HD · 2025-07-03T05:35:38Z

hypothesis-python/RELEASE.rst

+This version adds support for generating numpy.ndarray and pandas.Series with any python object as an element.
+Effectively, hypothesis can now generate ``np.array([MyObject()], dtype=object)``.
+The first use-case for this is with Pandas and Pandera where it is possible and sometimes required to have columns which themselves contain structured datatypes.
+Pandera seems to be waiting for this change to support ``PythonDict, PythonTypedDict, PythonNamedTuple`` etc.
+
+---
+
+- Accept ``dtype.kind = 'O'`` in ``from_dtype``
+- Use ``.iat`` instead of ``.iloc`` to set values in pandas strategies


(apologies for this comment, it's late at night & I don't really know what I want to do instead, but thought it better to send a review now than wait until later)

I'd like to rework this note, to focus more tightly on the specific changes - as prose, not dot-points - and then afterwards note why this is valuable, with pandera only mentioned as one possible case for structured data within a pandas series. I'd also include cross-references to each class you mention, and (optional but encouraged) a thank-you note to yourself at the end of the changelog ("Thanks to Shaun Read for identifying and fixing these issues!" or similar).

Great, ok I've reworded the release notes and implemented all the suggestions

philastrophist · 2025-07-03T09:16:06Z

Some interesting error is occurring outside of the changes in this PR...

tybug · 2025-07-03T20:37:49Z

sorry for dropping the requested review here, I'd want to be confident I understand the pandas interactions first and I don't have that requisite knowledge at the moment 😅

That failure might be a real crosshair failure, but I'm not sure it's worth pursuing with such a non-reproducer.

philastrophist · 2025-07-04T06:37:26Z

sorry for dropping the requested review here, I'd want to be confident I understand the pandas interactions first and I don't have that requisite knowledge at the moment 😅

As far as I understand at and iat are more basic indexers than loc and iloc in that they can only access a single entry rather than possibly an subset of entries.
But ignoring vector access here, loc will transform dicts into a series and then set them. There's an interesting note in their source here:

# TODO(EA): ExtensionBlock.setitem this causes issues with
# setting for extensionarrays that store dicts. Need to decide
# if it's worth supporting that.

Seems to be vaguely related.

But the important points are:

loc does transformations to the given values stopping us from inserting dicts into series using iloc/loc. This may or may not be a bug. Either way, editing this logic within pandas is likely to be fraught and it's difficult to tell what other transforms might be applied.
at is the intended way to set single values within a dataframe/series according to the docs. It's technically faster but more importantly it doesn't perform any checks or transformations on the value. The logic is a lot simpler. The reason ruff warns against it is that "iloc is more idiomatic and versatile". We know, that in our use-case, we will only ever be setting a series element by integer index, which is what iat is for.

From the docstrings:

DataFrame.iat : Access a single value for a row/column label pair by integer position(s).
DataFrame.iloc : Access a group of rows and columns by integer position(s).
Similar to ``iloc``, in that both provide integer-based lookups. Use
    ``iat`` if you only need to get or set a single value in a DataFrame
    or Series.

Demonstration:

import pandas as pd

s = pd.Series([1, 2, 3], dtype=object)  # object dtype so we dont get mismatch warnings

s.iloc[0] = {'a': 1}
print('series with iloc:\n', s)
print('entry type with iloc:', type(s.iloc[0]))

s.iat[0] = {'a': 1}
print('with iat:\n', s)
print('entry type with iat:', type(s.iat[0]))

prints out:

series with iloc:
 0    a    1
dtype: int64
1                      2
2                      3
dtype: object
entry type with iloc: <class 'pandas.core.series.Series'>
with iat:
 0    {'a': 1}
1           2
2           3
dtype: object
entry type with iat: <class 'dict'>

Shaun Read added 7 commits June 20, 2025 13:48

- Accept dtype.kind = 'O' in from_dtype

77fc61e

- Use `.iat` instead of `.iloc` to set values in pandas strategies

ruff: yes we really want .iat

895e360

linting

8fe26b8

add test for failing coverage and dtypes

d2bf820

linter

918a9f0

rst mistakes and linting

b514789

comparable datatypes only

07f8ea0

Zac-HD requested a review from tybug June 26, 2025 02:08

Shaun Read added 3 commits July 2, 2025 14:46

still keep the else line to catch unknown dtypes but remove from cove…

aa0ab3f

…rage since we now actually cover all types and this line is now not covered

make test agree with the from_dtype strategy

b9380b7

formatting

e0c2909

Zac-HD reviewed Jul 3, 2025

View reviewed changes

Shaun Read added 4 commits July 3, 2025 09:19

addressed comments :)

6aea8bc

formatting

b65e335

formatting

212a628

Got rst sytnax wrong again...

088d272

philastrophist requested a review from Zac-HD July 3, 2025 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for numpy.ndarray and pandas.Series with any python object as entry #4444

Support for numpy.ndarray and pandas.Series with any python object as entry #4444

philastrophist commented Jun 20, 2025

Uh oh!

philastrophist commented Jul 2, 2025

Uh oh!

Zac-HD commented Jul 3, 2025 •

edited

Loading

Uh oh!

Zac-HD left a comment

Uh oh!

Zac-HD Jul 3, 2025

Uh oh!

Zac-HD Jul 3, 2025

Uh oh!

Zac-HD Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

Zac-HD Jul 3, 2025

Uh oh!

Zac-HD Jul 3, 2025

Uh oh!

philastrophist Jul 3, 2025

Uh oh!

philastrophist commented Jul 3, 2025

Uh oh!

tybug commented Jul 3, 2025

Uh oh!

philastrophist commented Jul 4, 2025

Uh oh!

Uh oh!

		raise InvalidArgument(f"No strategy inference for {dtype}")
		raise InvalidArgument(f"No strategy inference for {dtype}") # pragma: no cover

		data[c.name].iloc[i] = value
		data[c.name].iat[i] = value # noqa: PD009

Support for numpy.ndarray and pandas.Series with any python object as entry #4444

Are you sure you want to change the base?

Support for numpy.ndarray and pandas.Series with any python object as entry #4444

Conversation

philastrophist commented Jun 20, 2025

Uh oh!

philastrophist commented Jul 2, 2025

Uh oh!

Zac-HD commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zac-HD left a comment

Choose a reason for hiding this comment

Uh oh!

Zac-HD Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Zac-HD Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Zac-HD Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Zac-HD Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Zac-HD Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

philastrophist Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

philastrophist commented Jul 3, 2025

Uh oh!

tybug commented Jul 3, 2025

Uh oh!

philastrophist commented Jul 4, 2025

Uh oh!

Uh oh!

Zac-HD commented Jul 3, 2025 •

edited

Loading