-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: avoid "Unable to determine type" warning with JSON columns in to_dataframe
#1876
Conversation
I might actually want to do something in |
Right now the behavior is inconsistent across REST and BQ Storage API. |
Marking as |
Actually, I think this needs a few more tests. I'm testing manually with |
# Prefer JSON type built-in to pyarrow (adding in 19.0.0), if available. | ||
# Otherwise, fallback to db-dtypes, where the JSONArrowType was added in 1.4.0, | ||
# but since they might have an older db-dtypes, have string as a fallback for that. | ||
# TODO(https://github.com/pandas-dev/pandas/issues/60958): switch to | ||
# pyarrow.json_(pyarrow.string()) if available and supported by pandas. | ||
if hasattr(db_dtypes, "JSONArrowType"): | ||
json_arrow_type = db_dtypes.JSONArrowType() | ||
else: | ||
json_arrow_type = pyarrow.string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key change. Mostly aligns with bigframes, but we've left off pyarrow.json_(pyarrow.string())
because of pandas-dev/pandas#60958.
Marking as Edit: Mailed #2144 |
I've added regression tests for #1580 |
tests/system/test_arrow.py
Outdated
"json_array_col", | ||
] | ||
assert table.shape == (0, 5) | ||
assert list(table.field("struct_col").type.names) == ["json_field", "int_field"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failure:
____________________ test_to_arrow_query_with_empty_results ____________________
bigquery_client =
def test_to_arrow_query_with_empty_results(bigquery_client):
"""
JSON regression test for https://github.com/googleapis/python-bigquery/issues/1580.
"""
job = bigquery_client.query(
"""
select
123 as int_col,
'' as string_col,
to_json('{}') as json_col,
struct(to_json('[]') as json_field, -1 as int_field) as struct_col,
[to_json('null')] as json_array_col,
from unnest([])
"""
)
table = job.to_arrow()
assert list(table.column_names) == [
"int_col",
"string_col",
"json_col",
"struct_col",
"json_array_col",
]
assert table.shape == (0, 5)
> assert list(table.field("struct_col").type.names) == ["json_field", "int_field"]
E AttributeError: 'pyarrow.lib.StructType' object has no attribute 'names'
Need to update this to support older pyarrow.
# but we'd like this to map as closely to the BQ Storage API as | ||
# possible, which uses the string() dtype, as JSON support in Arrow | ||
# predates JSON support in BigQuery by several years. | ||
"JSON": pyarrow.string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping to pa.string won't achieve round-trip? Meaning a value saving to local won't be able to be identified as JSON back. Does it matter to bigframes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BigQuery sets metadata on the Field that can be used to determine this type. I don't want to diverge from BigQuery Storage Read API behavior.
In bigframes and pandas-gbq, we have the BigQuery schema available to disambiguate to customize the pandas types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can investigate if such an override is also possible here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some progress plumbing through a json_type
everywhere it would need to go to be able to override this, but once I got to to_arrow_iterable
, it kinda breaks down. There we very much just return the pages we get from BQ Storage Read API. I don't really want to override that, as it adds new layers of complexity to what was a relatively straightforward internal API.
I'd prefer to leave this as-is without the change to allow overriding the arrow type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's still objections, I can try the same but just with the pandas data type. That gets a bit awkward when it comes to struct, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be important to have this json_arrow_type
feature for the work Chelsea is doing in bigframes and pandas-gbq. I'll give this another try, but I think since it's a feature it should be a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started #2149 but I'm hitting some roadblocks. Will pause for now.
…o_dataframe` * add regression tests for empty dataframe * fix arrow test to be compatible with old pyarrow
assert list(df.columns) == [ | ||
"int_col", | ||
"string_col", | ||
"json_col", | ||
"struct_col", | ||
"json_array_col", | ||
] | ||
assert len(df.index) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#QUESTION:
The test suite for test_to_arrow_query_with_empty_results
is more robust than this one.
Is there a reason for the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using object
dtype in pandas for STRUCT
and ARRAY
columns right now. This means there's not much we can inspect about subfields/subtypes like we can with Arrow.
Related: I'd like to introduce a dtype_backend
parameter like pandas has to pandas-gbq: googleapis/python-bigquery-pandas#621
In the meantime, I tend to use:
df = (
results
.to_arrow(bqstorage_client=bqstorage_client)
.to_pandas(types_mapper=lambda type_ : pandas.ArrowDtype(type_))
)
instead of to_dataframe()
in my personal projects, as this will give me the more structured Arrow types all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Included a question for my own edification.
Not a blocker.
Approved.
…o_dataframe` (#1876) * add regression tests for empty dataframe * fix arrow test to be compatible with old pyarrow
Based on #2144, which should merge first.
TODO:
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #1580
🦕