-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Improve local data validation #1598
Conversation
@@ -504,17 +503,3 @@ def test_read_pandas_with_bigframes_dataframe(): | |||
ValueError, match=re.escape("read_pandas() expects a pandas.DataFrame") | |||
): | |||
session.read_pandas(df) | |||
|
|||
|
|||
def test_read_pandas_inline_w_noninlineable_type_raises_error(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you don't want to remove this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything is inlinable now, no way to get this error anymore
bigframes/core/local_data.py
Outdated
def _adapt_pandas_series( | ||
series: pandas.Series, | ||
) -> tuple[Union[pa.ChunkedArray, pa.Array], bigframes.dtypes.Dtype]: | ||
if series.dtype == np.dtype("O"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attempting to identify geo_dtype by trying conversion on all object columns seems risky because pandas also classifies PyArrow types as object. I'm concerned this could lead to non-geographic PyArrow columns being incorrectly identified as geo_dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, valid concern, especially since we do not control geopandas constructor. Probably should try pyarrow conversion first, and try geopandas as a fallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, changed to fallback to attempting object->geo only after other conversions fail.
return cls(total_bytes=table.nbytes, row_count=table.num_rows) | ||
|
||
|
||
_MANAGED_STORAGE_TYPES_OVERRIDES: dict[bigframes.dtypes.Dtype, pa.DataType] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we use ManagedArrowTable (leveraging PyArrow) for local data, should we centralize our conversion logic based on that? Specifically, DataFrameAndLabels
also handles conversions (Pandas <-> BigQuery), unifying these under a single class seems beneficial. If we standardize on PyArrow as the intermediate "transfer station," the Pandas-to-BigQuery conversion could simply become a Pandas -> Arrow -> BigQuery process, simplifying the overall system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my plan is that everything goes through local managed storage, to ensure consistent normalization and validation.
return pa.array(series, type=pa.string()), bigframes.dtypes.GEO_DTYPE | ||
try: | ||
return _adapt_arrow_array(pa.array(series)) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of catching a generic Exception, what specific exception types should I expect when performing geographic data type conversions? Could TypeError be one of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple TypeError, but also ArrowInvalid. I don't really trust it not to change though? the docs don't specify what error types they will give you.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕