feat: Improve local data validation #1598

TrevorBergeron · 2025-04-05T01:47:39Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

chelsea-lin · 2025-04-07T18:46:23Z

tests/unit/session/test_io_pandas.py

@@ -504,17 +503,3 @@ def test_read_pandas_with_bigframes_dataframe():
        ValueError, match=re.escape("read_pandas() expects a pandas.DataFrame")
    ):
        session.read_pandas(df)
-
-
-def test_read_pandas_inline_w_noninlineable_type_raises_error():


maybe you don't want to remove this test?

everything is inlinable now, no way to get this error anymore

chelsea-lin · 2025-04-07T18:51:47Z

bigframes/core/local_data.py

+def _adapt_pandas_series(
+    series: pandas.Series,
+) -> tuple[Union[pa.ChunkedArray, pa.Array], bigframes.dtypes.Dtype]:
+    if series.dtype == np.dtype("O"):


Attempting to identify geo_dtype by trying conversion on all object columns seems risky because pandas also classifies PyArrow types as object. I'm concerned this could lead to non-geographic PyArrow columns being incorrectly identified as geo_dtype.

Yeah, valid concern, especially since we do not control geopandas constructor. Probably should try pyarrow conversion first, and try geopandas as a fallback

ok, changed to fallback to attempting object->geo only after other conversions fail.

chelsea-lin · 2025-04-07T19:07:49Z

bigframes/core/local_data.py

        return cls(total_bytes=table.nbytes, row_count=table.num_rows)


+_MANAGED_STORAGE_TYPES_OVERRIDES: dict[bigframes.dtypes.Dtype, pa.DataType] = {


Now that we use ManagedArrowTable (leveraging PyArrow) for local data, should we centralize our conversion logic based on that? Specifically, DataFrameAndLabels also handles conversions (Pandas <-> BigQuery), unifying these under a single class seems beneficial. If we standardize on PyArrow as the intermediate "transfer station," the Pandas-to-BigQuery conversion could simply become a Pandas -> Arrow -> BigQuery process, simplifying the overall system.

Yes, my plan is that everything goes through local managed storage, to ensure consistent normalization and validation.

chelsea-lin · 2025-04-08T19:12:53Z

bigframes/core/local_data.py

+        return pa.array(series, type=pa.string()), bigframes.dtypes.GEO_DTYPE
+    try:
+        return _adapt_arrow_array(pa.array(series))
+    except Exception as e:


Instead of catching a generic Exception, what specific exception types should I expect when performing geographic data type conversions? Could TypeError be one of them?

There are a couple TypeError, but also ArrowInvalid. I don't really trust it not to change though? the docs don't specify what error types they will give you.

feat: Improve local data validation

5cd413b

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Apr 5, 2025

TrevorBergeron added 4 commits April 5, 2025 01:49

remove useless test

2b97cd7

fix geopandas import mypy complaints

773e7dc

fixes

9cabcb8

fix duration storage validation

15a6e99

TrevorBergeron marked this pull request as ready for review April 7, 2025 18:36

TrevorBergeron requested review from a team as code owners April 7, 2025 18:36

TrevorBergeron requested a review from Genesis929 April 7, 2025 18:36

blunderbuss-gcf bot assigned ericfe-google Apr 7, 2025

chelsea-lin requested changes Apr 7, 2025

View reviewed changes

TrevorBergeron added 2 commits April 7, 2025 21:05

Merge remote-tracking branch 'github/main' into local_data_2

0c1fe86

clean up geo type coercion

5a7e1f6

TrevorBergeron requested a review from chelsea-lin April 7, 2025 22:06

chelsea-lin approved these changes Apr 8, 2025

View reviewed changes

TrevorBergeron merged commit 815e471 into main Apr 8, 2025
24 checks passed

TrevorBergeron deleted the local_data_2 branch April 8, 2025 19:55

release-please bot mentioned this pull request Apr 8, 2025

chore(main): release 2.0.0 #1552

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Improve local data validation #1598

feat: Improve local data validation #1598

TrevorBergeron commented Apr 5, 2025

chelsea-lin Apr 7, 2025

TrevorBergeron Apr 7, 2025

chelsea-lin Apr 7, 2025

TrevorBergeron Apr 7, 2025

TrevorBergeron Apr 7, 2025

chelsea-lin Apr 7, 2025

TrevorBergeron Apr 7, 2025

chelsea-lin Apr 8, 2025

TrevorBergeron Apr 8, 2025

		return cls(total_bytes=table.nbytes, row_count=table.num_rows)


		_MANAGED_STORAGE_TYPES_OVERRIDES: dict[bigframes.dtypes.Dtype, pa.DataType] = {

feat: Improve local data validation #1598

feat: Improve local data validation #1598

Conversation

TrevorBergeron commented Apr 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment