Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improve local data validation #1598

Merged
merged 7 commits into from
Apr 8, 2025
Merged

feat: Improve local data validation #1598

merged 7 commits into from
Apr 8, 2025

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Apr 5, 2025
@TrevorBergeron TrevorBergeron marked this pull request as ready for review April 7, 2025 18:36
@TrevorBergeron TrevorBergeron requested review from a team as code owners April 7, 2025 18:36
@@ -504,17 +503,3 @@ def test_read_pandas_with_bigframes_dataframe():
ValueError, match=re.escape("read_pandas() expects a pandas.DataFrame")
):
session.read_pandas(df)


def test_read_pandas_inline_w_noninlineable_type_raises_error():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you don't want to remove this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything is inlinable now, no way to get this error anymore

def _adapt_pandas_series(
series: pandas.Series,
) -> tuple[Union[pa.ChunkedArray, pa.Array], bigframes.dtypes.Dtype]:
if series.dtype == np.dtype("O"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempting to identify geo_dtype by trying conversion on all object columns seems risky because pandas also classifies PyArrow types as object. I'm concerned this could lead to non-geographic PyArrow columns being incorrectly identified as geo_dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, valid concern, especially since we do not control geopandas constructor. Probably should try pyarrow conversion first, and try geopandas as a fallback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed to fallback to attempting object->geo only after other conversions fail.

return cls(total_bytes=table.nbytes, row_count=table.num_rows)


_MANAGED_STORAGE_TYPES_OVERRIDES: dict[bigframes.dtypes.Dtype, pa.DataType] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we use ManagedArrowTable (leveraging PyArrow) for local data, should we centralize our conversion logic based on that? Specifically, DataFrameAndLabels also handles conversions (Pandas <-> BigQuery), unifying these under a single class seems beneficial. If we standardize on PyArrow as the intermediate "transfer station," the Pandas-to-BigQuery conversion could simply become a Pandas -> Arrow -> BigQuery process, simplifying the overall system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my plan is that everything goes through local managed storage, to ensure consistent normalization and validation.

return pa.array(series, type=pa.string()), bigframes.dtypes.GEO_DTYPE
try:
return _adapt_arrow_array(pa.array(series))
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of catching a generic Exception, what specific exception types should I expect when performing geographic data type conversions? Could TypeError be one of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple TypeError, but also ArrowInvalid. I don't really trust it not to change though? the docs don't specify what error types they will give you.

@TrevorBergeron TrevorBergeron merged commit 815e471 into main Apr 8, 2025
24 checks passed
@TrevorBergeron TrevorBergeron deleted the local_data_2 branch April 8, 2025 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants