Skip to content

fix: fix pandas.cut errors with empty bins #1499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 18, 2025
Merged

Conversation

chelsea-lin
Copy link
Contributor

Fixes internal issue 403638910 🦕

@chelsea-lin chelsea-lin requested review from a team as code owners March 17, 2025 23:36
@chelsea-lin chelsea-lin requested a review from tswast March 17, 2025 23:36
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Mar 17, 2025
@chelsea-lin chelsea-lin requested a review from sycai March 17, 2025 23:36
window_spec=window_specs.unbound(),
)
op = agg_ops.CutOp(bins, right=right, labels=labels)
if isinstance(bins, typing.Iterable) and len(as_index) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we directly return the [pd.NA] * len(x) result from the elif branch len(list(bins)) == 0?

In that case we don't need the additional if-else branch at the bottom of the function, and the logic looks more straightforward. Plus, we don't need to alter the behavior of CutOp.output_type() in aggregations.py too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the elif branch can turn to the [pd.NA] * len(x) result, such as

  • 1st elif for pd.IntervalIndex.from_tuples([]).
  • 2nd elif for []
  • 4st elif for [1]
    Because of that, CutOp.output_type() might return different result for each case above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is that the "if" check on line 88 has quite some overlap with line 49 + line 56. It seems we are doing the same check repeatedly.

Anyway, this shouldn't affect the overall functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just refactor these if branches for more readable. With this refactor, we also catches another edge cases bug. Thanks!

@chelsea-lin chelsea-lin requested a review from sycai March 18, 2025 06:21
window_spec=window_specs.unbound(),
)
op = agg_ops.CutOp(bins, right=right, labels=labels)
if isinstance(bins, typing.Iterable) and len(as_index) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is that the "if" check on line 88 has quite some overlap with line 49 + line 56. It seems we are doing the same check repeatedly.

Anyway, this shouldn't affect the overall functionality.

@chelsea-lin chelsea-lin force-pushed the main_chelsealin_cutbug branch from dc098b3 to d67531a Compare March 18, 2025 17:33
@chelsea-lin chelsea-lin merged commit 434fb5d into main Mar 18, 2025
22 of 24 checks passed
@chelsea-lin chelsea-lin deleted the main_chelsealin_cutbug branch March 18, 2025 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants