Skip to content

perf: Join op discards child ordering in unordered mode #923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Sep 23, 2024

Conversation

TrevorBergeron
Copy link
Contributor

…er mode

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Aug 23, 2024
@TrevorBergeron TrevorBergeron marked this pull request as ready for review August 23, 2024 23:52
@TrevorBergeron TrevorBergeron requested review from a team as code owners August 23, 2024 23:52
@TrevorBergeron TrevorBergeron requested a review from tswast August 23, 2024 23:52
join=node.join,
)
else:
left_unordered = self.compile_unordered_ir(node.left_child)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if wee need to do anything to handle sort=True behavior from the pandas join? I'm guessing "no", as I suspect this would be implemented as a sort after the join.

Also, let's add a comment explaining this optimization:

Suggested change
left_unordered = self.compile_unordered_ir(node.left_child)
# In general, joins are an ordering destroying operation.
# With ordering_mode = "partial", make this explicit. In
# this case, we don't need to provide a deterministic ordering.
left_unordered = self.compile_unordered_ir(node.left_child)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, sort=True sorts as an additional operation and so will still apply: https://github.com/googleapis/python-bigquery-dataframes/blob/main/bigframes/core/blocks.py#L2073-L2080

Comment on lines 280 to 289
if self.strict:
compiled_ordered = [
self.compile_ordered_ir(node) for node in node.children
]
return concat_impl.concat_ordered(compiled_ordered)
else:
compiled_unordered = [
self.compile_unordered_ir(node) for node in node.children
]
return concat_impl.concat_unordered(compiled_unordered).as_ordered_ir()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization worries me a little more than the join optimization. Could we update some documentation for concat to call out this fact? I don't find this intuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, could be confusing if concat doesn't conserve ordering. Reverting this aspect of the change - only join will drop child ordering for now.

@TrevorBergeron TrevorBergeron changed the title perf: Concat and Join ops don't compute child ordering in partial ord… perf: Join op discards child ordering in unordered mode Sep 4, 2024
@TrevorBergeron TrevorBergeron merged commit 1b5b0ee into main Sep 23, 2024
22 of 23 checks passed
@TrevorBergeron TrevorBergeron deleted the less_order branch September 23, 2024 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: s Pull request size is small.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants