Skip to content

fix: hoist order_by clauses from aggregations into window during GroupedTable.mutate() and Agg.over() #11314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Jun 6, 2025

Fixes #11307.

When you have either of

t.group_by("groupby").mutate(t.val.first(order_by="orderby"))

or

t.mutate(t.val.first(order_by="orderby").over(group_by="groupby"))

the order_by from the ops.First should get hoisted up to the resulting ops.Window. Similar for all aggregations that have an order_by, such as Last, ArrayCollect, and GroupConcat

eg both of the above should be equivalent to

t.mutate(t.val.first().over(group_by="groupby", order_by="orderby"))

This is a second attempt to #11308 that solves the issue in general.

'or t.mutate(t.c.first(order_by="d").over(group_by="a", order_by="b"))'
'because the "order_by" is specified twice, and is ambiguous.'
"Only specify it once, in either place."
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be convinced to make the op ordering take precedence over the window ordering, since it is on the more specific value. eg you might do t.group_by("a").order_by("b").mutate(c=t.c.first(order_by="d"), e=t.e.collect(), f=t.f.last()), and you want e and f to be ordered by the default of "b", but to override this default for "c", and order it by "d".

If a user gets this error message they are going to be perplexed, whereas if we just pick a consistent behavior, then they won't see it and we won't make rare uses illegal. eg if a user has to programmatically detect if an order by is given multiple times, that would be awful for them.

Copy link
Member

@cpcloud cpcloud Jun 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most obvious behavior is to append the existing ordering. Essentially: order by on a mutate inserts the passed keys to all window/aggregates as the inner-most key, so any function calls would append order_by keys to the tuple from any parent order_bys. This generalizes to work for specified additional keys as well as no keys specified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has precedent in the window_merge_frames rule and has existed since before the present rewrite system.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I do think the behavior of "use both keys, but give the inner most key precedent" is probably the way to go.

Let's make that change after the 10.6 release, and I will merge this PR with behavior that matches the existing behavior (which is to always place window order_by keys first, even in expressions like t.col.first(order_by="foo").over(order_by="bar")).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below. I am a strong -1 on merging the order bys together. I am fine with the more specific one taking precedence and totally overriding the rest, but merging opens up a lot of ambiguity.

@NickCrews NickCrews force-pushed the agg-order-by-into-window branch 3 times, most recently from 86e80cb to c9ea35d Compare June 7, 2025 00:59
@NickCrews NickCrews marked this pull request as ready for review June 7, 2025 01:00
@NickCrews
Copy link
Contributor Author

Assuming that CI passes, this is ready for review!

cpcloud pushed a commit that referenced this pull request Jun 7, 2025
…11315)

Now tests can reference these easily if they rely on one of these
datatypes

I wanted to use this in #11314
to mark the test case that relies on .collect()
@NickCrews NickCrews force-pushed the agg-order-by-into-window branch from c9ea35d to 34a6d06 Compare June 9, 2025 17:58


def test_duplicate_order_by_errors():
"""If you specify the order_by argument multiple times, it should raise an error."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me this should be the behavior. It seems like it would be equally valid to combine the order bys.

This change here seems unrelated to the goal of the PR, unless I'm misunderstanding something. I'll remove this in a separate commit, and we can discuss these changes in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was me being conservative: I didn't know what we wanted the behavior to be, so I didn't want people to rely upon anything, and this leaves the door open for us to choose whatever we want later.

There is some other issue in our bug tracker if you did two consecutive .order_by()s on a table before Ibis would combine them, but then we decided that this had bad semantics and so we made it so the final call overrode all previous calls. I didn't want to repeat that breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But your plan sounds fine to me, this doesn't seem like a huge deal, I'd love to get this fix in and then sort out this detail later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, here is that PR, and it looks like we actually DO merge order bys. But I still think that is not the best behavior.

def window_wrap_reduction(_: ops.Analytic | ops.Reduction, window: LegacyWindowBuilder):
# Wrap analytic and reduction functions in a window function, merging their ordering.
# Used in the value.over() API.
order_by_specified_in_op = ("order_by" in _.argnames) and bool(_.order_by)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an attempt to future proof us for when we add some aggregation in the future that has ordering. I want this logic to also apply to that new op, because I KNOW that we will forget to add that new op to the test I'm adding in this PR ;).

I'm relying on the assumption that the new op will have an argument called order_by. This is slightly brittle, if we made it so all ordered aggregation classes explicitly subclassed from some Ordered ABC then this check could be a little more official.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's defer this until 11 entirely. I think there's some hairy parts here we may need to shave down, yak style :)

@cpcloud cpcloud marked this pull request as draft June 15, 2025 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: order_by is not preserved in .first() aggregate when used in window functions
2 participants