-
Notifications
You must be signed in to change notification settings - Fork 652
fix: hoist order_by clauses from aggregations into window during GroupedTable.mutate() and Agg.over() #11314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1a12e82
to
1564981
Compare
1564981
to
6638b32
Compare
'or t.mutate(t.c.first(order_by="d").over(group_by="a", order_by="b"))' | ||
'because the "order_by" is specified twice, and is ambiguous.' | ||
"Only specify it once, in either place." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be convinced to make the op ordering take precedence over the window ordering, since it is on the more specific value. eg you might do t.group_by("a").order_by("b").mutate(c=t.c.first(order_by="d"), e=t.e.collect(), f=t.f.last())
, and you want e
and f
to be ordered by the default of "b", but to override this default for "c", and order it by "d".
If a user gets this error message they are going to be perplexed, whereas if we just pick a consistent behavior, then they won't see it and we won't make rare uses illegal. eg if a user has to programmatically detect if an order by is given multiple times, that would be awful for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most obvious behavior is to append the existing ordering. Essentially: order by on a mutate inserts the passed keys to all window/aggregates as the inner-most key, so any function calls would append order_by
keys to the tuple from any parent order_by
s. This generalizes to work for specified additional keys as well as no keys specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has precedent in the window_merge_frames
rule and has existed since before the present rewrite system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I do think the behavior of "use both keys, but give the inner most key precedent" is probably the way to go.
Let's make that change after the 10.6 release, and I will merge this PR with behavior that matches the existing behavior (which is to always place window order_by keys first, even in expressions like t.col.first(order_by="foo").over(order_by="bar")
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below. I am a strong -1 on merging the order bys together. I am fine with the more specific one taking precedence and totally overriding the rest, but merging opens up a lot of ambiguity.
86e80cb
to
c9ea35d
Compare
Assuming that CI passes, this is ready for review! |
…pedTable.mutate() and Agg.over()
c9ea35d
to
34a6d06
Compare
|
||
|
||
def test_duplicate_order_by_errors(): | ||
"""If you specify the order_by argument multiple times, it should raise an error.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me this should be the behavior. It seems like it would be equally valid to combine the order bys.
This change here seems unrelated to the goal of the PR, unless I'm misunderstanding something. I'll remove this in a separate commit, and we can discuss these changes in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was me being conservative: I didn't know what we wanted the behavior to be, so I didn't want people to rely upon anything, and this leaves the door open for us to choose whatever we want later.
There is some other issue in our bug tracker if you did two consecutive .order_by()s on a table before Ibis would combine them, but then we decided that this had bad semantics and so we made it so the final call overrode all previous calls. I didn't want to repeat that breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But your plan sounds fine to me, this doesn't seem like a huge deal, I'd love to get this fix in and then sort out this detail later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, here is that PR, and it looks like we actually DO merge order bys. But I still think that is not the best behavior.
def window_wrap_reduction(_: ops.Analytic | ops.Reduction, window: LegacyWindowBuilder): | ||
# Wrap analytic and reduction functions in a window function, merging their ordering. | ||
# Used in the value.over() API. | ||
order_by_specified_in_op = ("order_by" in _.argnames) and bool(_.order_by) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an attempt to future proof us for when we add some aggregation in the future that has ordering. I want this logic to also apply to that new op, because I KNOW that we will forget to add that new op to the test I'm adding in this PR ;).
I'm relying on the assumption that the new op will have an argument called order_by. This is slightly brittle, if we made it so all ordered aggregation classes explicitly subclassed from some Ordered ABC then this check could be a little more official.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's defer this until 11 entirely. I think there's some hairy parts here we may need to shave down, yak style :)
Fixes #11307.
When you have either of
or
the order_by from the ops.First should get hoisted up to the resulting ops.Window. Similar for all aggregations that have an order_by, such as Last, ArrayCollect, and GroupConcat
eg both of the above should be equivalent to
This is a second attempt to #11308 that solves the issue in general.