-
Notifications
You must be signed in to change notification settings - Fork 228
Should window default to expanding
with a sort
?
#1069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think it makes sense to adopt the same defaults as SQL, since as you point out it doesn't make much sense otherwise. On a perhaps (un)related note, I'm not a fan of the implicit window functions. I was surprised when I became of them in another thread with aljazerzen. I would rather have the window function be specified but get rid of the from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
sort [-sumQuantity]
window [
chunkRank = rank
]
)
sort [timechunk, sumQuantity]
filter chunkRank <= 5 I know that's not how our EDIT: From the playground apparently Given the description of the query in English, I'm not sure you want a window function in this case though:
Actually, the more I'm looking into this I don't think I understand the current queries. I'll respond in a separate comment as it might get a bit lenghty. |
I'm trying to understand the query as per description:
Original SQL with raw as (select
timechunk,
keyword,
sumQuantity,
rank() over (
partition by timechunk
order by sumQuantity desc
) as rank
FROM (
select
toStartOfInterval(timestamp, interval 3 hour) as timechunk,
keyword,
sum(quantity) as sumQuantity
from foo
group by toStartOfInterval(timestamp, interval 3 hour), keyword
order by 3
)
ORDER BY 1, 4) SELECT * from raw where rank <= 5 Factoring out the with grouped as (
select
toStartOfInterval(timestamp, interval 3 hour) as timechunk,
keyword,
sum(quantity) as sumQuantity
from foo
group by
toStartOfInterval(timestamp, interval 3 hour),
keyword
order by 3
),
raw as (
select
timechunk,
keyword,
sumQuantity,
rank() over (
partition by timechunk
order by sumQuantity desc
) as rank
from grouped
order by 1, 4
)
SELECT *
from raw
where rank <= 5 This is somewhat tangential to the question but I think this is a good example of why I believe in PRQL we should disallow inline (sub)queries in transforms such as FROM, JOIN, UNION - it makes things very confusing and breaks the linear dataflow. Given the existence of CTEs / table blocks, I argue it's ok to force people to put their transform logic there so that the reader then has a directed dataflow. Returning to the question, in the I'm wondering whether the following query might not solve the problem as stated and be the simplest and most idiomatic PRQL? from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
sort [-sumQuantity]
take 5
) The generated SQL for this is the following and I can't tell if it's correct or not because I can't picture in my mind how the ROW_NUMBER() window function interacts with the GROUP BY of the query that it's embedded in. Would be good if we had some sample data to test these with. WITH table_0 AS (
SELECT
keyword,
toStartOfInterval(timestamp, interval 3 hour) AS timechunk,
SUM(quantity) AS "sumQuantity",
ROW_NUMBER() OVER (
PARTITION BY toStartOfInterval(timestamp, interval 3 hour)
ORDER BY
SUM(quantity) DESC
) AS _rn_84
FROM
foo
GROUP BY
keyword,
toStartOfInterval(timestamp, interval 3 hour)
)
SELECT
table_0.*
FROM
table_0
WHERE
_rn_84 <= 5 |
This is brilliant — @aljazerzen spends a great deal of energy building the |
Yeah, I put this in the bucket of "It's better. But it's less familiar with a mental model of SQL". I'm fine opening an issue and collecting thoughts; these issues are a delicate tradeoff rather than something we can discuss our way to a solution. FWIW your final example — using the |
That's not true: when using LAG, you want the whole window but also want to have order. Same with RANK. SELECT
id,
value,
LAG(value) OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
RANK() OVER (ORDER BY value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test; This works the same as it would with expanding window (which is the default), because window functions that map So I think that being explicit is good in this case and we should stick to being predictable instead inheriting SQL defaults. |
Some notes on how I think about windowing (consider this an optional read, I want to formulate my thoughts). There are two types of windowing:
|
Hey folks, I'm the person who posted the example that motivated this ticket, thanks for your time and effort engaging with my use case! A new wrinkle I'm thinking about: the clients of my software also expect to see the number of "other" records that didn't have one of the top-k-ranked values. Can you think of a natural/easy/feasible way to adjust the PRQL to include that information? e.g.: Table schemaCREATE TABLE foo (
timestamp timestamp not null,
colour varchar(16) not null
) Human intent query"Top 3 colours by day" Desired result:
|
I believe you are want to do this: table daily_colors = (
from foo
group [day, colour] (aggregate cnt = count)
group [day] (sort [-cnt] | derive rnk = row_number)
)
table top3 = (
from daily_colors
filter rnk <= 3
)
table others = (
from daily_colors
filter rnk > 3
group [day] (aggregate cnt = count)
derive color = '[other]'
)
# just concat the two tables together.
# we currenty don't have union implemented
# and this is the workaround.
# yes, is's ugly
from s"SELECT * FROM top3 UNION ALL SELECT * FROM others" And something seems off with produced SQL. The first table should not fit into one CTE... |
(This example is about an actual window function, where frame does not matter. In my query, row_number is not within a window so default is used: the whole group) |
Briefly on one point — and sorry for not having responded above
...is absolutely right and I agree with the cases ( And more broadly, I'm not sure there's a reasonable standard for defaults beyond having things fully explicit |
I agree the majority case is quite different — which I guess is why SQL has different defaults. I'm not sure how to make a standard for those — both types technically reduce n values to 1 value for each row. (which agrees with your perspective on having the same default for both) |
Well yes - you can think about passing only a rolling frame to rank, but it's implemented as an actual window function and it is not affected by frames at all. These are all equivalent:
The last one especially does not make sense, but in SQL this doesn't even matter. (edit: fixed CURRENT ROW)
I also think that. And this means that we can choose our default. |
My message above is wrong — I was thinking of ranking one column (e.g. Not sure if I'm still thinking about it wrong, but I would have thought that
...every row would be rank 1, because it's always the highest value in its frame? Though BigQuery doesn't allow this when I attempted to test it — |
Yes, that probably because of what I'm saying. PG simply ignores the frame, while BQ apparently throws an error. Which means we should probably throw an error too... This thread has many tangential examples and discussions. I'll open an issue about this last thing, but we need some closure on discussion about the frame default when sorting. |
I got the query to work in Postgres, I just had to add the following line to select [d, cnt, c] Final: table daily_colors = (
from foo
group [d, c] (aggregate cnt = count)
group [d] (sort [-cnt] | derive rnk = row_number)
)
table top3 = (
from daily_colors
filter rnk <= 3
select [d, cnt, c]
)
table others = (
from daily_colors
filter rnk > 3
group [d] (aggregate cnt = count)
derive c = '[other]'
)
# just concat the two tables together.
# we currenty don't have union implemented
# and this is the workaround.
# yes, is's ugly
from s"SELECT * FROM top3 UNION ALL SELECT * FROM others"
sort [d, -cnt] I also added the sort [d, -cnt] at the end because the results were in a random order. Interestingly this actually also produced an error because it got translated to ORDER BY *, * DESC because of the opaqueness of that s-string Final SQL example I ran: with foo(d, c) as (
values (1, 'r'), (1, 'r'), (1, 'r'), (1, 'b'), (1, 'b'), (1, 'g'), (1, 'c'), (1, 'p'),
(2, 'r'), (2, 'r'), (2, 'b'), (2, 'b'), (2, 'b'), (2, 'g'), (2, 'c')
),
daily_colors AS (
SELECT
d,
c,
COUNT(*) AS cnt,
ROW_NUMBER() OVER (
PARTITION BY d
ORDER BY
COUNT(*) DESC ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
) AS rnk
FROM
foo
GROUP BY
d,
c
),
others AS (
SELECT
d,
COUNT(*) AS cnt,
'[other]' AS c
FROM
daily_colors
WHERE
rnk > 3
GROUP BY
d
),
top3 AS (
SELECT
d,
cnt,
c
FROM
daily_colors
WHERE
rnk <= 3
),
table_1 AS (
SELECT
*
FROM
top3
UNION
ALL
SELECT
*
FROM
others
)
SELECT
*
FROM
table_1 AS table_0
ORDER BY
d,
cnt DESC which produced
|
I had the same expectation as @max-sixty that every row would be rank 1 in the example below:
I tested this in postgres with: with foo(i, n) as (
values (1, 300), (2, 50000), (3, 1),
(4, 20), (5, 600000), (6, 4000)
)
select *
, RANK() OVER (ORDER BY n)
, RANK() OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as rank2
, RANK() OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as rank3
, RANK() OVER (ORDER BY n ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as rank4
, SUM(n) OVER (ORDER BY n)
, SUM(n) OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum2
, SUM(n) OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum3
, SUM(n) OVER (ORDER BY n ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as sum4
from foo
order by i The |
Because Imagine you'd have: def rank(values: List(number)) -> List(number):
return [i for i, v in enumerate(values)]
def sum(values: List(number)) -> number:
s = 0
for v in values:
s += v
return s For implementation of window functions, you'd check if return value of the function is
Note 1: second option is probably not implemented like this (having O(n^2)) - you can use cumsum, but this is just an illustration. Note 2: rank could be implemented as aggregation:
... but this is much less more efficient and does not hold for all such functions, such as NTILE. Note 3: both my RANK impls are simplified, they don't handle ties. It could be that you can't implement actual RANK as aggregation. In summary: actual window functions don't use window frame (or even reject it BQ). So I suggest we default to no frame (that it the whole table/window/group). |
Thanks. I understand there may be performance implications and some combinations may not be feasible. My complaint is more about how this is documented and communicated. I admit I probably did SQL for 10 years before I ever even heard of WINDOW functions but I was not aware of this distinction until now. Looking over the Postgresql documentation again now, I see it is in there but IMHO it's quite subtle:
I guess the implication is that the rest of the functions in that list DO NOT consider the "window frame". There's also 4.2.8. Window Function Calls which has a section
I just skimmed that section now but there's nothing jumping out at me that says that the optional frame_clause only applies to some window functions. I think maybe one way this could be made clearer would be to call first_value, last_value, and nth_value FRAME FUNCTIONS which are similar to WINDOW FUNCTIONS and use the same I think that's how I'm going to try and remember it internally. I find this a pretty tricky space with many footguns lying around. I agree with you that we should try and make this consistent in PRQL. I'll need to step away from this for a bit to think about what I think would be a better behaviour to try and keep it logical and consistent for our users. Two related functions where I've also gotten tripped up in the past and spent hours debugging queries is with |
So to confirm my SQL understanding, it seems there are three categories of window / "analytic" functions:
References:
Is that right? Footnotes
|
That sums it up well and is my understanding as well. |
If that's the case, then I can see a few options for our defaults:
I would probably vote for |
I vote +2 on the full window (which is also the current behavior). |
(I realize my numbers didn't agree with my titles, so fixed above) Full window was also my vote; do you think it's viable to also let dialects like BQ work with the "Most window functions" group? I think we could even just hardcode them in the compiler to not emit window clauses, assuming there isn't a cleverer way. |
I think we can close this as decided, please reopen for any disagreements! |
Currently we default to a window clause of including the whole window — aka
rows:..
SQL's default window definition changes depending on whether there's an
ORDER BY
, between:rows:..0
orexpanding:true
when anORDER BY
existsrows:..
when noORDER BY
existsSo we force that to be explicit. I'm not that confident we're doing the right thing here; the SQL default might be surprising, but you literally never want the whole window when supplying
ORDER BY
; or you wouldn't supply theORDER BY
.For context, this came up in a Discord discussion around this example:
to PRQL:
If we had the same defaults as SQL, the PRQL could instead be:
The text was updated successfully, but these errors were encountered: