Skip to content

Should window default to expanding with a sort? #1069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
max-sixty opened this issue Oct 22, 2022 · 24 comments
Closed

Should window default to expanding with a sort? #1069

max-sixty opened this issue Oct 22, 2022 · 24 comments
Labels
language-design Changes to PRQL-the-language

Comments

@max-sixty
Copy link
Member

max-sixty commented Oct 22, 2022

Currently we default to a window clause of including the whole window — aka rows:..

SQL's default window definition changes depending on whether there's an ORDER BY, between:

  • expanding — aka rows:..0 or expanding:true when an ORDER BY exists
  • or the whole window — aka rows:.. when no ORDER BY exists

So we force that to be explicit. I'm not that confident we're doing the right thing here; the SQL default might be surprising, but you literally never want the whole window when supplying ORDER BY; or you wouldn't supply the ORDER BY.

For context, this came up in a Discord discussion around this example:

with raw as (select 
  timechunk, 
  keyword, 
  sumQuantity, 
  rank() over (
    partition by timechunk 
    order by sumQuantity desc
  ) as rank
FROM (
  select 
    toStartOfInterval(timestamp, interval 3 hour) as timechunk,
    keyword,
    sum(quantity) as sumQuantity
    from foo
    group by toStartOfInterval(timestamp, interval 3 hour), keyword
    order by 3
) 
ORDER BY 1, 4) SELECT * from raw where rank <= 5

to PRQL:

from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
  aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
  sort sumQuantity
  window expanding:true (
    derive rank = rank
  )
)
sort [timechunk, sumQuantity]
filter rank <= 5

If we had the same defaults as SQL, the PRQL could instead be:

from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
  aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
  sort sumQuantity
+  derive rank = rank
-  window expanding:true (
-    derive rank = rank
-  )
)
sort [timechunk, sumQuantity]
filter rank <= 5
@max-sixty max-sixty added the language-design Changes to PRQL-the-language label Oct 22, 2022
@snth
Copy link
Member

snth commented Oct 24, 2022

I think it makes sense to adopt the same defaults as SQL, since as you point out it doesn't make much sense otherwise.


On a perhaps (un)related note, I'm not a fan of the implicit window functions. I was surprised when I became of them in another thread with aljazerzen. I would rather have the window function be specified but get rid of the derive like we have for aggregate. So my proposal would be (I changed the sort to DESC since that seems to have got lost since the Discord thread and I also renamed chunkRank since I found rank = rank confusing):

from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
  aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
  sort [-sumQuantity]
  window [
    chunkRank = rank
  ]
)
sort [timechunk, sumQuantity]
filter chunkRank <= 5

I know that's not how our window works at the moment but bringing it up here again as it's a good example.

EDIT: From the playground apparently sort -sumQuantity doesn't work and it has to be sort [-sumQuantity].


Given the description of the query in English, I'm not sure you want a window function in this case though:

And the query in English is: For every three-hour bucket of timestamp, show the five most interesting keywords, where interesting is defined as the largest sum(quantity)

Actually, the more I'm looking into this I don't think I understand the current queries. I'll respond in a separate comment as it might get a bit lenghty.

@snth
Copy link
Member

snth commented Oct 24, 2022

I'm trying to understand the query as per description:

And the query in English is: For every three-hour bucket of timestamp, show the five most interesting keywords, where interesting is defined as the largest sum(quantity)

Original SQL

with raw as (select 
  timechunk, 
  keyword, 
  sumQuantity, 
  rank() over (
    partition by timechunk 
    order by sumQuantity desc
  ) as rank
FROM (
  select 
    toStartOfInterval(timestamp, interval 3 hour) as timechunk,
    keyword,
    sum(quantity) as sumQuantity
    from foo
    group by toStartOfInterval(timestamp, interval 3 hour), keyword
    order by 3
) 
ORDER BY 1, 4) SELECT * from raw where rank <= 5

Factoring out the FROM clause into a separate CTE:

with grouped as (
    select 
        toStartOfInterval(timestamp, interval 3 hour) as timechunk,
        keyword,
        sum(quantity) as sumQuantity
    from foo
    group by 
        toStartOfInterval(timestamp, interval 3 hour), 
        keyword
    order by 3
),
raw as (
    select 
        timechunk, 
        keyword, 
        sumQuantity, 
        rank() over (
            partition by timechunk 
            order by sumQuantity desc
         ) as rank
    from grouped
    order by 1, 4
) 
SELECT * 
from raw 
where rank <= 5

This is somewhat tangential to the question but I think this is a good example of why I believe in PRQL we should disallow inline (sub)queries in transforms such as FROM, JOIN, UNION - it makes things very confusing and breaks the linear dataflow. Given the existence of CTEs / table blocks, I argue it's ok to force people to put their transform logic there so that the reader then has a directed dataflow.


Returning to the question, in the raw CTE, does the RANK() not need to be over all rows in the partition? If the rows come streamed in in descending order of sumQuantity then wouldn't a simple ROW_NUMBER() be enough? (I'm not sure of the details tbh as I'm one of those that gets the details of the row selections in SQL Window clauses wrong all the time.)

I'm wondering whether the following query might not solve the problem as stated and be the simplest and most idiomatic PRQL?

from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
  aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
  sort [-sumQuantity]
  take 5
)

The generated SQL for this is the following and I can't tell if it's correct or not because I can't picture in my mind how the ROW_NUMBER() window function interacts with the GROUP BY of the query that it's embedded in. Would be good if we had some sample data to test these with.

WITH table_0 AS (
  SELECT
    keyword,
    toStartOfInterval(timestamp, interval 3 hour) AS timechunk,
    SUM(quantity) AS "sumQuantity",
    ROW_NUMBER() OVER (
      PARTITION BY toStartOfInterval(timestamp, interval 3 hour)
      ORDER BY
        SUM(quantity) DESC
    ) AS _rn_84
  FROM
    foo
  GROUP BY
    keyword,
    toStartOfInterval(timestamp, interval 3 hour)
)
SELECT
  table_0.*
FROM
  table_0
WHERE
  _rn_84 <= 5

@max-sixty
Copy link
Member Author

I'm wondering whether the following query might not solve the problem as stated and be the simplest and most idiomatic PRQL?

from foo
derive [timechunk = s"toStartOfInterval(timestamp, interval 3 hour)"]
group [keyword, timechunk] (
  aggregate [sumQuantity = sum quantity]
)
group [timechunk] (
  sort [-sumQuantity]
  take 5
)

This is brilliant — @aljazerzen spends a great deal of energy building the group & take interaction, and I don't use it! Thanks @snth. I may put this in the canonical examples.

@max-sixty
Copy link
Member Author

On a perhaps (un)related note, I'm not a fan of the implicit window functions. I was surprised when I became of them in another thread with aljazerzen.

Yeah, I put this in the bucket of "It's better. But it's less familiar with a mental model of SQL". I'm fine opening an issue and collecting thoughts; these issues are a delicate tradeoff rather than something we can discuss our way to a solution.

FWIW your final example — using the group & take transforms — is also using implicit window functions!

@aljazerzen
Copy link
Member

the SQL default might be surprising, but you literally never want the whole window when supplying ORDER BY; or you wouldn't supply the ORDER BY.

That's not true: when using LAG, you want the whole window but also want to have order. Same with RANK.

SELECT 
  id,
  value,
  LAG(value) OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
  RANK() OVER (ORDER BY value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;

This works the same as it would with expanding window (which is the default), because window functions that map column -> column are not affected by range or rows at all.

So I think that being explicit is good in this case and we should stick to being predictable instead inheriting SQL defaults.

@aljazerzen
Copy link
Member

Some notes on how I think about windowing (consider this an optional read, I want to formulate my thoughts).

There are two types of windowing:

  • actual window functions that produce the whole column at once (LAG, RANK, NTILE, ...). They map n rows into n rows by definition, which means that specifying a window frame does not make sense, since they always use and produce the whole window at once.
  • aggregation functions that are computed for each row, from multiple rows (SUM, COUNT, FIRST_VALUE). They map n values into one, which means that it does make sense to adjust window frame i.e. which rows are used when computing a row value.

side note: if an aggregation window uses full window, the result can be broadcast to all rows, since it will be the same anyways.

@acruise
Copy link

acruise commented Dec 8, 2022

Hey folks, I'm the person who posted the example that motivated this ticket, thanks for your time and effort engaging with my use case!

A new wrinkle I'm thinking about: the clients of my software also expect to see the number of "other" records that didn't have one of the top-k-ranked values. Can you think of a natural/easy/feasible way to adjust the PRQL to include that information? e.g.:

Table schema

CREATE TABLE foo (
  timestamp timestamp not null,
  colour varchar(16) not null
)

Human intent query

"Top 3 colours by day"

Desired result:

  • 2001-02-03:
    • red: 5
    • blue: 7
    • brown: 9
    • [other]: 22
  • 2001-02-03:
    • red: 6
    • blue: 9
    • brown: 12
    • [other]: 34

@aljazerzen
Copy link
Member

I believe you are want to do this:

table daily_colors = (
    from foo
    group [day, colour] (aggregate cnt = count)
    group [day] (sort [-cnt] | derive rnk = row_number)
)
table top3 = (
    from daily_colors
    filter rnk <= 3
)
table others = (
    from daily_colors
    filter rnk > 3
    group [day] (aggregate cnt = count)
    derive color = '[other]'
)

# just concat the two tables together.
# we currenty don't have union implemented 
# and this is the workaround.
# yes, is's ugly
from s"SELECT * FROM top3 UNION ALL SELECT * FROM others"

And something seems off with produced SQL. The first table should not fit into one CTE...

@aljazerzen
Copy link
Member

(This example is about an actual window function, where frame does not matter. In my query, row_number is not within a window so default is used: the whole group)

@max-sixty
Copy link
Member Author

Briefly on one point — and sorry for not having responded above

That's not true: when using LAG, you want the whole window but also want to have order. Same with RANK.

...is absolutely right and I agree with the cases (LEAD is an even better example than LAG which technically would work with a rolling windows IIUC).

And more broadly, I'm not sure there's a reasonable standard for defaults beyond having things fully explicit

@max-sixty
Copy link
Member Author

max-sixty commented Dec 8, 2022

There are two types of windowing:

* actual window functions that produce the whole column at once (LAG, RANK, NTILE, ...). They map n rows into n rows by definition, which means that specifying a window frame does not make sense, since they always use and produce the whole window at once.

* aggregation functions that are computed for each row, from multiple rows (SUM, COUNT, FIRST_VALUE). They map n values into one, which means that it does make sense to adjust window frame i.e. which rows are used when computing a row value.

One slight refinement — it can still be helpful to have a window function in the first set of functions. Say you want "what's the rank of my score relative to the scores I've seen before" — then you want a RANK() OVER (ORDER BY date). (edit below)

I agree the majority case is quite different — which I guess is why SQL has different defaults. I'm not sure how to make a standard for those — both types technically reduce n values to 1 value for each row. (which agrees with your perspective on having the same default for both)

@aljazerzen
Copy link
Member

aljazerzen commented Dec 8, 2022

"what's the rank of my score relative to the scores I've seen before"

Well yes - you can think about passing only a rolling frame to rank, but it's implemented as an actual window function and it is not affected by frames at all. These are all equivalent:

RANK() OVER (ORDER BY date)
RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
RANK() OVER (ORDER BY id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)

The last one especially does not make sense, but in SQL this doesn't even matter.

(edit: fixed CURRENT ROW)


I'm not sure there's a reasonable standard for defaults beyond having things fully explicit

I also think that. And this means that we can choose our default.

@max-sixty
Copy link
Member Author

My message above is wrong — I was thinking of ranking one column (e.g. score) in a rolling window of another (e.g. date). SQL doesn't do that.

Not sure if I'm still thinking about it wrong, but I would have thought that

RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT

...every row would be rank 1, because it's always the highest value in its frame?

Though BigQuery doesn't allow this when I attempted to test it —  Window framing clause is not allowed for analytic function rank at [4:31]. (Maybe that's what you're saying)

@aljazerzen
Copy link
Member

Yes, that probably because of what I'm saying. PG simply ignores the frame, while BQ apparently throws an error. Which means we should probably throw an error too...

This thread has many tangential examples and discussions. I'll open an issue about this last thing, but we need some closure on discussion about the frame default when sorting.

@snth
Copy link
Member

snth commented Dec 9, 2022

I got the query to work in Postgres, I just had to add the following line to top3:

    select [d, cnt, c]

Final:

table daily_colors = (
    from foo
    group [d, c] (aggregate cnt = count)
    group [d] (sort [-cnt] | derive rnk = row_number)
)
table top3 = (
    from daily_colors
    filter rnk <= 3
    select [d, cnt, c]
)
table others = (
    from daily_colors
    filter rnk > 3
    group [d] (aggregate cnt = count)
    derive c = '[other]'
)
# just concat the two tables together.
# we currenty don't have union implemented 
# and this is the workaround.
# yes, is's ugly
from s"SELECT * FROM top3 UNION ALL SELECT * FROM others"
sort [d, -cnt]

I also added the

sort [d, -cnt]

at the end because the results were in a random order. Interestingly this actually also produced an error because it got translated to

ORDER BY *, * DESC

because of the opaqueness of that s-string from.

Final SQL example I ran:

with foo(d, c) as (
    values (1, 'r'), (1, 'r'), (1, 'r'), (1, 'b'), (1, 'b'), (1, 'g'), (1, 'c'), (1, 'p'),
        (2, 'r'), (2, 'r'), (2, 'b'), (2, 'b'), (2, 'b'), (2, 'g'), (2, 'c')
),
daily_colors AS (
  SELECT
    d,
    c,
    COUNT(*) AS cnt,
    ROW_NUMBER() OVER (
      PARTITION BY d
      ORDER BY
        COUNT(*) DESC ROWS BETWEEN UNBOUNDED PRECEDING
        AND UNBOUNDED FOLLOWING
    ) AS rnk
  FROM
    foo
  GROUP BY
    d,
    c
),
others AS (
  SELECT
    d,
    COUNT(*) AS cnt,
    '[other]' AS c
  FROM
    daily_colors
  WHERE
    rnk > 3
  GROUP BY
    d
),
top3 AS (
  SELECT
    d,
    cnt,
    c
  FROM
    daily_colors
  WHERE
    rnk <= 3
),
table_1 AS (
  SELECT
    *
  FROM
    top3
  UNION
  ALL
  SELECT
    *
  FROM
    others
)
SELECT
  *
FROM
  table_1 AS table_0
ORDER BY
  d,
  cnt DESC

which produced

1	3	r
1	2	[other]
1	2	b
1	1	p
2	3	b
2	2	r
2	1	g
2	1	[other]

@snth
Copy link
Member

snth commented Dec 9, 2022

I had the same expectation as @max-sixty that every row would be rank 1 in the example below:

Not sure if I'm still thinking about it wrong, but I would have thought that

RANK() OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT

...every row would be rank 1, because it's always the highest value in its frame?

I tested this in postgres with:

with foo(i, n) as (
    values (1, 300), (2, 50000), (3, 1),
        (4, 20), (5, 600000), (6, 4000)
)
select *
    , RANK() OVER (ORDER BY n)
    , RANK() OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as rank2
    , RANK() OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as rank3
    , RANK() OVER (ORDER BY n ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as rank4
    , SUM(n) OVER (ORDER BY n)
    , SUM(n) OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum2
    , SUM(n) OVER (ORDER BY n ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum3
    , SUM(n) OVER (ORDER BY n ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as sum4
from foo
order by i

which produces
image

The SUM() examples behave as I expect and respect the supplied frames, so why does RANK() just ignore the frames as you said @aljazerzen ? That's very surprising and confusing.

@aljazerzen
Copy link
Member

Because RANK computes the whole column at once while SUM computes an aggregate at each of the rows.

Imagine you'd have:

def rank(values: List(number)) -> List(number):
    return [i for i, v in enumerate(values)]

def sum(values: List(number)) -> number:
    s = 0
    for v in values:
        s += v
    return s

For implementation of window functions, you'd check if return value of the function is List or not:

  • if it is, call it only once and use output as the new column,
  • if it is not, iterate over rows and call the function once for each row. Here you can limit what you pass into the call by the window frame.

Note 1: second option is probably not implemented like this (having O(n^2)) - you can use cumsum, but this is just an illustration.

Note 2: rank could be implemented as aggregation:

def rank(values: List(number)) -> number:
    return len(values)

... but this is much less more efficient and does not hold for all such functions, such as NTILE.

Note 3: both my RANK impls are simplified, they don't handle ties. It could be that you can't implement actual RANK as aggregation.


In summary: actual window functions don't use window frame (or even reject it BQ).

So I suggest we default to no frame (that it the whole table/window/group).

@snth
Copy link
Member

snth commented Dec 9, 2022

Thanks. I understand there may be performance implications and some combinations may not be feasible. My complaint is more about how this is documented and communicated. I admit I probably did SQL for 10 years before I ever even heard of WINDOW functions but I was not aware of this distinction until now. Looking over the Postgresql documentation again now, I see it is in there but IMHO it's quite subtle:

9.22. Window Functions

Note that first_value, last_value, and nth_value consider only the rows within the “window frame”, which by default contains the rows from the start of the partition through the last peer of the current row. ...
...
When an aggregate function is used as a window function, it aggregates over the rows within the current row's window frame.

I guess the implication is that the rest of the functions in that list DO NOT consider the "window frame".

There's also 4.2.8. Window Function Calls which has a section

The optional frame_clause can be one of
...

I just skimmed that section now but there's nothing jumping out at me that says that the optional frame_clause only applies to some window functions.

I think maybe one way this could be made clearer would be to call first_value, last_value, and nth_value FRAME FUNCTIONS which are similar to WINDOW FUNCTIONS and use the same OVER (...) syntax with the addition that FRAME FUNCTIONS can also accept an optional frame_clause. Moreover AGGREGATE FUNCTIONS that act over a OVER (...) clause are actually FRAME FUNCTIONS.

I think that's how I'm going to try and remember it internally.


I find this a pretty tricky space with many footguns lying around. I agree with you that we should try and make this consistent in PRQL. I'll need to step away from this for a bit to think about what I think would be a better behaviour to try and keep it logical and consistent for our users.


Two related functions where I've also gotten tripped up in the past and spent hours debugging queries is with FIRST_VALUE and LAST_VALUE. I don't recall the details now but I vaguely recall that you're better off with FIRST_VALUE () OVER (ORDER BY ... DESC) than simply LAST_VALUE() OVER (ORDER BY ...) but I don't recall why. It also had something to do with the FRAME.

@max-sixty
Copy link
Member Author

max-sixty commented Dec 9, 2022

So to confirm my SQL understanding, it seems there are three categories of window / "analytic" functions:

  • Aggregate analytic functions — sum, count, etc — aggregate functions with a window clause
    • By default, these use RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (edit:) if an ORDER BY exists
    • ...or the full window if there's no ORDER BY
  • Most "window functions"
    • Two types: Navigation functions such as LAG and Numbering functions such as RANK
    • In BQ, these don't accept window clauses, raising an error — they always operate over the full window
    • Postgres docs seem to imply the same default, (though by omission, and I haven't tested it)
  • _VALUE window functions — FIRST_VALUE, LAST_VALUE, NTH_VALUE
    • These do accept window clauses
    • In BQ & postgres these use the same RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW default (reference below for postgres, and I tested it myself on BQ1)

      Note that first_value, last_value, and nth_value consider only the rows within the “window frame”, which by default contains the rows from the start of the partition through the last peer of the current row. This is likely to give unhelpful results for last_value and sometimes also nth_value. You can redefine the frame by adding a suitable frame specification (RANGE, ROWS or GROUPS) to the OVER clause. See Section 4.2.8 for more information about frame specifications.

References:

Is that right?

Footnotes

  1. BQ states "You can't use a window frame clause with navigation functions and numbering functions" in their docs, and lists the _VALUE functions in Navigation functions. But it also contains examples of _VALUE functions using window clauses, and they work when testing

@snth
Copy link
Member

snth commented Dec 10, 2022

That sums it up well and is my understanding as well.

@max-sixty
Copy link
Member Author

max-sixty commented Dec 10, 2022

If that's the case, then I can see a few options for our defaults:

  1. Inherit SQL defaults
    • Everything works now. The defaults are arguably normally the thing folks want. But the defaults are complicated and confusing.
  2. Default to a full window
    • Until we encode the SQL defaults, some dialects won't be able to do LAG / LEAD, because they'll raise an error when we try and supply a window clause.
    • ...though maybe we could add a type which will not write a window clause, because we know the default is the full window
    • It's a very simple standard
  3. Try and come up with some better standard
    • Based on @aljazerzen 's points, I don't think this is viable — any standard that does what you want most of the time ends up being overly complicated to document and understand

I would probably vote for 1 2: Full Window if we think it's possible to work around using something like types in the near future (and ofc happy to help with the work). I could also see resting on 2 1: Inherit SQL Defaults for a while being reasonable, before later switching.

@aljazerzen
Copy link
Member

I vote +2 on the full window (which is also the current behavior).

@max-sixty
Copy link
Member Author

(I realize my numbers didn't agree with my titles, so fixed above)

Full window was also my vote; do you think it's viable to also let dialects like BQ work with the "Most window functions" group? I think we could even just hardcode them in the compiler to not emit window clauses, assuming there isn't a cleverer way.

@max-sixty
Copy link
Member Author

I think we can close this as decided, please reopen for any disagreements!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-design Changes to PRQL-the-language
Projects
None yet
Development

No branches or pull requests

4 participants