Skip to content

Ordering, sorting and nulls #2622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aljazerzen opened this issue May 24, 2023 · 16 comments
Open

Ordering, sorting and nulls #2622

aljazerzen opened this issue May 24, 2023 · 16 comments
Labels
language-design Changes to PRQL-the-language

Comments

@aljazerzen
Copy link
Member

aljazerzen commented May 24, 2023

The question is: where should nulls values end up in when sort-ing a relation?

A followup for #2617

This justification was made by dissecting "what sorting is" down to formal mathematical (unwritten) definition of relations and PRQL. So bear with me, it may look like it's getting way to broad and abstract, but I believe that semantics of the language will remain as concise and orthogonal as they are now only if we really know what we are designing.


As a start, I'd suggest that instead of thinking about where the NULLs should be, we design this by figuring out what comparison operation should return for null values and derive the behavior of sort from that.

So, these are the key questions:

1    < 1     ->   false  (obviously)
1    < 2     ->   true   (obviously)
1    < null  ->   ?
2    < 1     ->   false  (obviously)
2    < 2     ->   false  (obviously)
2    < null  ->   ?
null < 1     ->   ?
null < 2     ->   ?
null < null  ->   ?

There are three ways of filling this out:

  1. null < x for every non-null x,
  2. x < null for every non-null x,
  3. false for all comparisons to null.

Option 1. and 2. are straight forward: if null is less than every other value, this translates to DESC NULLS LAST and ASC NULLS FIRST. If null is greater than every other value, it's the other way around.

But option 3 is interesting. What happens if null is not less, greater or equal to any number?

This is (of course) well researched in math and it is named a (strict) partial ordering. The < operator (they call it 'relation', but I'm not going to do that to the reader) must be ireflexive, anti-symmetric and transitive. With these properties it makes sense to define that a "sorted sequence" has every pair a < b in such positions that a is before b.

But this partial ordering does not imply that there is a single unique sorted sequence.

For example, take a set of products and a operator "is made from". You can have a chain "wood < wooden board < desk". But there is also "wood < wooden pillar < house". So these are two separate sorted chains with a common predecessor. But there is no relation between desk and house, so they cannot be merged into a single chain. Nevertheless, you can construct a sorted sequence such that every element comes after all its components:

wood, wooden board, desk, wooden pillar, house

The sorted sequence of this ordering is not unique as you could also have:

wood, wooden board, wooden pillar, desk, house

... and the desired property would still hold.

In practice, we are usually dealing with sets where at least one of the following is true: a < b or b < a or a == b. In words this means roughly "you can compare any two elements". Such constructs are called "total ordering" and provide a way to construct a unique sequence of elements where for every two consecutive elements a and b, it holds that a < b.

Fortunately, we don't need to deal with the most general case from the theory of sorting. Our base case are sets that have a total ordering (i.e. ints, strings, bools).

When nulls are added to the set, it becomes a partial ordering, because (using option 3 above) null is not greater, equal or less to any other value. Sequences constructed from this partial ordering could contain nulls at any position, as that would not violate any of the < operators.

When we use NULLS FIRST/LAST in SQL, we extend the < operator to say "nulls are less than any other value" or "nulls are greater than any other value", depending on what we need. The result is a total ordering which has the unique sorted sequence that we want.

So, for the purposes of PRQL, I suggest that we separate the "sorting" from the "extending < operator so partial ordering becomes total ordering":

from tracks
sort (nulls_last billing_state)

Function nulls_last would take any primitive value and map it to something (let's imagine integers) such that nulls will be last and all other values would preserve order from the initial type. But that's formal definition, in practice we'd compile to NULLS FIRST or NULLS LAST. We would also have to make sure that both of these get compiled correctly:

from tracks
sort (nulls_last (-billing_state))  # DESC NULLS LAST
sort (-(nulls_last billing_state))  # DESC NULLS FIRST

What this implies is that using plain sort billing_state is attempting to use a partial order for sorting. We can decide on a behavior here:

  • allow it. Partial ordering may have nulls at any position, which means that both NULLS LAST and NULLS FIRST are "correct" because PRQL does not guarantee anything here.
  • deny it. We can make a nice error message saying "this may contain nulls, use nulls_first or nulls_last. A bit less convenient, but much more reproducible.
@max-sixty
Copy link
Member

max-sixty commented May 24, 2023

Thanks for an "@aljazerzen epic"! 😁

An incomplete quick question clarifying one thing, partly because I got confused by reading docs on NULLS LAST out there:

IIUC NULLS LAST doesn't imply a general ordering of nulls against other values. In particular, it has opposite orderings for ASC & DESC. For ASC they're bigger than other values, for DESC, they're smaller than other values.

(Clearly it implies an ordering conditional on the value of ASC / DESC — we have to put the nulls somewhere. And in e.g. BigQuery, Nulls are always smaller than other values without specifying NULLS ...)

Is that your understanding too?

From the DuckDB issue:

CREATE TABLE integers(i integer);
INSERT INTO integers VALUES (1), (2), (3), (NULL);
SELECT * FROM integers ORDER BY i;
┌───────┐
│   i   │
│ int32 │
├───────┤
│     1 │
│     2 │
│     3 │
│  NULL │
└───────┘

With NULLS LAST, the desired values are returned instead:

SELECT * FROM integers ORDER BY i NULLS LAST LIMIT 1;
┌───────┐
│   i   │
│ int32 │
├───────┤
│     1 │
└───────┘
SELECT * FROM integers ORDER BY i DESC NULLS LAST LIMIT 1;
┌───────┐
│   i   │
│ int32 │
├───────┤
│     3 │
└───────┘

@max-sixty
Copy link
Member

deny it. We can make a nice error message saying "this may contain nulls, use nulls_first or nulls_last. A bit less convenient, but much more reproducible.

FWIW I think this would be much too restrictive for a language like PRQL. But given the prevalence of columns which a) don't have nulls and b) aren't typed as not having nulls — that would add a bump for users without fairly sparse benefits IMO.

I could imagine it being good for a more precise representation, such as an IR — forcing the higher level representation to make a reasonable default...

@eitsupi
Copy link
Member

eitsupi commented May 25, 2023

IIUC NULLS LAST doesn't imply a general ordering of nulls against other values. In particular, it has opposite orderings for ASC & DESC. For ASC they're bigger than other values, for DESC, they're smaller than other values.

I think the argument here is that this is related to the fact that a comparison to NULL always returns NULL in SQL.

duckdb> select 1 = NULL;
┌────────────┐
│ (1 = NULL) │
╞════════════╡
│            │
└────────────┘
Elapsed: 7 ms

duckdb> select 1 != NULL;
┌─────────────┐
│ (1 != NULL) │
╞═════════════╡
│             │
└─────────────┘
Elapsed: 6 ms

duckdb> select 1 != 2;
┌──────────┐
│ (1 != 2) │
╞══════════╡
│     true │
└──────────┘
Elapsed: 6 ms

On the other hand, PRQL is expected to always return true or false in comparison to null.

PRQL, on the other hand, treats `null` as a value, which means that:
- `null == null` evaluates to `true`,
- `null != null` evaluates to `false`,
- distinct column cannot contain multiple `null` values.
```prql
from employees
filter first_name == null
filter null != last_name
```

@aljazerzen
Copy link
Member Author

IIUC NULLS LAST doesn't imply a general ordering of nulls against other values. In particular, it has opposite orderings for ASC & DESC. For ASC they're bigger than other values, for DESC, they're smaller than other values.

Yes, I doesn't imply general ordering, just an ordering for this particular ORDER BY clause. If I had to write down format semantics of ORDER BY in SQL, I would define new < operator for every combination of type being compared, ASC/DESC and NULLS FIRST/LAST.


I think the argument here is that this is related to the fact that a comparison to NULL always returns NULL in SQL.

Oh yes, of course - I didn't mention that, but <, >, <=, >= and == should always a boolean, otherwise semantics get much more complicated. Fortunately, we vowed not to inherit things like this from SQL, which means that PRQL's null is quite a different thing than SQL's NULL.

@aljazerzen
Copy link
Member Author

@max-sixty FWIW I think this would be much too restrictive for a language like PRQL. But given the prevalence of columns which a) don't have nulls and b) aren't typed as not having nulls — that would add a bump for users without fairly sparse benefits IMO.

That's true, being strict here does not have much benefit and would be quite inconvenient.

What do you think about defaulting to infer all columns as not-null? In such scenario, the error about "missing nulls_last" would only show up for columns that were explicitly marked as nullable.

I have a feeling that this would still be too inconvenient.

@max-sixty
Copy link
Member

What do you think about defaulting to infer all columns as not-null? In such scenario, the error about "missing nulls_last" would only show up for columns that were explicitly marked as nullable.

I have a feeling that this would still be too inconvenient.

I do think this is better — I think it mostly depends on how flexible the assignment of nullable / non-nullable columns is. We can def see how it looks.


Assuming we don't force users to be explicit, which is the design we most prefer? I agree that there's no perfect choice, we should choose something though (which could be "defer to the DB")...

@aljazerzen
Copy link
Member Author

aljazerzen commented May 26, 2023

Well, my current idea on how to declare tables looks like this:

# table is just a declaration without a value
let foo <{
   id = int,
   name = str || null,
}>

Ordering enforcement would look like this:

from foo
sort name    # err: name can be null, use nulls_first or nulls_last

So the 3 viable options are:

  1. null < x for every non-null x (this would be DuckDB's option 1),
  2. null > x for every non-null x (this would be DuckDB's option 2),
  3. infer all columns to be non-null by default and require using nulls_first when a column is marked as nullable

Note that there is no DuckDB's option 3 or 4 on the table, because they have different ordering depending whether you are using DESC or ASC - which is mixing two things that should be orthogonal.

@max-sixty max-sixty mentioned this issue May 27, 2023
@max-sixty
Copy link
Member

So the 3 viable options are:

1. `null < x` for every non-null x (this would be [DuckDB's option 1](https://github.com/PRQL/prql/pull/2617#issuecomment-1560839169)),

2. `null > x` for every non-null x (this would be [DuckDB's option 2](https://github.com/PRQL/prql/pull/2617#issuecomment-1560839169)),

3. infer all columns to be non-null by default and require using `nulls_first` when a column is marked as nullable

Note that there is no DuckDB's option 3 or 4 on the table, because they have different ordering depending whether you are using DESC or ASC - which is mixing two things that should be orthogonal.

Re (3) — at the moment we don't have an ability to specify column types, and even when we add that, most users won't be specifying columns, so we do need to decide some way to proceed without knowing the type of the column. (Or lmk if we disagree there?)

Apart from this, I don't have a particularly strong view. I'm not enamored by (1) or (2) — I don't yet understand why nulls should necessarily always be bigger / smaller than non-null values, vs being treated differently. The DuckDB approach of "nulls always go at the end" seemed reasonable to me.

How would we evaluate 5 < null? If we're not doing three-valued logic at all, then the answer becomes easier, but I'm not sure if that's viable with SQL databases as targets?

@aljazerzen
Copy link
Member Author

That is partially true: if we go for option (3), we can delay this decision to when we support column type definitions.

It is not entirely true, because we should think about what happens when users sort by a column that is inferred to be non-null, the compiler figures it doesn't have to bother with "where do null values go" it outputs a query that assumes there will be no null values, and then there are null values.

If we wanted to be precise, we'd compile to a query that validates that the input conforms to the type definitions we inferred (or were defined). If they do not hold, entire query may be invalid / may produce errors somewhere else, which means we should abort as soon as possible.

I really don't want to go down the road of "oh it's fine, let's execute the query anyway", because many guarantees of what a PRQL query will output cannot hold anymore, if it has been compiled for a different scenario than what it is used in. In such case, we cannot formally define where the nulls should go, because formally, there should be no nulls!

Actually, the more I think about it, we probably cannot afford to infer columns non-null by default.


With currently proposed semantics, 5 < null should compile to COALESCE(5 < NULL, FALSE).

@max-sixty max-sixty added the needs-discussion Undecided dilemma label Jun 3, 2023
This was referenced Jun 3, 2023
@max-sixty
Copy link
Member

As discussed on dev call — we all quite liked "If it's not sortable, it goes at the end".

This would imply NULLS LAST

We're going to think more about this, if we agree we can re-merge NULLS LAST. (which is not un-revertable but we should try and avoid...)

@aljazerzen
Copy link
Member Author

aljazerzen commented Jun 4, 2023

I agree, out of all options, NULLS LAST is the most convenient, by far. I'll try to find a formulation that would implement that, while also being counted as "format semantics definition".

This here is a working draft, I'll update it later and maybe move it into the book.

Definitions

  • an ordering is expressed as a function with:

    • signature a<T> b<T> -> bool (non-null, we use notation a < b),
    • is ireflexive: not (a < a)
    • is asymmetric: if a < b then not b < a
    • is transitive: if a < b && b < c then a < c
  • a sorting is an array of arrays, where for every pair a<T> and b<T> it holds that:

    • if a < b, then a and b are placed in same sub-array and a is positioned before b,
    • if not (a < b || b < a), a and b are placed in separate sub-arrays,
    • order of sub-arrays is determined by the cardinality of sub-array's item type, from most to least.

Note: the types of sorting sub-arrays must be computable statically. Their cardinality is associates with the number of elements in a type.

Theorem: every ordering has a unique sorting, iff each of the sub-arrays has a unique cardinality.

  • function sort takes an ordering and returns a flattened sorting,
  • function min takes an ordering and returns last element of first sub-array of the sorting,
  • function max takes an ordering and returns first element of first sub-array of the sorting,

Examples:

int

Values:
1, 2, 3

Ordering:
1 < 2
2 < 3
1 < 3

Sorting:
[
  [1, 2, 3]
]

int || null

Values:
1, 2, 3, null

Ordering:
1 < 2
2 < 3
1 < 3

Sorting:
[
  [1, 2, 3], # type `int`, cardinality 2^64
  [null],    # type `null`, cardinality 1
]

sort: [1, 2, 3, null]
min: 1
max: 3

some weird enum, partially comparable

Values:
A, B, C, x, y, null

Ordering:
A < B
B < C
A < C
x < y

Sorting:
[
  [A, B, C], # type `A | B | C`, cardinality 3
  [x, y],    # type `x | y`, cardinality 2
  [null],    # type `null`, cardinality 1
]

sort: [A, B, C, x, y, null]
min: A
max: C

Notes

There is no such thing as "ascending" or "descending" sorting. You can, however, change the ordering, such that iff a < b in old sorting then in the new sorting b < a. This will reverse each of the sorting sub-arrays, but will not change the order of the main sorting array.

TODO

  1. I didn't yet think about what happens when a single field of the tuple provided to the sort functions is null. It's obvious what we want to happen, but I need to formulate how we are comparing tuples.

@aljazerzen aljazerzen added the language-design Changes to PRQL-the-language label Jun 8, 2023
@max-sixty
Copy link
Member

I agree with these semantics! In these terms — what defines the ordering of the sub-arrays themselves? Why does the sub-array with nulls go last?

A similar way of formulating this is "There's a hegemonic type for each column. Anything not comparable to that type goes at the end." That accounts for nulls, but not for different incomparable Enums.

@max-sixty
Copy link
Member

Do we want to merge the NULLS LAST? Is this refining how we're conceptualizing it or making a decision on whether to proceed?

(I'm happy to resuscitate that PR if we are)

@aljazerzen
Copy link
Member Author

what defines the ordering of the sub-arrays themselves?

order of sub-arrays is determined by the cardinality of sub-array's item type, from most to least.


A similar way of formulating this is "There's a hegemonic type for each column. Anything not comparable to that type goes at the end." That accounts for nulls, but not for different incomparable Enums.

Yea, that would work. I didn't want to give nulls any special status here. To keep things orthogonal :D


Do we want to merge the NULLS LAST?

Yes, I think this case is strong enough to justify NULLS LAST.

@max-sixty
Copy link
Member

FYI because of #2754 (comment), this is no longer urgent / a blocker for 0.9.0...

max-sixty added a commit to max-sixty/prql that referenced this issue Aug 1, 2023
We rolled this back, awaiting the outcome of PRQL#2622
max-sixty added a commit to max-sixty/prql that referenced this issue Aug 1, 2023
We rolled this back, awaiting the outcome of PRQL#2622
@aljazerzen aljazerzen removed the needs-discussion Undecided dilemma label Oct 24, 2023
@lukapeschke
Copy link
Contributor

Hello @max-sixty is there a way today to specify NULLS LAST when using sort ? I have not been able to figure out if there was a way to use an s-string for that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-design Changes to PRQL-the-language
Projects
None yet
Development

No branches or pull requests

4 participants