Description
The question is: where should null
s values end up in when sort
-ing a relation?
A followup for #2617
This justification was made by dissecting "what sorting is" down to formal mathematical (unwritten) definition of relations and PRQL. So bear with me, it may look like it's getting way to broad and abstract, but I believe that semantics of the language will remain as concise and orthogonal as they are now only if we really know what we are designing.
As a start, I'd suggest that instead of thinking about where the NULLs should be, we design this by figuring out what comparison operation should return for null values and derive the behavior of sort from that.
So, these are the key questions:
1 < 1 -> false (obviously)
1 < 2 -> true (obviously)
1 < null -> ?
2 < 1 -> false (obviously)
2 < 2 -> false (obviously)
2 < null -> ?
null < 1 -> ?
null < 2 -> ?
null < null -> ?
There are three ways of filling this out:
null < x
for every non-null x,x < null
for every non-null x,false
for all comparisons to null.
Option 1. and 2. are straight forward: if null is less than every other value, this translates to DESC NULLS LAST and ASC NULLS FIRST. If null is greater than every other value, it's the other way around.
But option 3 is interesting. What happens if null is not less, greater or equal to any number?
This is (of course) well researched in math and it is named a (strict) partial ordering. The <
operator (they call it 'relation', but I'm not going to do that to the reader) must be ireflexive, anti-symmetric and transitive. With these properties it makes sense to define that a "sorted sequence" has every pair a < b
in such positions that a is before b.
But this partial ordering does not imply that there is a single unique sorted sequence.
For example, take a set of products and a operator "is made from". You can have a chain "wood < wooden board < desk". But there is also "wood < wooden pillar < house". So these are two separate sorted chains with a common predecessor. But there is no relation between desk and house, so they cannot be merged into a single chain. Nevertheless, you can construct a sorted sequence such that every element comes after all its components:
wood, wooden board, desk, wooden pillar, house
The sorted sequence of this ordering is not unique as you could also have:
wood, wooden board, wooden pillar, desk, house
... and the desired property would still hold.
In practice, we are usually dealing with sets where at least one of the following is true: a < b or b < a or a == b
. In words this means roughly "you can compare any two elements". Such constructs are called "total ordering" and provide a way to construct a unique sequence of elements where for every two consecutive elements a and b, it holds that a < b
.
Fortunately, we don't need to deal with the most general case from the theory of sorting. Our base case are sets that have a total ordering (i.e. ints, strings, bools).
When nulls are added to the set, it becomes a partial ordering, because (using option 3 above) null is not greater, equal or less to any other value. Sequences constructed from this partial ordering could contain nulls at any position, as that would not violate any of the <
operators.
When we use NULLS FIRST/LAST in SQL, we extend the <
operator to say "nulls are less than any other value" or "nulls are greater than any other value", depending on what we need. The result is a total ordering which has the unique sorted sequence that we want.
So, for the purposes of PRQL, I suggest that we separate the "sorting" from the "extending < operator so partial ordering becomes total ordering":
from tracks
sort (nulls_last billing_state)
Function nulls_last
would take any primitive value and map it to something (let's imagine integers) such that nulls will be last and all other values would preserve order from the initial type. But that's formal definition, in practice we'd compile to NULLS FIRST or NULLS LAST. We would also have to make sure that both of these get compiled correctly:
from tracks
sort (nulls_last (-billing_state)) # DESC NULLS LAST
sort (-(nulls_last billing_state)) # DESC NULLS FIRST
What this implies is that using plain sort billing_state
is attempting to use a partial order for sorting. We can decide on a behavior here:
- allow it. Partial ordering may have nulls at any position, which means that both NULLS LAST and NULLS FIRST are "correct" because PRQL does not guarantee anything here.
- deny it. We can make a nice error message saying "this may contain nulls, use
nulls_first
ornulls_last
. A bit less convenient, but much more reproducible.