-
Notifications
You must be signed in to change notification settings - Fork 228
Ordering, sorting and nulls #2622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for an "@aljazerzen epic"! 😁 An incomplete quick question clarifying one thing, partly because I got confused by reading docs on IIUC (Clearly it implies an ordering conditional on the value of Is that your understanding too? From the DuckDB issue:
|
FWIW I think this would be much too restrictive for a language like PRQL. But given the prevalence of columns which a) don't have nulls and b) aren't typed as not having nulls — that would add a bump for users without fairly sparse benefits IMO. I could imagine it being good for a more precise representation, such as an IR — forcing the higher level representation to make a reasonable default... |
I think the argument here is that this is related to the fact that a comparison to duckdb> select 1 = NULL;
┌────────────┐
│ (1 = NULL) │
╞════════════╡
│ │
└────────────┘
Elapsed: 7 ms
duckdb> select 1 != NULL;
┌─────────────┐
│ (1 != NULL) │
╞═════════════╡
│ │
└─────────────┘
Elapsed: 6 ms
duckdb> select 1 != 2;
┌──────────┐
│ (1 != 2) │
╞══════════╡
│ true │
└──────────┘
Elapsed: 6 ms On the other hand, PRQL is expected to always return prql/web/book/src/language-features/null.md Lines 18 to 28 in 2e03033
|
Yes, I doesn't imply general ordering, just an ordering for this particular ORDER BY clause. If I had to write down format semantics of ORDER BY in SQL, I would define new
Oh yes, of course - I didn't mention that, but |
That's true, being strict here does not have much benefit and would be quite inconvenient. What do you think about defaulting to infer all columns as not-null? In such scenario, the error about "missing nulls_last" would only show up for columns that were explicitly marked as nullable. I have a feeling that this would still be too inconvenient. |
I do think this is better — I think it mostly depends on how flexible the assignment of nullable / non-nullable columns is. We can def see how it looks. Assuming we don't force users to be explicit, which is the design we most prefer? I agree that there's no perfect choice, we should choose something though (which could be "defer to the DB")... |
Well, my current idea on how to declare tables looks like this:
Ordering enforcement would look like this:
So the 3 viable options are:
Note that there is no DuckDB's option 3 or 4 on the table, because they have different ordering depending whether you are using DESC or ASC - which is mixing two things that should be orthogonal. |
Re (3) — at the moment we don't have an ability to specify column types, and even when we add that, most users won't be specifying columns, so we do need to decide some way to proceed without knowing the type of the column. (Or lmk if we disagree there?) Apart from this, I don't have a particularly strong view. I'm not enamored by (1) or (2) — I don't yet understand why nulls should necessarily always be bigger / smaller than non-null values, vs being treated differently. The DuckDB approach of "nulls always go at the end" seemed reasonable to me. How would we evaluate |
That is partially true: if we go for option (3), we can delay this decision to when we support column type definitions. It is not entirely true, because we should think about what happens when users sort by a column that is inferred to be non-null, the compiler figures it doesn't have to bother with "where do null values go" it outputs a query that assumes there will be no null values, and then there are null values. If we wanted to be precise, we'd compile to a query that validates that the input conforms to the type definitions we inferred (or were defined). If they do not hold, entire query may be invalid / may produce errors somewhere else, which means we should abort as soon as possible. I really don't want to go down the road of "oh it's fine, let's execute the query anyway", because many guarantees of what a PRQL query will output cannot hold anymore, if it has been compiled for a different scenario than what it is used in. In such case, we cannot formally define where the nulls should go, because formally, there should be no nulls! Actually, the more I think about it, we probably cannot afford to infer columns non-null by default. With currently proposed semantics, |
As discussed on dev call — we all quite liked "If it's not sortable, it goes at the end". This would imply We're going to think more about this, if we agree we can re-merge |
I agree, out of all options, This here is a working draft, I'll update it later and maybe move it into the book. Definitions
Note: the types of sorting sub-arrays must be computable statically. Their cardinality is associates with the number of elements in a type. Theorem: every ordering has a unique sorting, iff each of the sub-arrays has a unique cardinality.
Examples:int
int || null
some weird enum, partially comparable
NotesThere is no such thing as "ascending" or "descending" sorting. You can, however, change the ordering, such that iff TODO
|
I agree with these semantics! In these terms — what defines the ordering of the sub-arrays themselves? Why does the sub-array with nulls go last? A similar way of formulating this is "There's a hegemonic type for each column. Anything not comparable to that type goes at the end." That accounts for nulls, but not for different incomparable Enums. |
Do we want to merge the (I'm happy to resuscitate that PR if we are) |
Yea, that would work. I didn't want to give nulls any special status here. To keep things orthogonal :D
Yes, I think this case is strong enough to justify NULLS LAST. |
FYI because of #2754 (comment), this is no longer urgent / a blocker for 0.9.0... |
We rolled this back, awaiting the outcome of PRQL#2622
We rolled this back, awaiting the outcome of PRQL#2622
Hello @max-sixty is there a way today to specify |
The question is: where should
null
s values end up in whensort
-ing a relation?A followup for #2617
This justification was made by dissecting "what sorting is" down to formal mathematical (unwritten) definition of relations and PRQL. So bear with me, it may look like it's getting way to broad and abstract, but I believe that semantics of the language will remain as concise and orthogonal as they are now only if we really know what we are designing.
As a start, I'd suggest that instead of thinking about where the NULLs should be, we design this by figuring out what comparison operation should return for null values and derive the behavior of sort from that.
So, these are the key questions:
There are three ways of filling this out:
null < x
for every non-null x,x < null
for every non-null x,false
for all comparisons to null.Option 1. and 2. are straight forward: if null is less than every other value, this translates to DESC NULLS LAST and ASC NULLS FIRST. If null is greater than every other value, it's the other way around.
But option 3 is interesting. What happens if null is not less, greater or equal to any number?
This is (of course) well researched in math and it is named a (strict) partial ordering. The
<
operator (they call it 'relation', but I'm not going to do that to the reader) must be ireflexive, anti-symmetric and transitive. With these properties it makes sense to define that a "sorted sequence" has every paira < b
in such positions that a is before b.But this partial ordering does not imply that there is a single unique sorted sequence.
For example, take a set of products and a operator "is made from". You can have a chain "wood < wooden board < desk". But there is also "wood < wooden pillar < house". So these are two separate sorted chains with a common predecessor. But there is no relation between desk and house, so they cannot be merged into a single chain. Nevertheless, you can construct a sorted sequence such that every element comes after all its components:
The sorted sequence of this ordering is not unique as you could also have:
... and the desired property would still hold.
In practice, we are usually dealing with sets where at least one of the following is true:
a < b or b < a or a == b
. In words this means roughly "you can compare any two elements". Such constructs are called "total ordering" and provide a way to construct a unique sequence of elements where for every two consecutive elements a and b, it holds thata < b
.Fortunately, we don't need to deal with the most general case from the theory of sorting. Our base case are sets that have a total ordering (i.e. ints, strings, bools).
When nulls are added to the set, it becomes a partial ordering, because (using option 3 above) null is not greater, equal or less to any other value. Sequences constructed from this partial ordering could contain nulls at any position, as that would not violate any of the
<
operators.When we use NULLS FIRST/LAST in SQL, we extend the
<
operator to say "nulls are less than any other value" or "nulls are greater than any other value", depending on what we need. The result is a total ordering which has the unique sorted sequence that we want.So, for the purposes of PRQL, I suggest that we separate the "sorting" from the "extending < operator so partial ordering becomes total ordering":
Function
nulls_last
would take any primitive value and map it to something (let's imagine integers) such that nulls will be last and all other values would preserve order from the initial type. But that's formal definition, in practice we'd compile to NULLS FIRST or NULLS LAST. We would also have to make sure that both of these get compiled correctly:What this implies is that using plain
sort billing_state
is attempting to use a partial order for sorting. We can decide on a behavior here:nulls_first
ornulls_last
. A bit less convenient, but much more reproducible.The text was updated successfully, but these errors were encountered: