Skip to content

Improve multi-column sorting for primitive arrays by avoiding multi-column sort #7532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Dandandan opened this issue May 20, 2025 · 7 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog performance

Comments

@Dandandan
Copy link
Contributor

Dandandan commented May 20, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When sorting multiple columns, like two u32 columns, we take the approach of creating a LexicalComparator, and sort by that.
However, this has higher overhead compared to sorting by a single primitive.

Describe the solution you'd like

We can try combining arrays (pack) into one or a smaller number of primitive arrays.

We can pack the items (based on sorting options) to the next biggest type:
(u16, u16) -> u32
(u16, f32) -> u64
(u8, u32) -> u64
(u32, u32) -> u64
(u64, u32) -> u128
(u64, u64) -> u128
...etc.

and sort on that. This is similar to the RowFormat, but should have much lower overhead of creating the arrays and sorting is faster on primitives compared to variable-sized data.

Even when the columns can't be packed into one column, sorting can still be faster as it will create fewer columns to sort on.

I believe we can also extend this idea a bit further for larger types and more columns by converting the inputs into a fixed number of u64 tuples / fixed array sizes, so the compiler can optimize that. E.g. for a 4 column sort on i64 types we can convert it to Vec<[u64; 4]> and sort based on this.

Describe alternatives you've considered

Additional context

@Dandandan Dandandan added enhancement Any new improvement worthy of a entry in the changelog performance labels May 20, 2025
@Dandandan
Copy link
Contributor Author

Dandandan commented May 20, 2025

Let's see if AI (jules.google.com) can implement this!

Image

@Dandandan
Copy link
Contributor Author

It gave me some rough code to work with and it's pretty promising: ~3x improvement for sorting two f32 colums!

lexsort (f32, f32) 2^10 time:   [10.427 µs 10.511 µs 10.636 µs]
                        change: [−62.783% −62.528% −62.225%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

lexsort (f32, f32) 2^12 time:   [44.021 µs 44.111 µs 44.207 µs]
                        change: [−66.312% −66.081% −65.916%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

lexsort (f32, f32) nulls 2^10
                        time:   [10.292 µs 10.357 µs 10.471 µs]
                        change: [−64.731% −64.558% −64.340%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

lexsort (f32, f32) nulls 2^12
                        time:   [40.983 µs 41.311 µs 41.726 µs]
                        change: [−68.945% −68.599% −68.210%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) high mild
  4 (4.00%) high severe

lexsort (f32, f32) 2^12 limit 10
                        time:   [14.866 µs 14.955 µs 15.079 µs]
                        change: [−31.880% −30.753% −29.423%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

lexsort (f32, f32) 2^12 limit 100
                        time:   [15.472 µs 15.526 µs 15.612 µs]
                        change: [−34.326% −34.083% −33.757%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

lexsort (f32, f32) 2^12 limit 1000
                        time:   [22.139 µs 22.263 µs 22.444 µs]
                        change: [−52.587% −52.268% −51.968%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

@Dandandan
Copy link
Contributor Author

Thinking out loud: an alternative would be improving the row format in such a way it more efficiently can create / sort fixed-width data (e.g. viewing the data as a Vec<u64> or Vec<u128> or Vec<[u64;N]> 🤔

@tustvold
Copy link
Contributor

I think the key here is avoiding a codegen explosion, a specialized fixed width row format sounds promising

@Dandandan
Copy link
Contributor Author

Another option would be to reuse existing FixedLengthEncodings and try pack everything into u64 / u128, use existing lexsort_to_indices to handle the rest (sorting one or multiple packed arrays) 🤔

@tustvold
Copy link
Contributor

tustvold commented May 21, 2025

I'd be wary of making arrow-ord depend on arrow-row, but there is definitely the possibility of copying the encoding mechanism used for integers and floats and using that to pack into u64 and u128 respectively. The complex logic in arrow-row is to handle variable width types, e.g. strings, which I suspect wouldn't be relevant here.

@Dandandan
Copy link
Contributor Author

Good to know - if there is considerable overlap it might be possible to extract it to some common place - we can start with just copying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog performance
Projects
None yet
Development

No branches or pull requests

2 participants