Skip to content

arrow-select: add support for merging primitive dictionary values #7519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

asubiotto
Copy link
Contributor

Previously, should_merge_dictionaries would always return false in the ptr_eq closure creation match arm for types that were not {Large}{Utf8,Binary}. This could lead to excessive memory usage.

Which issue does this PR close?

Closes #7518

What changes are included in this PR?

Update to the match arm in should_merge_dictionary_values to not short circuit on primitive types. Also uses primitive byte representations to reuse the Interner pipeline used for the bytes types.

Are there any user-facing changes?

No

Previously, should_merge_dictionaries would always return false in the ptr_eq
closure creation match arm for types that were not {Large}{Utf8,Binary}. This
could lead to excessive memory usage.
@asubiotto
Copy link
Contributor Author

asubiotto commented May 16, 2025

This causes a regression vs main on the newly-added struct concatenation benchmark:

~~concat struct with int32 and dicts
                        time:   [8.5127 µs 8.5323 µs 8.5561 µs]
                        change: [+83.335% +84.097% +84.823%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)

I assume this is because merging is more computationally expensive. I'm going to play around with different batch sizes etc... to get a better idea of the performance implications. However, we should also factor in memory savings.

EDIT: After fixing the values slice length (see #7520 (comment)). the regression is more manageable and this is how it stacks up with varying batch sizes and batch counts vs #7517 (had to increase sample size to 10k since there seemed to be a lot of noise)

concat struct with int32 and dicts size=1024 count=2
                        time:   [4.0092 µs 4.0102 µs 4.0113 µs]
                        change: [+2.0662% +2.1281% +2.1905%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1205 outliers among 10000 measurements (12.05%)
  9 (0.09%) low severe
  51 (0.51%) low mild
  808 (8.08%) high mild
  337 (3.37%) high severe

concat struct with int32 and dicts size=1024 count=100
                        time:   [103.41 µs 103.85 µs 104.32 µs]
                        change: [+11.622% +12.187% +12.701%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 427 outliers among 10000 measurements (4.27%)
  110 (1.10%) low mild
  55 (0.55%) high mild
  262 (2.62%) high severe

concat struct with int32 and dicts size=8192 count=2
                        time:   [11.803 µs 12.095 µs 12.391 µs]
                        change: [+35.546% +39.202% +43.054%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 568 outliers among 10000 measurements (5.68%)
  113 (1.13%) high mild
  455 (4.55%) high severe

Benchmarking concat struct with int32 and dicts size=8192 count=100: Warming up for 3.0000 s
Warning: Unable to complete 10000 samples in 5.0s. You may wish to increase target time to 6.5s, or reduce sample count to 7700.
concat struct with int32 and dicts size=8192 count=100
                        time:   [564.06 µs 564.59 µs 565.14 µs]
                        change: [-5.3148% -4.9430% -4.5692%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 812 outliers among 10000 measurements (8.12%)
  446 (4.46%) low mild
  259 (2.59%) high mild
  107 (1.07%) high severe

So to summarize: this approach is slower (in the general case) for concatenating primitive dictionaries in exchange for reduced memory. In our case, this improves execution speed downstream so I would say this regression in microbenchmarks is worth it but I'd like to hear other opinions.

@tustvold
Copy link
Contributor

tustvold commented May 17, 2025

See also #7468 (although I suspect this formulation avoids the issue there).

Edit: as an aside I am curious why you are using primitive dictionaries, the performance hit is huge and in most cases the memory savings marginal... Curious what I am missing, I've always viewed them as supported but esoteric...

@asubiotto
Copy link
Contributor Author

Thanks @tustvold, I wasn't aware of that PR.

In our case we don't do any computations on these columns but care about the memory savings since we run in memory-constrained environments. Our data for these columns specifically has very few unique values (0.03% is a recent number). Additionally, our schema is deeply nested and these columns are usually found in the leaves (within lists of structs) so memory savings per batch is magnified. Granted, our idea was to have these be REE but that wasn't working for some reason (can't remember why but I should experiment when I have some time).

Let me know how you'd like to proceed. I guess one point in favor of this PR is that if someone is using primitive dictionary batches it is likely that they value memory over perf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow merging primitive dictionary values in concat and interleave kernels
2 participants