arrow-select: add support for merging primitive dictionary values #7519

asubiotto · 2025-05-16T11:05:10Z

Previously, should_merge_dictionaries would always return false in the ptr_eq closure creation match arm for types that were not {Large}{Utf8,Binary}. This could lead to excessive memory usage.

Which issue does this PR close?

Closes #7518

What changes are included in this PR?

Update to the match arm in should_merge_dictionary_values to not short circuit on primitive types. Also uses primitive byte representations to reuse the Interner pipeline used for the bytes types.

Are there any user-facing changes?

No

Previously, should_merge_dictionaries would always return false in the ptr_eq closure creation match arm for types that were not {Large}{Utf8,Binary}. This could lead to excessive memory usage.

asubiotto · 2025-05-16T12:56:10Z

~~This causes a regression vs main on the newly-added struct concatenation benchmark:~~

~~concat struct with int32 and dicts
                        time:   [8.5127 µs 8.5323 µs 8.5561 µs]
                        change: [+83.335% +84.097% +84.823%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)

I assume this is because merging is more computationally expensive. I'm going to play around with different batch sizes etc... to get a better idea of the performance implications. However, we should also factor in memory savings.

EDIT: After fixing the values slice length (see #7520 (comment)). the regression is more manageable and this is how it stacks up with varying batch sizes and batch counts vs #7517 (had to increase sample size to 10k since there seemed to be a lot of noise)

concat struct with int32 and dicts size=1024 count=2
                        time:   [4.0092 µs 4.0102 µs 4.0113 µs]
                        change: [+2.0662% +2.1281% +2.1905%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1205 outliers among 10000 measurements (12.05%)
  9 (0.09%) low severe
  51 (0.51%) low mild
  808 (8.08%) high mild
  337 (3.37%) high severe

concat struct with int32 and dicts size=1024 count=100
                        time:   [103.41 µs 103.85 µs 104.32 µs]
                        change: [+11.622% +12.187% +12.701%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 427 outliers among 10000 measurements (4.27%)
  110 (1.10%) low mild
  55 (0.55%) high mild
  262 (2.62%) high severe

concat struct with int32 and dicts size=8192 count=2
                        time:   [11.803 µs 12.095 µs 12.391 µs]
                        change: [+35.546% +39.202% +43.054%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 568 outliers among 10000 measurements (5.68%)
  113 (1.13%) high mild
  455 (4.55%) high severe

Benchmarking concat struct with int32 and dicts size=8192 count=100: Warming up for 3.0000 s
Warning: Unable to complete 10000 samples in 5.0s. You may wish to increase target time to 6.5s, or reduce sample count to 7700.
concat struct with int32 and dicts size=8192 count=100
                        time:   [564.06 µs 564.59 µs 565.14 µs]
                        change: [-5.3148% -4.9430% -4.5692%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 812 outliers among 10000 measurements (8.12%)
  446 (4.46%) low mild
  259 (2.59%) high mild
  107 (1.07%) high severe

So to summarize: this approach is slower (in the general case) for concatenating primitive dictionaries in exchange for reduced memory. In our case, this improves execution speed downstream so I would say this regression in microbenchmarks is worth it but I'd like to hear other opinions.

tustvold · 2025-05-17T23:21:04Z

See also #7468 (although I suspect this formulation avoids the issue there).

Edit: as an aside I am curious why you are using primitive dictionaries, the performance hit is huge and in most cases the memory savings marginal... Curious what I am missing, I've always viewed them as supported but esoteric...

asubiotto · 2025-05-21T10:10:14Z

Thanks @tustvold, I wasn't aware of that PR.

In our case we don't do any computations on these columns but care about the memory savings since we run in memory-constrained environments. Our data for these columns specifically has very few unique values (0.03% is a recent number). Additionally, our schema is deeply nested and these columns are usually found in the leaves (within lists of structs) so memory savings per batch is magnified. Granted, our idea was to have these be REE but that wasn't working for some reason (can't remember why but I should experiment when I have some time).

Let me know how you'd like to proceed. I guess one point in favor of this PR is that if someone is using primitive dictionary batches it is likely that they value memory over perf.

github-actions bot added the arrow Changes to the arrow crate label May 16, 2025

asubiotto mentioned this pull request May 16, 2025

arrow-select: add support for optimized concatenation of struct arrays #7517

Merged

arrow-select: add support for merging primitive dictionary values

62c4028

Previously, should_merge_dictionaries would always return false in the ptr_eq closure creation match arm for types that were not {Large}{Utf8,Binary}. This could lead to excessive memory usage.

asubiotto force-pushed the asubiotto-mprim branch from 4a2e763 to 62c4028 Compare May 16, 2025 12:33

asubiotto mentioned this pull request May 16, 2025

arrow: add concat structs benchmark #7520

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

arrow-select: add support for merging primitive dictionary values #7519

arrow-select: add support for merging primitive dictionary values #7519

Uh oh!

asubiotto commented May 16, 2025

Uh oh!

asubiotto commented May 16, 2025 •

edited

Loading

Uh oh!

tustvold commented May 17, 2025 •

edited

Loading

Uh oh!

asubiotto commented May 21, 2025

Uh oh!

Uh oh!

arrow-select: add support for merging primitive dictionary values #7519

Are you sure you want to change the base?

arrow-select: add support for merging primitive dictionary values #7519

Uh oh!

Conversation

asubiotto commented May 16, 2025

Which issue does this PR close?

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

asubiotto commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tustvold commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asubiotto commented May 21, 2025

Uh oh!

Uh oh!

asubiotto commented May 16, 2025 •

edited

Loading

tustvold commented May 17, 2025 •

edited

Loading