-
Notifications
You must be signed in to change notification settings - Fork 930
arrow-select: add support for merging primitive dictionary values #7519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Previously, should_merge_dictionaries would always return false in the ptr_eq closure creation match arm for types that were not {Large}{Utf8,Binary}. This could lead to excessive memory usage.
4a2e763
to
62c4028
Compare
EDIT: After fixing the values slice length (see #7520 (comment)). the regression is more manageable and this is how it stacks up with varying batch sizes and batch counts vs #7517 (had to increase sample size to 10k since there seemed to be a lot of noise)
So to summarize: this approach is slower (in the general case) for concatenating primitive dictionaries in exchange for reduced memory. In our case, this improves execution speed downstream so I would say this regression in microbenchmarks is worth it but I'd like to hear other opinions. |
See also #7468 (although I suspect this formulation avoids the issue there). Edit: as an aside I am curious why you are using primitive dictionaries, the performance hit is huge and in most cases the memory savings marginal... Curious what I am missing, I've always viewed them as supported but esoteric... |
Thanks @tustvold, I wasn't aware of that PR. In our case we don't do any computations on these columns but care about the memory savings since we run in memory-constrained environments. Our data for these columns specifically has very few unique values (0.03% is a recent number). Additionally, our schema is deeply nested and these columns are usually found in the leaves (within lists of structs) so memory savings per batch is magnified. Granted, our idea was to have these be REE but that wasn't working for some reason (can't remember why but I should experiment when I have some time). Let me know how you'd like to proceed. I guess one point in favor of this PR is that if someone is using primitive dictionary batches it is likely that they value memory over perf. |
Previously, should_merge_dictionaries would always return false in the ptr_eq closure creation match arm for types that were not {Large}{Utf8,Binary}. This could lead to excessive memory usage.
Which issue does this PR close?
Closes #7518
What changes are included in this PR?
Update to the match arm in
should_merge_dictionary_values
to not short circuit on primitive types. Also uses primitive byte representations to reuse theInterner
pipeline used for the bytes types.Are there any user-facing changes?
No