Skip to content

Consider a default max_statistics_truncate_length and max_column_index_truncate_length #7490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alamb opened this issue May 12, 2025 · 0 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented May 12, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
By default the arrow-rs parquet writer will save the entire actual min and max values for any column that has statistics enabled into the page metadata

For large binary/string columns (think JSON blobs), this means that two (a min and a max) potentially large values will be stored in both the file level metadata as well as in each page header

This can lead to pathalogical cases such as described in

It is possible to control the maximum size of the values using

  1. WriterPropertiesBuilder::set_statistics_truncate_length
  2. WriterPropertiesBuilder::set_column_index_truncate_length

However the values currently defaults to None (unlimited)

I also think it is unlikely that the actual min/max values for large string columns will add significantly better pruning.

Describe the solution you'd like
I propose we set the default statistics truncate length to a non None value to avoid pathalogical cases

Describe alternatives you've considered
I would propose picking a value like 128 that is long enough to capture all primitive data types and
"sort" strings.

We can (and should) also document the default better

Additional context

@alamb alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels May 12, 2025
@alamb alamb changed the title Consider a default Statistics max_truncate_lenght Consider a default Statistics max_truncate_length May 12, 2025
@alamb alamb changed the title Consider a default Statistics max_truncate_length Consider a default max_statistics_truncate_length and max_column_index_statistics_struncate_length May 12, 2025
@alamb alamb changed the title Consider a default max_statistics_truncate_length and max_column_index_statistics_struncate_length Consider a default max_statistics_truncate_length and max_column_index_truncate_length May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

1 participant