Consider a default max_statistics_truncate_length
and max_column_index_truncate_length
#7490
Labels
enhancement
Any new improvement worthy of a entry in the changelog
parquet
Changes to the parquet crate
Uh oh!
There was an error while loading. Please reload this page.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
By default the arrow-rs parquet writer will save the entire actual min and max values for any column that has statistics enabled into the page metadata
For large binary/string columns (think JSON blobs), this means that two (a min and a max) potentially large values will be stored in both the file level metadata as well as in each page header
This can lead to pathalogical cases such as described in
arrow-rs
can't be read withpyarrow
#7489It is possible to control the maximum size of the values using
WriterPropertiesBuilder::set_statistics_truncate_length
WriterPropertiesBuilder::set_column_index_truncate_length
However the values currently defaults to
None
(unlimited)I also think it is unlikely that the actual min/max values for large string columns will add significantly better pruning.
Describe the solution you'd like
I propose we set the default statistics truncate length to a non None value to avoid pathalogical cases
Describe alternatives you've considered
I would propose picking a value like
128
that is long enough to capture all primitive data types and"sort" strings.
We can (and should) also document the default better
Additional context
arrow-rs
can't be read withpyarrow
#7489arrow-rs
can't be read withpyarrow
arrow#46404arro3
can't be read withpyarrow
kylebarron/arro3#324The text was updated successfully, but these errors were encountered: