Files containing binary data with >=8_388_855 bytes per row written with `arrow-rs` can't be read with `pyarrow` #7489

jonded94 · 2025-05-12T13:34:55Z

Hello,

internally, we wrote an own library that wraps arrow-rs to make it useable from Python.
Such a thing also exists publicly available through arro3 which I used here for some minimal reproducible example:

import pyarrow.parquet
import arro3.io

data = [8388855, 8388924, 8388853, 8388880, 8388876, 8388879]

schema = pyarrow.schema([pyarrow.field("html", pyarrow.binary())])
data = [{"html": b"0" * d} for d in data]

t = pyarrow.Table.from_pylist(data, schema=schema)

path = "/tmp/foo.parquet"
with open(path, "wb") as file:
    for b in t.to_batches():
        arro3.io.write_parquet(b, file, max_row_group_size=len(data) - 3)

reader = pyarrow.parquet.ParquetFile(path)
for i in range(2):
    print(len(reader.read_row_group(i)))

This code writes a bit of dummy binary data through arrow-rs. Reading that with pyarrow results in

  File "pyarrow/_parquet.pyx", line 1655, in pyarrow._parquet.ParquetReader.read_row_group
  File "pyarrow/_parquet.pyx", line 1691, in pyarrow._parquet.ParquetReader.read_row_groups
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.

Observations

Reading in the same file through arro3 or own internal library wrapping arrow-rs works just fine
Reading in the same file through duckdb also works just fine
Reducing the amount of binary data per row slightly leads to the error disappearing (8_388_855 per row or 25_166_565 per row group or more seems to be the problematic amount)
Issue is reproducible with pyarrow version 18.1.0, 19.0.1 and 20.0.0

The text was updated successfully, but these errors were encountered:

alamb · 2025-05-12T16:37:19Z

Here is a related ticket in arrow which I think describes the same symptoms and has workaround

[Python] Files containing binary data with >=8_388_855 bytes per row written with arrow-rs can't be read with pyarrow arrow#46404

jonded94 · 2025-05-13T16:13:57Z

@alamb I tried setting statistics_truncate_length as well as column_index_truncate_length, but for some reason, this didn't enable pyarrow to read the new file.

I used this Rust code to initialize a Parquet Writer:

#[pyclass]
pub struct ParquetFileWriter {
    writer: Mutex<Option<ArrowWriter<FileWriter>>>,
}

impl ParquetFileWriter {
    #[allow(clippy::too_many_arguments)]
    fn try_new(
        file: FileWriter,
        schema: Schema,
        target_rows_per_row_group: NonZeroUsize,
        column_compression: Option<HashMap<ColumnPath, Compression>>,
        compression: Option<Compression>,
        statistics_enabled: Option<EnabledStatistics>,
        statistics_truncate_length: Option<NonZeroUsize>,
        column_index_truncate_length: Option<NonZeroUsize>,
    ) -> Result<Self, DiscoParquetError> {
        let props = {
            let mut builder = WriterProperties::builder()
                .set_max_row_group_size(target_rows_per_row_group.get())
                .set_key_value_metadata(Some(convert_arrow_metadata_to_parquet_metadata(
                    schema.metadata.clone(),
                )))
                .set_statistics_truncate_length(statistics_truncate_length.map(|value| value.get()))
                .set_column_index_truncate_length(
                    column_index_truncate_length.map(|value| value.get()),
                );
            dbg!(statistics_truncate_length);
            dbg!(column_index_truncate_length);
            if let Some(compression) = compression {
                builder = builder.set_compression(compression);
            }
            if let Some(column_compression) = column_compression {
                for (column_path, compression) in column_compression.into_iter() {
                    builder = builder.set_column_compression(column_path, compression);
                }
            }
            if let Some(statistics_enabled) = statistics_enabled {
                builder = builder.set_statistics_enabled(statistics_enabled);
            }

            builder.build()
        };

        Ok(Self {
            writer: Mutex::new(Some(ArrowWriter::try_new_with_options(
                file,
                SchemaRef::new(schema),
                ArrowWriterOptions::new().with_properties(props),
            )?)),
        })
    }

    fn write_batch(&self, batch: RecordBatch) -> Result<(), DiscoParquetError> {
        if let Some(file) = self.writer.lock()?.as_mut() {
            file.write(&batch)?;
            Ok(())
        } else {
            Err(PyValueError::new_err("File is already closed.").into())
        }
    }

And I used this pytest code to verify which test cases are readable or are not readable with pyarrow:

@pytest.mark.parametrize(
    "statistics_level,statistics_truncate_length,expected_fail",
    [
        (EnabledStatistics.NONE, 1, False),
        (EnabledStatistics.CHUNK, 1024, False),
        (EnabledStatistics.CHUNK, 16 * 1024 * 1024, False),
        (EnabledStatistics.CHUNK, None, False),
        (EnabledStatistics.PAGE, 1024, False),
        (EnabledStatistics.PAGE, 16 * 1024 * 1024, True),
        (EnabledStatistics.PAGE, None, True),
    ],
)
def test_page_statistics_pyarrow_compatibility(
    statistics_level: EnabledStatistics, statistics_truncate_length: int | None, expected_fail: bool, tmp_path: Path
) -> None:
    # 16MiB, to get statistic headers definitely above 16MiB which pyarrow doesn't support right now (version 20.0.0)
    length = 16 * 1024 * 1024

    schema = pyarrow.schema([pyarrow.field("data", pyarrow.binary())])
    data = [{"data": b"0" * length} for _ in range(2)]

    b = pyarrow.RecordBatch.from_pylist(data, schema=schema)

    path = tmp_path / "test.parquet"
    with ParquetFileWriter(
        path,
        schema,
        statistics_enabled=statistics_level,
        statistics_truncate_length=statistics_truncate_length,
        column_index_truncate_length=statistics_truncate_length,
    ) as writer:
        writer.write_batch(b)

    reader = pyarrow.parquet.ParquetFile(path)
    with pytest.raises(OSError) if expected_fail else contextlib.nullcontext():
        reader.read_row_group(0)

Expectation

With EnabledStatistics == None, it should never fail since no statistics at all are written
With EnabledStatistics == Chunk, it should never fail, regardless to which length the statistics are truncated since pyarrow appears to be able to read ColumnChunk/RowGroup level statistics that are arbitrarily large?
With EnabledStatistics == Page, it should fail whenever the statistics are not truncated at all or only to large value (16MiB for example), but it should not fail when they are truncated to a much smaller value (1024)

Reality

Every expectation holds true, besides one: It still fails with EnabledStatistics == Page when truncation is set to a low value. Note that I added a dbg! statement in the Rust code to force printing what statistics_truncate_length and column_index_truncate_length are set to (and yes, it doesn't matter if I set this to a very low value such as 1 or so).

_lib/parquet/parquet_writer.rs:168:13] statistics_truncate_length = Some(
    1024,
)
[_lib/parquet/parquet_writer.rs:169:13] column_index_truncate_length = Some(
    1024,
)

tests/test_writers.py:307 (test_page_statistics_pyarrow_compatibility[statistics_level4-1024-False])
statistics_level = EnabledStatistics.PAGE, statistics_truncate_length = 1024
expected_fail = False
tmp_path = PosixPath('/tmp/pytest-of-[...]/pytest-5/test_page_statistics_pyarrow_c4')

[...]
        reader = pyarrow.parquet.ParquetFile(path)
        with pytest.raises(OSError) if expected_fail else contextlib.nullcontext():
>           reader.read_row_group(0)

test_writers.py:343: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py:467: in read_row_group
    return self.reader.read_row_group(i, column_indices=column_indices,
pyarrow/_parquet.pyx:1655: in pyarrow._parquet.ParquetReader.read_row_group
    ???
pyarrow/_parquet.pyx:1691: in pyarrow._parquet.ParquetReader.read_row_groups
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   OSError: Couldn't deserialize thrift: No more data to read.
E   Deserializing page header failed.

pyarrow/error.pxi:92: OSError

Parquet file which is still unreadable with pyarrow: https://limewire.com/d/Cb2YQ#T76kY0beY7

jonded94 · 2025-05-23T09:59:54Z

@alamb do you have an idea how I could further debug this problem?

alamb · 2025-05-23T18:31:23Z

One think maybe you could do is use the viewer from @XiangpengHao at https://parquet-viewer.xiangpeng.systems/

To see if the changes to the settings you made resulted in the desired changes in the parquet file

I am sorry but I don't seem to be able to download the file myself. When I try I get

alamb · 2025-05-23T18:32:45Z

One thought I had was that maybe the statistics are still large if there are many many data pages (e.g. you are writing row groups with many columns and rows) so even if the statistics are truncated the size is still exceeded?

I think dumping the metadata and looking at what got written is the right next step

jonded94 · 2025-05-24T07:13:12Z

I am sorry but I don't seem to be able to download the file myself. When I try I get

Yeah, this fileupload service I chose only keeps them available for download for a week. Do you know another upload service with longer availability times?

maybe the statistics are still large if there are many many data pages (e.g. you are writing row groups with many columns and rows)

The code snippet I shared (which you in principle could also compile and use, at least the Rust part) actually writes a single column, two row parquet file. I think it is as minimal as it gets 😄

I'm unavailable now for at least a week, but I could have a look with this parquet viewer you shared then. Thanks!

jonded94 added the bug label May 12, 2025

jonded94 mentioned this issue May 12, 2025

Files containing binary data with >=8_388_855 bytes per row written with arro3 can't be read with pyarrow kylebarron/arro3#324

Open

alamb mentioned this issue May 12, 2025

Consider a default max_statistics_truncate_length and max_column_index_truncate_length #7490

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files containing binary data with >=8_388_855 bytes per row written with `arrow-rs` can't be read with `pyarrow` #7489

Files containing binary data with >=8_388_855 bytes per row written with `arrow-rs` can't be read with `pyarrow` #7489

jonded94 commented May 12, 2025 •

edited

Loading

alamb commented May 12, 2025 •

edited

Loading

Uh oh!

jonded94 commented May 13, 2025 •

edited

Loading

Uh oh!

jonded94 commented May 23, 2025

Uh oh!

alamb commented May 23, 2025

Uh oh!

alamb commented May 23, 2025

Uh oh!

jonded94 commented May 24, 2025

Uh oh!

Files containing binary data with >=8_388_855 bytes per row written with arrow-rs can't be read with pyarrow #7489

Files containing binary data with >=8_388_855 bytes per row written with arrow-rs can't be read with pyarrow #7489

Comments

jonded94 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Observations

alamb commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonded94 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Expectation

Reality

Uh oh!

jonded94 commented May 23, 2025

Uh oh!

alamb commented May 23, 2025

Uh oh!

alamb commented May 23, 2025

Uh oh!

jonded94 commented May 24, 2025

Uh oh!

Files containing binary data with >=8_388_855 bytes per row written with `arrow-rs` can't be read with `pyarrow` #7489

Files containing binary data with >=8_388_855 bytes per row written with `arrow-rs` can't be read with `pyarrow` #7489

jonded94 commented May 12, 2025 •

edited

Loading

alamb commented May 12, 2025 •

edited

Loading

jonded94 commented May 13, 2025 •

edited

Loading