Skip to content

Files containing binary data with >=8_388_855 bytes per row written with arrow-rs can't be read with pyarrow #7489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jonded94 opened this issue May 12, 2025 · 6 comments
Labels

Comments

@jonded94
Copy link
Contributor

jonded94 commented May 12, 2025

Hello,

internally, we wrote an own library that wraps arrow-rs to make it useable from Python.
Such a thing also exists publicly available through arro3 which I used here for some minimal reproducible example:

import pyarrow.parquet
import arro3.io

data = [8388855, 8388924, 8388853, 8388880, 8388876, 8388879]

schema = pyarrow.schema([pyarrow.field("html", pyarrow.binary())])
data = [{"html": b"0" * d} for d in data]

t = pyarrow.Table.from_pylist(data, schema=schema)

path = "/tmp/foo.parquet"
with open(path, "wb") as file:
    for b in t.to_batches():
        arro3.io.write_parquet(b, file, max_row_group_size=len(data) - 3)

reader = pyarrow.parquet.ParquetFile(path)
for i in range(2):
    print(len(reader.read_row_group(i)))

This code writes a bit of dummy binary data through arrow-rs. Reading that with pyarrow results in

  File "pyarrow/_parquet.pyx", line 1655, in pyarrow._parquet.ParquetReader.read_row_group
  File "pyarrow/_parquet.pyx", line 1691, in pyarrow._parquet.ParquetReader.read_row_groups
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.

Observations

  • Reading in the same file through arro3 or own internal library wrapping arrow-rs works just fine
  • Reading in the same file through duckdb also works just fine
  • Reducing the amount of binary data per row slightly leads to the error disappearing (8_388_855 per row or 25_166_565 per row group or more seems to be the problematic amount)
  • Issue is reproducible with pyarrow version 18.1.0, 19.0.1 and 20.0.0
@alamb
Copy link
Contributor

alamb commented May 12, 2025

Here is a related ticket in arrow which I think describes the same symptoms and has workaround

@jonded94
Copy link
Contributor Author

jonded94 commented May 13, 2025

@alamb I tried setting statistics_truncate_length as well as column_index_truncate_length, but for some reason, this didn't enable pyarrow to read the new file.

I used this Rust code to initialize a Parquet Writer:

#[pyclass]
pub struct ParquetFileWriter {
    writer: Mutex<Option<ArrowWriter<FileWriter>>>,
}

impl ParquetFileWriter {
    #[allow(clippy::too_many_arguments)]
    fn try_new(
        file: FileWriter,
        schema: Schema,
        target_rows_per_row_group: NonZeroUsize,
        column_compression: Option<HashMap<ColumnPath, Compression>>,
        compression: Option<Compression>,
        statistics_enabled: Option<EnabledStatistics>,
        statistics_truncate_length: Option<NonZeroUsize>,
        column_index_truncate_length: Option<NonZeroUsize>,
    ) -> Result<Self, DiscoParquetError> {
        let props = {
            let mut builder = WriterProperties::builder()
                .set_max_row_group_size(target_rows_per_row_group.get())
                .set_key_value_metadata(Some(convert_arrow_metadata_to_parquet_metadata(
                    schema.metadata.clone(),
                )))
                .set_statistics_truncate_length(statistics_truncate_length.map(|value| value.get()))
                .set_column_index_truncate_length(
                    column_index_truncate_length.map(|value| value.get()),
                );
            dbg!(statistics_truncate_length);
            dbg!(column_index_truncate_length);
            if let Some(compression) = compression {
                builder = builder.set_compression(compression);
            }
            if let Some(column_compression) = column_compression {
                for (column_path, compression) in column_compression.into_iter() {
                    builder = builder.set_column_compression(column_path, compression);
                }
            }
            if let Some(statistics_enabled) = statistics_enabled {
                builder = builder.set_statistics_enabled(statistics_enabled);
            }

            builder.build()
        };

        Ok(Self {
            writer: Mutex::new(Some(ArrowWriter::try_new_with_options(
                file,
                SchemaRef::new(schema),
                ArrowWriterOptions::new().with_properties(props),
            )?)),
        })
    }

    fn write_batch(&self, batch: RecordBatch) -> Result<(), DiscoParquetError> {
        if let Some(file) = self.writer.lock()?.as_mut() {
            file.write(&batch)?;
            Ok(())
        } else {
            Err(PyValueError::new_err("File is already closed.").into())
        }
    }

And I used this pytest code to verify which test cases are readable or are not readable with pyarrow:

@pytest.mark.parametrize(
    "statistics_level,statistics_truncate_length,expected_fail",
    [
        (EnabledStatistics.NONE, 1, False),
        (EnabledStatistics.CHUNK, 1024, False),
        (EnabledStatistics.CHUNK, 16 * 1024 * 1024, False),
        (EnabledStatistics.CHUNK, None, False),
        (EnabledStatistics.PAGE, 1024, False),
        (EnabledStatistics.PAGE, 16 * 1024 * 1024, True),
        (EnabledStatistics.PAGE, None, True),
    ],
)
def test_page_statistics_pyarrow_compatibility(
    statistics_level: EnabledStatistics, statistics_truncate_length: int | None, expected_fail: bool, tmp_path: Path
) -> None:
    # 16MiB, to get statistic headers definitely above 16MiB which pyarrow doesn't support right now (version 20.0.0)
    length = 16 * 1024 * 1024

    schema = pyarrow.schema([pyarrow.field("data", pyarrow.binary())])
    data = [{"data": b"0" * length} for _ in range(2)]

    b = pyarrow.RecordBatch.from_pylist(data, schema=schema)

    path = tmp_path / "test.parquet"
    with ParquetFileWriter(
        path,
        schema,
        statistics_enabled=statistics_level,
        statistics_truncate_length=statistics_truncate_length,
        column_index_truncate_length=statistics_truncate_length,
    ) as writer:
        writer.write_batch(b)

    reader = pyarrow.parquet.ParquetFile(path)
    with pytest.raises(OSError) if expected_fail else contextlib.nullcontext():
        reader.read_row_group(0)

Expectation

  • With EnabledStatistics == None, it should never fail since no statistics at all are written
  • With EnabledStatistics == Chunk, it should never fail, regardless to which length the statistics are truncated since pyarrow appears to be able to read ColumnChunk/RowGroup level statistics that are arbitrarily large?
  • With EnabledStatistics == Page, it should fail whenever the statistics are not truncated at all or only to large value (16MiB for example), but it should not fail when they are truncated to a much smaller value (1024)

Reality

Every expectation holds true, besides one: It still fails with EnabledStatistics == Page when truncation is set to a low value. Note that I added a dbg! statement in the Rust code to force printing what statistics_truncate_length and column_index_truncate_length are set to (and yes, it doesn't matter if I set this to a very low value such as 1 or so).

_lib/parquet/parquet_writer.rs:168:13] statistics_truncate_length = Some(
    1024,
)
[_lib/parquet/parquet_writer.rs:169:13] column_index_truncate_length = Some(
    1024,
)

tests/test_writers.py:307 (test_page_statistics_pyarrow_compatibility[statistics_level4-1024-False])
statistics_level = EnabledStatistics.PAGE, statistics_truncate_length = 1024
expected_fail = False
tmp_path = PosixPath('/tmp/pytest-of-[...]/pytest-5/test_page_statistics_pyarrow_c4')

[...]
        reader = pyarrow.parquet.ParquetFile(path)
        with pytest.raises(OSError) if expected_fail else contextlib.nullcontext():
>           reader.read_row_group(0)

test_writers.py:343: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py:467: in read_row_group
    return self.reader.read_row_group(i, column_indices=column_indices,
pyarrow/_parquet.pyx:1655: in pyarrow._parquet.ParquetReader.read_row_group
    ???
pyarrow/_parquet.pyx:1691: in pyarrow._parquet.ParquetReader.read_row_groups
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   OSError: Couldn't deserialize thrift: No more data to read.
E   Deserializing page header failed.

pyarrow/error.pxi:92: OSError

Parquet file which is still unreadable with pyarrow: https://limewire.com/d/Cb2YQ#T76kY0beY7

@jonded94
Copy link
Contributor Author

@alamb do you have an idea how I could further debug this problem?

@alamb
Copy link
Contributor

alamb commented May 23, 2025

One think maybe you could do is use the viewer from @XiangpengHao at https://parquet-viewer.xiangpeng.systems/

To see if the changes to the settings you made resulted in the desired changes in the parquet file

I am sorry but I don't seem to be able to download the file myself. When I try I get

Image

@alamb
Copy link
Contributor

alamb commented May 23, 2025

One thought I had was that maybe the statistics are still large if there are many many data pages (e.g. you are writing row groups with many columns and rows) so even if the statistics are truncated the size is still exceeded?

I think dumping the metadata and looking at what got written is the right next step

@jonded94
Copy link
Contributor Author

I am sorry but I don't seem to be able to download the file myself. When I try I get

Yeah, this fileupload service I chose only keeps them available for download for a week. Do you know another upload service with longer availability times?

maybe the statistics are still large if there are many many data pages (e.g. you are writing row groups with many columns and rows)

The code snippet I shared (which you in principle could also compile and use, at least the Rust part) actually writes a single column, two row parquet file. I think it is as minimal as it gets 😄

I'm unavailable now for at least a week, but I could have a look with this parquet viewer you shared then. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants