-
Notifications
You must be signed in to change notification settings - Fork 930
Files containing binary data with >=8_388_855 bytes per row written with arrow-rs
can't be read with pyarrow
#7489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here is a related ticket in arrow which I think describes the same symptoms and has workaround |
@alamb I tried setting I used this Rust code to initialize a Parquet Writer: #[pyclass]
pub struct ParquetFileWriter {
writer: Mutex<Option<ArrowWriter<FileWriter>>>,
}
impl ParquetFileWriter {
#[allow(clippy::too_many_arguments)]
fn try_new(
file: FileWriter,
schema: Schema,
target_rows_per_row_group: NonZeroUsize,
column_compression: Option<HashMap<ColumnPath, Compression>>,
compression: Option<Compression>,
statistics_enabled: Option<EnabledStatistics>,
statistics_truncate_length: Option<NonZeroUsize>,
column_index_truncate_length: Option<NonZeroUsize>,
) -> Result<Self, DiscoParquetError> {
let props = {
let mut builder = WriterProperties::builder()
.set_max_row_group_size(target_rows_per_row_group.get())
.set_key_value_metadata(Some(convert_arrow_metadata_to_parquet_metadata(
schema.metadata.clone(),
)))
.set_statistics_truncate_length(statistics_truncate_length.map(|value| value.get()))
.set_column_index_truncate_length(
column_index_truncate_length.map(|value| value.get()),
);
dbg!(statistics_truncate_length);
dbg!(column_index_truncate_length);
if let Some(compression) = compression {
builder = builder.set_compression(compression);
}
if let Some(column_compression) = column_compression {
for (column_path, compression) in column_compression.into_iter() {
builder = builder.set_column_compression(column_path, compression);
}
}
if let Some(statistics_enabled) = statistics_enabled {
builder = builder.set_statistics_enabled(statistics_enabled);
}
builder.build()
};
Ok(Self {
writer: Mutex::new(Some(ArrowWriter::try_new_with_options(
file,
SchemaRef::new(schema),
ArrowWriterOptions::new().with_properties(props),
)?)),
})
}
fn write_batch(&self, batch: RecordBatch) -> Result<(), DiscoParquetError> {
if let Some(file) = self.writer.lock()?.as_mut() {
file.write(&batch)?;
Ok(())
} else {
Err(PyValueError::new_err("File is already closed.").into())
}
} And I used this pytest code to verify which test cases are readable or are not readable with @pytest.mark.parametrize(
"statistics_level,statistics_truncate_length,expected_fail",
[
(EnabledStatistics.NONE, 1, False),
(EnabledStatistics.CHUNK, 1024, False),
(EnabledStatistics.CHUNK, 16 * 1024 * 1024, False),
(EnabledStatistics.CHUNK, None, False),
(EnabledStatistics.PAGE, 1024, False),
(EnabledStatistics.PAGE, 16 * 1024 * 1024, True),
(EnabledStatistics.PAGE, None, True),
],
)
def test_page_statistics_pyarrow_compatibility(
statistics_level: EnabledStatistics, statistics_truncate_length: int | None, expected_fail: bool, tmp_path: Path
) -> None:
# 16MiB, to get statistic headers definitely above 16MiB which pyarrow doesn't support right now (version 20.0.0)
length = 16 * 1024 * 1024
schema = pyarrow.schema([pyarrow.field("data", pyarrow.binary())])
data = [{"data": b"0" * length} for _ in range(2)]
b = pyarrow.RecordBatch.from_pylist(data, schema=schema)
path = tmp_path / "test.parquet"
with ParquetFileWriter(
path,
schema,
statistics_enabled=statistics_level,
statistics_truncate_length=statistics_truncate_length,
column_index_truncate_length=statistics_truncate_length,
) as writer:
writer.write_batch(b)
reader = pyarrow.parquet.ParquetFile(path)
with pytest.raises(OSError) if expected_fail else contextlib.nullcontext():
reader.read_row_group(0) Expectation
RealityEvery expectation holds true, besides one: It still fails with
Parquet file which is still unreadable with |
@alamb do you have an idea how I could further debug this problem? |
One think maybe you could do is use the viewer from @XiangpengHao at https://parquet-viewer.xiangpeng.systems/ To see if the changes to the settings you made resulted in the desired changes in the parquet file I am sorry but I don't seem to be able to download the file myself. When I try I get ![]() |
One thought I had was that maybe the statistics are still large if there are many many data pages (e.g. you are writing row groups with many columns and rows) so even if the statistics are truncated the size is still exceeded? I think dumping the metadata and looking at what got written is the right next step |
Yeah, this fileupload service I chose only keeps them available for download for a week. Do you know another upload service with longer availability times?
The code snippet I shared (which you in principle could also compile and use, at least the Rust part) actually writes a single column, two row parquet file. I think it is as minimal as it gets 😄 I'm unavailable now for at least a week, but I could have a look with this parquet viewer you shared then. Thanks! |
Uh oh!
There was an error while loading. Please reload this page.
Hello,
internally, we wrote an own library that wraps
arrow-rs
to make it useable from Python.Such a thing also exists publicly available through
arro3
which I used here for some minimal reproducible example:This code writes a bit of dummy binary data through
arrow-rs
. Reading that withpyarrow
results inObservations
arro3
or own internal library wrappingarrow-rs
works just fineduckdb
also works just finepyarrow
version18.1.0
,19.0.1
and20.0.0
The text was updated successfully, but these errors were encountered: