Add support for file row numbers in Parquet readers #7307

jkylling · 2025-03-18T18:37:35Z

Which issue does this PR close?

Closes #7299.

What changes are included in this PR?

In this PR we:

Add configuration to the ArrowReaderBuilder to set a row_number_column used to extend the read RecordBatches with an additional column with file row numbers.
Keep track of the first row number in each row group in the file. This is computed from the file metadata.
Add an ArrayReader to the vector of ArrayReaders reading columns from the Parquet file, if the row_number_column is set in the reader configuration. This is a RowNumberReader, which is a special ArrayReader. It reads no data from the Parquet pages, but uses the first row numbers in the RowGroupMetaData to keep track of progress.
Add some basic tests and fuzz tests of the functionality.

The RowGroupMetaData::first_row_number is Option<i64>, since it is possible that the row number is unknown (I encountered an instance of this when trying to integrate this PR in delta-rs), and it's better if None is used instead of some special integer value.

The performance impact of this PR should be negligible when the row number column is not set. The only additional overhead would be the tracking of the first_row_number of each row group.

Are there any user-facing changes?

We add an additional public method:

ArrowReaderBuilder::with_row_number_column

There are a few breaking changes as we touch a few public interfaces:

RowGroupMetaData::from_thrift and RowGroupMetaData::from_thrift_encrypted takes an additional parameter first_row_number: Optional<i64>.
The trait RowGroups has an additional method RowGroups::row_groups. Potentially this method could replace the RowGroups::num_rows method or provide a default implementation for it.
An additional error variant ParquetError::RowGroupMetaDataMissingRowNumber.

I'm very open to suggestions on how to reduce the amount of breaking changes.

etseidl · 2025-03-25T22:29:41Z

Thanks for you submission @jkylling, I'll try to get a first pass review done this week. In the meantime please add the Apache license to row_number.rs and correct the other lint errors. 🙏

jkylling · 2025-03-26T07:28:46Z

Thanks for you submission @jkylling, I'll try to get a first pass review done this week. In the meantime please add the Apache license to row_number.rs and correct the other lint errors. 🙏

Updated. Looking forward to the first review!

I was very confused as to why cargo format did not work properly, but looks like you are already aware of this (#6179) :)

etseidl

Partial review, just a few nits for now.

parquet/src/arrow/array_reader/builder.rs

parquet/src/arrow/arrow_reader/mod.rs

parquet/src/errors.rs

parquet/src/arrow/array_reader/builder.rs

etseidl

Thanks again @jkylling for taking this on. I've finished my first pass and have only one reservation. Otherwise it looks good and meets the criteria set forth in #7299 (comment).

etseidl · 2025-03-27T22:58:16Z

parquet/src/arrow/array_reader/row_number.rs

+            row_groups: VecDeque::from(
+                row_groups
+                    .into_iter()
+                    .map(TryInto::try_into)
+                    .collect::<Result<Vec<_>>>()?,
+            ),
+        })


I'm finding myself a bit uneasy with adding the first row number to the RowGroupMetaData. Rather than that, could this bit here instead be changed to keep track of the first row number while populating the deque? Is there some wrinkle I'm missing? Might the row groups be filtered before instantiating the RowNumberReader?

Answered my own question...it seems there's some complexity here at least when using the async reader.

Yes, I believe we don't have access to all row groups when creating the array readers.

I took a quick look at the corresponding Parquet reader implementations for Trino and parquet-java.

Trino:

Has a boolean to include a row number column, https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L112

Includes this column when the boolean is set: https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L337

Has a special block reader for reading row indexes https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L385-L393 I believe the positions play a similar role to our RowSelectors.

Gets row indexes from RowGroupInfo, a pruned version of https://github.com/trinodb/trino/blob/a54d38a30e486a94a365c7f12a94e47beb30b0fa/lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java#L456

Populates the fileRowOffset by iterating through the row groups: https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ParquetMetadata.java#L107-L111

parquet-java:

Has a method for tracking the current row index: https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L150-L155

This row index is based on an iterator which starts form a row group row index, https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L311-L339

This row group row index is initialized by iterating through the row groups: https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1654-L1656 (mapping obtained here: https://github.com/apache/parquet-java/blob/7d1fe32c8c972710a9d780ec5e7d1f95d871374d/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1496-L1506)

Their approaches are rather similar to ours.

One take away is that the above implementations do not be keep the full RowGroupMetaDatas around as we do by requiring an iterator over RowGroupMetadata in the RowGroups trait. This is likely a good idea as this struct can be quite large. What do you think about changing the RowGroups trait to something like below?

/// A collection of row groups pub trait RowGroups { /// Get the number of rows in this collection fn num_rows(&self) -> usize { self.row_group_infos.iter().map(|info| info.num_rows).sum() } /// Returns a [`PageIterator`] for the column chunks with the given leaf column index fn column_chunks(&self, i: usize) -> Result<Box<dyn PageIterator>>; /// Returns an iterator over the row groups in this collection fn row_group_infos(&self) -> Box<dyn Iterator<Item = &RowGroupInfo> + '_>; } struct RowGroupInfo { num_rows: usize, row_index: i64, }

etseidl · 2025-03-27T23:05:36Z

parquet/src/file/metadata/mod.rs

@@ -584,6 +585,11 @@ impl RowGroupMetaData {
        self.num_rows
    }

+    /// Returns the first row number in this row group.


Suggested change

/// Returns the first row number in this row group.

/// Returns the global index number for the first row in this row group.

And perhaps use first_row_index instead? That may be clearer.

Agree. Updated.

alamb · 2025-03-28T15:47:10Z

Thanks @jkylling and @etseidl -- I think we need to be very careful to balance adding new features in the parquet reader with keeping it fast and maintainable. I haven't had a chance to look at this PR yet, but I do worry about performance and complexity

Add support for file row numbers in Parquet readers

f93d36e

github-actions bot added the parquet Changes to the parquet crate label Mar 18, 2025

jkylling mentioned this pull request Mar 18, 2025

Return file row number in Parquet readers #7299

Open

etseidl added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Mar 25, 2025

jkylling added 2 commits March 26, 2025 08:21

Add Apache license header to row_number.rs

e485c0b

Run cargo format

2a62009

etseidl reviewed Mar 26, 2025

View reviewed changes

parquet/src/arrow/array_reader/builder.rs Outdated Show resolved Hide resolved

parquet/src/arrow/arrow_reader/mod.rs Outdated Show resolved Hide resolved

parquet/src/errors.rs Outdated Show resolved Hide resolved

parquet/src/arrow/array_reader/builder.rs Outdated Show resolved Hide resolved

jkylling added 4 commits March 27, 2025 18:02

Change with_row_number_column to take impl Into<String>

fb5126f

Change Option<String> -> Option<&str> in build_array_reader

5350728

Replace ParquetError::RowGroupMetaDataMissingRowNumber with General

188f350

Split test_create_array_reader test into two

37a9d83

etseidl reviewed Mar 27, 2025

View reviewed changes

first_row_number -> first_row_index

41e38fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for file row numbers in Parquet readers #7307

Add support for file row numbers in Parquet readers #7307

jkylling commented Mar 18, 2025 •

edited

Loading

etseidl commented Mar 25, 2025

jkylling commented Mar 26, 2025

etseidl left a comment

etseidl left a comment

etseidl Mar 27, 2025

etseidl Mar 28, 2025

jkylling Mar 28, 2025 •

edited

Loading

etseidl Mar 27, 2025

jkylling Mar 28, 2025

alamb commented Mar 28, 2025

	/// Returns the first row number in this row group.
	/// Returns the global index number for the first row in this row group.

Add support for file row numbers in Parquet readers #7307

Are you sure you want to change the base?

Add support for file row numbers in Parquet readers #7307

Conversation

jkylling commented Mar 18, 2025 • edited Loading

Which issue does this PR close?

What changes are included in this PR?

Are there any user-facing changes?

etseidl commented Mar 25, 2025

jkylling commented Mar 26, 2025

etseidl left a comment

Choose a reason for hiding this comment

etseidl left a comment

Choose a reason for hiding this comment

etseidl Mar 27, 2025

Choose a reason for hiding this comment

etseidl Mar 28, 2025

Choose a reason for hiding this comment

jkylling Mar 28, 2025 • edited Loading

Choose a reason for hiding this comment

etseidl Mar 27, 2025

Choose a reason for hiding this comment

jkylling Mar 28, 2025

Choose a reason for hiding this comment

alamb commented Mar 28, 2025

jkylling commented Mar 18, 2025 •

edited

Loading

jkylling Mar 28, 2025 •

edited

Loading