[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

asfimport · 2022-05-24T14:48:43Z

Hi All

When I use Arrow Reading Parquet File like follow:

auto st = parquet::arrow::FileReader::Make(
                    arrow::default_memory_pool(),
                    parquet::ParquetFileReader::Open(_parquet, _properties), &_reader);   
arrow::Status status = _reader->GetRecordBatchReader({_current_group},_parquet_column_ids, &_rb_batch);    
_reader->set_batch_size(65536);       
_reader->set_use_threads(true);      
status = _rb_batch->ReadNext(&_batch);

status is not ok and an error occured like this:

IOError: Corrupt snappy compressed data.

When I comment out this statement

 _reader->set_use_threads(true);

The program runs normally and I can read parquet file well.
Program errors only occur when I read multiple columns and using _reader->set_use_threads(true); and a single column will not occur error

The testing parquet file is created by pyarrow，I use only 1 group and each group has 3000000 records.
The parquet file has 20 columns including int and string types

you can create a test parquet file using attachment python script

In my case,I read 0,1,2,3,4,5,6 index columns

Reading file using C++,arrow 7.0.0 ,snappy 1.1.8

Writting file using python3.8 ,pyarrow 7.0.0

Looking forward to your reply

Thank you!

@pitrou

@westonpace

Environment: C++,arrow 7.0.0 ,snappy 1.1.8, arrow 8.0.0
pyarrow 7.0.0 ubuntu 9.4.0 python3.8,

Reporter: yurikoomiga

Original Issue Attachments:

test_std_02.py

Externally tracked issue: #13186

_{Note: This issue was originally created as ARROW-16642. Please see the migration documentation for further details.}

asfimport · 2022-05-25T21:20:23Z

Weston Pace / @westonpace:
You might need to provide a few more details on how you are reading the parquet file. I used the python script you provided to create a file /home/pace/test.parquet which I then tested with this script:


#include <iostream>

#include "arrow/filesystem/api.h"
#include "arrow/record_batch.h"

#include "parquet/api/reader.h"
#include "parquet/arrow/reader.h"

int main() {
  auto fs = std::make_unique<arrow::fs::LocalFileSystem>();
  auto input_file = fs->OpenInputFile("/home/pace/test.parquet").ValueOrDie();

  std::unique_ptr<parquet::arrow::FileReader> file_reader;
  arrow::Status st = parquet::arrow::FileReader::Make(
      arrow::default_memory_pool(),
      parquet::ParquetFileReader::Open(input_file), &file_reader);
  if (!st.ok()) {
    std::cerr << "Error making file reader: " << st << std::endl;
    return -1;
  }
  std::vector<int> parquet_column_ids = {0, 1, 2, 3, 4, 5, 6};
  std::cout << "The file has " << file_reader->num_row_groups() << " row groups"
            << std::endl;
  for (int row_group_idx = 0; row_group_idx < file_reader->num_row_groups();
       row_group_idx++) {
    std::cout << "Reading row group: " << row_group_idx << std::endl;
    std::shared_ptr<arrow::RecordBatchReader> record_batch_reader;
    st = file_reader->GetRecordBatchReader({row_group_idx}, parquet_column_ids,
                                           &record_batch_reader);
    file_reader->set_batch_size(65536);
    file_reader->set_use_threads(true);
    std::shared_ptr<arrow::RecordBatch> batch;
    while (true) {
      st = record_batch_reader->ReadNext(&batch);
      if (st.ok()) {
        if (!batch) {
          // Reached the end of the row group
          break;
        }
        std::cout << "  Read in record batch with " << batch->num_rows()
                  << " rows" << std::endl;
      } else {
        std::cerr << "Error encountered reading record batch: " << st
                  << std::endl;
        return -2;
      }
    }
  }
}

I did not get any errors and got the expected output:


The file has 1 row groups
Reading row group: 0
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 65536 rows
  Read in record batch with 50880 rows

Does my test program work in your environment?

4ertus2 · 2024-12-25T12:20:59Z

SnappyCodec::Decompress() called from SerializedPageReader::DecompressIfNeeded() fails if input_len == 0
It happens if all values in column are NULLs.

4ertus2 · 2024-12-25T12:29:24Z

diff --git a/cpp/src/arrow/util/compression_snappy.cc b/cpp/src/arrow/util/compression_snappy.cc
index 731fdfd13..b862c6a24 100644
--- a/cpp/src/arrow/util/compression_snappy.cc
+++ b/cpp/src/arrow/util/compression_snappy.cc
@@ -43,6 +43,9 @@ class SnappyCodec : public Codec {
  public:
   Result<int64_t> Decompress(int64_t input_len, const uint8_t* input,
                              int64_t output_buffer_len, uint8_t* output_buffer) override {
+    if (!input_len) {
+      return 0;
+    }
     size_t decompressed_size;
     if (!snappy::GetUncompressedLength(reinterpret_cast<const char*>(input),
                                        static_cast<size_t>(input_len),

kou · 2024-12-29T03:08:06Z

Could you open a pull request with a test?

wgtmac · 2024-12-30T05:25:06Z

IIRC, levels are compressed together with values. If all values are NULLs, it must have definition levels encoded and compressed. In any case, the compressed length should not be 0. The fix itself looks reasonable to me.

4ertus2 · 2025-01-02T15:43:11Z

Could you open a pull request with a test?

Would it be enough if I place buggy parquet here instead? :)

snappy_bug.parquet.gz

The file is made by java lib

Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)

wgtmac · 2025-01-03T09:31:36Z

@4ertus2 Do you mind opening a pull request against https://github.com/apache/parquet-testing to add this file?

4ertus2 · 2025-01-09T08:57:39Z

https://github.com/apache/parquet-testing

Done apache/parquet-testing#68

mapleFU · 2025-01-14T08:30:19Z

IIRC, levels are compressed together with values. If all values are NULLs, it must have definition levels encoded and compressed. In any case, the compressed length should not be 0. The fix itself looks reasonable to me.

I've checked this file, this file holds DataPageV2, which levels is not compressed with data-pages.

Arrow-rs also failed to decompress this, but parquet-java can, I don't know the reason here...

…lues buffer is empty (#45252) ### Rationale for this change In DataPageV2, the levels and data will not be compressed together. So, we might get the "empty" data page buffer. When meeting this, Snappy C++ will failed to decompress the `(input_len == 0, output_len == 0)` data. ### What changes are included in this PR? Handling the case in `column_reader.cc` ### Are these changes tested? * [x] Will add ### Are there any user-facing changes? Minor fix * GitHub Issue: #31992 Lead-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: mwish <[email protected]>

mapleFU · 2025-01-20T11:02:23Z

Issue resolved by pull request 45252
#45252

4ertus2 mentioned this issue Jan 9, 2025

Snappy compressed NULLs-only column apache/parquet-testing#68

Open

github-actions bot mentioned this issue Jan 14, 2025

GH-31992: [C++][Parquet] Handling the special case when DataPageV2 values buffer is empty #45252

Merged

1 task

github-actions bot assigned mapleFU Jan 14, 2025

mapleFU added this to the 20.0.0 milestone Jan 20, 2025

mapleFU closed this as completed Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

asfimport commented May 24, 2022

asfimport commented May 25, 2022

4ertus2 commented Dec 25, 2024

4ertus2 commented Dec 25, 2024

kou commented Dec 29, 2024

wgtmac commented Dec 30, 2024

4ertus2 commented Jan 2, 2025 •

edited

Loading

wgtmac commented Jan 3, 2025

4ertus2 commented Jan 9, 2025

mapleFU commented Jan 14, 2025

mapleFU commented Jan 20, 2025

[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

Comments

asfimport commented May 24, 2022

Original Issue Attachments:

Externally tracked issue: #13186

asfimport commented May 25, 2022

4ertus2 commented Dec 25, 2024

4ertus2 commented Dec 25, 2024

kou commented Dec 29, 2024

wgtmac commented Dec 30, 2024

4ertus2 commented Jan 2, 2025 • edited Loading

wgtmac commented Jan 3, 2025

4ertus2 commented Jan 9, 2025

mapleFU commented Jan 14, 2025

mapleFU commented Jan 20, 2025

4ertus2 commented Jan 2, 2025 •

edited

Loading