-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Weston Pace / @westonpace:
I did not get any errors and got the expected output:
Does my test program work in your environment? |
SnappyCodec::Decompress() called from SerializedPageReader::DecompressIfNeeded() fails if input_len == 0 |
|
Could you open a pull request with a test? |
IIRC, levels are compressed together with values. If all values are NULLs, it must have definition levels encoded and compressed. In any case, the compressed length should not be 0. The fix itself looks reasonable to me. |
Would it be enough if I place buggy parquet here instead? :) The file is made by java lib
|
@4ertus2 Do you mind opening a pull request against https://github.com/apache/parquet-testing to add this file? |
I've checked this file, this file holds DataPageV2, which levels is not compressed with data-pages. Arrow-rs also failed to decompress this, but parquet-java can, I don't know the reason here... |
…lues buffer is empty (#45252) ### Rationale for this change In DataPageV2, the levels and data will not be compressed together. So, we might get the "empty" data page buffer. When meeting this, Snappy C++ will failed to decompress the `(input_len == 0, output_len == 0)` data. ### What changes are included in this PR? Handling the case in `column_reader.cc` ### Are these changes tested? * [x] Will add ### Are there any user-facing changes? Minor fix * GitHub Issue: #31992 Lead-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: mwish <[email protected]>
Issue resolved by pull request 45252 |
Hi All
When I use Arrow Reading Parquet File like follow:
status is not ok and an error occured like this:
When I comment out this statement
The program runs normally and I can read parquet file well.
Program errors only occur when I read multiple columns and using _reader->set_use_threads(true); and a single column will not occur error
The testing parquet file is created by pyarrow,I use only 1 group and each group has 3000000 records.
The parquet file has 20 columns including int and string types
you can create a test parquet file using attachment python script
In my case,I read 0,1,2,3,4,5,6 index columns
Reading file using C++,arrow 7.0.0 ,snappy 1.1.8
Writting file using python3.8 ,pyarrow 7.0.0
Looking forward to your reply
Thank you!
@pitrou
@westonpace
Environment: C++,arrow 7.0.0 ,snappy 1.1.8, arrow 8.0.0
pyarrow 7.0.0 ubuntu 9.4.0 python3.8,
Reporter: yurikoomiga
Original Issue Attachments:
Externally tracked issue: #13186
Note: This issue was originally created as ARROW-16642. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: