An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #13186

yurikoomiga · 2022-05-18T09:46:28Z

Hi All

When I use Arrow Reading Parquet File like follow:

auto st = parquet::arrow::FileReader::Make(
                    arrow::default_memory_pool(),
                    parquet::ParquetFileReader::Open(_parquet, _properties), &_reader);   
arrow::Status status = _reader->GetRecordBatchReader({_current_group},_parquet_column_ids, &_rb_batch);    
_reader->set_batch_size(65536);       
_reader->set_use_threads(true);      
status = _rb_batch->ReadNext(&_batch); `

status is not ok and an error occured like this:
IOError: Corrupt snappy compressed data.

When I comment out this statement _reader->set_use_threads(true);，The program runs normally and I can read parquet file well.
Program errors only occur when I read multiple columns and using _reader->set_use_threads(true); and a single column will not occur error

The testing parquet file is created by pyarrow，I use only 1 group and each group has 3000000 records.
The parquet file has 20 columns including int and string types

Reading file using C++,arrow 7.0.0 ,snappy 1.1.8

Writting file using python3.8 ,pyarrow 7.0.0

Looking forward to your reply

Thank you!

The text was updated successfully, but these errors were encountered:

westonpace · 2022-05-18T14:16:27Z

This seems like a bug. Can you create a JIRA ticket? Can you attach a sample file that fails to read?

yurikoomiga · 2022-05-18T14:29:37Z

Thanks,By the way,I just upgraded Arrow To 8.0.0 just now，this error occurs again

pitrou · 2022-05-19T16:21:49Z

@yurikoomiga Can you post a sample file that fails somewhere? (or code to reproduce the generation of the file)

yurikoomiga · 2022-05-24T10:33:31Z

@yurikoomiga Can you post a sample file that fails somewhere? (or code to reproduce the generation of the file)

I'm sorry to reply you after so long.
The sample file is so large, so I post the generating code like this:

import random, string
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def create_list(type):
    if type == "VARCHAR":
        num = string.ascii_letters + string.digits
        return  "".join(random.sample(num, random.randint(1, 20)))
    elif type == "INT":
        return  random.randint(1,65536)

def chang_column_type(column_type,data_frame_column):
    if "INT" in column_type:
        return  data_frame_column.astype("int32")
    return data_frame_column

def build_parquet_schema(column_name,column_type):
    table_list = list()
    for index, column in enumerate(column_name):
        if "VARCHAR" in column_type[index]:
            table_list.append((column, pa.string()))
        elif "INT" in column_type[index]:
            table_list.append((column,pa.int32()))
        else:
            table_list.append((column, pa.string()))
    return  pa.schema(table_list)

if __name__ == '__main__':

    parquet_file ="test.parquet"

    column_type,column_name,data_list = list(),list(),list()
    for i in range(0,20):
        column_name.append("TEST%s"%i)
        column_type.append("VARCHAR") if i%2==0 else column_type.append("INT")

    table_schema = build_parquet_schema(column_name,column_type)

    for i in range(0,3*1000*1000):
        data_list.append(list(map(create_list,column_type)))

    test_panda_frame = pd.DataFrame(data_list, columns=tuple(column_name))
    for index, column in enumerate(column_name):
        test_panda_frame[column] = chang_column_type(column_type[index], test_panda_frame[column])
    table = pa.Table.from_pandas(test_panda_frame, schema=table_schema)
    pq.write_table(table, parquet_file,row_group_size=300*1000*1000)
    exit(0)

I run it in ubuntu 9.4.0 and use python3.8, pyarrow 7.0.0
You can use this to generate a test.parquet file and read any multiple columns with using _reader->set_use_threads(true);
@pitrou @westonpace

yurikoomiga · 2022-05-24T14:52:35Z

This seems like a bug. Can you create a JIRA ticket? Can you attach a sample file that fails to read?

I have created a JIRA ticket here:https://issues.apache.org/jira/browse/ARROW-16642

bipinmathew · 2022-06-01T15:23:56Z

FYI I am getting the same error using the CPP SDK. I am using apache arrow 8.0.0 and snappy 1.1.8. I will try to provide more details as I dive in.

asfimport mentioned this issue May 25, 2022

[C++] An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #31992

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #13186

An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #13186

yurikoomiga commented May 18, 2022 •

edited

Loading

westonpace commented May 18, 2022

yurikoomiga commented May 18, 2022

pitrou commented May 19, 2022

yurikoomiga commented May 24, 2022

yurikoomiga commented May 24, 2022

bipinmathew commented Jun 1, 2022

An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #13186

An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #13186

Comments

yurikoomiga commented May 18, 2022 • edited Loading

westonpace commented May 18, 2022

yurikoomiga commented May 18, 2022

pitrou commented May 19, 2022

yurikoomiga commented May 24, 2022

yurikoomiga commented May 24, 2022

bipinmathew commented Jun 1, 2022

yurikoomiga commented May 18, 2022 •

edited

Loading