Skip to content

An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. #13186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yurikoomiga opened this issue May 18, 2022 · 6 comments

Comments

@yurikoomiga
Copy link

yurikoomiga commented May 18, 2022

Hi All

When I use Arrow Reading Parquet File like follow:

auto st = parquet::arrow::FileReader::Make(
                    arrow::default_memory_pool(),
                    parquet::ParquetFileReader::Open(_parquet, _properties), &_reader);   
arrow::Status status = _reader->GetRecordBatchReader({_current_group},_parquet_column_ids, &_rb_batch);    
_reader->set_batch_size(65536);       
_reader->set_use_threads(true);      
status = _rb_batch->ReadNext(&_batch); `

status is not ok and an error occured like this:
IOError: Corrupt snappy compressed data.

When I comment out this statement _reader->set_use_threads(true);,The program runs normally and I can read parquet file well.
Program errors only occur when I read multiple columns and using _reader->set_use_threads(true); and a single column will not occur error

The testing parquet file is created by pyarrow,I use only 1 group and each group has 3000000 records.
The parquet file has 20 columns including int and string types

Reading file using C++,arrow 7.0.0 ,snappy 1.1.8

Writting file using python3.8 ,pyarrow 7.0.0

Looking forward to your reply

Thank you!

@westonpace
Copy link
Member

This seems like a bug. Can you create a JIRA ticket? Can you attach a sample file that fails to read?

@yurikoomiga
Copy link
Author

Thanks,By the way,I just upgraded Arrow To 8.0.0 just now,this error occurs again

@pitrou
Copy link
Member

pitrou commented May 19, 2022

@yurikoomiga Can you post a sample file that fails somewhere? (or code to reproduce the generation of the file)

@yurikoomiga
Copy link
Author

@yurikoomiga Can you post a sample file that fails somewhere? (or code to reproduce the generation of the file)

I'm sorry to reply you after so long.
The sample file is so large, so I post the generating code like this:

import random, string
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def create_list(type):
    if type == "VARCHAR":
        num = string.ascii_letters + string.digits
        return  "".join(random.sample(num, random.randint(1, 20)))
    elif type == "INT":
        return  random.randint(1,65536)

def chang_column_type(column_type,data_frame_column):
    if "INT" in column_type:
        return  data_frame_column.astype("int32")
    return data_frame_column

def build_parquet_schema(column_name,column_type):
    table_list = list()
    for index, column in enumerate(column_name):
        if "VARCHAR" in column_type[index]:
            table_list.append((column, pa.string()))
        elif "INT" in column_type[index]:
            table_list.append((column,pa.int32()))
        else:
            table_list.append((column, pa.string()))
    return  pa.schema(table_list)

if __name__ == '__main__':

    parquet_file ="test.parquet"

    column_type,column_name,data_list = list(),list(),list()
    for i in range(0,20):
        column_name.append("TEST%s"%i)
        column_type.append("VARCHAR") if i%2==0 else column_type.append("INT")

    table_schema = build_parquet_schema(column_name,column_type)

    for i in range(0,3*1000*1000):
        data_list.append(list(map(create_list,column_type)))

    test_panda_frame = pd.DataFrame(data_list, columns=tuple(column_name))
    for index, column in enumerate(column_name):
        test_panda_frame[column] = chang_column_type(column_type[index], test_panda_frame[column])
    table = pa.Table.from_pandas(test_panda_frame, schema=table_schema)
    pq.write_table(table, parquet_file,row_group_size=300*1000*1000)
    exit(0)

I run it in ubuntu 9.4.0 and use python3.8, pyarrow 7.0.0
You can use this to generate a test.parquet file and read any multiple columns with using _reader->set_use_threads(true);
@pitrou @westonpace

@yurikoomiga
Copy link
Author

This seems like a bug. Can you create a JIRA ticket? Can you attach a sample file that fails to read?

I have created a JIRA ticket here:https://issues.apache.org/jira/browse/ARROW-16642

@bipinmathew
Copy link

FYI I am getting the same error using the CPP SDK. I am using apache arrow 8.0.0 and snappy 1.1.8. I will try to provide more details as I dive in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants