Skip to content

[Doc][Python] Mention CSVStreamingReader pitfalls with type inference #28247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
asfimport opened this issue Apr 20, 2021 · 12 comments
Closed

[Doc][Python] Mention CSVStreamingReader pitfalls with type inference #28247

asfimport opened this issue Apr 20, 2021 · 12 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Apr 20, 2021

Looks like Arrow infer type for the first batch and apply it for all subsequent batches. But information might be not enough to infer the type correctly for the whole file. For our particular case, Arrow infers some field in the schema as date32 from the first batch but the next batch has an empty field value that can’t be converted to date32.

When I increase the batch size to have such a value in the first batch Arrow set string type (not sure why not nullable date32) for such a field since it can’t be converted to date32 and the whole file is read successfully.

This problem can be easily reproduced by using the following code and attached dataset:

import pyarrow as pa
import pyarrow._csv as pa_csv
import pyarrow._fs as pa_fs

read_options: pa_csv.ReadOptions = pa_csv.ReadOptions(block_size=5_000_000)
parse_options: pa_csv.ParseOptions = pa_csv.ParseOptions(newlines_in_values=True)
convert_options: pa_csv.ConvertOptions = pa_csv.ConvertOptions(timestamp_parsers=[''])
with pa_fs.LocalFileSystem().open_input_file("dataset.csv") as file:
 reader = pa_csv.open_csv(
 file, read_options=read_options, parse_options=parse_options, convert_options=convert_options
 )
 for batch in reader:
 table_batch = pa.Table.from_batches([batch])
 table_batch

Error message:

 for batch in reader:
 File "pyarrow/ipc.pxi", line 497, in __iter__
 File "pyarrow/ipc.pxi", line 531, in pyarrow.lib.RecordBatchReader.read_next_batch
 File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
 pyarrow.lib.ArrowInvalid: In CSV column #23: CSV conversion error to date32[day]: invalid value ''

 
When we use block_size 10_000_000 file can be read successfully since we have the problematic value in the first batch.

An error occurs when I try to attach dataset, so you can download it from Google Drive here

Reporter: Oleksandr Shevchenko
Assignee: Antoine Pitrou / @pitrou

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-12482. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Well, this is by design. How could we read the file incrementally if the schema could change until the end of the file?

You have several solutions here:

  • if you want unfailable type inference, use the non-streaming CSV reader
  • if you just want this particular file to succeed, you can indeed increase the block size as you found out
  • you can also set a column type explicitly using ConvertOptions.column_types

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Retargetting this as a documentation issue.

@jorisvandenbossche @jonkeane

@asfimport
Copy link
Collaborator Author

Oleksandr Shevchenko:
Thanks for a quick reply @pitrou!
Could you also comment on the conversion error? Not sure why the empty value can't be converted as null for the date32 type. I was trying to change null_values and a bunch of other confs but didn't find anything which can help with this particular case.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
If the empty value was quoted in the CSV file, then it won't be considered as null. You may check whether that's the case.

@asfimport
Copy link
Collaborator Author

Oleksandr Shevchenko:
Yes, you are right such date is written as quoted empty line (""). Is there any way to configure Arrow to parse it as null as well?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Not currently, but that could be added as an option.

@asfimport
Copy link
Collaborator Author

Oleksandr Shevchenko:
Thanks for the clarification!

@asfimport
Copy link
Collaborator Author

Oleksandr Shevchenko:
I also expected that quoted strings like "2015-09-21" should not be converted to date32 but should be read as a string. Tried to disable timestamp_parsers=[''] but looks like it's not applied to date types.
Is there any option to disable converting such a string to date32? It could also help to avoid such a problem with empty value conversion.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
No, there is no such option. There are unfortunately many different CSV writers out there, all with slightly different conventions. If you want to make sure the data adheres to a given schema, the best is to give explicit column_types.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Note that Antoined opened ARROW-12510 for the "quoted value as null" issue

@asfimport
Copy link
Collaborator Author

Oleksandr Shevchenko:
That's great, it's definitely will be useful. Thanks!
Is there any plans to add ability to also disable parsing quoted dates (like "2015-09-21") to just read it as string?
Something like timestamp_parsers=[''] allows to do for timestamp types.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 10132
#10132

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants