[Doc][Python] Mention CSVStreamingReader pitfalls with type inference #28247

asfimport · 2021-04-20T17:14:46Z

Looks like Arrow infer type for the first batch and apply it for all subsequent batches. But information might be not enough to infer the type correctly for the whole file. For our particular case, Arrow infers some field in the schema as date32 from the first batch but the next batch has an empty field value that can’t be converted to date32.

When I increase the batch size to have such a value in the first batch Arrow set string type (not sure why not nullable date32) for such a field since it can’t be converted to date32 and the whole file is read successfully.

This problem can be easily reproduced by using the following code and attached dataset:

import pyarrow as pa
import pyarrow._csv as pa_csv
import pyarrow._fs as pa_fs

read_options: pa_csv.ReadOptions = pa_csv.ReadOptions(block_size=5_000_000)
parse_options: pa_csv.ParseOptions = pa_csv.ParseOptions(newlines_in_values=True)
convert_options: pa_csv.ConvertOptions = pa_csv.ConvertOptions(timestamp_parsers=[''])
with pa_fs.LocalFileSystem().open_input_file("dataset.csv") as file:
 reader = pa_csv.open_csv(
 file, read_options=read_options, parse_options=parse_options, convert_options=convert_options
 )
 for batch in reader:
 table_batch = pa.Table.from_batches([batch])
 table_batch

Error message:

 for batch in reader:
 File "pyarrow/ipc.pxi", line 497, in __iter__
 File "pyarrow/ipc.pxi", line 531, in pyarrow.lib.RecordBatchReader.read_next_batch
 File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
 pyarrow.lib.ArrowInvalid: In CSV column #23: CSV conversion error to date32[day]: invalid value ''

When we use block_size 10_000_000 file can be read successfully since we have the problematic value in the first batch.

An error occurs when I try to attach dataset, so you can download it from Google Drive here

Reporter: Oleksandr Shevchenko
Assignee: Antoine Pitrou / @pitrou

Related issues:

[C++][Python][CSV] Allow quoted values to be null (is related to)

PRs and other links:

GitHub Pull Request #10132

_{Note: This issue was originally created as ARROW-12482. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2021-04-21T09:06:22Z

Antoine Pitrou / @pitrou:
Well, this is by design. How could we read the file incrementally if the schema could change until the end of the file?

You have several solutions here:

if you want unfailable type inference, use the non-streaming CSV reader
if you just want this particular file to succeed, you can indeed increase the block size as you found out
you can also set a column type explicitly using ConvertOptions.column_types

asfimport · 2021-04-21T09:07:29Z

Antoine Pitrou / @pitrou:
Retargetting this as a documentation issue.

@jorisvandenbossche @jonkeane

asfimport · 2021-04-21T12:56:10Z

Oleksandr Shevchenko:
Thanks for a quick reply @pitrou!
Could you also comment on the conversion error? Not sure why the empty value can't be converted as null for the date32 type. I was trying to change null_values and a bunch of other confs but didn't find anything which can help with this particular case.

asfimport · 2021-04-21T12:58:58Z

Antoine Pitrou / @pitrou:
If the empty value was quoted in the CSV file, then it won't be considered as null. You may check whether that's the case.

asfimport · 2021-04-21T13:31:47Z

Oleksandr Shevchenko:
Yes, you are right such date is written as quoted empty line (""). Is there any way to configure Arrow to parse it as null as well?

asfimport · 2021-04-21T13:52:52Z

Antoine Pitrou / @pitrou:
Not currently, but that could be added as an option.

asfimport · 2021-04-21T14:02:36Z

Oleksandr Shevchenko:
Thanks for the clarification!

asfimport · 2021-04-22T08:09:30Z

Oleksandr Shevchenko:
I also expected that quoted strings like "2015-09-21" should not be converted to date32 but should be read as a string. Tried to disable timestamp_parsers=[''] but looks like it's not applied to date types.
Is there any option to disable converting such a string to date32? It could also help to avoid such a problem with empty value conversion.

asfimport · 2021-04-22T14:55:09Z

Antoine Pitrou / @pitrou:
No, there is no such option. There are unfortunately many different CSV writers out there, all with slightly different conventions. If you want to make sure the data adheres to a given schema, the best is to give explicit column_types.

asfimport · 2021-04-23T10:49:18Z

Joris Van den Bossche / @jorisvandenbossche:
Note that Antoined opened ARROW-12510 for the "quoted value as null" issue

asfimport · 2021-04-23T10:58:01Z

Oleksandr Shevchenko:
That's great, it's definitely will be useful. Thanks!
Is there any plans to add ability to also disable parsing quoted dates (like "2015-09-21") to just read it as string?
Something like timestamp_parsers=[''] allows to do for timestamp types.

asfimport · 2021-04-27T13:25:02Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 10132
#10132

asfimport closed this as completed Apr 27, 2021

asfimport assigned pitrou Jan 10, 2023

asfimport added this to the 5.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[C++][Python][CSV] Allow quoted values to be null #28275

Closed

pstrzelczak mentioned this issue Apr 23, 2025

[C++] Column type inference in read_csv vs. open_csv. CSV conversion error to null #17278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc][Python] Mention CSVStreamingReader pitfalls with type inference #28247

[Doc][Python] Mention CSVStreamingReader pitfalls with type inference #28247

asfimport commented Apr 20, 2021 •

edited

Loading

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 22, 2021

asfimport commented Apr 22, 2021

asfimport commented Apr 23, 2021

asfimport commented Apr 23, 2021

asfimport commented Apr 27, 2021

[Doc][Python] Mention CSVStreamingReader pitfalls with type inference #28247

[Doc][Python] Mention CSVStreamingReader pitfalls with type inference #28247

Comments

asfimport commented Apr 20, 2021 • edited Loading

Related issues:

PRs and other links:

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 21, 2021

asfimport commented Apr 22, 2021

asfimport commented Apr 22, 2021

asfimport commented Apr 23, 2021

asfimport commented Apr 23, 2021

asfimport commented Apr 27, 2021

asfimport commented Apr 20, 2021 •

edited

Loading