[C++] Column type inference in read_csv vs. open_csv. CSV conversion error to null #17278

asfimport · 2020-07-15T01:27:32Z

The open_csv stream does not adjust the inferred column type based on the new data seen in new blocks.

For example if a csv has null values in the first few blocks of open_csv reader, the column is inferred as Null type. As PyArrow iterates over blocks and sees non null values in that column, it crashes.

Example Error:

pyarrow.lib.ArrowInvalid: In CSV column #44: CSV conversion error to null: invalid value '-176400'

This problem is resolved if a read_option with a huge block size is passed to the open_csv. But that negates the whole point of having a stream vs. read_csv.

System info:

PyArrow 0.17.1, Mac OS Catalina, Python 3.7.4

Reporter: Sep Dehpour

_{Note: This issue was originally created as ARROW-9474. Please see the migration documentation for further details.}

asfimport · 2020-07-15T08:58:56Z

Antoine Pitrou / @pitrou:
Well, the aim is to produce an homogenous stream of same-types record batches. We could perhaps add an option to return record batches with different types.
cc @nealrichardson

pstrzelczak · 2025-04-23T09:46:51Z

We could perhaps add an option to return record batches with different types.

@pitrou is it possible to add this option any time soon? It blocks me to use pyarrow to detect and reconcile different data types in columns while reading csv file in batches.

I am aware it is currently mentioned in the docs as caveat here as a resolution of #28247

pitrou · 2025-04-23T10:27:15Z

An interested contributor would have to submit a PR for that, and it's probably not trivial.

pstrzelczak · 2025-04-24T13:21:27Z

@pitrou thanks! Would you help me to find a contributor that may be interested and capable to introduce this change? IMHO this could have been unique feature of pyarrow as compared to similar libraries available.

kou · 2025-04-25T01:13:25Z

How about starting a thread on [email protected] for this?
See also: https://arrow.apache.org/community/#mailing-lists

Someone may be interested in this.

pstrzelczak · 2025-04-25T09:42:11Z

@kou @pitrou I reported enhancement request here as the title here does not clearly describe the need.

pstrzelczak mentioned this issue Apr 25, 2025

[C++] Add an option to RecordBatchReader to infer schema for each record batch fetched separately #46228

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Column type inference in read_csv vs. open_csv. CSV conversion error to null #17278

[C++] Column type inference in read_csv vs. open_csv. CSV conversion error to null #17278

asfimport commented Jul 15, 2020

asfimport commented Jul 15, 2020

pstrzelczak commented Apr 23, 2025 •

edited

Loading

pitrou commented Apr 23, 2025

pstrzelczak commented Apr 24, 2025 •

edited

Loading

kou commented Apr 25, 2025

pstrzelczak commented Apr 25, 2025

[C++] Column type inference in read_csv vs. open_csv. CSV conversion error to null #17278

[C++] Column type inference in read_csv vs. open_csv. CSV conversion error to null #17278

Comments

asfimport commented Jul 15, 2020

asfimport commented Jul 15, 2020

pstrzelczak commented Apr 23, 2025 • edited Loading

pitrou commented Apr 23, 2025

pstrzelczak commented Apr 24, 2025 • edited Loading

kou commented Apr 25, 2025

pstrzelczak commented Apr 25, 2025

pstrzelczak commented Apr 23, 2025 •

edited

Loading

pstrzelczak commented Apr 24, 2025 •

edited

Loading