Skip to content

[C++] Column type inference in read_csv vs. open_csv. CSV conversion error to null #17278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
asfimport opened this issue Jul 15, 2020 · 6 comments

Comments

@asfimport
Copy link
Collaborator

The open_csv stream does not adjust the inferred column type based on the new data seen in new blocks.

For example if a csv has null values in the first few blocks of open_csv reader, the column is inferred as Null type. As PyArrow iterates over blocks and sees non null values in that column,  it crashes.

Example Error:

pyarrow.lib.ArrowInvalid: In CSV column #44: CSV conversion error to null: invalid value '-176400' 

 

This problem is resolved if a read_option with a huge block size is passed to the open_csv. But that negates the whole point of having a stream vs. read_csv.

 

System info:

PyArrow 0.17.1, Mac OS Catalina, Python 3.7.4

Reporter: Sep Dehpour

Note: This issue was originally created as ARROW-9474. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Well, the aim is to produce an homogenous stream of same-types record batches. We could perhaps add an option to return record batches with different types.
cc @nealrichardson

@pstrzelczak
Copy link

pstrzelczak commented Apr 23, 2025

We could perhaps add an option to return record batches with different types.

@pitrou is it possible to add this option any time soon? It blocks me to use pyarrow to detect and reconcile different data types in columns while reading csv file in batches.

I am aware it is currently mentioned in the docs as caveat here as a resolution of #28247

@pitrou
Copy link
Member

pitrou commented Apr 23, 2025

An interested contributor would have to submit a PR for that, and it's probably not trivial.

@pstrzelczak
Copy link

pstrzelczak commented Apr 24, 2025

@pitrou thanks! Would you help me to find a contributor that may be interested and capable to introduce this change? IMHO this could have been unique feature of pyarrow as compared to similar libraries available.

@kou
Copy link
Member

kou commented Apr 25, 2025

How about starting a thread on [email protected] for this?
See also: https://arrow.apache.org/community/#mailing-lists

Someone may be interested in this.

@pstrzelczak
Copy link

@kou @pitrou I reported enhancement request here as the title here does not clearly describe the need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants