-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[C++] Column type inference in read_csv vs. open_csv. CSV conversion error to null #17278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Antoine Pitrou / @pitrou: |
@pitrou is it possible to add this option any time soon? It blocks me to use pyarrow to detect and reconcile different data types in columns while reading csv file in batches. I am aware it is currently mentioned in the docs as caveat here as a resolution of #28247 |
An interested contributor would have to submit a PR for that, and it's probably not trivial. |
@pitrou thanks! Would you help me to find a contributor that may be interested and capable to introduce this change? IMHO this could have been unique feature of pyarrow as compared to similar libraries available. |
How about starting a thread on Someone may be interested in this. |
The open_csv stream does not adjust the inferred column type based on the new data seen in new blocks.
For example if a csv has null values in the first few blocks of open_csv reader, the column is inferred as Null type. As PyArrow iterates over blocks and sees non null values in that column, it crashes.
Example Error:
This problem is resolved if a read_option with a huge block size is passed to the open_csv. But that negates the whole point of having a stream vs. read_csv.
System info:
PyArrow 0.17.1, Mac OS Catalina, Python 3.7.4
Reporter: Sep Dehpour
Note: This issue was originally created as ARROW-9474. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: