-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Doc][Python] Mention CSVStreamingReader pitfalls with type inference #28247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Antoine Pitrou / @pitrou: You have several solutions here:
|
Antoine Pitrou / @pitrou: |
Oleksandr Shevchenko: |
Antoine Pitrou / @pitrou: |
Oleksandr Shevchenko: |
Antoine Pitrou / @pitrou: |
Oleksandr Shevchenko: |
Oleksandr Shevchenko: |
Antoine Pitrou / @pitrou: |
Joris Van den Bossche / @jorisvandenbossche: |
Oleksandr Shevchenko: |
Antoine Pitrou / @pitrou: |
Looks like Arrow infer type for the first batch and apply it for all subsequent batches. But information might be not enough to infer the type correctly for the whole file. For our particular case, Arrow infers some field in the schema as date32 from the first batch but the next batch has an empty field value that can’t be converted to date32.
When I increase the batch size to have such a value in the first batch Arrow set string type (not sure why not nullable date32) for such a field since it can’t be converted to date32 and the whole file is read successfully.
This problem can be easily reproduced by using the following code and attached dataset:
Error message:
When we use block_size
10_000_000
file can be read successfully since we have the problematic value in the first batch.An error occurs when I try to attach dataset, so you can download it from Google Drive here
Reporter: Oleksandr Shevchenko
Assignee: Antoine Pitrou / @pitrou
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-12482. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: