-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Source S3: fix reading jsonl files with nested data #16607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source S3: fix reading jsonl files with nested data #16607
Conversation
/test connector=conenctors/source-s3 |
/test connector=conenctors/source-s3
|
/test connector=connectors/source-s3
Build PassedTest summary info:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for digging into this but I'm not sure of really understanding what happens there 😄
...tions/connectors/source-s3/unit_tests/sample_files/jsonl/test_file_10_nested_structure.jsonl
Outdated
Show resolved
Hide resolved
...rations/connectors/source-s3/source_s3/source_files_abstract/formats/abstract_file_parser.py
Show resolved
Hide resolved
...te-integrations/connectors/source-s3/source_s3/source_files_abstract/formats/jsonl_parser.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve because this fixes the OC issue. As @davydov-d mentioned, there's a lot of fragility in the schema inference mechanism. I hope we get to rework this as part of our efforts in abstracting the file sources.
/publish connector=connectors/source-s3
if you have connectors that successfully published but failed definition generation, follow step 4 here |
* airbytehq#531 source s3: fix reading nested jsonl files * airbytehq#531 source s3: upd changelog * oncall airbytehq#531 source s3: fix sample file * auto-bump connector version [ci skip] Co-authored-by: Octavia Squidington III <[email protected]>
* airbytehq#531 source s3: fix reading nested jsonl files * airbytehq#531 source s3: upd changelog * oncall airbytehq#531 source s3: fix sample file * auto-bump connector version [ci skip] Co-authored-by: Octavia Squidington III <[email protected]>
What
https://github.com/airbytehq/oncall/issues/531
Reading a file is a three step process:
The problem is the connector can not convert pyarrow
struct
type toobject
and back. It useslarge_string
instead therefore pyarrow fails when there's a type mismatchHow
Skip
object
s in master schema when reading json files. Because we do not distinguish different object types and do not provide a way for the user to describe the object's schema as for now so it can be literally any object. In this case (by default) pyarrow still reads the data and its type remains inferred. Another option could be ignoring mismatched data types but this would impact scalar types as well.