Skip to content

Source S3: fix reading jsonl files with nested data #16607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 19, 2022

Conversation

davydov-d
Copy link
Contributor

@davydov-d davydov-d commented Sep 12, 2022

What

https://github.com/airbytehq/oncall/issues/531

Reading a file is a three step process:

  • first, each file's schema is discovered in pyarrow data types
  • then all of them are converted to json schemas and merged with a user-provided schema that was filled in during the connection setup into a master schema
  • master schema is converted back to pyarrow schema and passed in to a reader as an explicit schema

The problem is the connector can not convert pyarrow struct type to object and back. It uses large_string instead therefore pyarrow fails when there's a type mismatch

How

Skip objects in master schema when reading json files. Because we do not distinguish different object types and do not provide a way for the user to describe the object's schema as for now so it can be literally any object. In this case (by default) pyarrow still reads the data and its type remains inferred. Another option could be ignoring mismatched data types but this would impact scalar types as well.

@github-actions github-actions bot added area/connectors Connector related issues area/documentation Improvements or additions to documentation labels Sep 12, 2022
@davydov-d
Copy link
Contributor Author

/test connector=conenctors/source-s3

@davydov-d
Copy link
Contributor Author

davydov-d commented Sep 13, 2022

/test connector=conenctors/source-s3

🕑 conenctors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/3043007587
❌ conenctors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/3043007587
🐛

@davydov-d
Copy link
Contributor Author

davydov-d commented Sep 13, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/3043080729
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/3043080729
Python tests coverage:

	 Name                                                 Stmts   Miss  Cover   Missing
	 ----------------------------------------------------------------------------------
	 source_acceptance_test/base.py                          10      4    60%   15-18
	 source_acceptance_test/config.py                        83      6    93%   78-80, 84-86
	 source_acceptance_test/conftest.py                     164    164     0%   6-282
	 source_acceptance_test/plugin.py                        48     48     0%   6-104
	 source_acceptance_test/tests/test_core.py              329    111    66%   39, 50-58, 63-70, 74-75, 79-80, 164, 202-219, 228-236, 240-245, 251, 284-289, 327-334, 374-376, 379, 439-448, 477-478, 484, 487, 520-530, 543-568, 573-577
	 source_acceptance_test/tests/test_full_refresh.py       52      2    96%   34, 65
	 source_acceptance_test/tests/test_incremental.py       121     25    79%   21-23, 29-31, 36-43, 48-61, 208-216
	 source_acceptance_test/utils/asserts.py                 37      2    95%   57-58
	 source_acceptance_test/utils/common.py                  77     17    78%   15-16, 24-30, 47-54, 64, 67
	 source_acceptance_test/utils/compare.py                 62     23    63%   21-51, 68, 97-99
	 source_acceptance_test/utils/connector_runner.py       110     48    56%   23-26, 32, 36, 39-64, 67-69, 72-74, 77-79, 82-84, 87-89, 92-110, 144-146
	 source_acceptance_test/utils/json_schema_helper.py     105     13    88%   30-31, 38, 41, 65-68, 96, 120, 190-192
	 ----------------------------------------------------------------------------------
	 TOTAL                                                 1325    463    65%
Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/jsonl_spec.py                13      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/s3file.py                                                  37      0   100%
source_s3/s3_utils.py                                                19      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source.py                                                  29      1    97%
source_s3/source_files_abstract/storagefile.py                       23      1    96%
source_s3/source_files_abstract/stream.py                           218     13    94%
source_s3/stream.py                                                  43      3    93%
source_s3/source_files_abstract/formats/abstract_file_parser.py      39      3    92%
source_s3/source_files_abstract/formats/csv_parser.py                76     18    76%
source_s3/source_files_abstract/file_info.py                         26      8    69%
source_s3/utils.py                                                   31     10    68%
source_s3/source_files_abstract/source.py                            37     14    62%
source_s3/source_files_abstract/spec.py                              44     22    50%
source_s3/source_files_abstract/formats/jsonl_parser.py              42     25    40%
source_s3/source_files_abstract/formats/avro_parser.py               38     25    34%
source_s3/source_files_abstract/formats/parquet_parser.py            61     44    28%
-------------------------------------------------------------------------------------
TOTAL                                                               808    187    77%
Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/storagefile.py                       23      0   100%
source_s3/source_files_abstract/spec.py                              44      0   100%
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/jsonl_spec.py                13      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/source.py                                                  29      0   100%
source_s3/s3file.py                                                  37      0   100%
source_s3/s3_utils.py                                                19      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source_files_abstract/formats/parquet_parser.py            61      1    98%
source_s3/stream.py                                                  43      1    98%
source_s3/source_files_abstract/formats/jsonl_parser.py              42      1    98%
source_s3/source_files_abstract/formats/abstract_file_parser.py      39      1    97%
source_s3/source_files_abstract/source.py                            37      2    95%
source_s3/source_files_abstract/formats/avro_parser.py               38      3    92%
source_s3/source_files_abstract/file_info.py                         26      3    88%
source_s3/source_files_abstract/stream.py                           218     36    83%
source_s3/source_files_abstract/formats/csv_parser.py                76     18    76%
source_s3/utils.py                                                   31      8    74%
-------------------------------------------------------------------------------------
TOTAL                                                               808     74    91%

Build Passed

Test summary info:

All Passed

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for digging into this but I'm not sure of really understanding what happens there 😄

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve because this fixes the OC issue. As @davydov-d mentioned, there's a lot of fragility in the schema inference mechanism. I hope we get to rework this as part of our efforts in abstracting the file sources.

@davydov-d
Copy link
Contributor Author

davydov-d commented Sep 19, 2022

/publish connector=connectors/source-s3

🕑 Publishing the following connectors:
connectors/source-s3
https://github.com/airbytehq/airbyte/actions/runs/3080988595


Connector Did it publish? Were definitions generated?
connectors/source-s3

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@davydov-d davydov-d merged commit 4dc394c into master Sep 19, 2022
@davydov-d davydov-d deleted the ddavydov/#531-source-s3-fix-jsonl-nested-structures branch September 19, 2022 09:09
robbinhan pushed a commit to robbinhan/airbyte that referenced this pull request Sep 29, 2022
* airbytehq#531 source s3: fix reading nested jsonl files

* airbytehq#531 source s3: upd changelog

* oncall airbytehq#531 source s3: fix sample file

* auto-bump connector version [ci skip]

Co-authored-by: Octavia Squidington III <[email protected]>
jhammarstedt pushed a commit to jhammarstedt/airbyte that referenced this pull request Oct 31, 2022
* airbytehq#531 source s3: fix reading nested jsonl files

* airbytehq#531 source s3: upd changelog

* oncall airbytehq#531 source s3: fix sample file

* auto-bump connector version [ci skip]

Co-authored-by: Octavia Squidington III <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants