Skip to content

🐛 Source S3: Loading of files' metadata #8252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 59 commits into from
Feb 1, 2022

Conversation

antixar
Copy link
Contributor

@antixar antixar commented Nov 25, 2021

What

The code had following bottleneck: Loading of LastModified files' properties was implemented with multithreading. And it can have potential problems where a S3 storage have a lot of files.

How

  • Code refactoring:
  1. All files' properties should be loaded while reading of bucket items. Certainly the "multithreading" must be replaced.
  2. Add more informative log messages.
  • Add test suites for testing

Recommended reading order

  1. unit_tests/
  2. integration_tests/
  3. csv_parser.py

Pre-merge Checklist

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the new connector version is published, connector version bumped in the seed directory as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

@antixar antixar temporarily deployed to more-secrets November 25, 2021 14:50 Inactive
@github-actions github-actions bot added the area/connectors Connector related issues label Nov 25, 2021
@antixar antixar self-assigned this Nov 25, 2021
@antixar
Copy link
Contributor Author

antixar commented Nov 25, 2021

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1504229322
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1504229322
🐛 https://gradle.com/s/wrxbbaeqzi5yc

@antixar antixar linked an issue Nov 25, 2021 that may be closed by this pull request
@jrhizor jrhizor temporarily deployed to more-secrets November 25, 2021 15:00 Inactive
@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Nov 25, 2021
@antixar antixar temporarily deployed to more-secrets November 25, 2021 15:43 Inactive
@antixar
Copy link
Contributor Author

antixar commented Nov 25, 2021

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1504399931
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1504399931
🐛 https://gradle.com/s/fdypicnscjjz4

@jrhizor jrhizor temporarily deployed to more-secrets November 25, 2021 15:46 Inactive
@antixar
Copy link
Contributor Author

antixar commented Nov 25, 2021

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1504662348
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1504662348
🐛 https://gradle.com/s/mnfyvhhpbni7m

@antixar antixar temporarily deployed to more-secrets November 25, 2021 17:05 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets November 25, 2021 17:06 Inactive
@antixar
Copy link
Contributor Author

antixar commented Nov 25, 2021

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1504983223
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1504983223
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        75      8    89%
	 source_acceptance_test/conftest.py                     108    108     0%
	 source_acceptance_test/plugin.py                        47     47     0%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              200     94    53%
	 source_acceptance_test/tests/test_full_refresh.py       38     27    29%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  41     24    41%
	 source_acceptance_test/utils/compare.py                 62     25    60%
	 source_acceptance_test/utils/connector_runner.py        82     49    40%
	 source_acceptance_test/utils/json_schema_helper.py     115     14    88%
	 ------------------------------------------------------------------------
	 TOTAL                                                  896    440    51%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                20      0   100%
	 source_s3/s3file.py                                                  37      2    95%
	 source_s3/source.py                                                  23      0   100%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/file_info.py                         32     11    66%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      38      2    95%
	 source_s3/source_files_abstract/formats/csv_parser.py               126     23    82%
	 source_s3/source_files_abstract/formats/csv_spec.py                  15      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            62     44    29%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            37     14    62%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       23      1    96%
	 source_s3/source_files_abstract/stream.py                           182     11    94%
	 source_s3/stream.py                                                  42      3    93%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                               690    133    81%
	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                20     13    35%
	 source_s3/s3file.py                                                  37     18    51%
	 source_s3/source.py                                                  23      0   100%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/file_info.py                         32      5    84%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      38      0   100%
	 source_s3/source_files_abstract/formats/csv_parser.py               126     22    83%
	 source_s3/source_files_abstract/formats/csv_spec.py                  15      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            62      3    95%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            37     15    59%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       23      5    78%
	 source_s3/source_files_abstract/stream.py                           182     91    50%
	 source_s3/stream.py                                                  42     30    29%
	 -------------------------------------------------------------------------------------
	 source_s3/stream.py                                                  42     30    29%/actions-runner/_work/airbyte/airbyte/airbyte-integrations/connectors/source-s3/.venv/lib/python3.8/site-packages/coverage/data.py:118: CoverageWarning: Data file '/actions-runner/_work/airbyte/airbyte/airbyte-integrations/connectors/source-s3/.coverage.ip-10-0-51-178.9125.575151' doesn't seem to be a coverage data file: Couldn't use data file '/actions-runner/_work/airbyte/airbyte/airbyte-integrations/connectors/source-s3/.coverage.ip-10-0-51-178.9125.575151': no such table: coverage_schema
	 TOTAL                                                               690    224    68%

@antixar antixar temporarily deployed to more-secrets November 25, 2021 19:05 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets November 25, 2021 19:06 Inactive
@antixar antixar requested a review from Phlair November 26, 2021 08:03
Phlair
Phlair previously requested changes Dec 8, 2021
Copy link
Contributor

@Phlair Phlair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some great stuff in here!

  • the file_info class and obtaining last_modified in stream are solid changes. We reduce network calls and potentially even use less memory (due to efficient class storage) 👍
  • adding a way to enforce memory cap during testing, awesome. This could be a great tool in cdk for developers to utilise in tests.

Would be great if you could update the docstrings and code comments where appropriate to indicate these logic changes, it will help any new developers to not get confused (e.g. at some point we'll be adding GCP/Azure Blob using this abstract framework).

Not sure on:

  • custom chunking in CSV. Can't/isn't PyArrow already doing this?
  • changes to stream slices, see my comment on that.

Let's discuss those on the comments I made and go from there 👍

(haven't done detailed review of all the new tests yet, will wait for any changes as this could alter those and then do so)

return self.__str__()

@classmethod
def create_by_local_file(cls, filepath: str) -> "FileInfo":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're only using this method in testing... doesn't make sense as a method of the class imo, should rather be a separate private testing method to instantiate an instance of FileInfo if we need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to test folders

…es_abstract/file_info.py

Co-authored-by: George Claireaux <[email protected]>
@antixar antixar temporarily deployed to more-secrets December 8, 2021 13:26 Inactive
@antixar
Copy link
Contributor Author

antixar commented Feb 1, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1779353901
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1779353901
No Python unittests run

@antixar antixar temporarily deployed to more-secrets February 1, 2022 16:13 Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 1, 2022 16:13 Inactive
@antixar antixar temporarily deployed to more-secrets February 1, 2022 17:19 Inactive
@antixar
Copy link
Contributor Author

antixar commented Feb 1, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1779708516
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1779708516
Python tests coverage:

	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        74      6    92%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              275    106    61%
	 source_acceptance_test/tests/test_full_refresh.py       52      2    96%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  70     17    76%
	 source_acceptance_test/utils/compare.py                 62     23    63%
	 source_acceptance_test/utils/connector_runner.py       110     48    56%
	 source_acceptance_test/utils/json_schema_helper.py     105     13    88%
	 ------------------------------------------------------------------------
	 TOTAL                                                  876    259    70%
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                19      0   100%
	 source_s3/s3file.py                                                  37      2    95%
	 source_s3/source.py                                                  28      1    96%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/file_info.py                         26      8    69%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      35      2    94%
	 source_s3/source_files_abstract/formats/csv_parser.py                74     18    76%
	 source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            61     44    28%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            37     14    62%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       23      1    96%
	 source_s3/source_files_abstract/stream.py                           184     11    94%
	 source_s3/stream.py                                                  43      3    93%
	 source_s3/utils.py                                                   29     10    66%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                               665    136    80%
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                19     13    32%
	 source_s3/s3file.py                                                  37     18    51%
	 source_s3/source.py                                                  28      0   100%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/file_info.py                         26     10    62%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      35      0   100%
	 source_s3/source_files_abstract/formats/csv_parser.py                74     18    76%
	 source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            61      3    95%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            37     15    59%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       23      5    78%
	 source_s3/source_files_abstract/stream.py                           184     91    51%
	 source_s3/stream.py                                                  43     30    30%
	 source_s3/utils.py                                                   29      8    72%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                               665    233    65%

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 1, 2022 17:23 Inactive
@antixar
Copy link
Contributor Author

antixar commented Feb 1, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1779814977
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1779814977
🐛 https://gradle.com/s/eqazn7stzdnwo

@antixar antixar temporarily deployed to more-secrets February 1, 2022 17:45 Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 1, 2022 17:46 Inactive
@antixar
Copy link
Contributor Author

antixar commented Feb 1, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1779910305
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1779910305
🐛 https://gradle.com/s/hm4nyrgi4zvc4

@antixar antixar temporarily deployed to more-secrets February 1, 2022 18:06 Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 1, 2022 18:07 Inactive
@antixar
Copy link
Contributor Author

antixar commented Feb 1, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1780270201
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1780270201
🐛 https://gradle.com/s/7vle5cins4ylg

@antixar antixar temporarily deployed to more-secrets February 1, 2022 19:24 Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 1, 2022 19:25 Inactive
@antixar
Copy link
Contributor Author

antixar commented Feb 1, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1780348324
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1780348324
Python tests coverage:

	 Name                                                 Stmts   Miss  Cover
	 ------------------------------------------------------------------------
	 source_acceptance_test/__init__.py                       2      0   100%
	 source_acceptance_test/base.py                          10      4    60%
	 source_acceptance_test/config.py                        74      6    92%
	 source_acceptance_test/tests/__init__.py                 4      0   100%
	 source_acceptance_test/tests/test_core.py              275    106    61%
	 source_acceptance_test/tests/test_full_refresh.py       52      2    96%
	 source_acceptance_test/tests/test_incremental.py        69     38    45%
	 source_acceptance_test/utils/__init__.py                 6      0   100%
	 source_acceptance_test/utils/asserts.py                 37      2    95%
	 source_acceptance_test/utils/common.py                  70     17    76%
	 source_acceptance_test/utils/compare.py                 62     23    63%
	 source_acceptance_test/utils/connector_runner.py       110     48    56%
	 source_acceptance_test/utils/json_schema_helper.py     105     13    88%
	 ------------------------------------------------------------------------
	 TOTAL                                                  876    259    70%
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                19      0   100%
	 source_s3/s3file.py                                                  37      2    95%
	 source_s3/source.py                                                  28      1    96%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/file_info.py                         26      8    69%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      35      2    94%
	 source_s3/source_files_abstract/formats/csv_parser.py                74     18    76%
	 source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            61     44    28%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            37     14    62%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       23      1    96%
	 source_s3/source_files_abstract/stream.py                           184     11    94%
	 source_s3/stream.py                                                  43      3    93%
	 source_s3/utils.py                                                   29     10    66%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                               665    136    80%
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 source_s3/__init__.py                                                 2      0   100%
	 source_s3/s3_utils.py                                                19     13    32%
	 source_s3/s3file.py                                                  37     18    51%
	 source_s3/source.py                                                  28      0   100%
	 source_s3/source_files_abstract/__init__.py                           0      0   100%
	 source_s3/source_files_abstract/file_info.py                         26     10    62%
	 source_s3/source_files_abstract/formats/abstract_file_parser.py      35      0   100%
	 source_s3/source_files_abstract/formats/csv_parser.py                74     18    76%
	 source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
	 source_s3/source_files_abstract/formats/parquet_parser.py            61      3    95%
	 source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
	 source_s3/source_files_abstract/source.py                            37     15    59%
	 source_s3/source_files_abstract/spec.py                              42     22    48%
	 source_s3/source_files_abstract/storagefile.py                       23      5    78%
	 source_s3/source_files_abstract/stream.py                           184     91    51%
	 source_s3/stream.py                                                  43     30    30%
	 source_s3/utils.py                                                   29      8    72%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                               665    233    65%

@antixar antixar temporarily deployed to more-secrets February 1, 2022 19:42 Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 1, 2022 19:43 Inactive
@antixar
Copy link
Contributor Author

antixar commented Feb 1, 2022

/publish connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1781019772
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/1781019772

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 1, 2022 22:23 Inactive
@antixar antixar temporarily deployed to more-secrets February 1, 2022 22:47 Inactive
@antixar antixar merged commit 91eff1d into master Feb 1, 2022
@antixar antixar deleted the antixar/6870-source-s3-csv-memory-leak branch February 1, 2022 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak in source-s3 while reading big CSV file
5 participants