-
Notifications
You must be signed in to change notification settings - Fork 4.6k
🐛 Source S3: Loading of files' metadata #8252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/test connector=connectors/source-s3
|
/test connector=connectors/source-s3
|
/test connector=connectors/source-s3
|
/test connector=connectors/source-s3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some great stuff in here!
- the file_info class and obtaining last_modified in stream are solid changes. We reduce network calls and potentially even use less memory (due to efficient class storage) 👍
- adding a way to enforce memory cap during testing, awesome. This could be a great tool in cdk for developers to utilise in tests.
Would be great if you could update the docstrings and code comments where appropriate to indicate these logic changes, it will help any new developers to not get confused (e.g. at some point we'll be adding GCP/Azure Blob using this abstract framework).
Not sure on:
- custom chunking in CSV. Can't/isn't PyArrow already doing this?
- changes to stream slices, see my comment on that.
Let's discuss those on the comments I made and go from there 👍
(haven't done detailed review of all the new tests yet, will wait for any changes as this could alter those and then do so)
return self.__str__() | ||
|
||
@classmethod | ||
def create_by_local_file(cls, filepath: str) -> "FileInfo": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we're only using this method in testing... doesn't make sense as a method of the class imo, should rather be a separate private testing method to instantiate an instance of FileInfo if we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to test folders
airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/formats/csv_parser.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/stream.py
Show resolved
Hide resolved
airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/stream.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-s3/integration_tests/big_csv_file_test.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/file_info.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/file_info.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-s3/source_s3/source_files_abstract/stream.py
Outdated
Show resolved
Hide resolved
…es_abstract/file_info.py Co-authored-by: George Claireaux <[email protected]>
/test connector=connectors/source-s3
|
/test connector=connectors/source-s3
|
/test connector=connectors/source-s3
|
/test connector=connectors/source-s3
|
/test connector=connectors/source-s3
|
/test connector=connectors/source-s3
|
/publish connector=connectors/source-s3
|
What
The code had following bottleneck: Loading of LastModified files' properties was implemented with multithreading. And it can have potential problems where a S3 storage have a lot of files.
How
Recommended reading order
Pre-merge Checklist
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampleAirbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing./publish
command described here