Skip to content

🎉 GCS destination: use serialized buffer; compress csv & jsonl #11686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Apr 4, 2022

Conversation

tuliren
Copy link
Contributor

@tuliren tuliren commented Apr 4, 2022

What

Config class

  • Each destination config class includes a BlobStorageCredentialConfig<ConfigType>
    • For S3, it is S3CredentialConfig, whose ConfigType is S3CredentialType.
    • For GCS, it is GcsCredentialConfig, whose ConfigType is GcsCredentialType.
  • S3 has two credential config implementations
    • S3AccessKeyCredentialConfig
    • S3InstanceProfileCredentialConfig
  • GCS has one credential config implementation
    • GcsHmacKeyCredentialConfig
  • The GCS credential config can be converted to an S3 credential config.
    • This is because currently we always use the S3 client to communicate with GCS.
    • This may change in the future.

Recommended reading order

  1. GcsDestination.java
  2. GcsDestinatioConfig.java
  3. S3DestinationConfig.java

🚨 User Impact 🚨

  1. The main improvement from this PR is that the GCS destination can handle a large number of streams now. Previously it will throw out-of-memory error whenever there are too many streams (e.g. 50).
  2. 🚨 The impact for the users is that CSV and JSONL formats will be automatically compressed with GZIP. For example, if an output file was previously abc.csv, it will become abc.csv.gz after this PR.

Pre-merge Checklist

Expand the relevant checklist and delete the others.

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the new connector version is published, connector version bumped in the seed directory as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

@github-actions github-actions bot added the area/connectors Connector related issues label Apr 4, 2022
@tuliren
Copy link
Contributor Author

tuliren commented Apr 4, 2022

/test connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/2089174172
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/2089174172
No Python unittests run

@tuliren
Copy link
Contributor Author

tuliren commented Apr 4, 2022

/test connector=connectors/destination-gcs

🕑 connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/2089174775
❌ connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/2089174775
🐛

@tuliren
Copy link
Contributor Author

tuliren commented Apr 4, 2022

/test connector=connectors/destination-gcs

🕑 connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/2089373624
✅ connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/2089373624
No Python unittests run

@tuliren
Copy link
Contributor Author

tuliren commented Apr 4, 2022

/test connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/2089373977
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/2089373977
No Python unittests run

@tuliren tuliren marked this pull request as ready for review April 4, 2022 10:55
@tuliren tuliren requested review from edgao and ChristopheDuong April 4, 2022 10:55
@tuliren
Copy link
Contributor Author

tuliren commented Apr 4, 2022

@ChristopheDuong, this is the prerequisite for BigQuery changes. It updates many of the GCS classes to be compatible with S3 to directly reuse the S3 code. It also completes the migration for the GCS destination.

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Apr 4, 2022
@tuliren
Copy link
Contributor Author

tuliren commented Apr 4, 2022

/test connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/2091922283
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/2091922283
No Python unittests run

@tuliren
Copy link
Contributor Author

tuliren commented Apr 4, 2022

/test connector=connectors/destination-gcs

🕑 connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/2091922489
✅ connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/2091922489
No Python unittests run

@tuliren tuliren changed the title 🎉 GCS destination: use file serialized buffer 🎉 GCS destination: use serialized buffer; compress csv & jsonl Apr 4, 2022
@tuliren
Copy link
Contributor Author

tuliren commented Apr 4, 2022

Will publish the connector in a follow up PR.

@wallies
Copy link
Contributor

wallies commented Apr 14, 2022

Who decided this was a good idea for an ETL tool to output to gz. How are tools using this data afterwards meant to process it

The impact for the users is that CSV and JSONL formats will be automatically compressed with GZIP. For example, if an output file was previously abc.csv, it will become abc.csv.gz after this PR.

@tuliren
Copy link
Contributor Author

tuliren commented Apr 14, 2022

@wallies, thank you for raising this question and creating this issue. We made the decision to compress CSV and JSONL formats based on the common use case of these blob storages. People usually use S3 and GCS just to archive their data, and the compression can reduce the storage cost.

I admit that this is not friendly for other use cases. I have created an issue here. Should have a new version that provides an option to not compress these formats by the end of this week or early next week.

@wallies
Copy link
Contributor

wallies commented Apr 14, 2022

@wallies, thank you for raising this question and creating this issue. We made the decision to compress CSV and JSONL formats based on the common use case of these blob storages. People usually use S3 and GCS just to archive their data, and the compression can reduce the storage cost.

I admit that this is not friendly for other use cases. I have created an issue here. Should have a new version that provides an option to not compress these formats by the end of this week or early next week.

Much appreciated @tuliren. I also raised this #11872 as we were seeing no file extension at all on any new files

@tuliren
Copy link
Contributor Author

tuliren commented Apr 22, 2022

@wallies, the new version with an option to not compress CSV and JSONL files has been published. Please give the new version a try. Thanks~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apply buffering changes to GCS Destination
3 participants