Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination S3: Deferred deletes on sync success in OVERWRITE #42579

Merged
merged 1 commit into from
Aug 14, 2024

Conversation

gisripa
Copy link
Contributor

@gisripa gisripa commented Jul 27, 2024

What

Deferring the deletes until end of the sync to only delete on receving a COMPLETE for stream status.

How

  • Marking all the existing objects which matches the given pathFormat. This logic is similar to the earlier cleanUpObjects, instead of deleting them immediately, collecting all the object keys marked for delete and sending them to onClose function for actual deletion.

Review guide

  • New methods in BlobStorageOperations interface to support querying and cleaning the objects in a deferred fashion.
  • Storing keys marked for deletion in WriteConfig. Note that it is not thread-safe but onStart and onClose are only called once so nothing else is mutating that list in between.
  • Logical change in S3ConsumerFactory for deferring the deletion action
  • Test verifying that existing objects are left intact on failed attempts in OVERWRITE mode

User Impact

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Jul 27, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 14, 2024 8:40pm

Copy link
Contributor Author

gisripa commented Jul 27, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @gisripa and the rest of your teammates on Graphite Graphite

@gisripa gisripa force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 2c71cb7 to d66a66b Compare July 27, 2024 01:41
@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch 2 times, most recently from 005b367 to f2bdd27 Compare July 29, 2024 05:17
@gisripa gisripa force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from d66a66b to 2b602b5 Compare July 29, 2024 05:18
@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch from f2bdd27 to d1c1c0a Compare July 29, 2024 05:18
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 2b602b5 to cc558fa Compare July 29, 2024 18:02
@gisripa gisripa force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from cdf2021 to 32f91da Compare July 29, 2024 18:27
@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch from d1c1c0a to 3b439ea Compare July 29, 2024 18:27
@gisripa gisripa force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 32f91da to 196731e Compare July 30, 2024 01:14
@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch 2 times, most recently from 14faf69 to 96cbf8f Compare July 30, 2024 01:30
@gisripa gisripa changed the title s3-no-downtime-full-refresh Destination S3: Deferred deletes on sync success in OVERWRITE Jul 30, 2024
@gisripa gisripa marked this pull request as ready for review July 30, 2024 15:20
@gisripa gisripa requested a review from a team as a July 30, 2024 15:20
Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, had some fyi/questions

@gisripa gisripa force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 196731e to 1627fdb Compare July 30, 2024 18:40
@gisripa gisripa requested a review from a team as a code owner July 30, 2024 18:40
@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch from 96cbf8f to 4ac9531 Compare July 30, 2024 18:40
@gisripa gisripa force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from e6055c8 to 475d5c2 Compare July 31, 2024 00:14
@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch from 4ac9531 to 1aaf8f9 Compare July 31, 2024 00:14
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch 2 times, most recently from 123df63 to 6629bba Compare July 31, 2024 16:46
@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch from 9b0f98a to 94bb9fe Compare August 12, 2024 23:58
Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last question, lgtm once resolved

objectPath: String,
maxKeysPerPage: Int = 1000,
pageConsumer: (List<S3ObjectSummary>) -> Unit
) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you intentionally keeping the regex matching stuff outside this method? (i.e. the Pattern.compile(getRegexFormat(namespace, streamName, pathFormat)) + objectSumaries.filter { regex.matcher(it).matches() } thing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. wasn't sure if all usages are doing filter at the time of refactoring (turns out they seem to be)

@octavia-squidington-iv octavia-squidington-iv requested a review from a team August 13, 2024 15:12
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 51c23f8 to 1844494 Compare August 13, 2024 15:34
* will be filtered with generationId metadata strictly less than [currentGenerationId]
* @return List of keys of the objects
*/
abstract fun listExistingObjects(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requesting a change to this, but I wonder if in the new CDK we should try to keep the BL out of the storage operations as much as possible.

Like this is really an object search with a metadata comparator?

stream,
outputBucketPath,
pathFormat
writeConfig.objectsFromOldGeneration.addAll(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another raised CDK question: what's the best way to handle accumulated stream state. Is appending to the config the right way? Does it make things less manageable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only remnant in S3 land, we avoided attaching to WriteConfig (or rather killed that object all together) and saving states at instance level of that StreamOperation class. If there is a "state" machine then the FSM executor should accumulate and pass it along right ?

@gisripa gisripa force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 1844494 to 33e3e0d Compare August 13, 2024 21:26
@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch from 94bb9fe to 67436cc Compare August 13, 2024 21:27
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 33e3e0d to 1844494 Compare August 13, 2024 21:41
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch from 1844494 to 194365d Compare August 13, 2024 22:16
@johnny-schmidt johnny-schmidt force-pushed the epic-8349-certify-s3/issue-8556-s3-v2-fields-gt branch 2 times, most recently from 491ad7c to a626bae Compare August 14, 2024 15:54
Base automatically changed from epic-8349-certify-s3/issue-8556-s3-v2-fields-gt to master August 14, 2024 16:41
@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch 2 times, most recently from 733226f to 8918b13 Compare August 14, 2024 17:47
@gisripa gisripa removed the request for review from a team August 14, 2024 17:54
@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch 2 times, most recently from 152d411 to d3b7688 Compare August 14, 2024 18:35
@gisripa
Copy link
Contributor Author

gisripa commented Aug 14, 2024

/publish-java-cdk

🕑 https://github.com/airbytehq/airbyte/actions/runs/10393278143
✅ Successfully published Java CDK version=0.44.13!

@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch from d3b7688 to 1f789b2 Compare August 14, 2024 18:57
@gisripa
Copy link
Contributor Author

gisripa commented Aug 14, 2024

@gisripa gisripa force-pushed the gireesh/07-26-s3-no-downtime-full-refresh branch from 1f789b2 to adfe484 Compare August 14, 2024 20:34
@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Aug 14, 2024
@gisripa gisripa merged commit cba3ccd into master Aug 14, 2024
35 checks passed
@gisripa gisripa deleted the gireesh/07-26-s3-no-downtime-full-refresh branch August 14, 2024 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation CDK Connector Development Kit connectors/destination/s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants