feat(registry): add remove stale partition job #38165

bnchrch · 2024-05-13T16:45:31Z

What

Add a job that lets us remove partition keys that no longer exist

Why

We have > 10,000 partitions, one for every metadata file ever. Likely only 500 of those reference files that exist.

Adding this job should let us clean out the noise.

Future

If it works I'll add it to a nightly job

vercel · 2024-05-13T16:45:33Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		May 17, 2024 7:23pm

bnchrch · 2024-05-13T16:45:36Z

feat(registry): Add dev deployment #38221
feat(registry): add remove stale partition job #38165 👈
update(registry): Bump to dagster-cloud 1.7.5 and use serverless #38164
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @bnchrch and the rest of your teammates on Graphite

alafanechere

LGTM.
Just wondering how you're going to test it.
This change will materialize into a job you can manually trigger from Dagster?

Is there a reason we're not bumping the orchestrator package version?

alafanechere · 2024-05-13T17:34:56Z

airbyte-ci/connectors/metadata_service/orchestrator/orchestrator/jobs/registry.py

+    stale_etags = [etag for etag in all_etag_partitions if etag not in all_fresh_etags]
+    context.log.info(f"Stale etags found: {stale_etags}")
+    for stale_etag in stale_etags:


nit: you can avoid one iteration if you filter and delete in a single iteration on all_etag_partitions

Ah great call. Will change

alafanechere · 2024-05-13T17:35:54Z

airbyte-ci/connectors/metadata_service/orchestrator/orchestrator/jobs/registry.py

+
+    all_fresh_etags = [blob.etag for blob in all_metadata_file_blobs]
+
+    all_etag_partitions = context.instance.get_dynamic_partitions(partition_name)


context.intance.get_dynamic_partitions calls the Dagster backend we went to clean up right?

Correct! That line asks Dagster "Hey what metadata files do you have partitions for?"

alafanechere · 2024-05-13T17:36:58Z

airbyte-ci/connectors/metadata_service/orchestrator/orchestrator/jobs/registry.py

@@ -24,6 +24,35 @@
 )


+@op(required_resource_keys={"all_metadata_file_blobs"})
+def remove_stale_metadata_partitions_op(context):


Any reason to not perform this clean up at partition insertion time?

Hmm I wanted to introduce this logic in a separate area right now simply because Im scared of it since its destructive.

If we accidentally delete ALL partitions, its not the end of the world, it just means we may miss a failed metadata file and have to reingest all existing metadata files to know we're ok.

Also Dagster doesn't let you bulk delete partition keys, and Im worried about the time it would take to interate over 10000 keys to delete. May lock up our sensor if deployed inside the add_partition sensor today.

So I wanted to keep it separate for now.

If things are looking good. I want to look at merging it back in.

Does that make sense?

💯 👍 - step by step :)

bnchrch · 2024-05-13T17:46:56Z

LGTM. Just wondering how you're going to test it.

Ive added unit tests, and tested it on my local dagster.

The next step is to trigger the job on production 😬

This change will materialize into a job you can manually trigger from Dagster?

exactly!

Is there a reason we're not bumping the orchestrator package version?

Oh that was a miss! Ill update

erohmensing

After clearing up which ones are getting cleaned up this makes sense 👍🏻 interested in how many we are left with, I feel like it should be about half

erohmensing · 2024-05-14T16:05:54Z

airbyte-ci/connectors/metadata_service/orchestrator/orchestrator/jobs/registry.py

+    """
+    This op is responsible for polling for new metadata files and adding their etag to the dynamic partition.
+    """


This should be updated!

erohmensing · 2024-05-14T16:08:07Z

airbyte-ci/connectors/metadata_service/orchestrator/orchestrator/jobs/registry.py

+    all_metadata_file_blobs = context.resources.all_metadata_file_blobs
+    partition_name = registry_entry.metadata_partitions_def.name
+
+    all_fresh_etags = [blob.etag for blob in all_metadata_file_blobs]


Would set subtraction be more appropriate here?

bnchrch · 2024-05-17T19:20:24Z

Merge activity

May 17, 3:20 PM EDT: @bnchrch started a stack merge that includes this pull request via Graphite.
May 17, 3:23 PM EDT: Graphite rebased this pull request as part of a merge.
May 17, 3:24 PM EDT: @bnchrch merged this pull request with Graphite.

bnchrch requested review from alafanechere and erohmensing May 13, 2024 16:45

bnchrch mentioned this pull request May 13, 2024

update(registry): Bump to dagster-cloud 1.7.5 and use serverless #38164

Merged

bnchrch marked this pull request as ready for review May 13, 2024 16:48

bnchrch requested a review from a team May 13, 2024 16:48

alafanechere approved these changes May 13, 2024

View reviewed changes

octavia-squidington-iv requested a review from a team May 13, 2024 17:40

bnchrch force-pushed the 05-13-feat_registry_add_remove_stale_partition_job branch from 2a8de52 to 92c0f72 Compare May 13, 2024 17:52

erohmensing approved these changes May 14, 2024

View reviewed changes

bnchrch mentioned this pull request May 15, 2024

feat(registry): Add dev deployment #38221

Merged

bnchrch force-pushed the 05-10-update_registry_bump_to_dagster-cloud_1.7.5_and_use_serverless branch from c433512 to ca2fbea Compare May 16, 2024 20:27

bnchrch force-pushed the 05-13-feat_registry_add_remove_stale_partition_job branch from 92c0f72 to c99de1f Compare May 16, 2024 20:27

bnchrch force-pushed the 05-10-update_registry_bump_to_dagster-cloud_1.7.5_and_use_serverless branch from ca2fbea to bae39f4 Compare May 17, 2024 18:03

bnchrch force-pushed the 05-13-feat_registry_add_remove_stale_partition_job branch from c99de1f to 3ce2aeb Compare May 17, 2024 18:04

bnchrch changed the base branch from 05-10-update_registry_bump_to_dagster-cloud_1.7.5_and_use_serverless to graphite-base/38165 May 17, 2024 19:21

bnchrch changed the base branch from graphite-base/38165 to master May 17, 2024 19:21

feat(registry): add remove stale partition job

33c257e

bnchrch force-pushed the 05-13-feat_registry_add_remove_stale_partition_job branch from 3ce2aeb to 33c257e Compare May 17, 2024 19:22

bnchrch merged commit caec5f2 into master May 17, 2024
23 of 25 checks passed

bnchrch deleted the 05-13-feat_registry_add_remove_stale_partition_job branch May 17, 2024 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(registry): add remove stale partition job #38165

feat(registry): add remove stale partition job #38165

bnchrch commented May 13, 2024 •

edited

Loading

vercel bot commented May 13, 2024 •

edited

Loading

bnchrch commented May 13, 2024 •

edited

Loading

alafanechere left a comment

alafanechere May 13, 2024

bnchrch May 13, 2024

alafanechere May 13, 2024

bnchrch May 13, 2024

alafanechere May 13, 2024

bnchrch May 13, 2024

alafanechere May 13, 2024

bnchrch commented May 13, 2024

erohmensing left a comment

erohmensing May 14, 2024

erohmensing May 14, 2024

bnchrch commented May 17, 2024 •

edited

Loading


		all_fresh_etags = [blob.etag for blob in all_metadata_file_blobs]

		all_etag_partitions = context.instance.get_dynamic_partitions(partition_name)

feat(registry): add remove stale partition job #38165

feat(registry): add remove stale partition job #38165

Conversation

bnchrch commented May 13, 2024 • edited Loading

What

Why

Future

vercel bot commented May 13, 2024 • edited Loading

bnchrch commented May 13, 2024 • edited Loading

alafanechere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnchrch commented May 13, 2024

erohmensing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnchrch commented May 17, 2024 • edited Loading

Merge activity

bnchrch commented May 13, 2024 •

edited

Loading

vercel bot commented May 13, 2024 •

edited

Loading

bnchrch commented May 13, 2024 •

edited

Loading

bnchrch commented May 17, 2024 •

edited

Loading