-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Destination Snowflake: Revisiting merge instead of insert+delete #43367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
2b2e786
to
7755aaa
Compare
7755aaa
to
329ee6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ngl I was only skimming the diff by the end :P largely trusting our tests to keep us safe here
+I assume you're planning to run some tests in the perf test workspace to try and validate the performance impact? I'm actually very curious what happens there - faker reemitting the whole dataset on every sync is basically the worst case, and I have no idea how merge vs insert+delete will perform there
changes seem reasonable at a glance. had some nitpicks but nothing structural. LGTM pending testing.
...otlin/io/airbyte/integrations/destination/snowflake/typing_deduping/SnowflakeSqlGenerator.kt
Outdated
Show resolved
Hide resolved
...otlin/io/airbyte/integrations/destination/snowflake/typing_deduping/SnowflakeSqlGenerator.kt
Show resolved
Hide resolved
JavaBaseConstants.COLUMN_NAME_AB_EXTRACTED_AT.uppercase().quoted(), | ||
JavaBaseConstants.COLUMN_NAME_AB_GENERATION_ID.uppercase().quoted() | ||
) | ||
val abMetaColumn = JavaBaseConstants.COLUMN_NAME_AB_META.uppercase().quoted() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not include this in columnsForUpdateList
?
...otlin/io/airbyte/integrations/destination/snowflake/typing_deduping/SnowflakeSqlGenerator.kt
Show resolved
Hide resolved
|${selectTypedRecordsFromRawTable(stream, minRawTimestamp, safeCast).replaceIndent(" ")} | ||
|) $sourceTable | ||
|ON | ||
|${pkEqualityMatch.replaceIndent(" ")} $whenMatchedCdcDeleteCondition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not just put whenMatchedCdcDeleteCondition
on its own line + remove the \n
from it? Right now it kind of looks like whenMatchedCdcDeleteCondition
is part of the ON
clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there was an empty line when this condition was missing. wasn't sure if sql compilation failed for this or something else so was trying different things
...otlin/io/airbyte/integrations/destination/snowflake/typing_deduping/SnowflakeSqlGenerator.kt
Outdated
Show resolved
Hide resolved
...otlin/io/airbyte/integrations/destination/snowflake/typing_deduping/SnowflakeSqlGenerator.kt
Show resolved
Hide resolved
...otlin/io/airbyte/integrations/destination/snowflake/typing_deduping/SnowflakeSqlGenerator.kt
Outdated
Show resolved
Hide resolved
48dba70
to
0fa45a9
Compare
bd085c3
to
1aadeb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm pending switching to an abstract base class for the test
1aadeb0
to
ff96664
Compare
ff96664
to
c96b2cb
Compare
@gisripa This PR is a huge deal for our business. Wanted to say thank you for opening this PR. |
just a heads up, we had to rollback to an older version of destination-snowflake for unrelated reasons. keep an eye on #44526, which will also rerelease this feature. |
What
Fixes: airbytehq/airbyte-internal-issues#9931
See the rabbit hole of linked tickets above
Resurrecting: #31683
Switching back to
MERGE INTO ... WHEN...
for CDC deletions.MERGE
still expects unique records in source dataset(a.k.a table from CTE in our case) to have a deterministic behavior.How
Review guide
Experimented a lot with Kotlin's idioms to deduplicate query string snippets.
Also consolidated the
row_number
window function to be executed on the preceding CTE (saw few millis impact when testing in console but not a great representation by individual runs)User Impact
Users have the ability to opt-in to use

MERGE
.This is intentionally left as non-default for now because of previous history of rollback.
Other details
Tested syncs with both
insert+delete
andmerge
using the toggle option. No improvement in Sync time per se but no indication that warehouse costs would have increased if time spent in query is a representation of the warehouse credits used.Query Profiles
I did not find any evidence that MERGE could increase warehouse activity if Time spent by query is a representation of warehouse credits.
Final Table seeded with 100M rows.
Raw Table generated with 500K data with some random percentage updated and cdc-deleted
Profiles for Insert+Delete+Delete
Insert
Delete from final table duplicated PKs
CDC Deletes
Profile for Merge
Can this PR be safely reverted and rolled back?