Support incremental dedupe in databricks destination connector #6042

tuliren · 2021-09-14T06:28:25Z

Tell us about the problem you're trying to solve

Currently the databricks destination connector only supports the incremental append mode. We want it to support the incremental dedup mode as well.

This relates to #2075, and is a follow-up issue from PR #5998.

Describe the solution you’d like

Use the MERGE command instead of the COPY INTO command to insert new records.

Reference:

Copy Into SQL command VS Merge SQL command

The Copy Into SQL command is designed to support high performence inserts (new data) into Databricks Delta. It does not support Updates and Deletes

Merge SQL command supports inserts updates and deletes. It is slower that the Copy Into command for inserting data into Delta.

Describe the alternative you’ve considered or used

N/A

Additional context

N/A

Are you willing to submit a PR?

Yes.

The text was updated successfully, but these errors were encountered:

shrodingers · 2022-01-14T09:24:20Z

Just wanted to post here to follow the thread, and to know if i could help on this matter since i would really be interested on deduped and dbt transformations with databricks

tuliren · 2022-01-14T23:08:59Z

@shrodingers, thanks for your comment. We probably won't have bandwidth to implement the incremental dedup mode for Databricks this quarter. So feel free to try it if you are interested.

As mentioned in the description, I think the MERGE command may be useful in the implementation. But we have not thoroughly looked into it yet.

shrodingers · 2022-04-15T10:16:26Z

@tuliren Coming back on this one, i started to implement normalization on the databricks connector (in order to handle append_deduped and basic normalisation).
Things were going pretty nicely on my tests, until the point where i realized that the connector used the catalog schema to generate the tables, using the S3 parquet writer that normalize the first level of columns in the raw data, thus bypassing entierely the _airbyte_data JSON blob syncing. Or all the normalization code uses the _airbyte_data in order to normalize and then dedup the data.
I was wondering what would be the correct way to handle this ?

Change the file format for staging files from parquet to another one without schema ? [Breaking for users]
Create a new raw parquet writer for S3 ? [Also Breaking]
Allow users to choose between json2avro or dbt normalization ? (but may be quite weird i think)
Write the blob alongside the first level of rows (this conflicts a bit with the json2avro blob existing for S3 destinations) ?
create a different connector, one that supports normalization through raw data S3 syncing and DBT transformation, and the other that does not but produces tables normalized on the first level ?

Would be glad to have an input on this, since i am not sure what is the most useful and simple way

shrodingers · 2022-04-25T09:19:42Z

Managed to have a custom implementation for this connector to support DBT, extending the S3StreamCopier directly, and with a different flow than the current implementation.
Still have some tiny issues (dangling threads and ignored python errors during normalization, as well as a full refresh issue on first run)
I'll be glad to submit a PR, but i think maybe i should allow to switch between the current implementation and the one compatible with DBT transformations, since both are really different, and incompatible (we need to keep the airbyte_data as a json blob to do normalization correctly). Anyways i'll be happy to help, and / or to have some insight on the more airbyte-ish way to do stuff :)

etsybaev · 2023-02-08T14:24:45Z

It seems like we already have some contributor's PR, but it's huge!
#14445

shrodingers · 2023-02-08T14:31:31Z

@etsybaev this pr is deprecated by now following some changes on the databricks / S3 connectors that have been happening since then, as well as other issues i saw arising by using it on production. And I think there is some code that have nothing to with the feature (may have failed my branch diff and added another source connector + ui changes by mistake), plus lots of code is generated test suite for the transformation acceptance tests. Kept it open in order to have a reference on some parts i managed to test on the normalization models / macros and connector updates. But i'll be glad to work on the subject if i have some insights on certain points, and resubmit a cleaner / updated pull request later :)

grishick · 2023-02-09T17:08:22Z

blocked by re-rollout of v1 normalization

marijncv · 2023-07-21T14:13:44Z

Hey @grishick, is this issue still blocked? Is it a matter of swapping the COPY INTO statement for MERGE INTO like the OP suggested or is there something else that makes the implementation difficult?

Considering to start using Airbyte and having incremental dedupe for the Databricks source would be a great pro!

shrodingers · 2023-07-24T07:55:21Z

Hello @marijncv, made a custom connector forked from the official one in order to achieve inc dedupe + normalization (with the option to only normalize first level in order to avoid having too many tables in case of deeply nested data). Will eventually submit a pr some time, but in the meantime if you're interested and using self-hosted airbyte I would be able to share this with you

marijncv · 2023-07-24T13:19:05Z

Hey @shrodingers, that sounds great! Would love to try it out

asilvis · 2023-08-04T02:34:17Z

@shrodingers can you open this PR and/or point me to your implementation as well? I am very interested in evaluating it.

shrodingers · 2023-08-17T08:12:18Z

Hi @marijncv @asilvis , just went back from vacations, will do as soon as possible !

shrodingers · 2023-08-17T12:34:21Z

Just submitted the pr #29510 that can be reviewed, with the code to enable incremental dedup (may need a hand for finalization / testing or edge-cases i didn't spot)

evantahler · 2023-08-18T00:52:52Z

Closing this issue to not be misleading - With Destinations V2 (#26028) no destinations should gain "normalization" via dbt.

tuliren added the type/enhancement New feature or request label Sep 14, 2021

tuliren self-assigned this Sep 14, 2021

tuliren mentioned this issue Sep 14, 2021

🎉 New Destination: Databricks #5998

Merged

17 tasks

sherifnada added the area/connectors Connector related issues label Oct 14, 2021

sherifnada moved this to Backlog in GL Roadmap Jan 12, 2022

sherifnada added this to GL Roadmap Jan 12, 2022

karinakuz added connectors/destinations-warehouse connectors/destination/databricks labels Jan 14, 2022

igrankova moved this from Backlog to Backlog (unscoped) in GL Roadmap Feb 2, 2022

bleonard added the team/connectors-java label Apr 19, 2022

grishick added team/databases and removed team/connectors-java labels Apr 28, 2022

grishick removed this from GL Roadmap Apr 28, 2022

grishick added the team/destinations Destinations team's backlog label Sep 27, 2022

grishick removed the team/databases label Oct 7, 2022

tuliren removed their assignment Oct 27, 2022

etsybaev self-assigned this Feb 8, 2023

shrodingers added a commit to Brigad/airbyte that referenced this issue Aug 17, 2023

resolve airbytehq#6042

2686bca

evantahler closed this as completed Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support incremental dedupe in databricks destination connector #6042

Support incremental dedupe in databricks destination connector #6042

tuliren commented Sep 14, 2021 •

edited

Loading

Copy Into SQL command VS Merge SQL command

shrodingers commented Jan 14, 2022

tuliren commented Jan 14, 2022

shrodingers commented Apr 15, 2022 •

edited

Loading

shrodingers commented Apr 25, 2022

etsybaev commented Feb 8, 2023

shrodingers commented Feb 8, 2023 •

edited

Loading

grishick commented Feb 9, 2023

marijncv commented Jul 21, 2023

shrodingers commented Jul 24, 2023

marijncv commented Jul 24, 2023

asilvis commented Aug 4, 2023

shrodingers commented Aug 17, 2023

shrodingers commented Aug 17, 2023

evantahler commented Aug 18, 2023

Support incremental dedupe in databricks destination connector #6042

Support incremental dedupe in databricks destination connector #6042

Comments

tuliren commented Sep 14, 2021 • edited Loading

Tell us about the problem you're trying to solve

Describe the solution you’d like

Copy Into SQL command VS Merge SQL command

Describe the alternative you’ve considered or used

Additional context

Are you willing to submit a PR?

shrodingers commented Jan 14, 2022

tuliren commented Jan 14, 2022

shrodingers commented Apr 15, 2022 • edited Loading

shrodingers commented Apr 25, 2022

etsybaev commented Feb 8, 2023

shrodingers commented Feb 8, 2023 • edited Loading

grishick commented Feb 9, 2023

marijncv commented Jul 21, 2023

shrodingers commented Jul 24, 2023

marijncv commented Jul 24, 2023

asilvis commented Aug 4, 2023

shrodingers commented Aug 17, 2023

shrodingers commented Aug 17, 2023

evantahler commented Aug 18, 2023

tuliren commented Sep 14, 2021 •

edited

Loading

shrodingers commented Apr 15, 2022 •

edited

Loading

shrodingers commented Feb 8, 2023 •

edited

Loading