-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Support incremental dedupe in databricks destination connector #6042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just wanted to post here to follow the thread, and to know if i could help on this matter since i would really be interested on deduped and dbt transformations with databricks |
@shrodingers, thanks for your comment. We probably won't have bandwidth to implement the incremental dedup mode for Databricks this quarter. So feel free to try it if you are interested. As mentioned in the description, I think the |
@tuliren Coming back on this one, i started to implement normalization on the databricks connector (in order to handle append_deduped and basic normalisation).
Would be glad to have an input on this, since i am not sure what is the most useful and simple way |
Managed to have a custom implementation for this connector to support DBT, extending the S3StreamCopier directly, and with a different flow than the current implementation. |
It seems like we already have some contributor's PR, but it's huge! |
@etsybaev this pr is deprecated by now following some changes on the databricks / S3 connectors that have been happening since then, as well as other issues i saw arising by using it on production. And I think there is some code that have nothing to with the feature (may have failed my branch diff and added another source connector + ui changes by mistake), plus lots of code is generated test suite for the transformation acceptance tests. Kept it open in order to have a reference on some parts i managed to test on the normalization models / macros and connector updates. But i'll be glad to work on the subject if i have some insights on certain points, and resubmit a cleaner / updated pull request later :) |
blocked by re-rollout of v1 normalization |
Hey @grishick, is this issue still blocked? Is it a matter of swapping the Considering to start using Airbyte and having incremental dedupe for the Databricks source would be a great pro! |
Hello @marijncv, made a custom connector forked from the official one in order to achieve inc dedupe + normalization (with the option to only normalize first level in order to avoid having too many tables in case of deeply nested data). Will eventually submit a pr some time, but in the meantime if you're interested and using self-hosted airbyte I would be able to share this with you |
Hey @shrodingers, that sounds great! Would love to try it out |
@shrodingers can you open this PR and/or point me to your implementation as well? I am very interested in evaluating it. |
Just submitted the pr #29510 that can be reviewed, with the code to enable incremental dedup (may need a hand for finalization / testing or edge-cases i didn't spot) |
Closing this issue to not be misleading - With Destinations V2 (#26028) no destinations should gain "normalization" via dbt. |
Tell us about the problem you're trying to solve
Currently the databricks destination connector only supports the incremental append mode. We want it to support the incremental dedup mode as well.
This relates to #2075, and is a follow-up issue from PR #5998.
Describe the solution you’d like
Use the
MERGE
command instead of theCOPY INTO
command to insert new records.Reference:
Describe the alternative you’ve considered or used
N/A
Additional context
N/A
Are you willing to submit a PR?
Yes.
The text was updated successfully, but these errors were encountered: