Adding SyncMethod "Full Refresh - Deduped + history" #3090

Fredehagelund92 · 2021-04-27T19:15:48Z

Tell us about the problem you're trying to solve

Currently it is only possible to do the following sync methods:

Full Refresh - Append
Full Refresh - Overwrite
Incremental - Append
Incremental - Deduped + history

This might be sufficient for most cases, but in some cases you might not be able to do an incremental sync. Lets say the source does not have and updated_at or auto_increment id. Then it will be hard to make an incremental sync. Of course you could just do a Full Refresh , but then you won't be able to get history. History can be important since it provides a timeline for when changes has happened in source systems that has not this included.

Describe the solution you’d like

I think we need to create a sync method called Full Refresh - Deduped + history. This can easily be done by comparing hash between rows. Since its a Full refresh it also enable us to track hard deletes. For instance a new column called airbyte_deleted_row.

I might update and be more specific after i'm more familiar with the codebase.

Describe the alternative you’ve considered or used

The alternative is just to handle this using dbt on the Full Refresh - Overwrite and handle this.

Additional context

┆Issue is synchronized with this Asana task by Unito

The text was updated successfully, but these errors were encountered:

marcosmarxm · 2021-04-29T01:53:35Z

Yes, I had this problem on a project before. I'll take a look to measure the effort and if is possible. @Fredehagelund92 are you willing to contribute? (no pressure 😬!) I can help you set up env / checking what needs to change if you like

Fredehagelund92 · 2021-04-29T06:50:14Z

Hi @marcosmarxm sure i’d like to get more familiar with the codebase, so if you can point me in the right direction then i can start start contributing 👍

marcosmarxm · 2021-04-29T22:46:08Z

@Fredehagelund92 you need to start from Full Refresh - Append code and then it’s a matter of handler the dedupe and the deleted rows in transformation steps (base-normalization).

identify a sync_session_id for each batch of a full refresh
flag rows with same primary keys that disappears from one batch to the next as the _airbyte_deleted rows
and then from there, it’s pretty much the same as incremental - dedupe history

jdclarke5 · 2022-01-20T20:49:58Z

Thumbs up on this. Let me give a use case I have come across for further motivation/context.

As a source, I want to ingest this COVID CSV file: https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.csv. One issue with this source is that old rows are sometimes updated retrospectively, and the CSV does not give an updated time for when this happens. So there is no cursor field to implement an incremental mode which captures all changes. To read in all the changes you need full refresh. On the destination side an overwrite will take all this data and overwrite old rows. What I would like is change tracking on any old rows which have been updated.

The deleted step is a difficult question for me: should it be the source's job to send through information that a row is deleted? Or the destination's job to compare the current state to the state the source sends? The latter probably makes most sense but it requires that the destination see all primary keys via a source sync mode full refresh in order to do an ID comparison after all records are seen in order to know which to delete. That's quite different to the other destination sync modes which can do what they need to do without depending on or knowing about the source sync mode. One way to do this may be to write all rows to a temporary table to prevent storing all the IDs in memory, do the comparison, and make all the necessary updates. On the other hand if you don't need/want delete capability then you can still do record-by-record consumption.

Another note is that there is no need for another field _airbyte_deleted. To delete a row it's enough to set _airbyte_active_row=false and _airbyte_end_at=<deleted_timestamp>.

marcosmarxm · 2023-06-27T17:40:54Z

@evantahler are this still relevant with Destination v2 plans?

evantahler · 2023-06-27T18:18:54Z

Closing this issue as I don't think that it really fits with the changes we are making to normalization. The SCD (history) tables are going away.
Learn more @ #26028

Fredehagelund92 added the type/enhancement New feature or request label Apr 27, 2021

marcosmarxm added the community_new label Apr 29, 2021

marcosmarxm added community and removed community_new labels Oct 19, 2021

bleonard added autoteam team/tse Technical Support Engineers labels Apr 26, 2022

marcosmarxm added area/databases normalization team/databases and removed team/tse Technical Support Engineers autoteam labels Jun 14, 2022

grishick added the team/destinations Destinations team's backlog label Sep 27, 2022

grishick removed the team/databases label Oct 7, 2022

evantahler closed this as completed Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding SyncMethod "Full Refresh - Deduped + history" #3090

Adding SyncMethod "Full Refresh - Deduped + history" #3090

Fredehagelund92 commented Apr 27, 2021 •

edited by sync-by-unito bot

Loading

marcosmarxm commented Apr 29, 2021 •

edited

Loading

Fredehagelund92 commented Apr 29, 2021

marcosmarxm commented Apr 29, 2021

jdclarke5 commented Jan 20, 2022

marcosmarxm commented Jun 27, 2023

evantahler commented Jun 27, 2023

Adding SyncMethod "Full Refresh - Deduped + history" #3090

Adding SyncMethod "Full Refresh - Deduped + history" #3090

Comments

Fredehagelund92 commented Apr 27, 2021 • edited by sync-by-unito bot Loading

Tell us about the problem you're trying to solve

Describe the solution you’d like

Describe the alternative you’ve considered or used

Additional context

marcosmarxm commented Apr 29, 2021 • edited Loading

Fredehagelund92 commented Apr 29, 2021

marcosmarxm commented Apr 29, 2021

jdclarke5 commented Jan 20, 2022

marcosmarxm commented Jun 27, 2023

evantahler commented Jun 27, 2023

Fredehagelund92 commented Apr 27, 2021 •

edited by sync-by-unito bot

Loading

marcosmarxm commented Apr 29, 2021 •

edited

Loading