Skip to content

Adding SyncMethod "Full Refresh - Deduped + history" #3090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Fredehagelund92 opened this issue Apr 27, 2021 · 6 comments
Closed

Adding SyncMethod "Full Refresh - Deduped + history" #3090

Fredehagelund92 opened this issue Apr 27, 2021 · 6 comments
Labels

Comments

@Fredehagelund92
Copy link

Fredehagelund92 commented Apr 27, 2021

Tell us about the problem you're trying to solve

Currently it is only possible to do the following sync methods:

  • Full Refresh - Append
  • Full Refresh - Overwrite
  • Incremental - Append
  • Incremental - Deduped + history

This might be sufficient for most cases, but in some cases you might not be able to do an incremental sync. Lets say the source does not have and updated_at or auto_increment id. Then it will be hard to make an incremental sync. Of course you could just do a Full Refresh , but then you won't be able to get history. History can be important since it provides a timeline for when changes has happened in source systems that has not this included.

Describe the solution you’d like

I think we need to create a sync method called Full Refresh - Deduped + history. This can easily be done by comparing hash between rows. Since its a Full refresh it also enable us to track hard deletes. For instance a new column called airbyte_deleted_row.

I might update and be more specific after i'm more familiar with the codebase.

Describe the alternative you’ve considered or used

The alternative is just to handle this using dbt on the Full Refresh - Overwrite and handle this.

Additional context

┆Issue is synchronized with this Asana task by Unito

@Fredehagelund92 Fredehagelund92 added the type/enhancement New feature or request label Apr 27, 2021
@marcosmarxm
Copy link
Member

marcosmarxm commented Apr 29, 2021

Yes, I had this problem on a project before. I'll take a look to measure the effort and if is possible. @Fredehagelund92 are you willing to contribute? (no pressure 😬!) I can help you set up env / checking what needs to change if you like

@Fredehagelund92
Copy link
Author

Hi @marcosmarxm sure i’d like to get more familiar with the codebase, so if you can point me in the right direction then i can start start contributing 👍

@marcosmarxm
Copy link
Member

@Fredehagelund92 you need to start from Full Refresh - Append code and then it’s a matter of handler the dedupe and the deleted rows in transformation steps (base-normalization).

  • identify a sync_session_id for each batch of a full refresh
  • flag rows with same primary keys that disappears from one batch to the next as the _airbyte_deleted rows
  • and then from there, it’s pretty much the same as incremental - dedupe history

@jdclarke5
Copy link
Contributor

Thumbs up on this. Let me give a use case I have come across for further motivation/context.

As a source, I want to ingest this COVID CSV file: https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.csv. One issue with this source is that old rows are sometimes updated retrospectively, and the CSV does not give an updated time for when this happens. So there is no cursor field to implement an incremental mode which captures all changes. To read in all the changes you need full refresh. On the destination side an overwrite will take all this data and overwrite old rows. What I would like is change tracking on any old rows which have been updated.

The deleted step is a difficult question for me: should it be the source's job to send through information that a row is deleted? Or the destination's job to compare the current state to the state the source sends? The latter probably makes most sense but it requires that the destination see all primary keys via a source sync mode full refresh in order to do an ID comparison after all records are seen in order to know which to delete. That's quite different to the other destination sync modes which can do what they need to do without depending on or knowing about the source sync mode. One way to do this may be to write all rows to a temporary table to prevent storing all the IDs in memory, do the comparison, and make all the necessary updates. On the other hand if you don't need/want delete capability then you can still do record-by-record consumption.

Another note is that there is no need for another field _airbyte_deleted. To delete a row it's enough to set _airbyte_active_row=false and _airbyte_end_at=<deleted_timestamp>.

@bleonard bleonard added autoteam team/tse Technical Support Engineers labels Apr 26, 2022
@grishick grishick added the team/destinations Destinations team's backlog label Sep 27, 2022
@marcosmarxm
Copy link
Member

@evantahler are this still relevant with Destination v2 plans?

@evantahler
Copy link
Contributor

Closing this issue as I don't think that it really fits with the changes we are making to normalization. The SCD (history) tables are going away.
Learn more @ #26028

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants