Skip to content

Duplicated data on s3 destination after multiple failed attempts #13692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nord-ine opened this issue Jun 10, 2022 · 3 comments
Closed

Duplicated data on s3 destination after multiple failed attempts #13692

nord-ine opened this issue Jun 10, 2022 · 3 comments
Labels
autoteam community team/tse Technical Support Engineers type/bug Something isn't working

Comments

@nord-ine
Copy link

Environment

  • Airbyte version: 0.35.64-alpha
  • OS Version / Instance: Kubernetes
  • Deployment: Kubernetes
  • Source Connector and version: source-mssql 0.3.22
  • Destination Connector and version: destination-s3 0.3.5
  • Step where error happened: Sync job

Current Behavior

we have some connection syncs that sometimes succeed after multiple failed attempts, I am fine with that (sync time is not important to us), However the data is duplicated in the destination (s3 bucket) because failed attempts data is not cleaned by the S3 destination,

Expected Behavior

keeping track of the files written during an attempt, and then deleting them if the attempt fails.

Are you willing to submit a PR?

Probably, if only the s3 destination connector is concerned by the update (not Airbyte Core )

@nord-ine nord-ine added needs-triage type/bug Something isn't working labels Jun 10, 2022
@alafanechere
Copy link
Contributor

Hey @nord-ine,
The problem you mention is not specific to MSSQL or S3 but is more global on our database connectors.
Airbyte connectors keep track of the offset of the record they consumed in a state object. For database connectors this state is stored at the end of a successful sync. It means that on the next sync, after a failure, the same data will be replicated. We need to improve this behavior by implementing intermittent checkpointing to reduce the number of duplicates. This is something that is available for API connectors but not for database connectors. I created an issue for this improvement here: please subscribe to it to follow updates.

@alafanechere
Copy link
Contributor

This topic about Airbyte record delivery guarantees might also be helpful.

@programmeddeath1
Copy link

@alafanechere Hi I am facing an issue of data space getting used up over multiple failed attempts. Can you tell me where airbyte stores this intermediate state object or the intermediate records. I have not been able to see the data even on the airbyte db. I removed the dockers for airbyte, but still am not able to reclaim almost more than 60gb of local data space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoteam community team/tse Technical Support Engineers type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants