Skip to content

🎉 File-based CDK: stop the sync on parse errors #31605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
clnoll opened this issue Oct 19, 2023 · 15 comments · Fixed by #32589
Closed
3 tasks

🎉 File-based CDK: stop the sync on parse errors #31605

clnoll opened this issue Oct 19, 2023 · 15 comments · Fixed by #32589

Comments

@clnoll
Copy link
Contributor

clnoll commented Oct 19, 2023

What

The original S3 connector logged a warning on file parsing issues but continued syncing. To be more consistent with the rest of the sources we actually want parse errors to stop the sync. Now that the fcdk has been in production for a while, it feels safe to make this change.

Suggested Implementation

This will involve changing this line where we yield a log message on record parse errors; instead we'll let the error bubble up and stop the sync.

The change in itself shouldn't be very big, but we anticipate that it will cause some CAT and unit test failures.

We'll also need to migrate source-S3 (and source-gcs, which is in progress) to the new pattern, which will be a breaking change.

Note: this is a breaking change because some syncs that would have succeeded in the past will now fail. It should be treated accordingly wrt Support and rollout.

Acceptance Criteria

  • When a file can't be parsed as the selected file type, we stop the sync with an error - the PR
  • Source S3 is updated to use the version of the fcdk with the new behavior.
  • If Source GCS has been deployed before the change went out, it's updated too.
@clnoll
Copy link
Contributor Author

clnoll commented Oct 24, 2023

Grooming notes:

  • This is technically a breaking change because users may have syncs that would have succeeded but will fail now. We'll want to flag that for Support.

@flash1293
Copy link
Contributor

@clnoll For some cases it might make sense to keep the existing behavior of skipping the file (e.g. see #32092 ). I'm fine with implementing the behavior explained here as the default and leaving it to the parser to do something else as it would cover the use case I have in mind, but did we consider making this behavior a configurable setting on the stream level itself?

@bazarnov bazarnov self-assigned this Nov 6, 2023
@bazarnov bazarnov changed the title File-based CDK: stop the sync on parse errors 🎉 File-based CDK: stop the sync on parse errors Nov 6, 2023
@clnoll
Copy link
Contributor Author

clnoll commented Nov 6, 2023

@flash1293 making it configurable on the stream makes sense to me. I'll put it to the team and update the plan here.

@bazarnov
Copy link
Collaborator

bazarnov commented Nov 6, 2023

@clnoll @flash1293 I'm going to work on this one, where can i find the full list of requirements to cover here?

@clnoll
Copy link
Contributor Author

clnoll commented Nov 6, 2023

@bazarnov I'll be asking the team tomorrow whether we should make this behavior configurable per @flash1293 's suggestion. If so, I'll update the requirements in the "Acceptance Criteria" section of this issue. Once you get started let me know if you need more details on anything.

@clnoll
Copy link
Contributor Author

clnoll commented Nov 7, 2023

@bazarnov @flash1293 we've decided not to surface a user option for skipping unparsable files at this time. What you did with the unstructured parser makes sense, and I can imagine a future where we offer other parsers that handle multiple file types, in which case we'll revisit.

So @bazarnov that means that this ticket is ready to be worked on as-is. If anything is unclear with the requirements/implementation/anything let me know!

@bazarnov
Copy link
Collaborator

bazarnov commented Nov 7, 2023

@clnoll Got it, thank you for the clarification.

@bazarnov bazarnov moved this to Implementation in progress in GL Roadmap Nov 8, 2023
@flash1293
Copy link
Contributor

Note: As the unstructured parser will have a special setting for this (#32092), a test needs to be adjusted before both of these things can be merged. We need to make sure we don't merge both in without fixing these compatibility problems: https://github.com/airbytehq/airbyte/pull/32092/files#diff-d72e22c154f3ccfe26fd05c6db975ea23613234e77f82d5042ba7f68e3533452R236

@bazarnov
Copy link
Collaborator

@flash1293 @clnoll Do we have some examples of the un-parsible files to experiment with? Maybe some on-call issues or anything?

@flash1293
Copy link
Contributor

@bazarnov a csv file is unparseable for the unstructured parsers, so it's a good example

@bazarnov
Copy link
Collaborator

bazarnov commented Nov 14, 2023

@clnoll @flash1293 I don't actually get the point about "raise the error and stop the sync explicitly", could you please give more details about the motivation for this? Why would we need to raise the errors we tried to avoid? It feels like another breaking change for another breaking change to come up later with no reason.

@flash1293
Copy link
Contributor

@clnoll is the expert, but my understanding is that the current state can be very confusing for users. Let's say there are 3 files a connection is syncing. One of them becomes corrupted. The sync still shows as a "success" in the UI (no indication something went wrong), but the data doesn't make it downstream anymore.

It will probably be noticed way later that something is wrong and a human needs to follow the trace from some BI dashboard backwards to notice where things broke. This is one of the things Airbyte tries to solve for users.

Unstructured sources are kind of an outlier here, because they often deal with a large number of messy files. I'm working on extending Airbyte with a notion of a "sync with warnings" to capture this better, but for now I believe the planned approach is the best one to take.

@bazarnov
Copy link
Collaborator

bazarnov commented Nov 14, 2023

@flash1293 I agree with :

The sync still shows as a "success" in the UI (no indication something went wrong), but the data doesn't make it downstream anymore.

But it's more like the UI issue, having these errors silently swallowed by the logs, not the actual connector issues yet.

And disagree on this one:

...but for now, I believe the planned approach is the best one to take.

the best option is to allow the users to observe the issue that happened (as a warning) without breaking the sync, allowing the user to decide whether or not it was the bad file or connector's issues while reading the file which is perfectly formated and suitable for the connector to work with.

@katmarkham @oustynova @lazebnyi
The main question is: do we plan to have something like the "Notification center" for the user where all of the warnings and errors from the syncs they have will be collected, making them take any actions or reach out to the Support Team for more info about the issues happened?

Updated: we have something to consider already, please check: https://docs.google.com/document/d/1KsBf4tdPeook1jGtwjN9I6q_py43d5PYKzV-TOUoff0/edit

@oustynova
Copy link

@bazarnov @flash1293 @katmarkham my voice is for NOT breaking the sync and log the errors.

@girarda
Copy link
Contributor

girarda commented Nov 16, 2023

@bazarnov @oustynova, I have a slightly different take on this issue.

Unstructured data is a special case because the same parser is used for a wide variety of file types.

Other parsers such as csv, json, avro, and parquet should always succeed. Unparseable files are indicative of a user error: either the stream's path is wrong because it includes files that should not be included, or the files are corrupt. Either way, it's preferable to fail and let the user fix the issue than pretend like we synced all the files.

Since only the unstructured parser needs to skip unparseable files, I'm in favor of not even exposing that option on the csv, json, avro, and parquet parsers and always fail. (I don't feel as strongly about this so not a blocker on my end)

Caveat 1: I'm in favor of replacing the failure with a success with warning once we have that capability
Caveat 2: This is all based on the assumption we'll raise a config error and not a system error. We absolutely should not have OC issues created because users try to sync poorly formatted files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: On Hold
Development

Successfully merging a pull request may close this issue.

6 participants