-
Notifications
You must be signed in to change notification settings - Fork 4.5k
🎉 File-based CDK: stop the sync on parse errors #31605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Grooming notes:
|
@clnoll For some cases it might make sense to keep the existing behavior of skipping the file (e.g. see #32092 ). I'm fine with implementing the behavior explained here as the default and leaving it to the parser to do something else as it would cover the use case I have in mind, but did we consider making this behavior a configurable setting on the stream level itself? |
@flash1293 making it configurable on the stream makes sense to me. I'll put it to the team and update the plan here. |
@clnoll @flash1293 I'm going to work on this one, where can i find the full list of requirements to cover here? |
@bazarnov I'll be asking the team tomorrow whether we should make this behavior configurable per @flash1293 's suggestion. If so, I'll update the requirements in the "Acceptance Criteria" section of this issue. Once you get started let me know if you need more details on anything. |
@bazarnov @flash1293 we've decided not to surface a user option for skipping unparsable files at this time. What you did with the unstructured parser makes sense, and I can imagine a future where we offer other parsers that handle multiple file types, in which case we'll revisit. So @bazarnov that means that this ticket is ready to be worked on as-is. If anything is unclear with the requirements/implementation/anything let me know! |
@clnoll Got it, thank you for the clarification. |
Note: As the unstructured parser will have a special setting for this (#32092), a test needs to be adjusted before both of these things can be merged. We need to make sure we don't merge both in without fixing these compatibility problems: https://github.com/airbytehq/airbyte/pull/32092/files#diff-d72e22c154f3ccfe26fd05c6db975ea23613234e77f82d5042ba7f68e3533452R236 |
@flash1293 @clnoll Do we have some examples of the un-parsible files to experiment with? Maybe some on-call issues or anything? |
@bazarnov a csv file is unparseable for the unstructured parsers, so it's a good example |
@clnoll @flash1293 I don't actually get the point about "raise the error and stop the sync explicitly", could you please give more details about the motivation for this? Why would we need to raise the errors we tried to avoid? It feels like another breaking change for another breaking change to come up later with no reason. |
@clnoll is the expert, but my understanding is that the current state can be very confusing for users. Let's say there are 3 files a connection is syncing. One of them becomes corrupted. The sync still shows as a "success" in the UI (no indication something went wrong), but the data doesn't make it downstream anymore. It will probably be noticed way later that something is wrong and a human needs to follow the trace from some BI dashboard backwards to notice where things broke. This is one of the things Airbyte tries to solve for users. Unstructured sources are kind of an outlier here, because they often deal with a large number of messy files. I'm working on extending Airbyte with a notion of a "sync with warnings" to capture this better, but for now I believe the planned approach is the best one to take. |
@flash1293 I agree with :
But it's more like the UI issue, having these errors silently swallowed by the logs, not the actual connector issues yet. And disagree on this one:
the best option is to allow the users to observe the issue that happened (as a warning) without breaking the sync, allowing the user to decide whether or not it was the bad file or connector's issues while reading the file which is perfectly formated and suitable for the connector to work with. @katmarkham @oustynova @lazebnyi Updated: we have something to consider already, please check: https://docs.google.com/document/d/1KsBf4tdPeook1jGtwjN9I6q_py43d5PYKzV-TOUoff0/edit |
@bazarnov @flash1293 @katmarkham my voice is for NOT breaking the sync and log the errors. |
@bazarnov @oustynova, I have a slightly different take on this issue. Unstructured data is a special case because the same parser is used for a wide variety of file types. Other parsers such as csv, json, avro, and parquet should always succeed. Unparseable files are indicative of a user error: either the stream's path is wrong because it includes files that should not be included, or the files are corrupt. Either way, it's preferable to fail and let the user fix the issue than pretend like we synced all the files. Since only the unstructured parser needs to skip unparseable files, I'm in favor of not even exposing that option on the csv, json, avro, and parquet parsers and always fail. (I don't feel as strongly about this so not a blocker on my end) Caveat 1: I'm in favor of replacing the failure with a success with warning once we have that capability |
What
The original S3 connector logged a warning on file parsing issues but continued syncing. To be more consistent with the rest of the sources we actually want parse errors to stop the sync. Now that the fcdk has been in production for a while, it feels safe to make this change.
Suggested Implementation
This will involve changing this line where we yield a log message on record parse errors; instead we'll let the error bubble up and stop the sync.
The change in itself shouldn't be very big, but we anticipate that it will cause some CAT and unit test failures.
We'll also need to migrate source-S3 (and source-gcs, which is in progress) to the new pattern, which will be a breaking change.
Note: this is a breaking change because some syncs that would have succeeded in the past will now fail. It should be treated accordingly wrt Support and rollout.
Acceptance Criteria
The text was updated successfully, but these errors were encountered: