🎉 File-based CDK: stop the sync on parse errors #31605

clnoll · 2023-10-19T11:04:37Z

What

The original S3 connector logged a warning on file parsing issues but continued syncing. To be more consistent with the rest of the sources we actually want parse errors to stop the sync. Now that the fcdk has been in production for a while, it feels safe to make this change.

Suggested Implementation

This will involve changing this line where we yield a log message on record parse errors; instead we'll let the error bubble up and stop the sync.

The change in itself shouldn't be very big, but we anticipate that it will cause some CAT and unit test failures.

We'll also need to migrate source-S3 (and source-gcs, which is in progress) to the new pattern, which will be a breaking change.

Note: this is a breaking change because some syncs that would have succeeded in the past will now fail. It should be treated accordingly wrt Support and rollout.

Acceptance Criteria

When a file can't be parsed as the selected file type, we stop the sync with an error - the PR
Source S3 is updated to use the version of the fcdk with the new behavior.
If Source GCS has been deployed before the change went out, it's updated too.

clnoll · 2023-10-24T17:28:32Z

Grooming notes:

This is technically a breaking change because users may have syncs that would have succeeded but will fail now. We'll want to flag that for Support.

flash1293 · 2023-11-03T10:03:49Z

@clnoll For some cases it might make sense to keep the existing behavior of skipping the file (e.g. see #32092 ). I'm fine with implementing the behavior explained here as the default and leaving it to the parser to do something else as it would cover the use case I have in mind, but did we consider making this behavior a configurable setting on the stream level itself?

clnoll · 2023-11-06T14:39:34Z

@flash1293 making it configurable on the stream makes sense to me. I'll put it to the team and update the plan here.

bazarnov · 2023-11-06T17:51:18Z

@clnoll @flash1293 I'm going to work on this one, where can i find the full list of requirements to cover here?

clnoll · 2023-11-06T22:38:39Z

@bazarnov I'll be asking the team tomorrow whether we should make this behavior configurable per @flash1293 's suggestion. If so, I'll update the requirements in the "Acceptance Criteria" section of this issue. Once you get started let me know if you need more details on anything.

clnoll · 2023-11-07T18:16:15Z

@bazarnov @flash1293 we've decided not to surface a user option for skipping unparsable files at this time. What you did with the unstructured parser makes sense, and I can imagine a future where we offer other parsers that handle multiple file types, in which case we'll revisit.

So @bazarnov that means that this ticket is ready to be worked on as-is. If anything is unclear with the requirements/implementation/anything let me know!

bazarnov · 2023-11-07T19:17:19Z

@clnoll Got it, thank you for the clarification.

flash1293 · 2023-11-08T15:24:02Z

Note: As the unstructured parser will have a special setting for this (#32092), a test needs to be adjusted before both of these things can be merged. We need to make sure we don't merge both in without fixing these compatibility problems: https://github.com/airbytehq/airbyte/pull/32092/files#diff-d72e22c154f3ccfe26fd05c6db975ea23613234e77f82d5042ba7f68e3533452R236

bazarnov · 2023-11-10T12:59:35Z

@flash1293 @clnoll Do we have some examples of the un-parsible files to experiment with? Maybe some on-call issues or anything?

flash1293 · 2023-11-10T13:36:07Z

@bazarnov a csv file is unparseable for the unstructured parsers, so it's a good example

bazarnov · 2023-11-14T10:35:32Z

@clnoll @flash1293 I don't actually get the point about "raise the error and stop the sync explicitly", could you please give more details about the motivation for this? Why would we need to raise the errors we tried to avoid? It feels like another breaking change for another breaking change to come up later with no reason.

flash1293 · 2023-11-14T10:48:24Z

@clnoll is the expert, but my understanding is that the current state can be very confusing for users. Let's say there are 3 files a connection is syncing. One of them becomes corrupted. The sync still shows as a "success" in the UI (no indication something went wrong), but the data doesn't make it downstream anymore.

It will probably be noticed way later that something is wrong and a human needs to follow the trace from some BI dashboard backwards to notice where things broke. This is one of the things Airbyte tries to solve for users.

Unstructured sources are kind of an outlier here, because they often deal with a large number of messy files. I'm working on extending Airbyte with a notion of a "sync with warnings" to capture this better, but for now I believe the planned approach is the best one to take.

bazarnov · 2023-11-14T10:59:34Z

@flash1293 I agree with :

The sync still shows as a "success" in the UI (no indication something went wrong), but the data doesn't make it downstream anymore.

But it's more like the UI issue, having these errors silently swallowed by the logs, not the actual connector issues yet.

And disagree on this one:

...but for now, I believe the planned approach is the best one to take.

the best option is to allow the users to observe the issue that happened (as a warning) without breaking the sync, allowing the user to decide whether or not it was the bad file or connector's issues while reading the file which is perfectly formated and suitable for the connector to work with.

@katmarkham @oustynova @lazebnyi
The main question is: do we plan to have something like the "Notification center" for the user where all of the warnings and errors from the syncs they have will be collected, making them take any actions or reach out to the Support Team for more info about the issues happened?

Updated: we have something to consider already, please check: https://docs.google.com/document/d/1KsBf4tdPeook1jGtwjN9I6q_py43d5PYKzV-TOUoff0/edit

oustynova · 2023-11-14T14:02:59Z

@bazarnov @flash1293 @katmarkham my voice is for NOT breaking the sync and log the errors.

girarda · 2023-11-16T22:03:54Z

@bazarnov @oustynova, I have a slightly different take on this issue.

Unstructured data is a special case because the same parser is used for a wide variety of file types.

Other parsers such as csv, json, avro, and parquet should always succeed. Unparseable files are indicative of a user error: either the stream's path is wrong because it includes files that should not be included, or the files are corrupt. Either way, it's preferable to fail and let the user fix the issue than pretend like we synced all the files.

Since only the unstructured parser needs to skip unparseable files, I'm in favor of not even exposing that option on the csv, json, avro, and parquet parsers and always fail. (I don't feel as strongly about this so not a blocker on my end)

Caveat 1: I'm in favor of replacing the failure with a success with warning once we have that capability
Caveat 2: This is all based on the assumption we'll raise a config error and not a system error. We absolutely should not have OC issues created because users try to sync poorly formatted files

clnoll added the team/extensibility label Oct 19, 2023

oustynova added the team/connectors-python label Oct 27, 2023

flash1293 mentioned this issue Nov 3, 2023

File CDK: Allow skipping unparseable file types #32092

Merged

bazarnov self-assigned this Nov 6, 2023

bazarnov added this to GL Roadmap Nov 6, 2023

bazarnov changed the title ~~File-based CDK: stop the sync on parse errors~~ 🎉 File-based CDK: stop the sync on parse errors Nov 6, 2023

lazebnyi removed the team/extensibility label Nov 6, 2023

bazarnov moved this to Implementation in progress in GL Roadmap Nov 8, 2023

bazarnov moved this from Implementation in progress to On Hold in GL Roadmap Nov 15, 2023

bazarnov mentioned this issue Nov 16, 2023

🎉 Airbyte CDK (File-based CDK): Stop the sync if the record could not be parsed #32589

Merged

lazebnyi linked a pull request Jan 11, 2024 that will close this issue

🎉 Airbyte CDK (File-based CDK): Stop the sync if the record could not be parsed #32589

Merged

bazarnov closed this as completed in #32589 Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 File-based CDK: stop the sync on parse errors #31605

🎉 File-based CDK: stop the sync on parse errors #31605

clnoll commented Oct 19, 2023 •

edited by bazarnov

Loading

clnoll commented Oct 24, 2023

flash1293 commented Nov 3, 2023

clnoll commented Nov 6, 2023 •

edited

Loading

bazarnov commented Nov 6, 2023

clnoll commented Nov 6, 2023

clnoll commented Nov 7, 2023

bazarnov commented Nov 7, 2023

flash1293 commented Nov 8, 2023

bazarnov commented Nov 10, 2023

flash1293 commented Nov 10, 2023

bazarnov commented Nov 14, 2023 •

edited

Loading

flash1293 commented Nov 14, 2023

bazarnov commented Nov 14, 2023 •

edited

Loading

oustynova commented Nov 14, 2023

girarda commented Nov 16, 2023

🎉 File-based CDK: stop the sync on parse errors #31605

🎉 File-based CDK: stop the sync on parse errors #31605

Comments

clnoll commented Oct 19, 2023 • edited by bazarnov Loading

What

Suggested Implementation

Note: this is a breaking change because some syncs that would have succeeded in the past will now fail. It should be treated accordingly wrt Support and rollout.

Acceptance Criteria

clnoll commented Oct 24, 2023

flash1293 commented Nov 3, 2023

clnoll commented Nov 6, 2023 • edited Loading

bazarnov commented Nov 6, 2023

clnoll commented Nov 6, 2023

clnoll commented Nov 7, 2023

bazarnov commented Nov 7, 2023

flash1293 commented Nov 8, 2023

bazarnov commented Nov 10, 2023

flash1293 commented Nov 10, 2023

bazarnov commented Nov 14, 2023 • edited Loading

flash1293 commented Nov 14, 2023

bazarnov commented Nov 14, 2023 • edited Loading

oustynova commented Nov 14, 2023

girarda commented Nov 16, 2023

clnoll commented Oct 19, 2023 •

edited by bazarnov

Loading

clnoll commented Nov 6, 2023 •

edited

Loading

bazarnov commented Nov 14, 2023 •

edited

Loading

bazarnov commented Nov 14, 2023 •

edited

Loading