Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python-cdk): add user friendly message for encoding errors #44438

Merged
merged 28 commits into from
Aug 28, 2024

Conversation

strosek
Copy link
Contributor

@strosek strosek commented Aug 20, 2024

What

Make it clearer when a CSV file has problems in discover/read because of a file encoding mismatch.
This also improves general error display in the UI thanks to catching AirbyteTracedExceptions and building a readable trace message with the concatenation of each ATE's message.

How

Catching the UnicodeError and throwing an easier to read exception message.

Review guide

  1. csv_parser.py - added try-except for creating a DecodingError from the UnicodeError. Also fixed a mypy error.
  2. exceptions.py - added friendlier error message and DecodingError alias.
  3. default_file_based_stream.py - fix exception throwing chain, clean duplicate log entries
  4. default_file_based_availability_strategy.py - fix exception throwing chain, raise ATE.

User Impact

The user will have more clarity on why a decoding error happens, and how to fix it.

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Aug 20, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 28, 2024 2:21am

@strosek strosek self-assigned this Aug 20, 2024
@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Aug 20, 2024
@strosek strosek requested a review from girarda August 20, 2024 20:47
@strosek strosek marked this pull request as ready for review August 20, 2024 20:47
@octavia-squidington-iii octavia-squidington-iii removed the area/connectors Connector related issues label Aug 21, 2024
@@ -54,7 +55,7 @@ def state(self, value: MutableMapping[str, Any]) -> None:
"""State setter, accept state serialized by state getter."""
self._cursor.set_initial_state(value)

@property
@property # type: ignore # no suitable parent cursor type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it not suitable / why can't the typing error be fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at several of the Cursor offered by the IDE, and they didn't seem right. The only Cursor object that fixed the problem was the GCS Cursor, but this is a more generic file as it belongs to the CDK and not the specific GCS source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the error exactly? Is the GCS cursor not an AbstractFileBasedCursor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that GCS cursor is not generic, as this file belongs to the CDK. Error is error: Signature of "cursor" incompatible with supertype "Stream" [override]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the expected type from the parent class? Is it really a mismatch? If not, can the comment be more clear about why this is a mypy problem, not with the type

Copy link
Contributor Author

@strosek strosek Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a mismatch. gcs.Cursor inherits from DefualtFileBasedCursor
, which inherits from AbstractFileBasedCursor. Improved comment.

@strosek
Copy link
Contributor Author

strosek commented Aug 21, 2024

New error message is correctly displayed. In this example, utf16 was used on a file encoded with utf8:
Screenshot 2024-08-26 at 8 47 11 a m

@strosek strosek requested a review from girarda August 22, 2024 03:19
@octavia-squidington-iii octavia-squidington-iii added the area/connectors Connector related issues label Aug 26, 2024
@octavia-squidington-iii octavia-squidington-iii removed the area/connectors Connector related issues label Aug 26, 2024
codeflash-ai bot added a commit that referenced this pull request Aug 26, 2024
…sek/gcs_decode_error_issue_8952`)

Certainly! To optimize the code, one can focus on multiple aspects like reducing the overhead of unnecessary checks, avoiding redundant operations, efficient use of data structures, and minimizing log string concatenations. Here is the optimized version of your given code.



### Optimization Highlights.
1. **Concise Manual Validation**: Combined `quote_char`, `escape_char` validation into a single validator.
2. **Helper Functions**: Utilized a generalized helper function to avoid repetition and handle specific validations concisely.
3. **Import Statements Simplified**: Removed unnecessary imports and rearranged for better readability.
4. **Combined Error Logging**: Concatenated log messages in a single operation to minimize the overhead.
5. **Proper Use of Pythonic Constructs**: Optimized type casting while skipping unnecessary checks and directly handling edge cases.

This optimized version should maintain the functionality while potentially improving performance due to fewer and more efficient operations.
Copy link

codeflash-ai bot commented Aug 26, 2024

⚡️ Codeflash found optimizations for this PR

📄 CsvParser._cast_types() in airbyte-cdk/python/airbyte_cdk/sources/file_based/file_types/csv_parser.py

📈 Performance improved by 18% (0.18x faster)

⏱️ Runtime went down from 1.70 millisecond to 1.43 millisecond

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch strosek/gcs_decode_error_issue_8952).

with pytest.raises(expected_exc) as exc:
check(capsys, tmp_path, scenario)

if expected_msg:
Copy link
Contributor Author

@strosek strosek Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block was never executed because previous block exited with the exception. Also, output was None, since the code never returned. Tests should be more thorough now.

Copy link
Contributor

@girarda girarda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great thanks @strosek 🚢

@strosek
Copy link
Contributor Author

strosek commented Aug 28, 2024

/approve-regression-tests

Check job output.

✅ Approving regression tests

@strosek strosek merged commit fc8cd5a into master Aug 28, 2024
36 checks passed
@strosek strosek deleted the strosek/gcs_decode_error_issue_8952 branch August 28, 2024 17:13
@@ -520,6 +520,17 @@ def test_parse_field_size_larger_than_default_python_maximum(self) -> None:
data_generator = self._read_data()
assert list(data_generator) == [{"header1": "1", "header2": long_string}]

def test_read_data_with_encoding_error(self) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test ATE raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit connectors/source/gcs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants