Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ [source-github] Bump cdk version and enable RFR for all non-incremental streams #42966

Merged
merged 9 commits into from
Aug 9, 2024

Conversation

brianjlai
Copy link
Contributor

@brianjlai brianjlai commented Aug 2, 2024

What

Enables RFR for all streams that are not already supporting incremental syncs.

How

All github streams are effectively substreams because their parent is either a repository or an organization which is based on the stream config. And for these streams we instantiate a SubstreamResumableFullRefresh cursor. We skip incrementals by checking for the presence of cursor_fields which indicates incremental sync mode.

Sample stream state for substreams:

to copy

User Impact

They should get RFR w/o needing to update anything

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Aug 2, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Aug 8, 2024 8:12pm

@@ -57,6 +58,9 @@ def __init__(self, api_url: str = "https://api.github.com", access_token_type: s
self.api_url = api_url
self.state = {}

if self.cursor_field or len(self.cursor_field) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this check - we use the substream RFR cursor if there is a cursor_field or if its length is 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, sorry i should've commented, but it's supposed to be not self.cursor_field, I reversed it to test something on regression tests, but i'll change it back for final testing when i move it out of draft

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the self.supports_incremental property to be checked?

CDK Code

It could be simply as:

if not self.supports_incremental:
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep fair point, will change thanks for the suggestion!

@brianjlai brianjlai marked this pull request as ready for review August 4, 2024 04:13
@brianjlai
Copy link
Contributor Author

brianjlai commented Aug 7, 2024

/approve-regression-tests passing tests here: https://github.com/airbytehq/airbyte/actions/runs/10275055124/job/28432993552

Check job output.

✅ Approving regression tests

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CATs where already failing before this change (see this). Should we fix them on master and introduce the fix here to make sure that the CATs will not be failing with this change?

Also, would it be worth adding a mock server test to validate that the state we expect is the right one?

@brianjlai
Copy link
Contributor Author

@maxi297

CATs where already failing before this change (see this). Should we fix them on master and introduce the fix here to make sure that the CATs will not be failing with this change?

Fair, but yeah I don't think it's related since nightly builds have been failing due to a missing record for some time now. But I'm okay waiting or fixing. To me it's relatively low risk

Also, would it be worth adding a mock server test to validate that the state we expect is the right one?

Good suggestion I can add that in

@bazarnov
Copy link
Collaborator

bazarnov commented Aug 8, 2024

It looks good to me; we still need to fix the GitHub source separately, I believe. I'll approve once this is safe-and-sound.

@brianjlai
Copy link
Contributor Author

@bazarnov

It looks good to me; we still need to fix the `GitHub source separately, I believe. I'll approve once this is safe-and-sound.

What is the GitHub source separately? Was this another error you found? The main test failure was an issue with a missing expected primary key record:

AssertionError: Expected to see those primary keys in the actual response for stream workflow_runs but they were not found.
E                   assert not {{('id',): 4871166142}}

@bazarnov
Copy link
Collaborator

bazarnov commented Aug 8, 2024

@brianjlai Nope, this is the same one, I mean we need to fix it first, then merge and continue with this change.

@brianjlai
Copy link
Contributor Author

@bazarnov @maxi297

got it thanks for clarifying.

i dug into it and i'm not 100% sure why, but the old workflow_runs record we expect doesn't seem to exist anymore. Maybe it's no longer retained after a year or something, but I replaced it with a new workflow run within our test repo that does exist as per https://github.com/airbytehq/integration-test/actions this. We'll see if this fixes expected records and nightly builds

@brianjlai
Copy link
Contributor Author

brianjlai commented Aug 8, 2024

/approve-regression-tests passing tests here: https://github.com/airbytehq/airbyte/actions/runs/10275055124/job/28432993552 . subsequent changes were just to fix or add tests

Check job output.

✅ Approving regression tests

Copy link
Contributor

@girarda girarda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@brianjlai brianjlai merged commit 2aaf33e into master Aug 9, 2024
36 checks passed
@brianjlai brianjlai deleted the brian/github_enable_rfr_for_substreams branch August 9, 2024 16:54
LouisAuneau pushed a commit to LouisAuneau/airbyte that referenced this pull request Aug 13, 2024
@szubster
Copy link
Contributor

After updating github source on our instance, to 1.8.2 (with this PR) we observed big regression in sync time.
Especially in IssueTimelineEvents stream.
From running py-spy it seems most of the time is spend in json handling in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/github
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants