Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Notion: Migrate to Low-Code #35974

Merged
merged 55 commits into from
Apr 15, 2024
Merged

Source Notion: Migrate to Low-Code #35974

merged 55 commits into from
Apr 15, 2024

Conversation

ChristoGrab
Copy link
Contributor

@ChristoGrab ChristoGrab commented Mar 11, 2024

What

This PR aims to migrate source Notion from the Python CDK to the low-code declarative framework. The proposed implementation is a hybrid connector, using the low-code manifest for all streams except the blocks stream, which has a fair amount of custom behavior, including a recursive fetching algorithm that I struggled to figure out how to implement as a custom component. Issue with more context.

Low-code migration reference

Changes

Non-Breaking

  • Check has been changed from a custom implementation (request to the /users/me endpoint which contains the current user's data) to the standard manifest implementation (check stream: pages). Some custom error handling has been lost in the process.
  • Some custom validation checks for the start_date were removed. The format of the start_date is already strictly enforced in the UI, and it is an optional field with a default value, so this felt like an acceptable loss.
  • Retry logic no longer reduces page size on repeated requests for low-code streams. My inclination is this is not a big functionality loss, as the blocks stream still retains the original behavior, and it is by far the biggest data stream for this connector (everything in Notion is a block, so the vast majority of records will always come from this stream).
  • Removed the call to transform the records read in the parent page stream when fetching slice ids for use in the blocks stream. Transformations are still applied to the records when syncing the low-code pages stream.

🚨🚨 Breaking

  • The production comments stream currently adds the parent page record's cursor field as a page_last_edited_time field to each comment record, which is used as the cursor field for semi-incremental syncs. This means all comments within a parent page's partition share the same cursor value. This has been updated to use per-partition state.

Current:

{"stream_state": {"page_last_edited_time": "2020-01-02"}} <- all `comments` records within a parent slice will share the same cursor value

New:

{"stream_state": "states": [
  {"partition": {"page_id_1": "2020-01-01"}}, 
  "partition": {"page_id_2": "2020-01-02"}}]} <- each `comments` record now compares its actual timestamp to the value of state for its partition

The resulting change is twofold: the format of state has changed to per-partition and the cursor field has been moved from the no-longer appended page_last_edited_time to the record's existing last_edited_time field.

Notes

  • I've done some trimming of unit tests, but have left in the tests that cover the remaining python code.

Copy link

vercel bot commented Mar 11, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 15, 2024 7:23am

@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Mar 20, 2024
@ChristoGrab ChristoGrab marked this pull request as ready for review March 21, 2024 22:43
@octavia-squidington-iv octavia-squidington-iv requested review from a team March 21, 2024 22:44
@alafanechere alafanechere added the low-code-migration This connector has been migrated to the low-code CDK label Mar 22, 2024
@clnoll clnoll self-requested a review March 22, 2024 14:34
Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far @ChristoGrab! I appreciate the detailed PR description.

Just a few questions in the code, and a few more here:

The pages and databases incremental logic changed from a custom semi-incremental implementation to the Data Feed low-code implementation. This means the requests to these streams are now in descending rather than ascending order. This should have no effect from a user perspective.

The one effect that I'm wondering about is how we'll pick up where we left off if the sync is interrupted.
Since the DataFeed implementation already does this I think we’re okay, but I’m curious whether you’re aware of a special reason that we were syncing this stream in ascending order previously.

Retry logic no longer reduces page size on repeated requests for low-code streams. My inclination is this is not a big functionality loss, as the blocks stream still retains the original behavior, and it is by far the biggest data stream for this connector (everything in Notion is a block, so the vast majority of records will always come from this stream).
This suggests to me that we should regression test against a connection that requires retries.

Can we verify e.g. from logs, that it’s only the blocks stream that was hitting this?

Removed the call to transform the records read in the parent page stream when fetching slice ids for use in the blocks stream. Transformations are still applied to the records when syncing the low-code pages stream.

Is there any external impact as a result of this?

The production comments stream currently adds the parent page record's cursor field as a page_last_edited_time field to each comment record, which is used as the cursor field for semi-incremental syncs. This means all comments within a parent page's partition share the same cursor value. This has been updated to use per-partition state.

Was this necessary for the migration? Can we still handle the old state?

Is there a breaking changes checklist that I should look at?

@@ -33,6 +33,7 @@ acceptance_tests:
configured_catalog_path: "integration_tests/incremental_catalog.json"
future_state:
future_state_path: "integration_tests/abnormal_state.json"
skip_comprehensive_incremental_tests: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we skipping these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad you asked! This test is a frequent pain point for connectors with limited sandbox data and can throw failures even when nothing is wrong; at first glance I had assumed this was just another such case, but I took a deeper look into the test results and am seeing some fishy logs for the Blocks stream. Investigating further and will get back to you soon as I have a clearer sense of what's going on here.

Copy link
Contributor Author

@ChristoGrab ChristoGrab Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cnoll Ok, I think I understand what's going on. The Blocks stream is a doozy, as we not only have no way of filtering or sorting results, but due to the nature of the recursive traversal through the blocks hierarchy, there isn't a defined top-level of records used in slicing.

Instead, an id_stack is populated with the parent stream ids; then as we query those pages, the nested blocks have their ids added to the stack and become the next slice (which will contain child blocks that will become slices for their own children, and down the rabbit hole we go). Right now the stream tracks the cursor_value of each record and uses a custom StateValueWrapper to compare and update the max_cursor_time until there is only one id remaining in the stack; this condition flips an is_finished boolean, which then triggers the final calculation of max_cursor in the StateValueWrapper to actually update the stream's state.

Here's some print statements of what these values looks like when there is an existing state value:

current stream_state: {'last_edited_time': 2022-10-10T04:00:00.000Z}
current StateValueWrapper: stream=<source_notion.streams.Blocks object at 0x7fc71ff9a140> state_value='2022-10-10T04:00:00.000Z' max_cursor_time='2023-10-10T13:51:00.000Z' <- `max_cursor_time` represents the latest cursor value in the read thus far

most recent record_cursor: 2023-10-23T17:30:00.000Z
updated StateValueWrapper.max_cursor_time: 2023-10-23T17:30:00.000 <- this will keep updating and become the new value of state only if `self.is_finished` has been flipped

So like with the other Notion streams, it's functionally a full read operation that uses filtering on subsequent reads to prevent records from emitting. The issue exposed by the comprehensive_incremental_tests is that the value of state is only updated at the end of the read, rather than on each call to get_updated_state. Unfortunately, changing the logic to update the value of state after a slice is processed would be dangerous, since our client-side filtering logic would then start taking effect on the first read and exclude unsorted blocks records that have never been synced.

Ordinarily, the best solution would probably be to set up per-partition semi-incremental logic (like in the Comments stream) so we can update each slice's state as we go and aren't reliant on a first successful sync to set a value for stream_state. My one hesitation in this case is with the inevitable ballooning of the state object for this stream. For reference, our tiny sandbox account with just a few basic sample pages has over 500 blocks records, any number of which could be a potential parent/slice for other blocks. I'm not certain what the memory limitations are for state objects, but I'm concerned about the potential scaling implications on larger Notion accounts, and unfortunately no matter what we do we can't change the fact that there's no server-side filtering we can use to limit the number of requests per sync.

Any thoughts on how to handle this? I'm tempted to say "working as intended!" for now since this is how the connector has always worked, but I'm admittedly biased by a desire to have this ready to ship by Monday 😅

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the mention should go to @clnoll

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops haha yes, so sorry!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thank you for the detailed explanation! This is super helpful.

The issue exposed by the comprehensive_incremental_tests is that the value of state is only updated at the end of the read, rather than on each call to get_updated_state.

It sounds like the main side effect is that we may end up syncing some records more than necessary. Is that right?

In any case I agree with you about the best solution, but also that since this is how it's always worked, we can leave it as-is for now. Would you mind creating an issue for this, in case we decide we want to follow up on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that sounds right: if we get halfway through a first sync and there's an interruption, any records that were successfully synced will still be emitted on the subsequent attempt. Put the issue together here: https://github.com/airbytehq/airbyte/issues/36698, just a copy/paste of the investigation to start but I'll do some grooming 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChristoGrab, is it possible to delete the Blocks stream from the catalog used during incremental tests, or will you enable the tests when they are fixed? #36814

@octavia-squidington-iv octavia-squidington-iv requested a review from a team March 25, 2024 14:50
@ChristoGrab
Copy link
Contributor Author

@clnoll Thanks for the thorough review! I'll address the main block of comments one at a time here:

The pages and databases incremental logic changed from a custom semi-incremental implementation to the Data Feed low-code implementation. This means the requests to these streams are now in descending rather than ascending order. This should have no effect from a user perspective.

The one effect that I'm wondering about is how we'll pick up where we left off if the sync is interrupted.
Since the DataFeed implementation already does this I think we’re okay, but I’m curious whether you’re aware of a special reason that we were syncing this stream in ascending order previously.

The previous behavior was sort of hacky, since the endpoint is very limited when it comes to sorting/filtering, and only allows us to specify the order and filter by type: page/database. So "incremental" is a bit of a cheeky misnomer as every read is essentially a full refresh operation that simply uses client-side filtering to determine which records to emit. I checked the original PR to see if there was any indication of why the order was set to ascending, but didn't see anything pertinent. My naive guess is the developer picked it more as a seemingly sensible default value than a deliberate design choice, based on the fact that the existing logic would always request all records regardless of the query order and state value. Sorry I don't have a more certain answer, but on the plus side this means the migration does actually have a positive impact for these streams which I failed to note, in that we now have an actual stop condition for requesting data once the appropriate cursor value has been reached!

@ChristoGrab
Copy link
Contributor Author

ChristoGrab commented Mar 27, 2024

Retry logic no longer reduces page size on repeated requests for low-code streams. My inclination is this is not a big functionality loss, as the blocks stream still retains the original behavior, and it is by far the biggest data stream for this connector (everything in Notion is a block, so the vast majority of records will always come from this stream).
This suggests to me that we should regression test against a connection that requires retries.

Can we verify e.g. from logs, that it’s only the blocks stream that was hitting this?

Unfortunately my limited access means I can't view customer connections/logs to verify this (Maxime is looking into getting my access upgraded), but I can at least point to the original oncall issue that led to this functionality being added. For some context, the issue was related to frequent 504 errors plaguing an unusually large Notion workspace (~10k pages) that was exacerbated by unspecified changes going on in the Notion API at the time. Sorry, I know that's not much to go on, if this is a point of concern I can certainly take a stab at implementing the logic as a CustomBackoffStrategy for the low-code streams as well!

@ChristoGrab
Copy link
Contributor Author

ChristoGrab commented Mar 27, 2024

Removed the call to transform the records read in the parent page stream when fetching slice ids for use in the blocks stream. Transformations are still applied to the records when syncing the low-code pages stream.

Is there any external impact as a result of this?

This change should provide a minor boost to performance when syncing the Blocks stream, since we're no longer applying transformations during the parent read_records call and can therefore process those records a tad more efficiently. Aside from that there's no impact; the only value we need from this parent call is the id, which was untouched by the transformation.

@ChristoGrab
Copy link
Contributor Author

ChristoGrab commented Mar 27, 2024

The production comments stream currently adds the parent page record's cursor field as a page_last_edited_time field to each comment record, which is used as the cursor field for semi-incremental syncs. This means all comments within a parent page's partition share the same cursor value. This has been updated to use per-partition state.

Was this necessary for the migration? Can we still handle the old state?

In theory, it would be possible to set this up to resume the previous behavior by adding a CustomPartitionRouter component. However, although this is a breaking change, I THINK (happy to be wrong here) that it's worth it due to being an improved pattern. The existing Python implementation is hacky, as the timestamps for the Comments records are essentially being ignored, meaning we always emit every record in a page if the page's cursor value exceeds the value in state, so given this scenario:

State value: page_last_edited_time = 2

Page 1: last_edited_time = 1
  Comment 1: last_edited_time = 1, page_last_edited_time = 1
  Comment 2: last_edited_time = 5, page_last_edited_time = 1
  Comment 3: last_edited_time = 3, page_last_edited_time = 1

Page 2: last_edited_time = 3
  Comment 4: last_edited_time = 1, page_last_edited_time = 3
  Comment 5: last_edited_time = 6, page_last_edited_time = 3
  Comment 6: last_edited_time = 2, page_last_edited_time = 3

In the current implementation, we will emit all comments in page 2 and ignore all comments in page 1, because the last_edited_time timestamp for the comments are ignored in favor of using the appended page_last_edited_time. By switching to per-partition state, we ensure that we are actually tracking and comparing each comment record's timestamp against the state value for that individual page/partition.

@clnoll
Copy link
Contributor

clnoll commented Mar 28, 2024

Thanks very much for the clarifications, @ChristoGrab! All makes sense to me.

Let me know the results of your investigation into the skip_comprehensive_incremental_tests failures for the Blocks stream. Happy to take a look at any changes that have come out of it.

@octavia-squidington-iv octavia-squidington-iv requested a review from a team March 29, 2024 18:33
@octavia-squidington-iv octavia-squidington-iv requested a review from a team April 1, 2024 14:36
@ChristoGrab ChristoGrab requested a review from katmarkham April 1, 2024 14:48
Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@lazebnyi lazebnyi requested a review from tolik0 April 5, 2024 15:49
@tolik0
Copy link
Contributor

tolik0 commented Apr 10, 2024

@ChristoGrab Hi, have you tested these changes using the regression tool?

@tolik0 tolik0 force-pushed the christo/notion-low-code branch from c64082b to 6992880 Compare April 12, 2024 12:34
@tolik0
Copy link
Contributor

tolik0 commented Apr 12, 2024

Approved. Issue for refactoring improvements: https://github.com/airbytehq/airbyte-internal-issues/issues/7311

@lazebnyi lazebnyi merged commit 98478f0 into master Apr 15, 2024
29 checks passed
@lazebnyi lazebnyi deleted the christo/notion-low-code branch April 15, 2024 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/notion low-code-migration This connector has been migrated to the low-code CDK
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants