Source Notion: Migrate to Low-Code #35974

ChristoGrab · 2024-03-11T20:14:27Z

What

This PR aims to migrate source Notion from the Python CDK to the low-code declarative framework. The proposed implementation is a hybrid connector, using the low-code manifest for all streams except the blocks stream, which has a fair amount of custom behavior, including a recursive fetching algorithm that I struggled to figure out how to implement as a custom component. Issue with more context.

Low-code migration reference

Changes

Non-Breaking

Check has been changed from a custom implementation (request to the /users/me endpoint which contains the current user's data) to the standard manifest implementation (check stream: pages). Some custom error handling has been lost in the process.
Some custom validation checks for the start_date were removed. The format of the start_date is already strictly enforced in the UI, and it is an optional field with a default value, so this felt like an acceptable loss.
Retry logic no longer reduces page size on repeated requests for low-code streams. My inclination is this is not a big functionality loss, as the blocks stream still retains the original behavior, and it is by far the biggest data stream for this connector (everything in Notion is a block, so the vast majority of records will always come from this stream).
Removed the call to transform the records read in the parent page stream when fetching slice ids for use in the blocks stream. Transformations are still applied to the records when syncing the low-code pages stream.

🚨🚨 Breaking

The production comments stream currently adds the parent page record's cursor field as a page_last_edited_time field to each comment record, which is used as the cursor field for semi-incremental syncs. This means all comments within a parent page's partition share the same cursor value. This has been updated to use per-partition state.

Current:

{"stream_state": {"page_last_edited_time": "2020-01-02"}} <- all `comments` records within a parent slice will share the same cursor value

New:

{"stream_state": "states": [
  {"partition": {"page_id_1": "2020-01-01"}}, 
  "partition": {"page_id_2": "2020-01-02"}}]} <- each `comments` record now compares its actual timestamp to the value of state for its partition

The resulting change is twofold: the format of state has changed to per-partition and the cursor field has been moved from the no-longer appended page_last_edited_time to the record's existing last_edited_time field.

Notes

I've done some trimming of unit tests, but have left in the tests that cover the remaining python code.

vercel · 2024-03-11T20:14:33Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Apr 15, 2024 7:23am

airbyte-integrations/connectors/source-notion/source_notion/components.py

clnoll

Looks great so far @ChristoGrab! I appreciate the detailed PR description.

Just a few questions in the code, and a few more here:

The pages and databases incremental logic changed from a custom semi-incremental implementation to the Data Feed low-code implementation. This means the requests to these streams are now in descending rather than ascending order. This should have no effect from a user perspective.

The one effect that I'm wondering about is how we'll pick up where we left off if the sync is interrupted.
Since the DataFeed implementation already does this I think we’re okay, but I’m curious whether you’re aware of a special reason that we were syncing this stream in ascending order previously.

Retry logic no longer reduces page size on repeated requests for low-code streams. My inclination is this is not a big functionality loss, as the blocks stream still retains the original behavior, and it is by far the biggest data stream for this connector (everything in Notion is a block, so the vast majority of records will always come from this stream).
This suggests to me that we should regression test against a connection that requires retries.

Can we verify e.g. from logs, that it’s only the blocks stream that was hitting this?

Removed the call to transform the records read in the parent page stream when fetching slice ids for use in the blocks stream. Transformations are still applied to the records when syncing the low-code pages stream.

Is there any external impact as a result of this?

The production comments stream currently adds the parent page record's cursor field as a page_last_edited_time field to each comment record, which is used as the cursor field for semi-incremental syncs. This means all comments within a parent page's partition share the same cursor value. This has been updated to use per-partition state.

Was this necessary for the migration? Can we still handle the old state?

Is there a breaking changes checklist that I should look at?

clnoll · 2024-03-25T10:28:13Z

airbyte-integrations/connectors/source-notion/acceptance-test-config.yml

@@ -33,6 +33,7 @@ acceptance_tests:
        configured_catalog_path: "integration_tests/incremental_catalog.json"
        future_state:
          future_state_path: "integration_tests/abnormal_state.json"
+        skip_comprehensive_incremental_tests: true


Why are we skipping these?

Glad you asked! This test is a frequent pain point for connectors with limited sandbox data and can throw failures even when nothing is wrong; at first glance I had assumed this was just another such case, but I took a deeper look into the test results and am seeing some fishy logs for the Blocks stream. Investigating further and will get back to you soon as I have a clearer sense of what's going on here.

@cnoll Ok, I think I understand what's going on. The Blocks stream is a doozy, as we not only have no way of filtering or sorting results, but due to the nature of the recursive traversal through the blocks hierarchy, there isn't a defined top-level of records used in slicing.

Instead, an id_stack is populated with the parent stream ids; then as we query those pages, the nested blocks have their ids added to the stack and become the next slice (which will contain child blocks that will become slices for their own children, and down the rabbit hole we go). Right now the stream tracks the cursor_value of each record and uses a custom StateValueWrapper to compare and update the max_cursor_time until there is only one id remaining in the stack; this condition flips an is_finished boolean, which then triggers the final calculation of max_cursor in the StateValueWrapper to actually update the stream's state.

Here's some print statements of what these values looks like when there is an existing state value:

current stream_state: {'last_edited_time': 2022-10-10T04:00:00.000Z} current StateValueWrapper: stream=<source_notion.streams.Blocks object at 0x7fc71ff9a140> state_value='2022-10-10T04:00:00.000Z' max_cursor_time='2023-10-10T13:51:00.000Z' <- `max_cursor_time` represents the latest cursor value in the read thus far most recent record_cursor: 2023-10-23T17:30:00.000Z updated StateValueWrapper.max_cursor_time: 2023-10-23T17:30:00.000 <- this will keep updating and become the new value of state only if `self.is_finished` has been flipped

So like with the other Notion streams, it's functionally a full read operation that uses filtering on subsequent reads to prevent records from emitting. The issue exposed by the comprehensive_incremental_tests is that the value of state is only updated at the end of the read, rather than on each call to get_updated_state. Unfortunately, changing the logic to update the value of state after a slice is processed would be dangerous, since our client-side filtering logic would then start taking effect on the first read and exclude unsorted blocks records that have never been synced.

Ordinarily, the best solution would probably be to set up per-partition semi-incremental logic (like in the Comments stream) so we can update each slice's state as we go and aren't reliant on a first successful sync to set a value for stream_state. My one hesitation in this case is with the inevitable ballooning of the state object for this stream. For reference, our tiny sandbox account with just a few basic sample pages has over 500 blocks records, any number of which could be a potential parent/slice for other blocks. I'm not certain what the memory limitations are for state objects, but I'm concerned about the potential scaling implications on larger Notion accounts, and unfortunately no matter what we do we can't change the fact that there's no server-side filtering we can use to limit the number of requests per sync.

Any thoughts on how to handle this? I'm tempted to say "working as intended!" for now since this is how the connector has always worked, but I'm admittedly biased by a desire to have this ready to ship by Monday 😅

I assume the mention should go to @clnoll

Whoops haha yes, so sorry!

Wow, thank you for the detailed explanation! This is super helpful.

The issue exposed by the comprehensive_incremental_tests is that the value of state is only updated at the end of the read, rather than on each call to get_updated_state.

It sounds like the main side effect is that we may end up syncing some records more than necessary. Is that right?

In any case I agree with you about the best solution, but also that since this is how it's always worked, we can leave it as-is for now. Would you mind creating an issue for this, in case we decide we want to follow up on it?

Yup, that sounds right: if we get halfway through a first sync and there's an interruption, any records that were successfully synced will still be emitted on the subsequent attempt. Put the issue together here: https://github.com/airbytehq/airbyte/issues/36698, just a copy/paste of the investigation to start but I'll do some grooming 👍

@ChristoGrab, is it possible to delete the Blocks stream from the catalog used during incremental tests, or will you enable the tests when they are fixed? #36814

airbyte-integrations/connectors/source-notion/source_notion/spec.json

airbyte-integrations/connectors/source-notion/metadata.yaml

docs/integrations/sources/notion-migrations.md

airbyte-integrations/connectors/source-notion/source_notion/manifest.yaml

airbyte-integrations/connectors/source-notion/integration_tests/abnormal_state.json

ChristoGrab · 2024-03-27T20:19:06Z

@clnoll Thanks for the thorough review! I'll address the main block of comments one at a time here:

The pages and databases incremental logic changed from a custom semi-incremental implementation to the Data Feed low-code implementation. This means the requests to these streams are now in descending rather than ascending order. This should have no effect from a user perspective.

The one effect that I'm wondering about is how we'll pick up where we left off if the sync is interrupted.
Since the DataFeed implementation already does this I think we’re okay, but I’m curious whether you’re aware of a special reason that we were syncing this stream in ascending order previously.

The previous behavior was sort of hacky, since the endpoint is very limited when it comes to sorting/filtering, and only allows us to specify the order and filter by type: page/database. So "incremental" is a bit of a cheeky misnomer as every read is essentially a full refresh operation that simply uses client-side filtering to determine which records to emit. I checked the original PR to see if there was any indication of why the order was set to ascending, but didn't see anything pertinent. My naive guess is the developer picked it more as a seemingly sensible default value than a deliberate design choice, based on the fact that the existing logic would always request all records regardless of the query order and state value. Sorry I don't have a more certain answer, but on the plus side this means the migration does actually have a positive impact for these streams which I failed to note, in that we now have an actual stop condition for requesting data once the appropriate cursor value has been reached!

ChristoGrab · 2024-03-27T20:25:45Z

Retry logic no longer reduces page size on repeated requests for low-code streams. My inclination is this is not a big functionality loss, as the blocks stream still retains the original behavior, and it is by far the biggest data stream for this connector (everything in Notion is a block, so the vast majority of records will always come from this stream).
This suggests to me that we should regression test against a connection that requires retries.

Can we verify e.g. from logs, that it’s only the blocks stream that was hitting this?

Unfortunately my limited access means I can't view customer connections/logs to verify this (Maxime is looking into getting my access upgraded), but I can at least point to the original oncall issue that led to this functionality being added. For some context, the issue was related to frequent 504 errors plaguing an unusually large Notion workspace (~10k pages) that was exacerbated by unspecified changes going on in the Notion API at the time. Sorry, I know that's not much to go on, if this is a point of concern I can certainly take a stab at implementing the logic as a CustomBackoffStrategy for the low-code streams as well!

ChristoGrab · 2024-03-27T20:37:43Z

Removed the call to transform the records read in the parent page stream when fetching slice ids for use in the blocks stream. Transformations are still applied to the records when syncing the low-code pages stream.

Is there any external impact as a result of this?

This change should provide a minor boost to performance when syncing the Blocks stream, since we're no longer applying transformations during the parent read_records call and can therefore process those records a tad more efficiently. Aside from that there's no impact; the only value we need from this parent call is the id, which was untouched by the transformation.

ChristoGrab · 2024-03-27T20:54:57Z

The production comments stream currently adds the parent page record's cursor field as a page_last_edited_time field to each comment record, which is used as the cursor field for semi-incremental syncs. This means all comments within a parent page's partition share the same cursor value. This has been updated to use per-partition state.

Was this necessary for the migration? Can we still handle the old state?

In theory, it would be possible to set this up to resume the previous behavior by adding a CustomPartitionRouter component. However, although this is a breaking change, I THINK (happy to be wrong here) that it's worth it due to being an improved pattern. The existing Python implementation is hacky, as the timestamps for the Comments records are essentially being ignored, meaning we always emit every record in a page if the page's cursor value exceeds the value in state, so given this scenario:

State value: page_last_edited_time = 2

Page 1: last_edited_time = 1
  Comment 1: last_edited_time = 1, page_last_edited_time = 1
  Comment 2: last_edited_time = 5, page_last_edited_time = 1
  Comment 3: last_edited_time = 3, page_last_edited_time = 1

Page 2: last_edited_time = 3
  Comment 4: last_edited_time = 1, page_last_edited_time = 3
  Comment 5: last_edited_time = 6, page_last_edited_time = 3
  Comment 6: last_edited_time = 2, page_last_edited_time = 3

In the current implementation, we will emit all comments in page 2 and ignore all comments in page 1, because the last_edited_time timestamp for the comments are ignored in favor of using the appended page_last_edited_time. By switching to per-partition state, we ensure that we are actually tracking and comparing each comment record's timestamp against the state value for that individual page/partition.

clnoll · 2024-03-28T19:42:18Z

Thanks very much for the clarifications, @ChristoGrab! All makes sense to me.

Let me know the results of your investigation into the skip_comprehensive_incremental_tests failures for the Blocks stream. Happy to take a look at any changes that have come out of it.

airbyte-integrations/connectors/source-notion/source_notion/components.py

airbyte-integrations/connectors/source-notion/metadata.yaml

clnoll

🎉

tolik0 · 2024-04-10T08:49:29Z

@ChristoGrab Hi, have you tested these changes using the regression tool?

tolik0 · 2024-04-12T12:38:34Z

Approved. Issue for refactoring improvements: https://github.com/airbytehq/airbyte-internal-issues/issues/7311

octavia-squidington-iii added area/connectors Connector related issues connectors/source/notion labels Mar 11, 2024

girarda reviewed Mar 18, 2024

View reviewed changes

airbyte-integrations/connectors/source-notion/source_notion/components.py Show resolved Hide resolved

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Mar 20, 2024

vercel bot deployed to Preview March 20, 2024 22:54 View deployment

vercel bot deployed to Preview March 21, 2024 19:25 View deployment

vercel bot deployed to Preview March 21, 2024 21:23 View deployment

ChristoGrab marked this pull request as ready for review March 21, 2024 22:43

octavia-squidington-iv requested review from a team March 21, 2024 22:44

alafanechere added the low-code-migration This connector has been migrated to the low-code CDK label Mar 22, 2024

clnoll self-requested a review March 22, 2024 14:34

clnoll reviewed Mar 25, 2024

View reviewed changes

octavia-squidington-iv requested a review from a team March 25, 2024 14:50

clnoll reviewed Mar 29, 2024

View reviewed changes

airbyte-integrations/connectors/source-notion/source_notion/components.py Outdated Show resolved Hide resolved

airbyte-integrations/connectors/source-notion/source_notion/components.py Outdated Show resolved Hide resolved

octavia-squidington-iv requested a review from a team March 29, 2024 18:33

katmarkham requested changes Apr 1, 2024

View reviewed changes

airbyte-integrations/connectors/source-notion/metadata.yaml Outdated Show resolved Hide resolved

octavia-squidington-iv requested a review from a team April 1, 2024 14:36

ChristoGrab requested a review from katmarkham April 1, 2024 14:48

katmarkham approved these changes Apr 1, 2024

View reviewed changes

clnoll approved these changes Apr 3, 2024

View reviewed changes

lazebnyi requested a review from tolik0 April 5, 2024 15:49

ChristoGrab and others added 23 commits April 12, 2024 15:32

update metadata and migration guide

9cb8419

reimplement get_updated_state unit test

4918927

chore: format

7d567df

remove poetry.toml, update metadata message

4507ad2

chore: format

cb8c54b

add user transformation as custom component

de19d5b

implement filter component for data feed streams

d8077f6

format

6138929

fix pagination conditions

0256964

private methods and fixed cursor comparisons

1401272

revert pagination logic in manifest

41c63a7

update method invocation, revert pagination changes

b398300

update release metadata

ce668a5

fix semi-incremental cursor

d651e4d

fix unit tests

1bf3c6c

add test for custom close_slice

8247a54

update deps, cleanup test files

f7bfdb5

update cdk, revert custom incremental and data feed

fb19026

remove unused test

0c773d9

remove custom filter and revert to data_feed implementation

02596f4

re-implement and fix data feed filter

a7ddd44

update abnormal state comments

de80046

Update upgrade deadline

6992880

tolik0 force-pushed the christo/notion-low-code branch from c64082b to 6992880 Compare April 12, 2024 12:34

tolik0 approved these changes Apr 12, 2024

View reviewed changes

Merge branch 'master' into christo/notion-low-code

ffc800f

vercel bot deployed to Preview April 15, 2024 07:23 View deployment

lazebnyi merged commit 98478f0 into master Apr 15, 2024
29 checks passed

lazebnyi deleted the christo/notion-low-code branch April 15, 2024 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source Notion: Migrate to Low-Code #35974

Source Notion: Migrate to Low-Code #35974

ChristoGrab commented Mar 11, 2024 •

edited

Loading

vercel bot commented Mar 11, 2024 •

edited

Loading

clnoll left a comment

clnoll Mar 25, 2024

ChristoGrab Mar 27, 2024

ChristoGrab Mar 29, 2024 •

edited

Loading

cnoll Mar 29, 2024

ChristoGrab Mar 29, 2024

clnoll Mar 29, 2024

ChristoGrab Mar 29, 2024

tolik0 Apr 10, 2024

ChristoGrab commented Mar 27, 2024

ChristoGrab commented Mar 27, 2024 •

edited

Loading

ChristoGrab commented Mar 27, 2024 •

edited

Loading

ChristoGrab commented Mar 27, 2024 •

edited

Loading

clnoll commented Mar 28, 2024

clnoll left a comment

tolik0 commented Apr 10, 2024

tolik0 commented Apr 12, 2024

Source Notion: Migrate to Low-Code #35974

Source Notion: Migrate to Low-Code #35974

Conversation

ChristoGrab commented Mar 11, 2024 • edited Loading

What

Changes

Non-Breaking

🚨🚨 Breaking

Notes

vercel bot commented Mar 11, 2024 • edited Loading

clnoll left a comment

Choose a reason for hiding this comment

clnoll Mar 25, 2024

Choose a reason for hiding this comment

ChristoGrab Mar 27, 2024

Choose a reason for hiding this comment

ChristoGrab Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

cnoll Mar 29, 2024

Choose a reason for hiding this comment

ChristoGrab Mar 29, 2024

Choose a reason for hiding this comment

clnoll Mar 29, 2024

Choose a reason for hiding this comment

ChristoGrab Mar 29, 2024

Choose a reason for hiding this comment

tolik0 Apr 10, 2024

Choose a reason for hiding this comment

ChristoGrab commented Mar 27, 2024

ChristoGrab commented Mar 27, 2024 • edited Loading

ChristoGrab commented Mar 27, 2024 • edited Loading

ChristoGrab commented Mar 27, 2024 • edited Loading

clnoll commented Mar 28, 2024

clnoll left a comment

Choose a reason for hiding this comment

tolik0 commented Apr 10, 2024

tolik0 commented Apr 12, 2024

ChristoGrab commented Mar 11, 2024 •

edited

Loading

vercel bot commented Mar 11, 2024 •

edited

Loading

ChristoGrab Mar 29, 2024 •

edited

Loading

ChristoGrab commented Mar 27, 2024 •

edited

Loading

ChristoGrab commented Mar 27, 2024 •

edited

Loading

ChristoGrab commented Mar 27, 2024 •

edited

Loading