[RFR for API Sources] New Python interfaces to support resumable full refresh #37429

brianjlai · 2024-04-19T07:41:12Z

What

Primarily updates how the Stream class performs a read. The big change is around how we resolve what the next partition of records to retrieve is. Because resumable full refresh operates under the paradigm of an unbounded set of pages (unlike incremental partitioned time windows), we need to change how we determine the next slice.

The primary changes to the flow at a high level are:

Consolidates a lot of the branching logic to be agnostic of the incoming catalog's sync mode which incorporated parts of Ella's changes in Synchronous python CDK: make sync mode agnostic #36999
Instead of stream_slices() being the mechanism for determining the next partition of records to retrieve. I've introduced the concept of a CheckpointReader whose type is instantiated based on the Stream's implementation
The reader interprets and passed state back to the ConnectorStateManager by observing the stream's current state.
The CDK is now "state aware" meaning that it actually reads stream state instead of just passing it back to the platform like a block box. I'll go into this in more detail below.

How

Some of the major design changes in the review are:

The new CheckpointReader class which is now the main way a stream determines the next partition of values to read. For incremental this continues to be partitions like time windows. For RFR, this is the next page of records. And for RFR this can be parent records for substreams or a single slice {}.
Deprecating IncrementalMixin in favor of StateMixin since state is used by RFR streams which are not incremental. This is a better name, but I kept the old one for backwards compatibility reasons
Changing the default Stream.stream_slices() implementation from [None] to [{}]. None is now the indicator to stop iterating over slices
Adding the supports_checkpointing field to streams. It’s needed for two areas. We need to surface this value to the catalog. AND we need this to be overridable because declarative low-code sources delineation for checkpointing differs from python sources.

This PR does not implement the work for a substream that requires resumable full refresh. I have a sketched out interface to see if its possible which it does appear to be. But substream state management for RFR becomes quite convoluted due to the issue I'll go into below.

Making the CDK and `Stream` class read directly from the connector managed state

This is arguably the most controversial DX and design change compared to before RFR. It has some impact on connector developers. In order for RFR to function using the current read_records() method, we need some way of communicating state from the the specific connector implementation back to the CDK. To do so, we expect the developer to manage Stream.state. And with RFR we now read the input to decide what to do next. For example:

if state:
{ "pageNumber": 23 }
then,
continue syncing.

if state:
{}
then,
stop syncing no more pages.

This is a relatively simple example, but it does illustrate that the developer needs to know to emit {} to stop RFR paging. In hubspot updating state is done with self.state = self.next_page_token(response) or {}. But it feels like not an ideal precedent that a developer needs to have a general awareness of how the checkpoint reader works in order to successfully implement RFR. And this is coupled with the CheckpointReader being mostly an internal CDK implementation detail that developers shouldn't need to think of.

As mentioned earlier, substream RFR streams which would need per-partition cursors requires much more careful reading of the state object for specific structure which is indicative that state as an unstructured map is not the right data type.

This is just for connectors that implement the legacy Python CDK, and since state is managed internally by low-code connectors, I think its fair to also say from the 80:20 rule, we are aiming for a more ideal interface for low-code which makes up a majority of our connectors vs legacy Python.

Alternative:

I did also look into incremental vs. RFR streams having different versions of the read_records() method. For incremental/full refresh, they would have the normal read_records() -> iterable. And for RFR read_records() -> iterable + updated stream state. This however felt like a step in the wrong direction because we're moving back into two different runtime flows depending on sync type.

A structured state class:

This feels like the more appropriate long term solution. If we have a state interface that handles interpreting state and explicitly communicate what to do with state and what the CDK should emit back to the platform. This is well outside what we can do for the project, but something that I thought about.

Review guide

checkpoint_reader.py
core.py
declarative_stream.py
abstract_source.py
test files

User Impact

noop

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

… how they fit into Stream.read_records()

…rfaces

vercel · 2024-04-19T07:41:19Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		May 6, 2024 9:35pm

brianjlai · 2024-04-19T07:47:15Z

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py

@@ -45,10 +52,10 @@ def package_name_from_class(cls: object) -> str:
        raise ValueError(f"Could not find package name for class {cls}")


-class IncrementalMixin(ABC):
-    """Mixin to make stream incremental.
+class StateMixin(ABC):  # Rename to CheckpointMixin?


I renamed this mixin because state setter/getter are no longer specific to Incremental streams. But to retain backwards compatibility I left the old IncrementalMixin but it just inherits from this mixin

…lper methods

…rfaces

… code

…rfaces

brianjlai · 2024-04-24T09:19:36Z

airbyte-cdk/python/airbyte_cdk/sources/streams/checkpoint/checkpoint_reader.py

+        self._state = new_state
+
+    def get_checkpoint(self) -> Optional[Mapping[str, Any]]:
+        return self._state or {}


Because we call get_checkpoint() and the end of the last slice/page and at the end of the sync, we end up emitting the same final state twice. We can potentially insert more fields to track state internally within the reader, but i don't think its worth the hassle

brianjlai · 2024-04-24T09:20:59Z

airbyte-integrations/connectors/source-hubspot/source_hubspot/streams.py

+    def state(self, value: MutableMapping[str, Any]) -> None:
+        self._state = value
+
+    def read_records(


I left the implementation to get the 2 RFR streams working for Hubspot within the new interfaces to illustrate how to uptake the changes in the CDK.

But before merging I'll remove these from the PR

…ter if it exists

…rfaces

brianjlai · 2024-04-30T02:01:13Z

docs/connector-development/cdk-python/resumable-full-refresh-stream.md

@@ -0,0 +1,77 @@
+# Resumable Full Refresh Streams


@girarda fyi i added a docs for RFR instead of inlining comments for certain methods etc. I still need to proof grammar and clean it up a little but just a heads up since some PR comments called this out

girarda · 2024-04-29T20:53:20Z

airbyte-cdk/python/airbyte_cdk/sources/streams/core.py

-
-        except AttributeError:
-            state_manager.update_state_for_stream(self.name, self.namespace, stream_state)
+        # todo: This can be consolidated into one ConnectorStateManager.update_and_create_state_message() method, but I want


do we need a follow up issue?

yep created it here https://github.com/airbytehq/airbyte-internal-issues/issues/7500

girarda · 2024-05-01T00:18:24Z

airbyte-cdk/python/airbyte_cdk/sources/streams/checkpoint/checkpoint_reader.py

+    @abstractmethod
+    def get_checkpoint(self) -> Optional[Mapping[str, Any]]:
+        """
+        Retrieves the current state value of the stream


can you add a comment explaining that None means we stop reading?

I think you meant this for the checkpointReader.next(), since None in that case stops parsing, but I'll also comment here that we don't emit state messages if return is None either.

girarda

this is great! approved pending reverting the changes to hubspot

… fix test name

…rfaces

…nectors

…rfaces

brianjlai · 2024-05-06T22:42:55Z

ran a few local connectors-ci runs of hubspot, a few low-code connectors to see that tests passed before the merge

brianjlai added 3 commits April 8, 2024 18:21

initial checkpoint reader interface/implementations and sketching out…

45be2a9

… how they fit into Stream.read_records()

cleaning up and simplifying checkpoint reader interface

3aacffe

Merge branch 'master' into resumable_full_refresh_python_cdk_new_inte…

67582c2

…rfaces

octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/source/hubspot labels Apr 19, 2024

vercel bot deployed to Preview April 19, 2024 07:43 View deployment

brianjlai commented Apr 19, 2024

View reviewed changes

clean up interfaces, fix tests, and add new tests across all layers

f0af212

brianjlai requested review from erohmensing and girarda April 19, 2024 07:57

brianjlai added 5 commits April 21, 2024 22:43

fix ci issues and clean out the code a bit

2d69903

format

5484d39

break apart cdk mock server tests into smaller test files and move he…

45ce52d

…lper methods

better incremental state management code and more mock server tests

5f488ab

Merge branch 'master' into resumable_full_refresh_python_cdk_new_inte…

2169498

…rfaces

vercel bot deployed to Preview April 23, 2024 05:55 View deployment

brianjlai added 4 commits April 23, 2024 10:53

remove sync_mode from checkpointer and stream read logic

6c2d596

remove get_final_checkpoint and update tests and simplify Stream read…

f60a0b7

… code

fix mypy

1b18edb

Merge branch 'master' into resumable_full_refresh_python_cdk_new_inte…

514fb3f

…rfaces

brianjlai commented Apr 24, 2024

View reviewed changes

brianjlai marked this pull request as ready for review April 24, 2024 09:24

brianjlai requested review from lazebnyi and oustynova as code owners April 24, 2024 09:24

brianjlai requested a review from a team April 24, 2024 09:24

octavia-squidington-iv requested a review from a team April 24, 2024 09:25

fix tests and when instantiating checkpoint reader use self.state get…

a39bcc6

…ter if it exists

Merge branch 'master' into resumable_full_refresh_python_cdk_new_inte…

28564e4

…rfaces

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Apr 30, 2024

brianjlai commented Apr 30, 2024

View reviewed changes

vercel bot deployed to Preview April 30, 2024 02:03 View deployment

fix tests and polish documentation a bit

94bc5be

vercel bot deployed to Preview April 30, 2024 20:18 View deployment

girarda reviewed May 1, 2024

View reviewed changes

octavia-squidington-iv requested a review from a team May 1, 2024 00:32

pr feedback and fix bug with empty {} for declarative streams

8ddae37

brianjlai mentioned this pull request May 2, 2024

Update the python connector template to use StateMixin instead of get_updated_state #37745

Closed

1 task

girarda approved these changes May 3, 2024

View reviewed changes

brianjlai added 2 commits May 3, 2024 13:33

adding some code to trigger full ci run for a few test connectors and…

52b1a5e

… fix test name

Merge branch 'master' into resumable_full_refresh_python_cdk_new_inte…

e0e88fb

…rfaces

brianjlai requested a review from a team May 3, 2024 20:36

octavia-squidington-iii added the connectors/source/freshdesk label May 3, 2024

vercel bot deployed to Preview May 3, 2024 20:40 View deployment

remove connector code after running tests against a few different con…

e976383

…nectors

octavia-squidington-iii removed the area/connectors Connector related issues label May 4, 2024

brianjlai removed connectors/source/hubspot connectors/source/freshdesk labels May 4, 2024

brianjlai added 2 commits May 6, 2024 13:53

last small changes before merge

0af07eb

Merge branch 'master' into resumable_full_refresh_python_cdk_new_inte…

8bda880

…rfaces

vercel bot deployed to Preview May 6, 2024 20:58 View deployment

format -_-

87fd05e

brianjlai merged commit d74125b into master May 6, 2024

brianjlai deleted the resumable_full_refresh_python_cdk_new_interfaces branch May 6, 2024 22:41

brianjlai mentioned this pull request May 6, 2024

Synchronous python CDK: make sync mode agnostic #36999

Closed

szemek mentioned this pull request May 11, 2025

[docs] Update Resumable Full Refresh Streams docs to use CheckpointMixin instead of StateMixin #60212

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFR for API Sources] New Python interfaces to support resumable full refresh #37429

[RFR for API Sources] New Python interfaces to support resumable full refresh #37429

Uh oh!

brianjlai commented Apr 19, 2024 •

edited

Loading

Uh oh!

vercel bot commented Apr 19, 2024 •

edited

Loading

Uh oh!

brianjlai Apr 19, 2024

Uh oh!

brianjlai Apr 24, 2024

Uh oh!

brianjlai Apr 24, 2024

Uh oh!

brianjlai Apr 30, 2024

Uh oh!

girarda Apr 29, 2024

Uh oh!

brianjlai May 1, 2024

Uh oh!

girarda May 1, 2024

Uh oh!

brianjlai May 1, 2024

Uh oh!

girarda left a comment

Uh oh!

brianjlai commented May 6, 2024

Uh oh!

Uh oh!

[RFR for API Sources] New Python interfaces to support resumable full refresh #37429

[RFR for API Sources] New Python interfaces to support resumable full refresh #37429

Uh oh!

Conversation

brianjlai commented Apr 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Making the CDK and Stream class read directly from the connector managed state

Review guide

User Impact

Can this PR be safely reverted and rolled back?

Uh oh!

vercel bot commented Apr 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

girarda left a comment

Choose a reason for hiding this comment

Uh oh!

brianjlai commented May 6, 2024

Uh oh!

Uh oh!

brianjlai commented Apr 19, 2024 •

edited

Loading

Making the CDK and `Stream` class read directly from the connector managed state

vercel bot commented Apr 19, 2024 •

edited

Loading