[ISSUE #7674] contacts_list_memberships incremental using client filt… #38128

maxi297 · 2024-05-10T20:05:15Z

…ering

What

Addressing https://github.com/airbytehq/airbyte-internal-issues/issues/7674

A customer of ours is using this stream but this is emitted too many records leading to some very expensive syncs. We don't have a good way to do server side filtering so we will use client side filtering for now.

Deletes are not required for this.

How

Having ContactsListMemberships inherit from ClientSideIncrementalStream

Review guide

airbyte-integrations/connectors/source-hubspot/source_hubspot/streams.py
airbyte-integrations/connectors/source-hubspot/unit_tests/integrations/test_contacts_list_memberships.py
The rest

User Impact

The stream now supports incremental. Indeed python main.py discover --config secrets/config.json will yield

{
    "type": "CATALOG",
    "catalog": {
        "streams": [
            <...>
            {
                "name": "contacts_list_memberships",
                "json_schema": <...>
                "supported_sync_modes": [
                    "full_refresh",
                    "incremental"
                ],
                "source_defined_cursor": true,
                "default_cursor_field": [
                    "timestamp"
                ],
                "source_defined_primary_key": [
                    [
                        "canonical-vid"
                    ]
                ]
            }
            <...>
        ]
    }
}

Using main.py read --config secrets/config.json --catalog sample_files/test_catalog.json --debug

test_catalog.json

{
  "streams": [
    {
      "stream": {
        "name": "contacts_list_memberships",
        "json_schema": {},
        "supported_sync_modes": ["full_refresh", "incremental"]
      },
      "sync_mode": "full_refresh",
      "destination_sync_mode": "overwrite"
    }
  ]
}

I get {"type": "LOG", "log": {"level": "INFO", "message": "Read 1025 records from contacts_list_memberships stream"}}

Running the same thing with state { "timestamp": "1714412405830" }, I get {"type": "LOG", "log": {"level": "INFO", "message": "Read 3 records from contacts_list_memberships stream"}}.

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

…ering

vercel · 2024-05-10T20:05:27Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		May 13, 2024 0:22am

maxi297 · 2024-05-10T20:08:24Z

airbyte-integrations/connectors/source-hubspot/source_hubspot/streams.py

@@ -891,7 +891,7 @@ def filter_by_state(self, stream_state: Mapping[str, Any] = None, record: Mappin
        # save the state
        self.state = {self.cursor_field: int(max_state) if int_field_type else max_state}
        # emmit record if it has bigger cursor value compare to the state (`True` only)
-        return record_value > state_value
+        return record_value >= state_value


Very edge case but we had a similar discussion for Salesforce here

maxi297 · 2024-05-10T20:08:59Z

airbyte-integrations/connectors/source-hubspot/source_hubspot/streams.py

@@ -1369,6 +1369,7 @@ class ContactsAllBase(Stream):
    page_filter = "vidOffset"
    page_field = "vid-offset"
    primary_key = "canonical-vid"
+    limit_field = "count"


The previous implementation was using limit instead of count. This should have no impact as we pass the default value which is 100

bazarnov

General question: did we gain any speed boost out of this change?

airbyte-integrations/connectors/source-hubspot/source_hubspot/streams.py

maxi297 · 2024-05-10T22:00:19Z

General question: did we gain any speed boost out of this change?

@bazarnov The goal what not speed boost. I expect there would be none in that case. The goal is to only emits the record the customer needs as some customers will have too much data on full_refresh and hence it will cost a lot

brianjlai

changes to contacts_list_memberships seems reasonable, but i also think that we might as well port contacts_form_submissions and contacts_merged_audit to the same functionality because all three can be expressed as semi-incremental.

Looking at the graph you mentioned in the RFR PR here , it sounds like the biggest issue w/ these streams is the cost associated with a full refresh load on every attempt to the destination warehouse.

And if so, I'm okay just saying we make all of them semi-incremental instead of RFR since that solves the issue of cost and reliability is already high enough

airbyte-integrations/connectors/source-hubspot/source_hubspot/streams.py

maxi297 · 2024-05-13T12:21:39Z

@brianjlai you are right that it would have other benefits (i.e. stream would be smaller and therefore there would be smaller load on the destination) and that there are other streams that could provide this benefit.

I terms of sequencing, I would release this change for our client to be able to leverage this and we can make contacts_form_submissions and contacts_merged_audit changes as part of the RFR project. If it works in terms of timing for you, the Critical Connectors team can take it once we get to the RFR project

[ISSUE #7674] contacts_list_memberships incremental using client filt…

e7bbbfa

…ering

maxi297 requested review from a team and brianjlai May 10, 2024 20:05

octavia-squidington-iii added area/connectors Connector related issues connectors/source/hubspot labels May 10, 2024

maxi297 commented May 10, 2024

View reviewed changes

maxi297 requested a review from bazarnov May 10, 2024 20:09

Update release information and format

9d2892c

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label May 10, 2024

vercel bot deployed to Preview May 10, 2024 20:17 View deployment

bazarnov reviewed May 10, 2024

View reviewed changes

airbyte-integrations/connectors/source-hubspot/source_hubspot/streams.py Show resolved Hide resolved

brianjlai reviewed May 10, 2024

View reviewed changes

airbyte-integrations/connectors/source-hubspot/source_hubspot/streams.py Show resolved Hide resolved

bazarnov approved these changes May 11, 2024

View reviewed changes

Code review

a3af996

maxi297 merged commit 1855408 into master May 13, 2024

maxi297 deleted the issue-7674/hubspot-contacts-list-membership-client-side-filtering-incremental branch May 13, 2024 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ISSUE #7674] contacts_list_memberships incremental using client filt… #38128

[ISSUE #7674] contacts_list_memberships incremental using client filt… #38128

Uh oh!

maxi297 commented May 10, 2024

Uh oh!

vercel bot commented May 10, 2024 •

edited

Loading

Uh oh!

maxi297 May 10, 2024

Uh oh!

maxi297 May 10, 2024

Uh oh!

bazarnov left a comment •

edited

Loading

Uh oh!

Uh oh!

maxi297 commented May 10, 2024

Uh oh!

brianjlai left a comment

Uh oh!

Uh oh!

maxi297 commented May 13, 2024

Uh oh!

Uh oh!

[ISSUE #7674] contacts_list_memberships incremental using client filt… #38128

[ISSUE #7674] contacts_list_memberships incremental using client filt… #38128

Uh oh!

Conversation

maxi297 commented May 10, 2024

What

How

Review guide

User Impact

Can this PR be safely reverted and rolled back?

Uh oh!

vercel bot commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxi297 May 10, 2024

Choose a reason for hiding this comment

Uh oh!

maxi297 May 10, 2024

Choose a reason for hiding this comment

Uh oh!

bazarnov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxi297 commented May 10, 2024

Uh oh!

brianjlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxi297 commented May 13, 2024

Uh oh!

Uh oh!

vercel bot commented May 10, 2024 •

edited

Loading

bazarnov left a comment •

edited

Loading