Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a walkthrough of building a custom Python connector #36743

Merged
merged 13 commits into from
Apr 9, 2024

Conversation

girarda
Copy link
Contributor

@girarda girarda commented Apr 2, 2024

What

Write a Python source connector development walkthrough covering

  • reading a page of data
  • pagination
  • incremental
  • integration tests
  • rate limits
  • concurrency

This can replace two old tutorials

Copy link

vercel bot commented Apr 2, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 9, 2024 3:33am

@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Apr 2, 2024
@natikgadzhi
Copy link
Contributor

Spicy take: should this replace one of the existing tutorials? I don't like having to maintain three.

@@ -58,7 +58,7 @@ and choose the relevant template by using the arrow keys. This will generate a n
Search the generated directory for "TODO"s and follow them to implement your connector. For more detailed walkthroughs and instructions, follow the relevant tutorial:

- [Speedrun: Building a HTTP source with the CDK](tutorials/cdk-speedrun.md)
- [Building a HTTP source with the CDK](tutorials/cdk-tutorial-python-http/getting-started.md)
- [Building a HTTP source with the CDK](tutorials/custom-python-connector/0-getting-started.md)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the path because the cdk-tutorial-python-http is deleted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it deleted in this PR? Make sure that you add redirects for all URLs that you delete — redirect to the new guide. That way we don't loose search traffic.

@@ -50,7 +50,7 @@ cd airbyte-integrations/connector-templates/generator
Next, find all `TODO`s in the generated project directory. They're accompanied by comments explaining what you'll
need to do in order to implement your connector. Upon completing all TODOs properly, you should have a functioning connector.

Additionally, you can follow [this tutorial](../tutorials/cdk-tutorial-python-http/getting-started.md) for a complete walkthrough of creating an HTTP connector using the Airbyte CDK.
Additionally, you can follow [this tutorial](../tutorials/custom-python-connector/0-getting-started.md) for a complete walkthrough of creating an HTTP connector using the Airbyte CDK.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the path because the cdk-tutorial-python-http is deleted

@@ -72,7 +72,7 @@ Airbyte recommends using the CDK template generator to develop with the CDK. The

For tips on useful Python knowledge, see the [Python Concepts](python-concepts.md) page.

You can find a complete tutorial for implementing an HTTP source connector in [this tutorial](../tutorials/cdk-tutorial-python-http/getting-started.md)
You can find a complete tutorial for implementing an HTTP source connector in [this tutorial](../tutorials/custom-python-connector/0-getting-started.md)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the path because the cdk-tutorial-python-http is deleted

@@ -26,7 +26,7 @@ See the [catalog guide](https://docs.airbyte.com/understanding-airbyte/beginners

Let's define the stream schema in `source-exchange-rates-tutorial/source_exchange_rates_tutorial/schemas/rates.json`

You can download the JSON file describing the output schema with all currencies [here](../../tutorials/cdk-tutorial-python-http/exchange_rates_schema.json) for convenience and place it in `schemas/`.
You can download the JSON file describing the output schema with all currencies [here](./exchange_rates_schema.json) for convenience and place it in `schemas/`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the path because the cdk-tutorial-python-http is deleted

@@ -35,7 +35,7 @@ airbyte-ci connectors --use-remote-secrets=false --name source-exchange-rates-tu

## Next steps:

Next, we'll add the connector to the [Airbyte platform](https://docs.airbyte.com/connector-development/tutorials/cdk-tutorial-python-http/use-connector-in-airbyte).
Next, we'll add the connector to the [Airbyte platform](https://docs.airbyte.com/operator-guides/using-custom-connectors).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized there's a better guide to using custom connectors

```
rm unit_tests/test_incremental_streams.py unit_tests/test_source.py unit_tests/test_streams.py
```
Replace the content of `airbyte-integrations/connectors/source-survey-monkey-demo/source_survey_monkey_demo/source.py` with the following template:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not a fan of the code in the template. We can do a better job of encouraging composition instead of inheritance


We'll do this by trying to read a single record from the stream, and fail the connector could not read any.
```python
def check_connection(self, logger, config) -> Tuple[bool, any]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the template should provide a sensible implementation by default instead of returning True, None. it's pretty useless


The `get_updated_state` method is used to update the stream's state. We'll set its value to the maximum between the current state's value and the value extracted from the record.
```python
def get_updated_state(self, current_stream_state: MutableMapping[str, Any], latest_record: Mapping[str, Any]) -> Mapping[str, Any]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this won't be part of the best practices anymore since RFR won't work with connectors that use get_updated_state


Let's update the source. The bulk of the change is changing its parent class to `ConcurrentSourceAdapter`, and updating its `__init__` method so it's properly initialized. This requires a little bit of boilerplate:
```python
class SourceSurveyMonkeyDemo(ConcurrentSourceAdapter):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the need for the adapter to be pretty sad. I think we're pretty close to being able to recommend using the concurrent cdk interfaces from the get go. We're mostly only missing the concept of a PaginatedRequester.

We did a PoC with Stripe back in October, but never pulled the trigger because the connector had too many issues

yield parent_record
```

This can be solved by implementing the connector using constructs from the concurrent CDK directly instead of wrapping synchronous streams in an adapter. This is left outside of the scope of this tutorial because no production connectors currently implement this.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very unfortunate broken window. We never got to productionizing support for concurrent substreams.

@girarda girarda marked this pull request as ready for review April 5, 2024 21:12
@girarda
Copy link
Contributor Author

girarda commented Apr 5, 2024

Build is failing on vercel but passing locally. not sure what the issue is

@girarda girarda requested a review from a team April 8, 2024 13:37
Copy link
Contributor

@natikgadzhi natikgadzhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this is great — this will be an epic onboarding guide for Cristina!

A few things to work on:

  • Formatting
  • Make sure links we delete redirect to this guide
  • Make sure to delete speedrun as well
  • Make issues for further improvements

I will sit down with this later today to get a few changes in and debug why Vercel did not work.

@@ -58,7 +58,7 @@ and choose the relevant template by using the arrow keys. This will generate a n
Search the generated directory for "TODO"s and follow them to implement your connector. For more detailed walkthroughs and instructions, follow the relevant tutorial:

- [Speedrun: Building a HTTP source with the CDK](tutorials/cdk-speedrun.md)
- [Building a HTTP source with the CDK](tutorials/cdk-tutorial-python-http/getting-started.md)
- [Building a HTTP source with the CDK](tutorials/custom-python-connector/0-getting-started.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it deleted in this PR? Make sure that you add redirects for all URLs that you delete — redirect to the new guide. That way we don't loose search traffic.

@@ -3,7 +3,7 @@
## CDK Speedrun \(HTTP API Source Creation Any Route\)

This is a blazing fast guide to building an HTTP source connector. Think of it as the TL;DR version
of [this tutorial.](cdk-tutorial-python-http/getting-started.md)
of [this tutorial.](custom-python-connector/0-getting-started.md)

If you are a visual learner and want to see a video version of this guide going over each part in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should urge people to use Connector Builder and low-code for most connectors.

I would say, let's delete this guide as well and point it to the new one. Perhaps we can add a little note on the top of it to direct some users to lowcode?

# Getting started
This tutorial will walk you through the creation of a custom Airbyte connector implemented with the Python CDK. This tutorial assumes you're already familiar with Airbyte concept and you've already built a connector using the [Connector Builder](../../connector-builder-ui/tutorial.mdx).

The Python CDK should be used to implement connectors that require features that are not yet available in the Connector Builder or in the low-code framework. You can use the [Connector Builder compatibility guide](../../connector-builder-ui/connector-builder-compatibility.md) to know whether it is suitable for your needs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

The two approaches are equivalent for the Survey Monkey API, but as a rule of thumb, it is preferable to use the links provided by the API if it is available instead of reverse engineering the mechanism. This way, we don't need to modify the connector if the API changes their pagination mechanism, for instance, if they decide to implement server-side pagination.

:::info
When available, server-side pagination should be preferred over client-side pagination because it has lower risks of missing records if the collection is modified while the connector iterates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super clear to junior devs what server-side vs client-side pagination is. You're talking about it enough. Server-side — links provided by API, client-side: manually changing page_number or offset.

Copy link
Contributor

@natikgadzhi natikgadzhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, I'll look into redirects and merge tonight

@natikgadzhi natikgadzhi merged commit 87f051c into master Apr 9, 2024
25 of 26 checks passed
@natikgadzhi natikgadzhi deleted the alex/python_tutorial branch April 9, 2024 04:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants