Skip to content

Improve performance of interpolation in decalarative sources #44027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 22, 2024

Conversation

szubster
Copy link
Contributor

@szubster szubster commented Aug 14, 2024

What

interpolated_string.py is called thousands or even millions of times in declarative streams.
E.g. in datetime_based_cursor.py.
It is often the case (e.g. in jira source) that there is no evaluation needed, and input is just a string.
When encountered with multiple parent streams this can lead to jinja evaluation taking most of the CPU time.

Jinja compiled templates are also compiled on each evaluation. Which makes them slow.

How

Check whether evaluated string equals input on first try. If so, use input string directly.
Add cache for compiled jinja templates.

Review guide

  1. interpolated_string.py

User Impact

Improved performance.
Example of small run on our jira instance (tests from changes in interpolated_string.py only:
pre:

Executed in  267.01 secs    fish           external
   usr time  153.66 secs    0.11 millis  153.66 secs
   sys time    1.10 secs    1.35 millis    1.10 secs

post:

Executed in  130.13 secs    fish           external
   usr time   24.47 secs    0.18 millis   24.47 secs
   sys time    0.79 secs    2.05 millis    0.78 secs

Difference in test execution for whole interpolated package (with jinja improvements):

before after
single eval 0.75s 0.76s
loop 100 evals on single object 14.93s 1.36s

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Aug 14, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 21, 2024 8:06pm

@CLAassistant
Copy link

CLAassistant commented Aug 14, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@natikgadzhi natikgadzhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge gain 👏

@@ -28,6 +28,7 @@ def __post_init__(self, parameters: Mapping[str, Any]) -> None:
self.default = self.default or self.string
self._interpolation = JinjaInterpolation()
self._parameters = parameters
self._just_string = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._just_string = None
self._is_plain_string = None

Naming doesn't sit right with me — something like this perhaps? Up to @artem1205 and @girarda though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this one

@@ -37,6 +38,12 @@ def eval(self, config: Config, **kwargs: Any) -> Any:
:param kwargs: Optional parameters used for interpolation
:return: The interpolated string
"""
if self._just_string:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this does a bit of caching magic, I'd love a comment explaining why you're doing this in eval specifically. I understand you need config to actually interpolate, so this is not insane, but a comment for posterity would be good.

Also, are there scenarios where the first interpolation would render the same string, but later interpolation would render something different? I can't think of it, but is there a world where a jinja expression renders itself / morphs conditionally on an input?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. The change makes sense to me, but let's describe the rationale for posterity.

That being said, thank you for taking the time to contribute to Airbyte @szubster ! <3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is any (which I doubt) then the output would still not look like input?
There are code-challenges, like https://en.wikipedia.org/wiki/Quine_(computing) but I doubt it's a real-life scenario.
And even if it is, then the logic is still correct :)

@szubster
Copy link
Contributor Author

Thank you for the comments.
I think I have found few more tricks to speed whole jinja interpolation. Not just the simple case.
Give me some time and I will come back to this.

@szubster
Copy link
Contributor Author

Is there some synthetic benchmark or a test suite I could use to test those changes?

@girarda
Copy link
Contributor

girarda commented Aug 15, 2024

@szubster there aren't performance benchmark tests, but depending on where your follow up changes land:

We'll also run regression tests with a few connectors to make sure this doesn't introduce regressions

@szubster szubster temporarily deployed to community-ci-auto August 19, 2024 11:45 — with GitHub Actions Inactive
@szubster szubster temporarily deployed to community-ci-auto August 19, 2024 11:55 — with GitHub Actions Inactive
@szubster szubster changed the title Check for simple strings in interpolated_string and disable eval for them Improve performance of interpolation in decalarative sources Aug 19, 2024
@szubster szubster force-pushed the faster-interpolated-string branch from e128129 to 61a15d3 Compare August 19, 2024 12:01
@szubster szubster force-pushed the faster-interpolated-string branch from 6d9ebc5 to cc27a11 Compare August 20, 2024 13:32
@szubster szubster temporarily deployed to community-ci-auto August 20, 2024 13:32 — with GitHub Actions Inactive
@szubster szubster temporarily deployed to community-ci-auto August 20, 2024 13:32 — with GitHub Actions Inactive
@szubster
Copy link
Contributor Author

All checks are passing now :)

@girarda
Copy link
Contributor

girarda commented Aug 21, 2024

@szubster
Copy link
Contributor Author

There is one failure, but I do not see how it could be related to my change?
Error: Invalid value for '--name': 'source-quickbook' in ....

@natikgadzhi
Copy link
Contributor

It should be quickbooks — must be a typo ;)

@girarda girarda merged commit 9e35a88 into airbytehq:master Aug 22, 2024
56 of 64 checks passed
@girarda
Copy link
Contributor

girarda commented Aug 22, 2024

@szubster this is now available in CDK 4.5.3 🎉

Thank you for the huge improvement <3

@szubster
Copy link
Contributor Author

If you would have any numbers from synthetic tests or anything to share how it improved performance, I would be glad to hear :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants