Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk-cdk: source output performance improvements #44865

Merged
merged 2 commits into from
Aug 28, 2024
Merged

Conversation

postamar
Copy link
Contributor

What

Informs https://github.com/airbytehq/airbyte-internal-issues/issues/9461

Performance of source-oracle is currently underwhelming. By running it on a t2.medium EC2 linux box with cpulimit -l 30 to limit it to 30% of a cpu core, I was able to deduce that CPU usage was the performance bottleneck.

The linux box was necessary because cpulimit isn't useful on mac for the jvm, it just ignores it! This explains why the connector was so fast on my local airbyte deployment compared to airbyte cloud.

By looking at wall-clock flame graphs collected by async-profiler, I was able to make some educated guesses as to which parts of the code could be improved. This PR contains those improvements.

Here's the before and after, with JDBC-driver stack frames in purple.

Before, only 40% of the time was spent actually fetching data from the source, and the remaining 60% of the time was spent in what's effectively overhead:
Screenshot 2024-08-28 at 15 42 45

Afterwards, the ratios are inverted:
Screenshot 2024-08-28 at 15 43 00

How

This PR has two commits.

Firstly, the SelectQuerier interface has an inner data class Parameters to which val reuseResultObject: Boolean = false has been added. When false as is the default nothing changes relative to how things are now. When true then the next method of the Result of the executeQuery method returns the same instance of ObjectNode from one call to the next. Instead of creating a new one it modifies the existing one in-place.

Furthermore, the JdbcPartitionReader cuts down on the garbage by re-using the AirbyteMessage of type RECORD, including its ObjectNode record data object, instead of re-creating those every single time.

This improves performance because it creates less garbage for the GC to collect in the first place, and also because adding entries into an empty object is expensive enough. Under the hood an ObjectNode is basically a LinkedHashMap and modifying in-place is cheap whereas rebuilding the whole linked list structure isn't as cheap.

Secondly, the StdoutOutputConsumer improves its jackson serialization. Instead of calling ObjectMapper::writeValueAsBytes which re-creates a JsonGenerator (albeit from an object pool) it uses a SequenceWriter which is the preferred way of serializing multiple values.

Review guide

Commit by commit.

User Impact

Bulk-CDK source connector READs should perform better. We'll see if that proves to be the case in the metabase cloud performance dashboard.

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

@postamar postamar requested review from a team as code owners August 28, 2024 19:49
Copy link

vercel bot commented Aug 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Aug 28, 2024 7:49pm

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Aug 28, 2024
@@ -335,7 +335,7 @@ data class TestCase(
)
for (actualState in actual!!) {
Assertions.assertTrue(
actualState in expected,
actualState.toString() in expected.map { it.toString() },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stringification is required because of the deep-copying added to BufferingOutputConsumer, this causes the hash values to change or something, failing the check for no good reason.

Copy link
Contributor

@johnny-schmidt johnny-schmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@postamar postamar merged commit abd9da9 into master Aug 28, 2024
33 checks passed
@postamar postamar deleted the postamar/perf-hacks branch August 28, 2024 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants