bulk-cdk: source output performance improvements #44865

postamar · 2024-08-28T19:49:50Z

What

Informs https://github.com/airbytehq/airbyte-internal-issues/issues/9461

Performance of source-oracle is currently underwhelming. By running it on a t2.medium EC2 linux box with cpulimit -l 30 to limit it to 30% of a cpu core, I was able to deduce that CPU usage was the performance bottleneck.

The linux box was necessary because cpulimit isn't useful on mac for the jvm, it just ignores it! This explains why the connector was so fast on my local airbyte deployment compared to airbyte cloud.

By looking at wall-clock flame graphs collected by async-profiler, I was able to make some educated guesses as to which parts of the code could be improved. This PR contains those improvements.

Here's the before and after, with JDBC-driver stack frames in purple.

Before, only 40% of the time was spent actually fetching data from the source, and the remaining 60% of the time was spent in what's effectively overhead:

Afterwards, the ratios are inverted:

How

This PR has two commits.

Firstly, the SelectQuerier interface has an inner data class Parameters to which val reuseResultObject: Boolean = false has been added. When false as is the default nothing changes relative to how things are now. When true then the next method of the Result of the executeQuery method returns the same instance of ObjectNode from one call to the next. Instead of creating a new one it modifies the existing one in-place.

Furthermore, the JdbcPartitionReader cuts down on the garbage by re-using the AirbyteMessage of type RECORD, including its ObjectNode record data object, instead of re-creating those every single time.

This improves performance because it creates less garbage for the GC to collect in the first place, and also because adding entries into an empty object is expensive enough. Under the hood an ObjectNode is basically a LinkedHashMap and modifying in-place is cheap whereas rebuilding the whole linked list structure isn't as cheap.

Secondly, the StdoutOutputConsumer improves its jackson serialization. Instead of calling ObjectMapper::writeValueAsBytes which re-creates a JsonGenerator (albeit from an object pool) it uses a SequenceWriter which is the preferred way of serializing multiple values.

Review guide

Commit by commit.

User Impact

Bulk-CDK source connector READs should perform better. We'll see if that proves to be the case in the metabase cloud performance dashboard.

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

vercel · 2024-08-28T19:49:55Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)			Aug 28, 2024 7:49pm

postamar · 2024-08-28T19:53:28Z

airbyte-cdk/bulk/core/extract/src/test/kotlin/io/airbyte/cdk/read/RootReaderIntegrationTest.kt

@@ -335,7 +335,7 @@ data class TestCase(
            )
            for (actualState in actual!!) {
                Assertions.assertTrue(
-                    actualState in expected,
+                    actualState.toString() in expected.map { it.toString() },


stringification is required because of the deep-copying added to BufferingOutputConsumer, this causes the hash values to change or something, failing the check for no good reason.

johnny-schmidt

LGTM!

Marius Posta added 2 commits August 28, 2024 15:20

bulk-cdk-toolkit-extract-jdbc: reuse objects and generate less garbage

56b9c73

bulk-cdk-core-base: use jackson SequenceWriter in StdoutOutputConsumer

ea5140e

postamar requested review from a team as code owners August 28, 2024 19:49

octavia-squidington-iii added the CDK Connector Development Kit label Aug 28, 2024

postamar commented Aug 28, 2024

View reviewed changes

johnny-schmidt approved these changes Aug 28, 2024

View reviewed changes

postamar merged commit abd9da9 into master Aug 28, 2024
33 checks passed

postamar deleted the postamar/perf-hacks branch August 28, 2024 20:13

postamar mentioned this pull request Aug 29, 2024

bulk-cdk-core-base: optimized record serialization #44880

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulk-cdk: source output performance improvements #44865

bulk-cdk: source output performance improvements #44865

postamar commented Aug 28, 2024

vercel bot commented Aug 28, 2024

postamar Aug 28, 2024

johnny-schmidt left a comment

bulk-cdk: source output performance improvements #44865

bulk-cdk: source output performance improvements #44865

Conversation

postamar commented Aug 28, 2024

What

How

Review guide

User Impact

Can this PR be safely reverted and rolled back?

vercel bot commented Aug 28, 2024

postamar Aug 28, 2024

Choose a reason for hiding this comment

johnny-schmidt left a comment

Choose a reason for hiding this comment