bulk-cdk: source output performance improvements #44865
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
Informs https://github.com/airbytehq/airbyte-internal-issues/issues/9461
Performance of source-oracle is currently underwhelming. By running it on a t2.medium EC2 linux box with
cpulimit -l 30
to limit it to 30% of a cpu core, I was able to deduce that CPU usage was the performance bottleneck.The linux box was necessary because cpulimit isn't useful on mac for the jvm, it just ignores it! This explains why the connector was so fast on my local airbyte deployment compared to airbyte cloud.
By looking at wall-clock flame graphs collected by async-profiler, I was able to make some educated guesses as to which parts of the code could be improved. This PR contains those improvements.
Here's the before and after, with JDBC-driver stack frames in purple.
Before, only 40% of the time was spent actually fetching data from the source, and the remaining 60% of the time was spent in what's effectively overhead:

Afterwards, the ratios are inverted:

How
This PR has two commits.
Firstly, the
SelectQuerier
interface has an innerdata class Parameters
to whichval reuseResultObject: Boolean = false
has been added. Whenfalse
as is the default nothing changes relative to how things are now. Whentrue
then thenext
method of theResult
of theexecuteQuery
method returns the same instance ofObjectNode
from one call to the next. Instead of creating a new one it modifies the existing one in-place.Furthermore, the
JdbcPartitionReader
cuts down on the garbage by re-using theAirbyteMessage
of type RECORD, including itsObjectNode
record data object, instead of re-creating those every single time.This improves performance because it creates less garbage for the GC to collect in the first place, and also because adding entries into an empty object is expensive enough. Under the hood an
ObjectNode
is basically aLinkedHashMap
and modifying in-place is cheap whereas rebuilding the whole linked list structure isn't as cheap.Secondly, the
StdoutOutputConsumer
improves its jackson serialization. Instead of callingObjectMapper::writeValueAsBytes
which re-creates aJsonGenerator
(albeit from an object pool) it uses aSequenceWriter
which is the preferred way of serializing multiple values.Review guide
Commit by commit.
User Impact
Bulk-CDK source connector READs should perform better. We'll see if that proves to be the case in the metabase cloud performance dashboard.
Can this PR be safely reverted and rolled back?