Skip to content

Use cheaper operation to estimate json data byte size #13240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 27, 2022

Conversation

tuliren
Copy link
Contributor

@tuliren tuliren commented May 26, 2022

What

How

  • Replace the current byte size calculation with simply the length of the serialized JSON string.
  • The rationale is that the most frequent characters are in ASCII, and they all take only one byte in UTF-8 encoding.

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels May 26, 2022
@tuliren tuliren temporarily deployed to more-secrets May 26, 2022 21:16 Inactive
@tuliren tuliren requested review from cgardens and evantahler May 26, 2022 21:19
@tuliren tuliren temporarily deployed to more-secrets May 26, 2022 21:19 Inactive
Copy link
Contributor

@evantahler evantahler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment below, but it's an optimization. This is better than what we have now, so perhaps this is merged as-is, and we can look at an even more memory-safe later.

Comment on lines +139 to +141
public static int getEstimatedByteSize(final JsonNode jsonNode) {
return serialize(jsonNode).length();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are already playing the game of memory optimization, and we are comfortable with estimates, is there a way we can get the size of the object without re-serializing the instance back to a string again? That's also got overhead...

A rough approximation could be to store a constant for how much memory an empty JsonNode takes in bytes, and subtract that from this instance's size maybe?

Alternatively, before we parse the message, could we get the string length then and store that value as a property of AirbyteMessage... perhaps AirbyteMessage.originalMessageSize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A rough approximation could be to store a constant for how much memory an empty JsonNode takes in bytes, and subtract that from this instance's size maybe?

By "this instance's size", do you mean the serialized byte size or the size in memory of a JsonNode instance? There is no good way to measure an object's size in memory though. The libraries that can do this usually have large overhead, and are not recommended for production.

It's possible to implement one by tracking the JSON field and value sizes in the JsonNode object itself. However, that requires forking the jackson library and more complicated work.

Alternatively, before we parse the message, could we get the string length then and store that value as a property of AirbyteMessage... perhaps AirbyteMessage.originalMessageSize

That's a good idea. Somewhere in the source, it's probably serializing the data inside message already. So we can store the serialized string size as a field in AirbyteMessage. However, I cannot find where we construct the AirbyteMessage in the upstream.

I will merge this PR as is for now.

@tuliren tuliren merged commit 3dcda7a into master May 27, 2022
@tuliren tuliren deleted the liren/simplify-byte-size-calculation branch May 27, 2022 02:42
@cgardens
Copy link
Contributor

cgardens commented May 27, 2022

looks good.

earlier in the stack, we have the serialized version of the message. we deserialize it and then the serialized version gets garbage collected. if we ever do need the exact counts, we can just keep the serialized version in memory and pass it through the stream and count it directly.

that said, i think your approach makes the right tradeoffs for us now.

@tuliren
Copy link
Contributor Author

tuliren commented May 27, 2022

earlier in the stack, we have the serialized version of the message

Do you know where it is? I'd like to take a look and see if we can cache the size in AirbyteMessage in a field there (as suggested by Evan) to completely avoid doing the serialization in the message tracker.

@cgardens
Copy link
Contributor

yeah! DefaultAirbyteStreamFactory is the class that listens on STDIN for anything emitted from the source.

jscottpolevault pushed a commit to jscottpolevault/airbyte that referenced this pull request Jun 1, 2022
* Simplify byte size estimation

* Format code

* Update comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants