-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Use cheaper operation to estimate json data byte size #13240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment below, but it's an optimization. This is better than what we have now, so perhaps this is merged as-is, and we can look at an even more memory-safe later.
public static int getEstimatedByteSize(final JsonNode jsonNode) { | ||
return serialize(jsonNode).length(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are already playing the game of memory optimization, and we are comfortable with estimates, is there a way we can get the size of the object without re-serializing the instance back to a string again? That's also got overhead...
A rough approximation could be to store a constant for how much memory an empty JsonNode takes in bytes, and subtract that from this instance's size maybe?
Alternatively, before we parse the message, could we get the string length then and store that value as a property of AirbyteMessage
... perhaps AirbyteMessage.originalMessageSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A rough approximation could be to store a constant for how much memory an empty JsonNode takes in bytes, and subtract that from this instance's size maybe?
By "this instance's size", do you mean the serialized byte size or the size in memory of a JsonNode
instance? There is no good way to measure an object's size in memory though. The libraries that can do this usually have large overhead, and are not recommended for production.
It's possible to implement one by tracking the JSON field and value sizes in the JsonNode
object itself. However, that requires forking the jackson library and more complicated work.
Alternatively, before we parse the message, could we get the string length then and store that value as a property of
AirbyteMessage
... perhapsAirbyteMessage.originalMessageSize
That's a good idea. Somewhere in the source, it's probably serializing the data inside message already. So we can store the serialized string size as a field in AirbyteMessage
. However, I cannot find where we construct the AirbyteMessage
in the upstream.
I will merge this PR as is for now.
looks good. earlier in the stack, we have the serialized version of the message. we deserialize it and then the serialized version gets garbage collected. if we ever do need the exact counts, we can just keep the serialized version in memory and pass it through the stream and count it directly. that said, i think your approach makes the right tradeoffs for us now. |
Do you know where it is? I'd like to take a look and see if we can cache the size in |
yeah! |
* Simplify byte size estimation * Format code * Update comment
What
String#getBytes
method.How