Skip to content

Use cheaper operation to estimate json data byte size #13240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions airbyte-commons/src/main/java/io/airbyte/commons/json/Jsons.java
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,17 @@ public static byte[] toBytes(final JsonNode jsonNode) {
return serialize(jsonNode).getBytes(Charsets.UTF_8);
}

/**
* Use string length as an estimation for byte size, because all ASCII characters are one byte long
* in UTF-8, and ASCII characters cover most of the use cases. To be more precise, we can convert
* the string to byte[] and use the length of the byte[]. However, this conversion is expensive in
* memory consumption. Given that the byte size of the serialized JSON is already an estimation of
* the actual size of the JSON object, using a cheap operation seems an acceptable compromise.
*/
public static int getEstimatedByteSize(final JsonNode jsonNode) {
return serialize(jsonNode).length();
}
Comment on lines +139 to +141
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are already playing the game of memory optimization, and we are comfortable with estimates, is there a way we can get the size of the object without re-serializing the instance back to a string again? That's also got overhead...

A rough approximation could be to store a constant for how much memory an empty JsonNode takes in bytes, and subtract that from this instance's size maybe?

Alternatively, before we parse the message, could we get the string length then and store that value as a property of AirbyteMessage... perhaps AirbyteMessage.originalMessageSize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A rough approximation could be to store a constant for how much memory an empty JsonNode takes in bytes, and subtract that from this instance's size maybe?

By "this instance's size", do you mean the serialized byte size or the size in memory of a JsonNode instance? There is no good way to measure an object's size in memory though. The libraries that can do this usually have large overhead, and are not recommended for production.

It's possible to implement one by tracking the JSON field and value sizes in the JsonNode object itself. However, that requires forking the jackson library and more complicated work.

Alternatively, before we parse the message, could we get the string length then and store that value as a property of AirbyteMessage... perhaps AirbyteMessage.originalMessageSize

That's a good idea. Somewhere in the source, it's probably serializing the data inside message already. So we can store the serialized string size as a field in AirbyteMessage. However, I cannot find where we construct the AirbyteMessage in the upstream.

I will merge this PR as is for now.


public static Set<String> keys(final JsonNode jsonNode) {
if (jsonNode.isObject()) {
return Jsons.object(jsonNode, new TypeReference<Map<String, Object>>() {}).keySet();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,12 @@ void testGetStringOrNull() {
assertNull(Jsons.getStringOrNull(json, "xyz"));
}

@Test
void testGetEstimatedByteSize() {
final JsonNode json = Jsons.deserialize("{\"string_key\":\"abc\",\"array_key\":[\"item1\", \"item2\"]}");
assertEquals(Jsons.toBytes(json).length, Jsons.getEstimatedByteSize(json));
}

private static class ToClass {

@JsonProperty("str")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,10 +111,9 @@ private void handleSourceEmittedRecord(final AirbyteRecordMessage recordMessage)
final long currentTotalCount = streamToTotalRecordsEmitted.getOrDefault(streamIndex, 0L);
streamToTotalRecordsEmitted.put(streamIndex, currentTotalCount + 1);

// todo (cgardens) - pretty wasteful to do an extra serialization just to get size.
final int numBytes = Jsons.serialize(recordMessage.getData()).getBytes(Charsets.UTF_8).length;
final int estimatedNumBytes = Jsons.getEstimatedByteSize(recordMessage.getData());
final long currentTotalStreamBytes = streamToTotalBytesEmitted.getOrDefault(streamIndex, 0L);
streamToTotalBytesEmitted.put(streamIndex, currentTotalStreamBytes + numBytes);
streamToTotalBytesEmitted.put(streamIndex, currentTotalStreamBytes + estimatedNumBytes);
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertTrue;

import com.google.common.base.Charsets;
import io.airbyte.commons.json.Jsons;
import io.airbyte.config.FailureReason;
import io.airbyte.config.State;
Expand Down Expand Up @@ -53,7 +52,7 @@ public void testGetTotalRecordsStatesAndBytesEmitted() {
messageTracker.acceptFromSource(s2);

assertEquals(3, messageTracker.getTotalRecordsEmitted());
assertEquals(3 * Jsons.serialize(r1.getRecord().getData()).getBytes(Charsets.UTF_8).length, messageTracker.getTotalBytesEmitted());
assertEquals(3L * Jsons.getEstimatedByteSize(r1.getRecord().getData()), messageTracker.getTotalBytesEmitted());
assertEquals(2, messageTracker.getTotalStateMessagesEmitted());
}

Expand Down Expand Up @@ -112,9 +111,9 @@ public void testEmittedBytesByStream() {
final AirbyteMessage r2 = AirbyteMessageUtils.createRecordMessage(STREAM_2, 2);
final AirbyteMessage r3 = AirbyteMessageUtils.createRecordMessage(STREAM_3, 3);

final long r1Bytes = Jsons.serialize(r1.getRecord().getData()).getBytes(Charsets.UTF_8).length;
final long r2Bytes = Jsons.serialize(r2.getRecord().getData()).getBytes(Charsets.UTF_8).length;
final long r3Bytes = Jsons.serialize(r3.getRecord().getData()).getBytes(Charsets.UTF_8).length;
final long r1Bytes = Jsons.getEstimatedByteSize(r1.getRecord().getData());
final long r2Bytes = Jsons.getEstimatedByteSize(r2.getRecord().getData());
final long r3Bytes = Jsons.getEstimatedByteSize(r3.getRecord().getData());

messageTracker.acceptFromSource(r1);
messageTracker.acceptFromSource(r2);
Expand Down