-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Support string interning / deduplication within packets #11640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
34017b7
to
8602854
Compare
a25a0f2
to
aa1ebe2
Compare
Kk the undraft commit has the diff since review, but the behavioral changes are:
And then just chores:
|
/perfstar |
5 tasks
SimaTian
reviewed
Apr 29, 2025
SimaTian
reviewed
Apr 29, 2025
Overall I like it. Just some weird questions that are mostly to satisfy my curiosity. |
SimaTian
approved these changes
May 2, 2025
MichalPavlik
approved these changes
May 5, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
When looking at serialization overhead for the RAR node, I found that most of the serialization overhead comes down to reading strings from the payload.
Many strings are shared between task items, such as common metadata values, file names, parent directories.
With the existing translator system, you can manually do this - you effectively need to create a temporary buffer, ID mapping, and offset the lookup depending on the direction. But it's difficult to set up, all child objects would need separate constructors, and it breaks the idea of having
ITranslator
be unidirectional since the write patterns have different ordering requirements - which makes it incredibly easy to accidentally break.Changes Made
This bakes in an opt-in string interning API directly into the
ITranslator
framework. By defining a block like:...all steps are automatically handled.
BinaryWriteTranslator
swaps its writer with a translator provided byInterningWriteTranslator
. When the block is completed, the offset + flush automatically happens. On the read side,BinaryReadTranslator
just reads directly from the stream, usingInterningReadTranslator
to decode any interned strings.A parent translator can enter multiple intern blocks across its lifetime as the buffers are just reset and reused. This makes it possible to use with the existing node pipe implementations, since they are reused across packets.
The
ITranslatable
implementation can selectively choose which strings to intern, and which strings are unlikely to be duplicated.If
_.Intern()
methods are called outside of an enclosing delegate, they simply behave as a non-interned operation.This makes it extremely easy to add string interning / deduplication support for any arbitrary
INodePacket
. See the included example of enabling this for the RAR statefile.Testing
Implementation seen here effectively cuts packet size for RAR request / response packets by ~50-60%.
RAR statefile can hit a ~20% size reduction, but only with additional changes to intern on substrings rather than a path separator (e.g. you run into strings like
file///somepath/foo.dll
where the full path isn't deduplicated due to the prefix). Given this, substring / path API needs more thought as it would be very useful to split on multiple known separators. Currently it's closer to ~10% (tested on cloudbuild and orchard core repos).Basic powershell command to count the diff: