Skip to content

Support string interning / deduplication within packets #11640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 6, 2025

Conversation

ccastanedaucf
Copy link
Contributor

@ccastanedaucf ccastanedaucf commented Mar 27, 2025

Context

When looking at serialization overhead for the RAR node, I found that most of the serialization overhead comes down to reading strings from the payload.

Many strings are shared between task items, such as common metadata values, file names, parent directories.

With the existing translator system, you can manually do this - you effectively need to create a temporary buffer, ID mapping, and offset the lookup depending on the direction. But it's difficult to set up, all child objects would need separate constructors, and it breaks the idea of having ITranslator be unidirectional since the write patterns have different ordering requirements - which makes it incredibly easy to accidentally break.

Changes Made

This bakes in an opt-in string interning API directly into the ITranslator framework. By defining a block like:

translator.WithInterning(StringComparer.Ordinal, translator =>
{
   translator.Translate(ref  _someItem);
   translator.Intern(ref _someString);
   translator.InternDictionary(ref _someDictionary, StringComparer.OrdinalIgnoreCase, translator => new SomeITranslator(translator));
});

...all steps are automatically handled. BinaryWriteTranslator swaps its writer with a translator provided by InterningWriteTranslator. When the block is completed, the offset + flush automatically happens. On the read side, BinaryReadTranslator just reads directly from the stream, using InterningReadTranslator to decode any interned strings.

A parent translator can enter multiple intern blocks across its lifetime as the buffers are just reset and reused. This makes it possible to use with the existing node pipe implementations, since they are reused across packets.

The ITranslatable implementation can selectively choose which strings to intern, and which strings are unlikely to be duplicated.

If _.Intern() methods are called outside of an enclosing delegate, they simply behave as a non-interned operation.

This makes it extremely easy to add string interning / deduplication support for any arbitrary INodePacket. See the included example of enabling this for the RAR statefile.

Testing

Implementation seen here effectively cuts packet size for RAR request / response packets by ~50-60%.

RAR statefile can hit a ~20% size reduction, but only with additional changes to intern on substrings rather than a path separator (e.g. you run into strings like file///somepath/foo.dll where the full path isn't deduplicated due to the prefix). Given this, substring / path API needs more thought as it would be very useful to split on multiple known separators. Currently it's closer to ~10% (tested on cloudbuild and orchard core repos).

Basic powershell command to count the diff:

ls -r *AssemblyReference.cache | select -ExpandProperty Length | Measure-Object -Sum | foreach {$_.Sum / 100000}

Copy link
Member

@rainersigwald rainersigwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@ccastanedaucf ccastanedaucf marked this pull request as ready for review April 10, 2025 17:38
@ccastanedaucf ccastanedaucf force-pushed the dev/chcasta/prop-intern branch 2 times, most recently from 34017b7 to 8602854 Compare April 10, 2025 17:50
@ccastanedaucf ccastanedaucf force-pushed the dev/chcasta/prop-intern branch from a25a0f2 to aa1ebe2 Compare April 10, 2025 18:05
@ccastanedaucf
Copy link
Contributor Author

Kk the undraft commit has the diff since review, but the behavioral changes are:

  • Instead of passing InterningWriteTranslator.Translator to the caller via the delegate, just swap its binary writer for the scope of the delegate. Effectively does the same thing, but simplifies a lot of the setup and routing of calls since everything still goes through the original translator instance.
  • Merged the non-nullable optimization up into ITranslator as a flag to scale down the API.
  • Removed usage in RAR statefile - will do the real switch in a follow up.

And then just chores:

  • Added a bunch of UTs at the BinaryTranslator level. Nothing at the implementation level (aka asserting on the stream contents), but this still validates that strings are deduped + wanted to ensure that the translator doesn't blow up from reuse or ordering edge cases.
  • Documented interning structure, signal flow, everything where ordering is important
  • Resolved active PR comments (no more Console.WriteLine 😅).

@maridematte maridematte self-assigned this Apr 15, 2025
@JanProvaznik
Copy link
Member

/perfstar

@SimaTian
Copy link
Member

Overall I like it. Just some weird questions that are mostly to satisfy my curiosity.
LGTM

@SimaTian SimaTian merged commit 85587d8 into dotnet:main May 6, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants