Skip to content

Optimise Guid.DecodeByte #116574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

xtqqczze
Copy link
Contributor

@xtqqczze xtqqczze commented Jun 12, 2025

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 12, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 12, 2025
@teo-tsirpanis teo-tsirpanis added area-System.Runtime and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 12, 2025
@MihaZupan
Copy link
Member

MihuBot/runtime-utils#1118

Method Toolchain Mean Error Ratio
ParseExactD Main 18.6149 ns 0.0806 ns 1.00
ParseExactD PR 14.8572 ns 0.0753 ns 0.80
ctor_str Main 18.5534 ns 0.0249 ns 1.00
ctor_str PR 15.2745 ns 0.3799 ns 0.82
Parse Main 18.6017 ns 0.0772 ns 1.00
Parse PR 14.7625 ns 0.0705 ns 0.79

invalidIfNegative |= result;
return (byte)result;

int c1 = typeof(TChar) == typeof(byte) ? byte.CreateTruncating(ch1) : (int)Math.Min(TChar.CastToUInt32(ch1), 0x7F);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this just TChar.CastToUInt32(ch1) instead of byte.CreateTruncating(ch1)?

The helper API exists because it's cheaper and easier to inline than T.CreateTruncating(ch1) since we know its either byte or char and so just zero-extending up.

Copy link
Contributor Author

@xtqqczze xtqqczze Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


int c1 = typeof(TChar) == typeof(byte) ? byte.CreateTruncating(ch1) : (int)Math.Min(TChar.CastToUInt32(ch1), 0x7F);
uint upper = lookup[c1];
invalidIfGreaterThan7F |= upper;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect it would be cheaper if we did this in a local and only wrote the byref once at the end.

I also expect it'd be cheaper if we kept it as invalidIfNegative. You could still have this method bitwise or it together as unsigned, since it's always 0xFF for invalid. But then you can invalidIfNegative = (sbyte)lookupBitmask which takes care of the write and sign-extension together.

This then ensures the if (invalidIfNegative >= 0) can continue being cheaper to encode/execute since its doing an operation against 0, which can be special cased.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect it would be cheaper if we did this in a local and only wrote the byref once at the end.

The byref disappears once DecodeByte is inlined, so using a local would be no cheaper.

I also expect it'd be cheaper if we kept it as invalidIfNegative. You could still have this method bitwise or it together as unsigned, since it's always 0xFF for invalid. But then you can invalidIfNegative = (sbyte)lookupBitmask which takes care of the write and sign-extension together.

I took care to avoid the unnecessary sign-extension, expecting the additional or to be cheaper than 2x movsx.

This then ensures the if (invalidIfNegative >= 0) can continue being cheaper to encode/execute since its doing an operation against 0, which can be special cased.

Compare against zero saves one byte (on x64) for the whole parse, the invalidIfGreaterThan7F approach reduces total code size significantly, see https://gist.github.com/MihuBot/523f631c5a33a1d5ae3fa14ca3e32777.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took care to avoid the unnecessary sign-extension, expecting the additional or to be cheaper than 2x movsx.

These are the diffs for one iteration of DecodeByte once inlined. Not sure why the JIT isn't emitting movsx rax, byte ptr [rax+rcx] for the base.

        lea      rdx, bword ptr [r15+0x04]
        movzx    rsi, word  ptr [rbx+0x16]
        movzx    r8, word  ptr [rbx+0x18]
-       movzx    rax, sil
-       movzx    rax, byte  ptr [rax+rcx]
-       movsx    rax, al
-       movzx    r9, r8b
-       movzx    r9, byte  ptr [r9+rcx]
-       movsx    r9, r9b
-       shl      eax, 4
-       or       eax, r9d
+       cmp      esi, 127
+       cmovg    esi, eax
+       movzx    rsi, byte  ptr [rsi+rdi]
+       or       ecx, esi
+       cmp      r8d, 127
+       cmovg    r8d, eax
+       movzx    r8, byte  ptr [r8+rdi]
+       or       ecx, r8d
+       shl      esi, 4
        or       esi, r8d
-       mov      r8d, -1
-       shr      esi, 8
-       cmovne   eax, r8d
-       or       edi, eax
-       mov      byte  ptr [rdx], al
+       mov      byte  ptr [rdx], sil

Copy link
Contributor Author

@xtqqczze xtqqczze Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also expect it'd be cheaper if we kept it as invalidIfNegative.

012d58e

It’s annoying this requires using (sbyte)Unsafe.Add(ref MemoryMarshal.GetReference(lookup), ch1) to get sign-extension on the memory load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s annoying this requires using (sbyte)Unsafe.Add(ref MemoryMarshal.GetReference(lookup), ch1) to get sign-extension on the memory load.

Opened #116593

Copy link
Member

@tannergooding tannergooding Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure you can just simplify this to (effectively):

int c1 = ...;
int c2 = ...;

ReadOnlySpan<byte> lookup = CharToHexLookup;
int upper = lookup[c1];
int lower = lookup[c2];
int result = (upper << 4) | lower;

resultBitMask |= (upper | lower);
return (byte)result;

Then you can do the isNegative check as:

if ((sbyte)(resultBitMask) >= 0)

This greatly reduces the work and removes any branching considerations, bringing it down to:

movzx edx, byte ptr [r8+2]       ; create c1 and c2 from ch1/ch2
movzx r11d, byte ptr [r8+3]      ;
movzx edx, byte ptr [rdx+r10]    ; lookup of lower and upper from c1/c2
movzx r11d, byte ptr [r11+r10]   ; 
or r9d, edx                      ; or upper and lower into resultBitMask
or r9d, r11d                     ;
shl edx, 4                       ; create result
or edx, r11d                     ;
mov [rcx], dl                    ; write result to destination
lea rcx, [rax+2]                 ;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possibly you could change the ordering slightly to better take advantage of pipelining too, but likely not that beneficial.

The general point being that we can minimize the additional work while still preserving the ability to optimize later parts of the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've kept an approach using sign-extension. This possibly leaves some performance on the table, but it uses simpler logic and if #116593 is resolved, would achieve the best performance.

The alternative is to use a documented piece of unsafe code for indexing that is provably in bounds.

@xtqqczze xtqqczze force-pushed the improve-decode-byte branch from dfc39fc to 012d58e Compare June 12, 2025 18:08
@vcsjones
Copy link
Member

There was a build break in main which is the source of these compilation errors. Merging in main will fix the compilation errors.

@xtqqczze
Copy link
Contributor Author

@MihuBot benchmark Perf_Guid -medium

@MihuBot
Copy link

MihuBot commented Jun 12, 2025

System.Tests.Perf_Guid
BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  IterationCount=15  LaunchCount=2
WarmupCount=10
Method Toolchain Mean Error Ratio Allocated Alloc Ratio
NewGuid Main 548.9893 ns 0.2231 ns 1.00 - NA
NewGuid PR 549.5465 ns 0.2342 ns 1.00 - NA
ctor_bytes Main 0.5641 ns 0.0043 ns 1.00 - NA
ctor_bytes PR 0.5682 ns 0.0047 ns 1.01 - NA
EqualsSame Main 0.0003 ns 0.0003 ns ? - ?
EqualsSame PR 0.0014 ns 0.0003 ns ? - ?
EqualsNotSame Main 0.0012 ns 0.0007 ns ? - ?
EqualsNotSame PR 0.0009 ns 0.0007 ns ? - ?
EqualsOperator Main 0.0007 ns 0.0004 ns ? - ?
EqualsOperator PR 0.0006 ns 0.0002 ns ? - ?
NotEqualsOperator Main 0.0002 ns 0.0002 ns ? - ?
NotEqualsOperator PR 0.0003 ns 0.0003 ns ? - ?
ParseExactD Main 19.3275 ns 0.4322 ns 1.00 - NA
ParseExactD PR 14.6864 ns 0.0145 ns 0.76 - NA
GuidToString Main 11.5308 ns 0.0785 ns 1.00 96 B 1.00
GuidToString PR 13.3686 ns 1.6937 ns 1.16 96 B 1.00
TryWriteBytes Main 0.1890 ns 0.0560 ns 1.20 - NA
TryWriteBytes PR 0.1955 ns 0.0568 ns 1.24 - NA
ctor_str Main 18.5711 ns 0.0830 ns 1.00 - NA
ctor_str PR 14.7165 ns 0.0583 ns 0.79 - NA
Parse Main 18.6136 ns 0.0802 ns 1.00 - NA
Parse PR 14.6805 ns 0.0282 ns 0.79 - NA
System.Text.Json.Tests.Perf_Guids
BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=DontRemove  IterationCount=15
LaunchCount=2  MemoryRandomization=True  WarmupCount=10
Method Toolchain Formatted SkipValidation Mean Error Ratio Allocated Alloc Ratio
WriteGuids Main False False 614.5 μs 4.19 μs 1.00 8.63 KB 1.00
WriteGuids PR False False 609.1 μs 1.21 μs 0.99 8.63 KB 1.00
WriteGuids Main False True 585.8 μs 1.01 μs 1.00 8.63 KB 1.00
WriteGuids PR False True 585.9 μs 1.23 μs 1.00 8.63 KB 1.00
WriteGuids Main True False 846.8 μs 22.95 μs 1.00 8.63 KB 1.00
WriteGuids PR True False 810.1 μs 0.55 μs 0.96 8.63 KB 1.00
WriteGuids Main True True 799.2 μs 0.76 μs 1.00 8.63 KB 1.00
WriteGuids PR True True 799.9 μs 1.72 μs 1.00 8.63 KB 1.00

@xtqqczze
Copy link
Contributor Author

xtqqczze commented Jun 12, 2025

Method Toolchain Mean Error Ratio Allocated Alloc Ratio
ParseExactD Main 19.3275 ns 0.4322 ns 1.00 - NA
ParseExactD PR 14.6864 ns 0.0145 ns 0.76 - NA
ctor_str Main 18.5711 ns 0.0830 ns 1.00 - NA
ctor_str PR 14.7165 ns 0.0583 ns 0.79 - NA
Parse Main 18.6136 ns 0.0802 ns 1.00 - NA
Parse PR 14.6805 ns 0.0282 ns 0.79 - NA

@xtqqczze
Copy link
Contributor Author

@MihuBot benchmark Perf_Guid -medium -arm

@MihuBot
Copy link

MihuBot commented Jun 12, 2025

System.Tests.Perf_Guid
BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
Unknown processor, 8 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job=MediumRun  IterationCount=15  LaunchCount=2
WarmupCount=10
Method Toolchain Mean Error Ratio Allocated Alloc Ratio
NewGuid Main 360.7914 ns 0.2896 ns 1.00 - NA
NewGuid PR 360.4806 ns 0.2844 ns 1.00 - NA
ctor_bytes Main 0.6046 ns 0.0232 ns 1.00 - NA
ctor_bytes PR 0.5415 ns 0.0002 ns 0.90 - NA
EqualsSame Main 0.3042 ns 0.0399 ns 1.03 - NA
EqualsSame PR 0.1276 ns 0.0856 ns 0.43 - NA
EqualsNotSame Main 0.1779 ns 0.1209 ns ? - ?
EqualsNotSame PR 0.0000 ns 0.0000 ns ? - ?
EqualsOperator Main 0.2288 ns 0.0908 ns 1.52 - NA
EqualsOperator PR 0.2377 ns 0.0930 ns 1.58 - NA
NotEqualsOperator Main 0.4406 ns 0.0009 ns 1.000 - NA
NotEqualsOperator PR 0.0000 ns 0.0000 ns 0.000 - NA
ParseExactD Main 18.5836 ns 0.0154 ns 1.00 - NA
ParseExactD PR 13.9214 ns 0.0105 ns 0.75 - NA
GuidToString Main 11.3054 ns 0.0902 ns 1.00 96 B 1.00
GuidToString PR 11.2633 ns 0.0959 ns 1.00 96 B 1.00
TryWriteBytes Main 0.0000 ns 0.0000 ns ? - ?
TryWriteBytes PR 0.0000 ns 0.0000 ns ? - ?
ctor_str Main 18.8576 ns 0.0353 ns 1.00 - NA
ctor_str PR 13.9297 ns 0.0096 ns 0.74 - NA
Parse Main 18.8300 ns 0.0395 ns 1.00 - NA
Parse PR 13.9466 ns 0.0549 ns 0.74 - NA
System.Text.Json.Tests.Perf_Guids
BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
Unknown processor, 8 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job=MediumRun  OutlierMode=DontRemove  IterationCount=15
LaunchCount=2  MemoryRandomization=True  WarmupCount=10
Method Toolchain Formatted SkipValidation Mean Error Ratio Allocated Alloc Ratio
WriteGuids Main False False 708.2 μs 2.76 μs 1.00 8.63 KB 1.00
WriteGuids PR False False 709.6 μs 2.94 μs 1.00 8.63 KB 1.00
WriteGuids Main False True 685.1 μs 2.12 μs 1.00 8.63 KB 1.00
WriteGuids PR False True 681.6 μs 3.56 μs 0.99 8.63 KB 1.00
WriteGuids Main True False 1,083.1 μs 4.41 μs 1.00 17.13 KB 1.00
WriteGuids PR True False 1,079.9 μs 3.04 μs 1.00 17.13 KB 1.00
WriteGuids Main True True 1,091.8 μs 2.72 μs 1.00 17.13 KB 1.00
WriteGuids PR True True 1,093.9 μs 3.57 μs 1.00 17.13 KB 1.00

@xtqqczze
Copy link
Contributor Author

Method Toolchain Mean Error Ratio Allocated Alloc Ratio
ParseExactD Main 18.5836 ns 0.0154 ns 1.00 - NA
ParseExactD PR 13.9214 ns 0.0105 ns 0.75 - NA
ctor_str Main 18.8576 ns 0.0353 ns 1.00 - NA
ctor_str PR 13.9297 ns 0.0096 ns 0.74 - NA
Parse Main 18.8300 ns 0.0395 ns 1.00 - NA
Parse PR 13.9466 ns 0.0549 ns 0.74 - NA

uint c1 = typeof(TChar) == typeof(byte) ? TChar.CastToUInt32(ch1) : Math.Min(TChar.CastToUInt32(ch1), 0x7F);
uint c2 = typeof(TChar) == typeof(byte) ? TChar.CastToUInt32(ch2) : Math.Min(TChar.CastToUInt32(ch2), 0x7F);
int upper = (sbyte)Unsafe.Add(ref MemoryMarshal.GetReference(lookup), c1);
int lower = (sbyte)Unsafe.Add(ref MemoryMarshal.GetReference(lookup), c2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying a redundant zero extension is worth replacing with unsafe code?

The original change happened 2 years ago, Parse benchmarks were improved since then e.g. recent runs:

image

image

I understand that you observe some wins from that, but I think they can wait for a proper JIT change.
We appreciate any change that replaces unsafe with safe, the opposite action require a strong motivation and micro-benchmarks aren't typically the case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think Guid specifically can be made 100% safe code while maintaining same perf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying a redundant zero extension is worth replacing with unsafe code?

In my opinion this is justified by the improvement of ~10% in ParseExactD (#116574 (comment) vs #116574 (comment)).

Any workaround would not achieve the same performance and would complicate the logic. However, given this requires strong motivation and this is best resolved with a JIT change, I have removed the unsafe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all 10% is created equal. It’s saving just over a nanosecond worth of time, or about 1-4 cycles on a typical CPU

This is unlikely to ever be the bottleneck in a real world application and is more likely going to be lost to the general noise of the system (as caused by memory latency, dynamic frequency changes, power settings, branch mispredicts, cache misses, disk, network, or other IO, etc)

It’s simply not worth micro-optimizing every possible scenario to that degree, especially when there are worse overall downsides to code readability, maintainability, and other risks.

Copy link
Member

@MihaZupan MihaZupan Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were trying to optimize this further, we should look at replacing these with a SIMD implementation instead of trying to hyperoptimize the current impl IMO - parse the 32 hex chars + shuffle into the right order.
We already have most of the relevant logic in HexConverter, just a question of whether we feel it's worth the extra code/maintainability cost.
If/when we could say that we don't care about perf on targets without Vector128, it might even be less code overall.

@xtqqczze
Copy link
Contributor Author

@MihuBot benchmark Perf_Guid -medium

@MihuBot
Copy link

MihuBot commented Jun 13, 2025

System.Tests.Perf_Guid
BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  IterationCount=15  LaunchCount=2
WarmupCount=10
Method Toolchain Mean Error Ratio Allocated Alloc Ratio
NewGuid Main 548.0208 ns 0.6214 ns 1.00 - NA
NewGuid PR 548.1821 ns 0.4214 ns 1.00 - NA
ctor_bytes Main 0.5594 ns 0.0026 ns 1.00 - NA
ctor_bytes PR 0.6190 ns 0.0425 ns 1.11 - NA
EqualsSame Main 0.0005 ns 0.0004 ns ? - ?
EqualsSame PR 0.0002 ns 0.0001 ns ? - ?
EqualsNotSame Main 0.0004 ns 0.0004 ns ? - ?
EqualsNotSame PR 0.0016 ns 0.0004 ns ? - ?
EqualsOperator Main 0.0002 ns 0.0001 ns ? - ?
EqualsOperator PR 0.0001 ns 0.0002 ns ? - ?
NotEqualsOperator Main 0.0004 ns 0.0005 ns ? - ?
NotEqualsOperator PR 0.0006 ns 0.0005 ns ? - ?
ParseExactD Main 18.1742 ns 0.3769 ns 1.00 - NA
ParseExactD PR 15.5857 ns 0.0523 ns 0.86 - NA
GuidToString Main 11.6960 ns 0.3840 ns 1.00 96 B 1.00
GuidToString PR 11.5414 ns 0.0438 ns 0.99 96 B 1.00
TryWriteBytes Main 0.1998 ns 0.0494 ns 1.16 - NA
TryWriteBytes PR 0.1964 ns 0.0527 ns 1.14 - NA
ctor_str Main 18.5513 ns 0.0808 ns 1.00 - NA
ctor_str PR 15.4892 ns 0.0945 ns 0.83 - NA
Parse Main 18.6132 ns 0.0761 ns 1.00 - NA
Parse PR 15.4071 ns 0.0745 ns 0.83 - NA
System.Text.Json.Tests.Perf_Guids
BenchmarkDotNet v0.14.1-nightly.20250107.205, Linux Ubuntu 22.04.5 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=DontRemove  IterationCount=15
LaunchCount=2  MemoryRandomization=True  WarmupCount=10
Method Toolchain Formatted SkipValidation Mean Error Ratio Allocated Alloc Ratio
WriteGuids Main False False 609.9 μs 2.02 μs 1.00 8.63 KB 1.00
WriteGuids PR False False 610.1 μs 1.06 μs 1.00 8.63 KB 1.00
WriteGuids Main False True 585.4 μs 0.85 μs 1.00 8.63 KB 1.00
WriteGuids PR False True 586.1 μs 1.33 μs 1.00 8.63 KB 1.00
WriteGuids Main True False 811.5 μs 0.52 μs 1.00 8.63 KB 1.00
WriteGuids PR True False 813.0 μs 0.93 μs 1.00 8.63 KB 1.00
WriteGuids Main True True 798.5 μs 1.42 μs 1.00 8.63 KB 1.00
WriteGuids PR True True 799.8 μs 1.17 μs 1.00 8.63 KB 1.00

@xtqqczze
Copy link
Contributor Author

Method Toolchain Mean Error Ratio Allocated Alloc Ratio
ParseExactD Main 18.1742 ns 0.3769 ns 1.00 - NA
ParseExactD PR 15.5857 ns 0.0523 ns 0.86 - NA
ctor_str Main 18.5513 ns 0.0808 ns 1.00 - NA
ctor_str PR 15.4892 ns 0.0945 ns 0.83 - NA
Parse Main 18.6132 ns 0.0761 ns 1.00 - NA
Parse PR 15.4071 ns 0.0745 ns 0.83 - NA

@xtqqczze xtqqczze marked this pull request as ready for review June 13, 2025 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants