-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Optimise Guid.DecodeByte
#116574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Optimise Guid.DecodeByte
#116574
Conversation
|
invalidIfNegative |= result; | ||
return (byte)result; | ||
|
||
int c1 = typeof(TChar) == typeof(byte) ? byte.CreateTruncating(ch1) : (int)Math.Min(TChar.CastToUInt32(ch1), 0x7F); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this just TChar.CastToUInt32(ch1)
instead of byte.CreateTruncating(ch1)
?
The helper API exists because it's cheaper and easier to inline than T.CreateTruncating(ch1)
since we know its either byte
or char
and so just zero-extending up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
int c1 = typeof(TChar) == typeof(byte) ? byte.CreateTruncating(ch1) : (int)Math.Min(TChar.CastToUInt32(ch1), 0x7F); | ||
uint upper = lookup[c1]; | ||
invalidIfGreaterThan7F |= upper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect it would be cheaper if we did this in a local and only wrote the byref once at the end.
I also expect it'd be cheaper if we kept it as invalidIfNegative
. You could still have this method bitwise or it together as unsigned, since it's always 0xFF
for invalid. But then you can invalidIfNegative = (sbyte)lookupBitmask
which takes care of the write and sign-extension together.
This then ensures the if (invalidIfNegative >= 0)
can continue being cheaper to encode/execute since its doing an operation against 0
, which can be special cased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect it would be cheaper if we did this in a local and only wrote the byref once at the end.
The byref disappears once DecodeByte
is inlined, so using a local would be no cheaper.
I also expect it'd be cheaper if we kept it as
invalidIfNegative
. You could still have this method bitwise or it together as unsigned, since it's always0xFF
for invalid. But then you caninvalidIfNegative = (sbyte)lookupBitmask
which takes care of the write and sign-extension together.
I took care to avoid the unnecessary sign-extension, expecting the additional or
to be cheaper than 2x movsx
.
This then ensures the
if (invalidIfNegative >= 0)
can continue being cheaper to encode/execute since its doing an operation against0
, which can be special cased.
Compare against zero saves one byte (on x64) for the whole parse, the invalidIfGreaterThan7F
approach reduces total code size significantly, see https://gist.github.com/MihuBot/523f631c5a33a1d5ae3fa14ca3e32777.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took care to avoid the unnecessary sign-extension, expecting the additional
or
to be cheaper than 2xmovsx
.
These are the diffs for one iteration of DecodeByte
once inlined. Not sure why the JIT isn't emitting movsx rax, byte ptr [rax+rcx]
for the base.
lea rdx, bword ptr [r15+0x04]
movzx rsi, word ptr [rbx+0x16]
movzx r8, word ptr [rbx+0x18]
- movzx rax, sil
- movzx rax, byte ptr [rax+rcx]
- movsx rax, al
- movzx r9, r8b
- movzx r9, byte ptr [r9+rcx]
- movsx r9, r9b
- shl eax, 4
- or eax, r9d
+ cmp esi, 127
+ cmovg esi, eax
+ movzx rsi, byte ptr [rsi+rdi]
+ or ecx, esi
+ cmp r8d, 127
+ cmovg r8d, eax
+ movzx r8, byte ptr [r8+rdi]
+ or ecx, r8d
+ shl esi, 4
or esi, r8d
- mov r8d, -1
- shr esi, 8
- cmovne eax, r8d
- or edi, eax
- mov byte ptr [rdx], al
+ mov byte ptr [rdx], sil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also expect it'd be cheaper if we kept it as
invalidIfNegative
.
It’s annoying this requires using (sbyte)Unsafe.Add(ref MemoryMarshal.GetReference(lookup), ch1)
to get sign-extension on the memory load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s annoying this requires using
(sbyte)Unsafe.Add(ref MemoryMarshal.GetReference(lookup), ch1)
to get sign-extension on the memory load.
Opened #116593
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure you can just simplify this to (effectively):
int c1 = ...;
int c2 = ...;
ReadOnlySpan<byte> lookup = CharToHexLookup;
int upper = lookup[c1];
int lower = lookup[c2];
int result = (upper << 4) | lower;
resultBitMask |= (upper | lower);
return (byte)result;
Then you can do the isNegative check as:
if ((sbyte)(resultBitMask) >= 0)
This greatly reduces the work and removes any branching considerations, bringing it down to:
movzx edx, byte ptr [r8+2] ; create c1 and c2 from ch1/ch2
movzx r11d, byte ptr [r8+3] ;
movzx edx, byte ptr [rdx+r10] ; lookup of lower and upper from c1/c2
movzx r11d, byte ptr [r11+r10] ;
or r9d, edx ; or upper and lower into resultBitMask
or r9d, r11d ;
shl edx, 4 ; create result
or edx, r11d ;
mov [rcx], dl ; write result to destination
lea rcx, [rax+2] ;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possibly you could change the ordering slightly to better take advantage of pipelining too, but likely not that beneficial.
The general point being that we can minimize the additional work while still preserving the ability to optimize later parts of the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've kept an approach using sign-extension. This possibly leaves some performance on the table, but it uses simpler logic and if #116593 is resolved, would achieve the best performance.
The alternative is to use a documented piece of unsafe code for indexing that is provably in bounds.
8a80f60
to
dfc39fc
Compare
dfc39fc
to
012d58e
Compare
There was a build break in |
@MihuBot benchmark Perf_Guid -medium |
System.Tests.Perf_Guid
System.Text.Json.Tests.Perf_Guids
|
|
@MihuBot benchmark Perf_Guid -medium -arm |
System.Tests.Perf_Guid
System.Text.Json.Tests.Perf_Guids
|
|
uint c1 = typeof(TChar) == typeof(byte) ? TChar.CastToUInt32(ch1) : Math.Min(TChar.CastToUInt32(ch1), 0x7F); | ||
uint c2 = typeof(TChar) == typeof(byte) ? TChar.CastToUInt32(ch2) : Math.Min(TChar.CastToUInt32(ch2), 0x7F); | ||
int upper = (sbyte)Unsafe.Add(ref MemoryMarshal.GetReference(lookup), c1); | ||
int lower = (sbyte)Unsafe.Add(ref MemoryMarshal.GetReference(lookup), c2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying a redundant zero extension is worth replacing with unsafe code?
The original change happened 2 years ago, Parse benchmarks were improved since then e.g. recent runs:
I understand that you observe some wins from that, but I think they can wait for a proper JIT change.
We appreciate any change that replaces unsafe with safe, the opposite action require a strong motivation and micro-benchmarks aren't typically the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I think Guid specifically can be made 100% safe code while maintaining same perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying a redundant zero extension is worth replacing with unsafe code?
In my opinion this is justified by the improvement of ~10% in ParseExactD
(#116574 (comment) vs #116574 (comment)).
Any workaround would not achieve the same performance and would complicate the logic. However, given this requires strong motivation and this is best resolved with a JIT change, I have removed the unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all 10% is created equal. It’s saving just over a nanosecond worth of time, or about 1-4 cycles on a typical CPU
This is unlikely to ever be the bottleneck in a real world application and is more likely going to be lost to the general noise of the system (as caused by memory latency, dynamic frequency changes, power settings, branch mispredicts, cache misses, disk, network, or other IO, etc)
It’s simply not worth micro-optimizing every possible scenario to that degree, especially when there are worse overall downsides to code readability, maintainability, and other risks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were trying to optimize this further, we should look at replacing these with a SIMD implementation instead of trying to hyperoptimize the current impl IMO - parse the 32 hex chars + shuffle into the right order.
We already have most of the relevant logic in HexConverter
, just a question of whether we feel it's worth the extra code/maintainability cost.
If/when we could say that we don't care about perf on targets without Vector128
, it might even be less code overall.
@MihuBot benchmark Perf_Guid -medium |
System.Tests.Perf_Guid
System.Text.Json.Tests.Perf_Guids
|
|
Address regressions from #88267:
Benchmarks:
Diffs:
Guid.DecodeByte
MihuBot/runtime-utils#1128Guid.DecodeByte
MihuBot/runtime-utils#1129