Description
Description
The JIT generates suboptimal x64 code for Math.BigMul(long, long, out long)
.
Configuration
- .NET 6
- Windows 10
- AMD Ryzen CPU (x64)
Data
Source code:
[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul2(ref ulong x, ref ulong y)
{
x = Math.BigMul(x, y, out y);
}
[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul1(ref long x, ref long y)
{
x = Math.BigMul(x, y, out y);
}
Results in the following machine code:
TestBigMul1(ref ulong, ref ulong):
push rax
mov rax,qword ptr [rcx]
mov qword ptr [rsp+18h],rdx
mov r8,qword ptr [rdx]
lea r9,[rsp]
mov rdx,rax
mulx rax,r10,r8
mov qword ptr [r9],r10
mov rdx,qword ptr [rsp]
mov r8,qword ptr [rsp+18h]
mov qword ptr [r8],rdx
mov qword ptr [rcx],rax
add rsp,8
ret
TestBigMul1(ref long, ref long):
push rax
mov rax,qword ptr [rcx]
mov qword ptr [rsp+18h],rdx
mov r8,qword ptr [rdx]
lea r9,[rsp]
mov rdx,rax
mulx rdx,r10,r8
mov qword ptr [r9],r10
mov r9,qword ptr [rsp]
mov r10,qword ptr [rsp+18h]
mov qword ptr [r10],r9
mov r9,rax
sar r9,3Fh
and r9,r8
sub rdx,r9
sar r8,3Fh
and rax,r8
sub rdx,rax
mov qword ptr [rcx],rdx
add rsp,8
ret
Analysis
The unsigned overload uses a single mulx
instruction as expected.
The signed overload also uses mulx
with additional 6 instructions (2x sar, and, sub
) to adjust the upper half of the result. This increases the latency from 4 cycles to at least 8 cycles in fully inlined code. This is completely unnecessary as the x64 architecture has a dedicated instruction for signed multiplication: the one-operand imul. The whole sequence of mulx, sar, and, sub, sar, and, sub
could thus be replaced by a single imul
instruction.
Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.
category:cq
theme:floating-point
skill-level:intermediate
cost:medium
impact:small