Skip to content

Suboptimal x64 codegen for signed Math.BigMul #75594

@tevador

Description

@tevador

Description

The JIT generates suboptimal x64 code for Math.BigMul(long, long, out long).

Configuration

  • .NET 6
  • Windows 10
  • AMD Ryzen CPU (x64)

Data

Source code:

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul2(ref ulong x, ref ulong y)
{
	x = Math.BigMul(x, y, out y);
}

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul1(ref long x, ref long y)
{
	x = Math.BigMul(x, y, out y);
}

Results in the following machine code:

TestBigMul1(ref ulong, ref ulong):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rax,r10,r8  
 mov         qword ptr [r9],r10  
 mov         rdx,qword ptr [rsp]  
 mov         r8,qword ptr [rsp+18h]  
 mov         qword ptr [r8],rdx  
 mov         qword ptr [rcx],rax  
 add         rsp,8  
 ret  

TestBigMul1(ref long, ref long):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rdx,r10,r8  
 mov         qword ptr [r9],r10  
 mov         r9,qword ptr [rsp]  
 mov         r10,qword ptr [rsp+18h]  
 mov         qword ptr [r10],r9  
 mov         r9,rax  
 sar         r9,3Fh  
 and         r9,r8  
 sub         rdx,r9  
 sar         r8,3Fh  
 and         rax,r8  
 sub         rdx,rax  
 mov         qword ptr [rcx],rdx  
 add         rsp,8  
 ret

Analysis

The unsigned overload uses a single mulx instruction as expected.

The signed overload also uses mulx with additional 6 instructions (2x sar, and, sub) to adjust the upper half of the result. This increases the latency from 4 cycles to at least 8 cycles in fully inlined code. This is completely unnecessary as the x64 architecture has a dedicated instruction for signed multiplication: the one-operand imul. The whole sequence of mulx, sar, and, sub, sar, and, sub could thus be replaced by a single imul instruction.

Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.

category:cq
theme:floating-point
skill-level:intermediate
cost:medium
impact:small

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIin-prThere is an active PR which will close this issue when it is mergedtenet-performancePerformance related issue

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions