Suboptimal x64 codegen for signed Math.BigMul

### Description

The JIT generates suboptimal x64 code for `Math.BigMul(long, long, out long)`.

### Configuration

* .NET 6
* Windows 10
* AMD Ryzen CPU (x64)

### Data

Source code:

```csharp
[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul2(ref ulong x, ref ulong y)
{
	x = Math.BigMul(x, y, out y);
}

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul1(ref long x, ref long y)
{
	x = Math.BigMul(x, y, out y);
}
```

Results in the following machine code:

```asm
TestBigMul1(ref ulong, ref ulong):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rax,r10,r8  
 mov         qword ptr [r9],r10  
 mov         rdx,qword ptr [rsp]  
 mov         r8,qword ptr [rsp+18h]  
 mov         qword ptr [r8],rdx  
 mov         qword ptr [rcx],rax  
 add         rsp,8  
 ret  

TestBigMul1(ref long, ref long):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rdx,r10,r8  
 mov         qword ptr [r9],r10  
 mov         r9,qword ptr [rsp]  
 mov         r10,qword ptr [rsp+18h]  
 mov         qword ptr [r10],r9  
 mov         r9,rax  
 sar         r9,3Fh  
 and         r9,r8  
 sub         rdx,r9  
 sar         r8,3Fh  
 and         rax,r8  
 sub         rdx,rax  
 mov         qword ptr [rcx],rdx  
 add         rsp,8  
 ret
```

### Analysis

The unsigned overload uses a single `mulx` instruction as expected.

The signed overload also uses `mulx` with additional 6 instructions (2x `sar, and, sub`) to adjust the upper half of the result. This increases the latency from 4 cycles to at least 8 cycles in fully inlined code. This is completely unnecessary as the x64 architecture has a dedicated instruction for signed multiplication: [the one-operand imul](https://www.felixcloutier.com/x86/imul). The whole sequence of `mulx, sar, and, sub, sar, and, sub` could thus be replaced by a single `imul` instruction.

Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.

category:cq
theme:floating-point
skill-level:intermediate
cost:medium
impact:small

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suboptimal x64 codegen for signed Math.BigMul #75594

Description

Configuration

Data

Analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suboptimal x64 codegen for signed Math.BigMul #75594

Description

Description

Configuration

Data

Analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions