Skip to content

Simplify codegen of mixed-type checked integer addition and subtraction #15878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

HertzDevil
Copy link
Contributor

Code generation for mixed-type uses of overflow-checked primitive integer addition or subtraction is currently somewhat convoluted, in particular when the receiver is signed and the argument unsigned, or vice versa. The compiler would create an intermediate integer type that has one more bit than the operands, e.g. i9 or i33; most LLVM targets do not like these types, and this leads to some rather unsightly LLVM IR that is hard to optimize.

This PR is a complete overhaul of the way these additions and subtractions are emitted. There are now multiple specialized code paths depending on whether the two operand types have the same signedness or width. The following Crystal snippet illustrates how each code path could be expressed equivalently in terms of other primitive calls in native Crystal:

fun i8_add_u8(p1 : Int8, p2 : UInt8) : Int8
  p1_biased = (p1 ^ Int8::MIN).to_u8!
  result = p1_biased + p2 # same-type, checked
  result.to_i8! ^ Int8::MIN
end

fun u16_add_i8(p1 : UInt16, p2 : Int8) : UInt16
  p1_biased = p1.to_i16! ^ Int16::MIN
  result = i16_add_i8(p1_biased, p2) # checked, see below
  (result ^ Int16::MIN).to_u16!
end

fun i8_add_u16(p1 : Int8, p2 : UInt16) : Int8
  p1_biased = (p1 ^ Int8::MIN).to_u8!
  result = u8_add_u16(p1_biased, p2) # checked, see below
  result.to_i8! ^ Int8::MIN
end

fun i8_add_i16(p1 : Int8, p2 : Int16) : Int8
  p1_ext = p1.to_i16!
  result = p1_ext &+ p2
  result.to_i8 # checked
end

# the actual optimal call sequence is slightly different,
# probably due to some short-circuit evaluation issue 
fun u8_add_u16(p1 : UInt8, p2 : UInt16) : UInt8
  p2_trunc = p2.to_u8 # checked
  p1 + p2_trunc       # same-type, checked
end

fun i16_add_i8(p1 : Int16, p2 : Int8) : Int16
  p2_ext = p2.to_i16!
  p1 + p2_ext # same-type, checked
end

(Before and after on Compiler Explorer)

The gist here is that mixed-signedness operations are transformed into same-signedness ones by applying a bias to the first operand and switching its signedness, using a bitwise XOR. For example, -0x80_i8..0x7F_i8 maps linearly to 0x00_u8..0xFF_u8, and vice-versa. The same-signedness arithmetic operation that follows will overflow if and only if the original operation does. The result is XOR'ed again afterwards, as the bias action is the inverse of itself. This is the trick that allows mixed-type addition or subtraction without resorting to arcane integer bit widths.

@straight-shoota
Copy link
Member

It might be nice to put the snippets with the native Crystal equivalents directly into code comments.
And perhaps also a bit of the explanation.

@straight-shoota straight-shoota added this to the 1.17.0 milestone Jun 13, 2025
@straight-shoota straight-shoota merged commit cd78c4b into crystal-lang:master Jun 15, 2025
36 of 37 checks passed
@HertzDevil HertzDevil deleted the perf/mixed-type-add-sub branch June 15, 2025 21:13
@BlobCodes
Copy link
Contributor

This definitely looks like an improvement, but if the issue was that the types used to perform the arithmetic were inefficient, why not just switch those out?

I tried doing that and I think the resulting assembly looks even better for u8/i8/u16/i16 (I haven't tested other types).
The new functions are sometimes an instruction longer, but just because they need lots of MOV-s because of C's ABI.

I also added a test function to the provided Godbolt sample at the end performing multiple mixed-sign, mixed-size additions to show how well LLVM can inline and combine the arithmetic operations, and it seems to be more efficient.


The test code shown here performs the operation (u16 + i8) + (u8 + u16) + (i8 + i16),
with all operands read from a pointer.

Original x86 ASM (before this MR)
test_sum_of_array_old:
        pushq   %rax
        movzwl  (%rdi), %ecx
        movsbl  2(%rdi), %eax
        addl    %ecx, %eax
        movl    %eax, %ecx
        shll    $15, %ecx
        sarl    $15, %ecx
        cmpl    %eax, %ecx
        jne     .LBB12_10
        testl   %ecx, %ecx
        js      .LBB12_10
        movzbl  4(%rdi), %ecx
        addw    6(%rdi), %cx
        jb      .LBB12_10
        movzwl  %cx, %edx
        cmpl    $256, %edx
        jae     .LBB12_10
        addw    %cx, %ax
        jb      .LBB12_10
        movsbl  8(%rdi), %ecx
        movzwl  10(%rdi), %esi
        leal    (%rcx,%rsi), %edx
        addw    %si, %cx
        jo      .LBB12_10
        addl    $-128, %edx
        movzwl  %dx, %edx
        cmpl    $65279, %edx
        jbe     .LBB12_10
        movzwl  %ax, %eax
        movsbl  %cl, %ecx
        addl    %eax, %ecx
        movl    %ecx, %eax
        shll    $15, %eax
        sarl    $15, %eax
        cmpl    %ecx, %eax
        jne     .LBB12_10
        testl   %eax, %eax
        js      .LBB12_10
        popq    %rcx
        retq
.LBB12_10:
        callq   __crystal_raise_overflow
Current x86 ASM (this MR)
test_sum_of_array_pr:
        pushq   %rax
        movzwl  (%rdi), %ecx
        movsbl  2(%rdi), %eax
        xorl    $32768, %ecx
        addw    %ax, %cx
        jo      .LBB10_7
        movzwl  6(%rdi), %eax
        movzwl  %ax, %edx
        cmpl    $256, %edx
        jae     .LBB10_7
        addb    4(%rdi), %al
        jb      .LBB10_7
        xorl    $32768, %ecx
        movzbl  %al, %eax
        addw    %cx, %ax
        jb      .LBB10_7
        movsbl  8(%rdi), %ecx
        movzwl  10(%rdi), %edx
        movswl  %cx, %ecx
        addl    %edx, %ecx
        movsbl  %cl, %edx
        cmpw    %cx, %dx
        jne     .LBB10_7
        xorl    $32768, %eax
        addw    %cx, %ax
        jo      .LBB10_7
        xorl    $32768, %eax
        popq    %rcx
        retq
.LBB10_7:
        callq   __crystal_raise_overflow
New x86 ASM (always cast to u64/i64)
test_sum_of_array_new:
        pushq   %rax
        movzwl  (%rdi), %eax
        movsbq  2(%rdi), %rcx
        addq    %rax, %rcx
        cmpq    $65536, %rcx
        jae     .LBB13_6
        movzwl  6(%rdi), %edx
        movzbl  4(%rdi), %eax
        addq    %rdx, %rax
        cmpq    $256, %rax
        jae     .LBB13_6
        addq    %rcx, %rax
        cmpq    $65536, %rax
        jae     .LBB13_6
        movsbq  8(%rdi), %rdx
        movswq  10(%rdi), %rcx
        addq    %rdx, %rcx
        movsbq  %cl, %rdx
        cmpq    %rcx, %rdx
        jne     .LBB13_6
        addq    %rcx, %rax
        cmpq    $65536, %rax
        jae     .LBB13_6
        popq    %rcx
        retq
.LBB13_6:
        callq   __crystal_raise_overflow

This PR: https://godbolt.org/z/cajb65ex5
New: https://godbolt.org/z/eTnrYro8s


I haven't tested this with bigger types such as u32 or i128, but I think we could use this to improve performance even further while keeping the codegen code as simple as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants